Famous SaaS Outages and What We Learned

A look at major SaaS outages from Fastly, AWS, Facebook, Slack, and Cloudflare. What happened, how long they lasted, and what we can learn from each.

Every major internet service has gone down. It does not matter how large the team, how deep the pockets, or how sophisticated the infrastructure. Outages happen to everyone, and when they happen to services that millions of people and businesses depend on, the ripple effects are massive.

These incidents are not just war stories. Each one reveals something about how modern internet infrastructure fails and what you can do to protect your own business when a service you depend on goes dark. For a complete framework for handling vendor outages, see the SaaS outage survival guide.

Fastly CDN Outage -- June 2021

What Happened

On June 8, 2021, a configuration change at Fastly, a major content delivery network, triggered a bug that caused 85% of Fastly's network to return 503 errors. The outage began at approximately 09:47 UTC and affected websites and services globally.

Who Was Affected

Because Fastly is a CDN used by some of the largest websites in the world, the impact was enormous. Amazon, Reddit, Twitch, Spotify, the New York Times, BBC, the UK government website (gov.uk), Pinterest, and thousands of other sites went down simultaneously. Users worldwide saw 503 Service Unavailable errors or blank pages.

How Long It Lasted

Fastly identified the issue within one minute and began deploying a fix within minutes. Most services were restored within about one hour. Full recovery was confirmed approximately 49 minutes after the initial impact.

What We Learned

A single CDN configuration change took down a significant portion of the internet. The incident highlighted the concentration risk of relying on a small number of CDN providers. It also showed how fast a well-prepared team can recover. Fastly's detection and response time was impressive, but the blast radius of the failure reminded everyone that CDN infrastructure is a single point of failure for many websites.

Takeaway for your business: Know which CDN your critical vendors use. If multiple services you depend on share the same CDN, an outage there can take down everything at once. Consider whether your own site can serve stale content or fall back to the origin during a CDN outage.

AWS US-East-1 Outage -- December 2021

What Happened

On December 7, 2021, Amazon Web Services experienced a major outage in its US-East-1 region, the largest and most heavily used AWS region. The root cause was an issue with the internal network that interconnects AWS services within the region. Automated systems that manage network devices triggered excessive activity, overwhelming the internal network.

Who Was Affected

The outage cascaded across dozens of AWS services: EC2, ECS, Lambda, DynamoDB, RDS, and the AWS Management Console itself. Because US-East-1 is the default region for many AWS services (and some services only operate from US-East-1), the impact extended far beyond companies that intentionally chose that region.

Consumer services affected included Disney+, Ticketmaster, Venmo, McDonald's app, Instacart, and many more. AWS's own status dashboard was impacted, making it difficult for customers to get accurate information about the outage.

How Long It Lasted

The outage lasted approximately 10 hours before services were fully restored. Some services recovered earlier, but intermittent issues persisted throughout the day.

What We Learned

US-East-1 is a single point of failure for a surprising amount of the internet. Even companies that thought they were multi-region often had dependencies on US-East-1 for control plane operations, authentication, or specific services.

The fact that AWS's own status page was affected highlighted the importance of status monitoring that does not depend on the same infrastructure as the services it monitors.

Takeaway for your business: Review your dependencies on specific cloud regions. Ensure your monitoring and status systems operate independently of your primary infrastructure. Use Is That Down or similar external monitoring to track the services you depend on.

Facebook, Instagram, WhatsApp Outage -- October 2021

What Happened

On October 4, 2021, Facebook, Instagram, WhatsApp, and Messenger went completely offline for approximately six hours. The cause was a BGP (Border Gateway Protocol) configuration change that disconnected Facebook's DNS servers from the internet. Facebook's domains stopped resolving entirely, as if the company's entire online presence had been erased from the internet's routing tables.

Who Was Affected

Approximately 3.5 billion people use at least one Facebook-owned service. The outage affected personal communication (WhatsApp is the primary messaging platform in many countries), business operations (companies that rely on Facebook Messenger for customer service, WhatsApp for business communication), and advertising (Facebook Ads stopped serving).

The economic impact was estimated at over $60 million in lost ad revenue for Facebook alone. Businesses that depend on Facebook's platforms for sales and customer communication lost an entire day of operations.

How Long It Lasted

The outage lasted approximately 6 hours. Recovery was complicated by the fact that Facebook's internal tools also depended on the same network, so engineers reportedly had difficulty accessing their own systems to fix the problem. There were reports of engineers physically going to data centers because remote access was unavailable.

What We Learned

A single routing configuration error took down an entire portfolio of services serving billions of people. The incident was a stark reminder that even the largest technology companies can experience catastrophic failures.

The most notable lesson was the cascading effect of internal dependency. Facebook's internal communication, access systems, and engineering tools all depended on the same infrastructure that went down. When everything relies on the same foundation, a foundation failure is total.

Takeaway for your business: If your business communicates with customers through a single platform (WhatsApp, Facebook Messenger, or any other), have a backup communication channel ready. Do not let a vendor outage cut off your customer communication entirely. See the vendor monitoring guide for a structured approach.

Slack Outage -- February 2022

What Happened

On February 22, 2022, Slack experienced a widespread outage that prevented users from connecting, sending messages, and accessing channels. The root cause was related to a database infrastructure change that went wrong during a routine operation.

Who Was Affected

Slack is the primary communication tool for millions of knowledge workers. When it goes down, team communication stops. Companies that use Slack as their central hub for conversation, alerts, and coordination lose their primary real-time communication channel.

The outage particularly affected remote and hybrid teams, where Slack is often the substitute for in-person conversation. Engineering teams that route monitoring alerts through Slack lost visibility into their own infrastructure health during the outage.

How Long It Lasted

Significant disruption lasted approximately 5 hours, with degraded performance for several hours afterward. Some users experienced intermittent issues for most of the workday.

What We Learned

Communication tools are critical infrastructure for modern businesses, and they are also single points of failure. When Slack goes down, teams cannot coordinate, alerts do not reach engineers, and remote workers are isolated.

The incident reinforced the importance of having a backup communication channel. A Slack outage is a predictable event. Having a documented fallback (email distribution list, Microsoft Teams room, Zoom bridge, or even a group text thread) means the team can keep working.

Takeaway for your business: Set up a backup communication channel for your team that does not depend on Slack. Document it so everyone knows where to go during an outage. Monitor Slack's status with Is That Down so you know about outages before they disrupt your workflow.

Cloudflare Outage -- June 2022

What Happened

On June 21, 2022, Cloudflare experienced an outage affecting 19 of its data centers, including some of its largest and busiest locations. The root cause was a change to network configuration that was part of a long-running project to increase resilience in Cloudflare's busiest data centers.

A configuration error caused a subset of the network to become unreachable, affecting traffic routed through those 19 data centers. Because these included major hubs, a disproportionate amount of global traffic was affected.

Who Was Affected

Cloudflare handles a significant percentage of global internet traffic. The 19 affected data centers served high-traffic regions, so the impact was felt broadly. Websites using Cloudflare for DNS, CDN, or security services experienced errors or were unreachable. Discord, Shopify, Fitbit, and many other services were affected.

How Long It Lasted

The outage lasted approximately 90 minutes for the affected data centers. Because only a subset of data centers was impacted, the outage was regional rather than global. Users in some locations experienced no disruption while others were completely unable to access affected sites.

What We Learned

The incident showed that infrastructure improvements can introduce new risks. Cloudflare was working to make their network more resilient, and a mistake during that process caused the very kind of outage they were trying to prevent. This is a common pattern: changes intended to improve reliability carry the risk of causing failures during implementation.

Cloudflare's transparent post-incident report detailed exactly what went wrong, step by step. This level of transparency set a good example for the industry.

Takeaway for your business: Even services actively working to improve reliability can have outages. Monitoring is not paranoia; it is preparation. Set up alerts for the services you depend on so you know within minutes when something breaks. See how to check if a service is down and outage alerts setup.

Every major outage listed here was caused by a configuration change or internal error, not by external attacks. The biggest risk to internet services is the complexity of their own systems. Your job is not to prevent vendor outages (you cannot). Your job is to detect them quickly and have a plan for when they happen.

Patterns Across Major Outages

Looking at these incidents together, several patterns emerge:

Configuration changes are the leading cause. Five of the five outages above were triggered by internal changes, not external events.

Blast radius scales with centralization. The more services that depend on a single provider, the wider the impact when that provider fails.

Status pages are often affected. When a service's own infrastructure goes down, its status page and internal tools may go down too, creating an information vacuum.

Recovery depends on access. Engineers need to access systems to fix them. When the outage prevents access, recovery takes longer.

Transparent post-mortems build trust. Companies that publish detailed post-incident reports maintain credibility. Those that stay silent lose trust.

What You Can Do

You cannot prevent vendor outages. But you can prepare:

  • Monitor your critical vendors with an independent tool that alerts you when they report incidents. Is That Down automates this.
  • Map your vendor dependencies to understand your exposure. See SaaS dependency mapping.
  • Have a response playbook for each critical vendor. What do you do when Slack is down? When AWS is down? When your CDN is down? See the vendor outage response playbook.
  • Communicate with your own users when a vendor outage affects your service. See the outage communication guide.

Know when your vendors go down

Is That Down monitors the status pages of the services your business depends on and sends you alerts the moment an incident is reported.

Try Is That Down