SaaS Outage Survival Guide: Preparing for Third-Party Downtime
A complete guide to surviving SaaS and third-party outages. Covers preparation, dependency mapping, incident response, post-incident review, SLA credits, and building long-term resilience.
The average company depends on dozens of SaaS tools. Slack for communication. Stripe for payments. AWS for hosting. Shopify for commerce. Cloudflare for performance. When any of these go down, your business feels it, even if your own infrastructure is running perfectly.
The question is not whether your vendors will have outages. They will. Even the most reliable services experience downtime. AWS had a 7-hour outage in December 2021 that cascaded across thousands of businesses. [1] Fastly's June 2021 incident took down Amazon, Reddit, and the New York Times for nearly an hour. [2] Facebook's October 2021 BGP misconfiguration made the entire platform unreachable for over 5 hours. [3]
The question is whether you will be ready when it happens. This guide covers the three phases of outage survival: preparing before outages happen, responding effectively during an outage, and learning from incidents afterward. It also covers the long-term architectural and organizational practices that make your business resilient to third-party failures.
Before the outage: Preparation
The work you do before an outage determines how well you handle it. Teams that prepare recover faster, communicate better, and suffer less business impact.
Dependency mapping
You cannot prepare for failures you do not know about. The first step is building a complete map of your third-party dependencies.
Direct dependencies are the services you use intentionally: your hosting provider, payment processor, email service, CRM, analytics platform, and so on.
Indirect dependencies are harder to spot. Your hosting provider depends on a DNS provider. Your payment processor depends on banking infrastructure. Your CDN depends on network providers. An outage in any of these affects you, even though you have no direct relationship with them.
Hidden dependencies are the ones nobody tracks: the free tier of a monitoring service, a library that phones home to a CDN, a font service that loads on every page, a third-party chat widget.
To build your dependency map:
- Inventory every external service. Go through your DNS records, your billing statements, your codebase imports, and your network traffic. If it is not on your servers, it is a dependency.
- Categorize by criticality. Which services, if they went down right now, would prevent you from operating? Those are critical. Which would degrade the experience but not stop operations? Those are important. Which would barely be noticed? Those are low-priority.
- Document the dependency chain. For critical services, understand their dependencies too. If your site runs on Vercel, and Vercel uses AWS, an AWS outage affects you even though you are not an AWS customer.
For a structured approach, see SaaS dependency mapping.
Fallback plans
For each critical dependency, define what you will do when it fails.
Payment processor down. Can you queue orders and process payments when the service recovers? Can you switch to a backup processor? Can you display a message letting customers know the issue is temporary?
Communication tool down (Slack, Teams). Where does your team communicate during the outage? Email? A backup Slack workspace on a different plan? Phone calls? Define the fallback before you need it. See incident communication templates for pre-written messages.
Hosting provider down. Can you fail over to a secondary provider? Can you serve a static maintenance page from a different host? Is your DNS configured for quick failover?
Email service down. Can transactional emails (password resets, order confirmations) queue and send later? Do your customers know about the delay?
Analytics down. This one rarely needs a fallback because the business continues without it. But know that you will have a gap in your data.
Not every dependency needs an elaborate fallback. The cost of the fallback should be proportional to the cost of the outage. A 1-hour analytics gap costs nothing. A 1-hour payment processing outage costs real revenue.
Communication templates
When an outage happens, you need to communicate fast. Writing clear, empathetic messages under pressure is hard. Writing them in advance is easy.
Prepare templates for:
- Internal notification. "We have detected an outage with [Service]. [Impact description]. We are monitoring the situation. Updates every [interval]."
- Customer notification (initial). "We are currently experiencing issues with [affected feature]. Our team is working on it. We will update you as soon as we have more information."
- Customer notification (update). "Update: the issue is caused by a [vendor] outage affecting [scope]. We expect resolution by [time if known, or 'We are monitoring for updates']."
- Customer notification (resolution). "The issue affecting [feature] has been resolved. [Brief explanation]. We apologize for the inconvenience."
- Post-incident summary. A template for the public post-mortem that includes timeline, impact, root cause, and preventive measures.
See the outage communication guide for detailed guidance on tone, timing, and channels.
Monitoring your dependencies
You need to know when a vendor goes down before your customers tell you. Relying on vendor status pages is insufficient because status pages are often delayed (vendors are slow to acknowledge issues) and sometimes inaccurate (showing "operational" during active outages).
External monitoring. Use tools that check your vendors' availability independently. Monitor the specific endpoints you depend on, not just the vendor's homepage.
Synthetic transaction monitoring. For critical integrations (payment processing, API calls), run synthetic transactions at regular intervals. These detect functional failures that simple HTTP checks miss.
Social monitoring. Twitter/X and Reddit often surface outage reports faster than official status pages. Monitor for your vendor's name plus "down" or "outage."
Vendor status page subscriptions. Despite their limitations, subscribe to official status pages for email, webhook, or RSS notifications. They are useful as one signal among several.
For setting up monitoring, see outage alerts setup and what is vendor monitoring.
The goal of vendor monitoring is not to detect every outage. It is to detect outages that affect you before your customers notice. A 30-second Stripe blip that does not affect any transactions does not need a response. A 5-minute Stripe outage during peak checkout hours does.
During the outage: Response
When a vendor outage hits, your response follows four steps: confirm, assess, communicate, and monitor.
Step 1: Confirm the outage
Before mobilizing your team, confirm that the problem is actually a vendor outage and not something on your end.
- Check the vendor's status page. If it shows an incident, you have confirmation (but remember, status pages lag behind reality).
- Check third-party monitoring. Services like Is That Down aggregate reports from multiple sources.
- Check your own monitoring. Is the failure consistent across all your monitoring locations, or is it localized?
- Check social media. Are other customers reporting the same issue?
- Test manually. Try the failing operation yourself from a different network or device.
The confirmation step takes 2 to 5 minutes and prevents you from wasting time debugging your own systems when the problem is upstream.
Step 2: Assess the impact
Determine what the outage means for your business right now:
- Which features are affected? A Stripe outage affects checkout. A Cloudflare outage might affect your entire site. An analytics outage affects nothing user-facing.
- How many users are impacted? Is this peak traffic time or a quiet period?
- Is the impact total or partial? Can users still browse, just not check out? Can they use the site from some regions but not others?
- What is the revenue impact? For e-commerce, calculate the rough cost per hour of the affected functionality being down.
- Are there workarounds? Can users accomplish their goal through an alternative path?
Step 3: Communicate
Based on your assessment, follow your communication plan:
Internal communication:
- Notify the relevant team(s) through your designated backup channel if the primary is affected
- Assign an incident commander who owns the response
- Set an update cadence (every 15 minutes for critical issues, every 30 minutes for important ones)
External communication:
- If the impact is user-visible, post an update on your own status page or social channels
- Use your prepared templates, customized to the specific situation
- Be honest about the cause ("a third-party payment processing issue") without blaming the vendor publicly
- Set expectations about resolution timeline: "The vendor reports this is being investigated" is better than "This will be fixed in 30 minutes" if you do not know
What not to do:
- Do not stay silent and hope nobody notices (they will, and the silence makes it worse)
- Do not provide overly optimistic timelines
- Do not publicly blame the vendor (deal with accountability privately)
Step 4: Activate fallbacks
If you prepared fallback plans, now is the time to use them:
- Switch to backup payment processor if available
- Enable static/cached versions of affected pages
- Route traffic away from the affected vendor
- Queue operations for later processing
If you did not prepare fallbacks, your options are limited to communicating and waiting. This is why Phase 1 (preparation) matters so much.
Step 5: Monitor for resolution
- Subscribe to vendor status page updates if you have not already
- Monitor your own systems for recovery signals (successful API calls, restored connectivity)
- Test the affected functionality manually when the vendor reports resolution
- Do not declare the incident resolved until you have confirmed recovery in your own environment (vendor "resolved" does not always mean your specific integration is working)
After the outage: Review and accountability
The hours and days after an outage are where you extract lasting value from a bad experience.
Internal post-incident review
Conduct a blameless post-incident review (also called a retrospective or post-mortem) within 48 hours while the details are fresh.
Timeline reconstruction: Build a minute-by-minute timeline from detection to resolution. When did the outage start? When did you detect it? When did you communicate? When was it resolved?
Detection analysis: How did you find out? Did your monitoring catch it, or did a customer report it? If monitoring missed it, why?
Response assessment: How quickly did you respond? Were the communication templates useful? Did the fallback plan work? What would you do differently?
Impact measurement: Total duration, number of affected users, revenue impact, support ticket volume.
Action items: What specific changes will prevent this from happening again (or reduce its impact next time)?
Vendor accountability
After the incident, engage with your vendor:
- Request a post-incident report. Reputable vendors publish these. If yours does not, ask for one directly.
- Review the root cause. Was this a one-time event or a systemic issue? How confident are you in the vendor's fix?
- Claim SLA credits. If the outage breached your SLA, file a credit claim within the required window. You typically need to provide evidence (timestamps, monitoring data). Do not expect the vendor to offer credits proactively.
- Evaluate the response. How quickly did the vendor detect, communicate, and resolve the issue? Was their communication honest and timely?
- Update your vendor scorecard. Track each vendor's incident history, response quality, and reliability trend over time. See the vendor reliability scorecard for a framework.
SLA credits: What to expect
SLA credits are the contractual remedy when a vendor fails to meet their uptime commitment. Here is the reality:
- Credits are usually small. A typical SLA credit is 10-25% of one month's fee. If you pay $500/month and the vendor had an 8-hour outage, you might get $50-125 in credits.
- Credits do not cover business losses. If the outage cost you $50,000 in lost revenue, a $125 credit is symbolic.
- You must claim them. Most vendors require you to file a claim within 30 days with evidence of the outage. No claim, no credit.
- The calculation matters. Read your SLA carefully. Some vendors measure uptime per-region, per-service, or per-instance. Your overall experience may have been degraded, but by their measurement, the SLA was not breached.
SLA credits are a baseline accountability mechanism, not insurance. They keep vendors honest about their commitments, but they do not make your business whole after a major outage.
For more on evaluating vendor SLAs, see how uptime SLAs work.
The most common mistake after a vendor outage is doing nothing. You weather the storm, service restores, and you move on. Then the same vendor has another outage three months later, and you are equally unprepared. The post-incident review and the action items that come out of it are what prevent recurring impact.
Building long-term resilience
Surviving individual outages is tactical. Building a business that is structurally resilient to third-party failures is strategic.
Multi-vendor strategy
For your most critical dependencies, having a backup vendor is the most effective resilience strategy.
When multi-vendor makes sense:
- Payment processing (primary + backup processor)
- DNS (multiple providers for redundancy, see DNS monitoring)
- CDN (primary + backup, or a multi-CDN strategy)
- Communication (Slack + email + a backup channel)
- Cloud hosting (multi-region or multi-cloud for the highest availability requirements)
When multi-vendor is overkill:
- Analytics (a gap in data is annoying but not harmful)
- Project management tools (you can survive a day without Jira)
- Non-critical SaaS (the cost of maintaining two vendors exceeds the cost of occasional downtime)
The key question: what is the cost of this service being unavailable for 4 hours versus the cost of maintaining a backup? If the outage cost is significantly higher, a backup is justified.
Graceful degradation
Graceful degradation means your application continues to function (possibly with reduced capabilities) when a dependency fails, rather than crashing entirely.
Examples of graceful degradation:
- If your recommendation engine is down, show popular products instead of personalized recommendations
- If your analytics service is down, queue events locally and send them when the service recovers
- If your third-party search is down, fall back to basic database-powered search
- If your chat widget provider is down, show a "Contact us by email" message instead of a broken chat interface
- If your CDN is down, serve assets directly from your origin server (slower but functional)
Implementing graceful degradation:
- Identify the failure mode. For each third-party integration, what happens when it returns errors or times out?
- Define the fallback behavior. What should the user see instead?
- Implement timeouts. Never let a third-party call block your page indefinitely. Set aggressive timeouts (2-5 seconds for non-critical services, 10-15 seconds for critical ones).
- Test the fallback. Deliberately simulate vendor failures in staging and verify the degraded experience is acceptable.
Circuit breakers
A circuit breaker is a software pattern that detects when a third-party service is failing and automatically stops sending requests to it. [4] This prevents cascading failures where a slow or failing external service causes your own application to slow down or crash.
How circuit breakers work:
- Closed state (normal). Requests flow through to the third-party service as usual.
- Open state (tripped). After a threshold of failures (e.g., 5 consecutive errors), the circuit breaker "opens" and immediately returns a fallback response without calling the external service.
- Half-open state (testing). After a timeout period, the circuit breaker allows a limited number of requests through to test whether the external service has recovered. If they succeed, the circuit closes again. If they fail, it stays open.
Circuit breakers prevent your application from:
- Wasting resources on requests that will fail
- Accumulating timeouts that slow down your entire application
- Overloading a recovering service with a flood of queued requests
Most modern web frameworks have circuit breaker libraries available (Hystrix for Java, Polly for .NET, opossum for Node.js).
Caching as resilience
Aggressive caching can keep your site functional during vendor outages:
- CDN caching. If your CDN caches your pages, users continue to see cached content even if your origin is down.
- API response caching. Cache responses from third-party APIs with appropriate TTLs. When the API is down, serve cached data (clearly marked as potentially stale if necessary).
- Service worker caching. For Progressive Web Apps, service workers can serve cached content when the network is unavailable.
The tradeoff is freshness. Cached data may be stale. For content sites, staleness of a few hours is usually acceptable. For real-time data (stock prices, inventory counts), you need a different approach.
Dependency health dashboards
Create a single dashboard that shows the health of all your critical dependencies:
- Current status of each vendor (from your monitoring, not their status page)
- Recent incident history
- SLA compliance over the past 30/90 days
- Time since last outage
This dashboard serves as an early warning system and as a management tool for vendor accountability. For frameworks, see vendor reliability scorecard and what is Downdetector for understanding crowd-sourced status data.
Outage response playbook
Here is a consolidated playbook you can adapt for your organization.
Severity levels
| Severity | Criteria | Response | |----------|----------|----------| | S1 - Critical | Revenue-impacting, all users affected | Full incident response, executive notification, customer communication | | S2 - Major | Significant degradation, many users affected | Incident commander assigned, engineering response, customer communication if extended | | S3 - Minor | Limited impact, workaround available | Engineering notified, monitored, communicated if user-visible | | S4 - Informational | Vendor issue detected, no user impact yet | Monitored, no immediate action needed |
Response timeline
0-5 minutes: Confirm the outage (check vendor status, your monitoring, social media)
5-10 minutes: Assess impact and assign severity level
10-15 minutes: Activate fallback plan (if S1/S2) and notify team
15-30 minutes: Send first customer communication (if S1/S2)
Every 15-30 minutes: Provide updates internally and externally
Upon resolution: Verify recovery in your environment, send all-clear communication
Within 48 hours: Conduct post-incident review
Within 7 days: Complete action items and file SLA credit claims
For a detailed playbook, see vendor outage response playbook.
Choosing resilient vendors
Prevention starts with vendor selection. When evaluating SaaS vendors, look beyond features and pricing. See choosing reliable SaaS vendors for a complete evaluation framework.
Historical reliability. Check the vendor's status page archive. How often do they have incidents? How long do incidents last? How transparent are their post-mortems?
Architecture transparency. Does the vendor publish information about their infrastructure? Multi-region deployment, redundancy, and disaster recovery capabilities are positive signals.
SLA terms. Read the actual SLA document, not the marketing page. What is the uptime commitment? What are the exclusions? What are the credit amounts?
Communication quality. During past incidents, how quickly did the vendor communicate? Were updates honest and useful, or vague and delayed?
Dependency risk. Is the vendor itself dependent on a single cloud provider? If so, that provider's outages become your outages.
The most expensive vendor is not always the most reliable. And the cheapest vendor is not always the least reliable. Evaluate based on actual track record and architectural choices.
References
[1] AWS, "Summary of the AWS Service Event in the Northern Virginia (US-EAST-1) Region," aws.amazon.com/message/12721/, December 2021.
[2] Fastly, "Summary of June 8 outage," fastly.com/blog/summary-of-june-8-outage, June 2021.
[3] Facebook Engineering, "More details about the October 4 outage," engineering.fb.com/2021/10/05/networking-traffic/outage-details/, October 2021.
[4] M. Nygard, "Release It! Design and Deploy Production-Ready Software," Pragmatic Bookshelf, 2018. The definitive reference on circuit breakers and other stability patterns.
[5] Productiv, "2023 State of SaaS," productiv.com/blog/state-of-saas-2023. Reports the average enterprise uses 371 SaaS applications.
[6] Gartner, "The Cost of Downtime," referenced in multiple Gartner research publications and widely cited across the industry as $5,600 per minute average.
[7] PagerDuty, "Incident Response Best Practices," pagerduty.com/resources/learn/incident-response-best-practices.
[8] Google, "Site Reliability Engineering: Postmortem Culture," sre.google/sre-book/postmortem-culture/.
Know when your vendors go down
Monitor the services you depend on. Get alerts before your customers notice. Track Slack, Stripe, Shopify, AWS, and 30+ services.
Try Is That Down