The AWS and Azure Outages of October 2025: Analysis, Lessons, and Resilience Strategies

On décrypte les pannes AWS et Azure en octobre 2025

How could these two Cloud giants crash?

The AWS Outage of October 20, 2025: The cascading disaster scenario

On October 20, 2025, AWS experienced one of its most severe regional outages in years. Early in the morning, the DynamoDB service (one of the core managed database services) received an update containing a critical defect. This update triggered a series of cascading failures that affected 113 different AWS services for more than 15 hours.

The mechanism behind this disaster is revealing: the problematic update caused a failure in the DNS system (the infrastructure that translates domain names into IP addresses). Without DNS, AWS services were unable to locate their critical paired infrastructure, creating a gradual collapse. ThousandEyes, a monitoring platform, detected massive packet loss at AWS edge nodes located in Northern Virginia, the first observable symptom of the disaster.

The consequences were global: many systems and services such as Snapchat, Pinterest, Fortnite, Roblox, Reddit, Venmo, Disney+, Canva, and even Amazon retail’s own support and retail systems were knocked offline. The outage particularly impacted the us-east-1 region, the heart of AWS’s North American infrastructure.

What makes this incident especially concerning is that it fits into a well-known risk-management architectural model: the “Swiss Cheese Model.” Multiple minor failures align perfectly to create a major problem. In this case, the initial defect combined with systemic redundancy weaknesses, and even the recovery process acted as an “additional layer of Swiss cheese.”

The Microsoft Azure Outage of October 29, 2025: The centralized configuration error

Barely a week after AWS’s centralized outage, Microsoft experienced a very similar incident. On October 29, 2025, around noon Eastern time, an unintentional configuration change within Azure Front Door (AFD, Microsoft’s global content delivery and load-balancing system) triggered a massive outage that lasted about 9 hours.

Azure Front Door is the backbone of traffic routing for Azure and Microsoft 365. A single incorrect configuration change caused the failure of proper AFD node loading across the entire global fleet. This created a traffic imbalance: the “unhealthy” nodes had to withdraw, overloading the remaining healthy nodes.

The impact was devastating and widely distributed. Azure Active Directory B2C, Azure Databricks, Azure SQL Database, Virtual Desktop, Microsoft Sentinel, and even the Azure portal itself became inaccessible. Outside Microsoft, Alaska Airlines and Hawaiian Airlines experienced issues with online flight verification. The Scottish Parliament suspended its live voting. More than 16,000 Azure outage reports and 9,000 Microsoft 365 outage reports were logged on Downdetector.

The critical finding: this was not a cyberattack nor a hardware failure. It was human error in configuration management. Microsoft had to block all further configuration changes to prevent the defective state from spreading, then gradually deploy the “last known good configuration” across its global fleet in controlled phases.

Can we expect more incidents of this kind?

These two incidents echo an event that marked the industry a year earlier. In July 2024, the firm CrowdStrike (global leader in threat detection and response (EDR) with 18% global market share) deployed an update containing a critical bug. This update crippled 8.5 million Windows machines worldwide and affected global air traffic for several days.

Unlike the current cloud outages, CrowdStrike’s event had different but equally severe implications. The incident revealed several systemic failures:

Lack of adequate change testing: updates were not sufficiently validated before global deployment
Lack of tested redundancy: even organizations following best practices (maintaining a delayed production version) discovered that critical system components did not have the expected protections
Lack of supply-chain risk assessment: organizations had not mapped critical dependencies on third-party security solutions
Insufficient incident response plans: even with a disaster recovery plan, organizations struggled to implement remediation steps

This 2024 incident, combined with the 2025 AWS and Azure outages, establishes a clear pattern: vulnerability is not technically inevitable, it is organizational. Failures occur not because the technology is not robust, but because of insufficient change-management processes and a lack of security validation before deployment.

Future risks: a statistical certainty

The key question is not “if” another outage will occur, but “when.” The cloud industry now operates at a scale where errors affect millions of users and thousands of businesses almost instantly. Risk factors persist:

Configuration and deployment: As infrastructure expands, the number of possible failure points increases exponentially. Centralized control systems like Azure Front Door or AWS DNS remain potential Single Points of Failure (SPOF).
Hidden dependencies: As seen in the CrowdStrike outage, organizations often underestimate the extent of their dependencies on third-party providers.
Layered systems accumulation: Modern cloud environments rely on layers of virtualization. A defect in one layer (DNS, load balancing, etc.) affects everything built on top of it, creating a chain reaction when something goes wrong.

How can companies protect themselves?

For organizations consuming services from major cloud providers like AWS or Microsoft, the approach must be multilayered. The message is clear: never assume your provider will never experience an outage, and identify your critical services and the infrastructure they rely on (also considering that many SaaS solutions themselves rely on AWS or Microsoft Azure infrastructures).

Three essential strategies exist:

1. Set up multi-region architecture with automatic failover

For critical services, duplicate your infrastructure across several AWS or Azure regions. If one region experiences an outage, traffic automatically switches to a secondary region.

How to implement it?

Use AWS Route 53 for intelligent DNS routing and automatic failover, or Azure Traffic Manager for Microsoft solutions
Set up continuous health checks to detect failures, with a robust observability strategy for critical systems

The advantage: even if a region like us-east-1 (for AWS) goes down, as in the October incident, your traffic automatically reroutes to us-west-2 or a European region.

2. Diversify your multi-cloud approach for critical services

Identify your most critical services from a business continuity perspective (payments, authentication, transactions, etc.), and consider a multi-cloud presence to reduce the potential impact surface. This means avoiding vendor lock-in, which consists of relying on a primary provider for most services or infrastructure.

What this means:

Essential capabilities exist in both AWS and Microsoft Azure
Less critical services can remain “single cloud” to control infrastructure complexity and manage cloud costs

Obstacles to consider:

Increased operational complexity and costs
Need to abstract cloud dependencies
More support teams required

E-commerce companies, financial services, and telecommunications providers should absolutely consider this approach.

3. Continuously test recovery plans and cloud services

DNS and caching redundancy

DNS failed during the AWS outage, and Azure Front Door (which also handles DNS routing) failed during the Microsoft incident.

What to do?

Use low TTLs (Time To Live) for critical DNS records. This enables faster switching if endpoints must be changed
Maintain tested alternate DNS paths that you can quickly switch to during an incident
Be aware that reducing TTLs increases DNS query volume, which can create other issues

Important limitations: Upstream DNS resolvers and CDN caches may retain old responses regardless of your TTL, creating variability in client recovery time.

Architectural patterns and “circuit breakers”

Implement the “circuit breaker” architectural pattern, which, when a dependency begins to fail, “opens the circuit,” stopping calls to that service rather than attempting calls indefinitely. Studies show this pattern reduces error rates by 58% and improves system availability by 10%.

Concrete example: If Azure SQL Database becomes inaccessible, your circuit breaker detects this after a few failed attempts and stops sending requests. Your application switches to a degraded mode (read-only from cache, for example) instead of attempting to process indefinitely.

Regularly tested business continuity plans

Identify critical functions: What are the 3 to 5 processes without which your business stops?
Evaluate dependencies: What infrastructure supports each function?
Define recovery/backup objectives:
- RTO (Recovery Time Objective): maximum acceptable downtime
- RPO (Recovery Point Objective): maximum acceptable data loss
Regular testing processes: simulations at least quarterly

Common mistake: many organizations create a recovery plan and file it away. Six months later, when they test it, they discover the plan is outdated or doesn’t work properly.

Coordinated multi-region failover strategies

Failover strategies must go beyond simple component redundancy:

Component-level failover: most granular; each service has its own failover plan
Application-level failover : groups of applications fail over together, respecting dependencies
Dependency-graph failover: explicit mapping of dependencies; failover follows this order
Portfolio-level failover : the most coordinated approach, where the entire application portfolio fails over together

Proactive monitoring and alerts

Use network monitoring tools like ThousandEyes to detect anomalies at the cloud infrastructure level, not just within your applications
Configure alerts that activate not when things have already broken, but when abnormal signals begin to appear

From pessimism to action

The AWS and Microsoft Azure outages of October 2025 are not anomalies. They are predictable manifestations of the extreme complexity of cloud systems. They also show that resilience is not automatic: it is the result of intentional architectural choices, disciplined processes, and continuous investment in testing and recovery planning.

For cloud providers, this means strengthening change management and increasing testing. For organizations consuming these services, it means accepting an uncomfortable truth: no provider can guarantee zero downtime. Resilience comes from a combination of diversification, redundancy, automated failover, and regularly tested emergency plans.

As shown by the 2024 CrowdStrike experience and now the AWS-Azure outages, the industry is finally learning this lesson. The question now is not how to avoid outages, it’s how to survive them with minimal impact on your operations and your customers.

Without being fatalistic, let’s not forget that the cloud remains an incredible opportunity for new use cases and accelerators, supported by extremely robust providers.

Watch our On Décrypte video capsule by our Cloud Architecture, AI and Digital Transformation Leader, Thibault Blaizot. (French only)

Written by

Thibault Blaizot

Leader - Cloud Architecture, AI, and Digital Transformation