A Single Automated Process Broke Dozens of Major Services. Inside the AWS us-east-1 Outage.
A Single Automated Process Broke Dozens of Major Services. Inside the AWS us-east-1 Outage.
At 7:30 AM PST on December 7, 2021, an automated scaling activity inside AWS triggered a cascade that took down DoorDash, Netflix, Disney+, and dozens of other services for hours. The root cause wasn't a cyberattack. It wasn't a hardware failure. It was AWS's own automation doing exactly what it was designed to do, at a scale nobody anticipated.

This outage is one of the most instructive failures in recent cloud history. Not because it was the biggest, but because it exposes a tension that I keep running into after 14+ years of building and operating systems on cloud infrastructure: the automation we depend on for reliability can become the single point of failure that destroys it.
What Actually Happened: The Chain Reaction
To understand this outage, you need to understand a piece of AWS's internal architecture that most people never think about. AWS runs two networks. The main AWS network is where customer workloads and most AWS services live. Behind it sits an internal network that hosts foundational services: monitoring, internal DNS, authorization, and parts of the EC2 control plane.

These two networks are connected by networking devices that handle routing and network address translation. According to AWS's official post-mortem, at 7:30 AM PST, an automated activity designed to scale capacity on the main network "triggered an unexpected behavior from a large number of clients inside the internal network." That behavior caused a massive surge of connection activity that overwhelmed the networking devices bridging the two networks.
Then the real damage started. The congestion caused latency and errors for services trying to communicate across the boundary. Those services responded the way any well-designed client would: they retried. And retried. And retried. Each retry added more load to the already overwhelmed devices. Textbook retry storm. The system didn't just fail. It actively made itself worse.
I've watched this exact pattern play out at smaller scales in my own work. Retry logic is essential for resilient distributed systems. But without proper backoff, jitter, and circuit breaking, retries become an amplifier for failure. AWS's internal clients apparently lacked sufficient safeguards against this scenario. When the bridge between networks started choking, every retry was another brick on the pile.
Flying Blind: When Monitoring Goes Down With the Ship
The most damaging part of this outage wasn't the initial congestion. It was what the congestion took down with it.

Because AWS's monitoring infrastructure lived on the same internal network that was being overwhelmed, the operations team lost visibility into what was happening. Their real-time dashboards went dark. The people responsible for fixing the problem couldn't see the problem.
When your monitoring depends on the same infrastructure it's supposed to monitor, you don't have monitoring. You have a shared fate dependency.
This is the kind of architectural coupling that looks fine on paper until it doesn't. AWS's engineers were forced to fall back on logs. Diagnosing a massive distributed systems failure by reading text files. As Simon Sharwood at The Register noted, the "automation that runs the automated cloud fell over," and the humans left to clean up the mess were working without their primary tools.
The team initially identified elevated internal DNS errors and focused on rerouting that traffic away from the congested paths. According to the AWS post-mortem, this DNS remediation was completed at 9:28 AM PST, roughly two hours in. DNS resolution recovered, which helped some services, but the underlying congestion persisted. And monitoring data was still gone.
I've been in incident rooms where observability degrades during the incident itself. It's genuinely terrifying. You're making decisions on incomplete data, guessing at root causes, hoping your remediation doesn't make things worse. The lesson isn't new, but AWS demonstrated it at a scale that's hard to wave away: your monitoring plane must be independent of your data plane. Full stop.
The Blast Radius Problem: Why us-east-1 Is Different
The outage didn't just hit services running in us-east-1. It rippled outward because of how AWS's architecture concentrates critical control planes in that region.
us-east-1 isn't just another AWS region. It's the oldest, the largest, and the one where many global control plane operations are anchored. When the internal network in us-east-1 went down, it didn't just take out EC2 instances in Virginia. It hit the AWS Management Console, parts of IAM, CloudWatch, and other global services with dependencies on that region's internal infrastructure.
AWS acknowledged this in the post-mortem, noting the failure occurred on the internal network hosting "foundational services including monitoring, internal DNS, authorization services, and parts of the EC2 control plane." The community response was swift and pointed: concentrating this many critical dependencies in a single region creates an unacceptable blast radius, no matter how well-provisioned it is.
I've written about this dynamic before in the context of geopolitical risk and cloud region decisions. The assumption that a single region can be made reliable enough to anchor global services is a bet that works until it doesn't. And when it doesn't, the blast radius isn't proportional to the region. It's proportional to everything that depends on it.
The downstream impact was brutal. DoorDash drivers couldn't receive orders. Netflix and Disney+ experienced disruptions. Amazon's own retail operations, including Alexa and Ring, were affected. The outage lasted roughly seven hours total, with different services recovering at different rates as AWS engineers methodically worked through the congestion without their normal tooling.
Automation as a Double-Edged Sword
Here's the thing nobody's saying about this outage: it was caused by automation doing its job.
The automated scaling activity that triggered the cascade wasn't a rogue process or a misconfiguration. It was a routine capacity scaling operation that exposed an unexpected behavior in internal clients. The system was working as designed. The design just didn't account for this interaction at this scale.
Think about the irony for a second. AWS sells automation as a core value proposition. Auto-scaling, self-healing infrastructure, managed services that abstract away operational complexity. And it was AWS's own internal automation that triggered the most significant outage of 2021. The mechanism meant to improve reliability became the vector for failure.
This isn't a reason to abandon automation. Obviously. But it's a reason to think much harder about how automated systems interact with each other at scale. I've shipped enough distributed systems to know that the most dangerous failures aren't the ones you can predict. They're the emergent behaviors that show up when multiple automated systems interact in ways nobody modeled.
As the industry pushes toward more autonomous AI-driven systems, this tension is only going to get worse. We're stacking layers of automation on top of other layers of automation. Each layer is individually well-tested and reliable. But the interactions between layers create a combinatorial explosion of possible failure modes that no amount of unit testing will catch.
What AWS Changed (and What You Should Change)
To their credit, AWS's post-mortem was detailed and included specific remediation steps. They committed to adding additional capacity to the networking devices connecting the internal and main networks, implementing better traffic management to prevent retry storms, and decoupling their monitoring systems from the infrastructure they monitor.
The bigger lessons apply to anyone building on cloud infrastructure.
Test your failure modes, not just your features. This outage wasn't caused by a bug in the traditional sense. It was an emergent behavior from the interaction between a scaling process and internal network clients. If you're not running chaos engineering experiments that test how your systems behave when dependencies degrade, you're hoping instead of engineering.
Make your observability plane independent. If your monitoring shares infrastructure with the thing it monitors, it will fail exactly when you need it most. Doesn't matter if you're on AWS, GCP, Azure, or bare metal. Your ability to see problems must survive the problems themselves.
Treat retry logic as a potential weapon. Every retry without exponential backoff and jitter is a potential contributor to a storm. After shipping enough distributed services, I've learned that the default retry behavior in most HTTP clients is dangerously aggressive. Configure it explicitly. Add circuit breakers. Test what happens when thousands of clients retry simultaneously. You will not like the results.
Question single-region dependencies for global services. If your architecture has global services anchored in a single region, you've accepted a blast radius that extends far beyond that region's borders. That can be a legitimate engineering tradeoff, but make sure it's a conscious decision and not an accident of history. I dug into this more when looking at why your cloud region is a geopolitical decision, and the same logic applies to reliability planning.
What This Really Means
Every piece of this system was working correctly. The automation was doing what it was supposed to do. The clients were retrying as configured. The monitoring was running on infrastructure provisioned for high availability. And the interaction between all of these correct behaviors produced a seven-hour outage that affected millions of users.
This is one of those things where the boring answer is actually the right one: invest in understanding how your systems interact under stress. Build independent observability. Treat every automated process as a potential source of cascading failure. Don't assume that because each piece works individually, the whole system works together.
The next major cloud outage won't be caused by a server catching fire. It'll be caused by automation doing exactly what it was told to do. The question is whether you'll be able to see it happening when it does.
Photo by Logan Voss on Unsplash.


