Amazon’s Christmas Eve Outage Teaches Recovery Lessons

Amazon’s explanation for the problem that took down Netflix and other sites on Christmas Eve: human error.

The Web giant blamed an unnamed developer who ran a maintenance process against state data used by the company’s Elastic Load Balancers, or ELBs. That mistake cascaded into other areas. At its peak, 6.8 percent of the company’s ELBs were affected—which might not sound like a lot, but they were balancing loads across multiple servers.

Netflix was forced to apologize for the outage, publicly pinning the blame on AWS infrastructure. That was small consolation to anyone seeking to escape from family holiday duties with a streaming marathon of American Horror Story.

Amazon also apologized. “We know how critical our services are to our customers’ businesses, and we know this disruption came at an inopportune time for some of our customers,” it wrote. “We will do everything we can to learn from this event and use it to drive further improvement in the ELB service.”

Amazon’s U.S East region has been bitten by several small outages over the past several months. A June 2012 electrical storm, for example, affected its services in a way that knocked high-profile clients such as Instagram and Netflix offline. Amazon’s other U.S. data centers, including ones in Oregon and California, haven’t suffered from widespread outages.

The Problem

The service disruption began at 12:24 PM PST on December 24th, when the aforementioned developer accidentally triggered a maintenance program that erased state data used to manage the region’s load balancers. That generated a high number of API errors—but, in an odd twist, customers were able to create and manage new load balancers, but not the ones that had been previously generated.

“During this event, because the ELB control plane lacked some of the necessary ELB state data to successfully make these changes, load balancers that were modified were improperly configured by the control plane,” Amazon wrote. “This resulted in degraded performance and errors for customer applications using these modified load balancers.”

Amazon disabled several ELB control plane workflows at 5:28 PM Christmas Eve, and worked through the night to try and manually bring back some of the affected ELBs. Amazon also tried and failed to restore the ELBs to their state just before the outage, an automated process that would have solved the problem. But the company was unable to come up with a workable snapshot of the data until an alternate solution was found. It was 12:05 PM PST on Christmas Day before the service returned to normalcy.

Lessons Learned

Amazon’s mea culpa highlights two areas in which the company can improve: access to its infrastructure, and disaster recovery (even if that disaster was self-inflicted).

Data center operators running a private cloud will undoubtedly get a bit of a chuckle from Amazon’s woes; although companies operating a private cloud must bear the costs of infrastructure and deployment, in theory they have the ability to manage access in a way that Amazon does not. And Amazon said that that’s one of the practices it will change: including limited access to production ELB state data to prevent inadvertent modification without specific Change Management (CM) approval. Those processes are currently transitioning over to an automated process that can be directly controlled by Amazon.

Amazon also tacitly acknowledged that its recovery strategy could have been better implemented. However, the company said it had learned from its mistake. “We believe that we can reprogram our ELB control plane workflows to more thoughtfully reconcile the central service data with the current load balancer state,” it said. “This would allow the service to recover automatically from logical data loss or corruption without needing manual data restoration.”

As arguably the highest-profile public cloud, Amazon’s services are closely scrutinized. But even the smallest data-center provider can take away some key lessons, not the least of which is that disaster-recovery strategies need to be as fine-grained, and as fine-tuned, as possible.