Amazon's Dec. 24th Outage: A Closer Look

On Christmas Eve, Amazon Web services experienced an outage at its Northern Virginia data center. In a prompt follow up, it issued an explanation on Dec. 29, apologized to customers and said it wouldn't happen again. It was the fourth outage of the year in its most heavily trafficked data center complex.

Explanations in the press of what happened, based on the Dec. 29 statement, were relatively brief. The Wall Street Journal, for example, stated that Amazon spokesmen blamed the outage "on a developer who accidentally deleted some key data ... Amazon said the disruption affected its Elastic Load Balancing Service, which distributes incoming data from applications to be handled by different computing hardware."

To an IT manager thinking of using Amazon, that leaves as much unexplained as explained. A developer disrupted running production systems? Development and production are kept distinctly separate in enterprise data centers for exactly the reason demonstrated in the Dec. 24 outage. The developer, Amazon took pains to explain, was "one of a very small number of developers who have access to this production environment." Amazon is a large organization with many developers; how many developers had access?

The developer launched a maintenance process against the running production system which deleted the state information needed by load balancers. "Unfortunately, the developer did not realize the mistake at the time. After this data was deleted, the ELB control plane began experiencing high latency and error rates for API calls to manage ELB load balancers," the Amazon team's statement said.

The cloud promises greater efficiency than enterprise data centers because it offers both a more uniform and more automated environment. However, when something unexpected goes wrong, as Amazon customers saw in the 2011 Easter weekend "remirroring storm," automation takes over and can amplify the error. That started to happen Christmas Eve around 12:30 p.m. Pacific time.

The AWS trouble shooters spotted the error rates for API calls, but a larger underlying problem was developing out of sight. When a customer sought to modify his load balancer configuration, the Elastic Load Balancer control plane needed the state information that had been deleted. "Load balancers that were modified (by customers) were improperly configured by the control plane. This resulted in degraded performance and errors for customer applications," and the problem began to spread.

The AWS trouble shooters noticed more load balancers were issuing increased error rates and realized some sort of infection was spreading out. It didn't affect newly created load balancers, only those that had been operating prior to the developer's maintenance procedure. They dug "deeply into these degraded load balancers (and) identified the missing ELB state data."

At that point, it became a containment and recovery problem. After 4.5 hours of disruption, with 6.8% of load balancers affected, the team disabled the control plane workflows that could spread the problem. Other running load balancers couldn't scale up or be modified by customers, a serious setback on the final day of the Christmas shopping season. Netflix customers who wished to spend Christmas Eve watching "It's A Wonderful Life" or "Miracle on 34th Street," found they weren't able to access the films.

"The team was able to manually recover some of the affected running load balancers" by that evening and "worked through the night to try to restore the missing ELB state data." But initial effort went awry, consuming several more hours but "failed to provide a usable snapshot of the data."

A second recovery attempt worked. At 2:45 a.m. Pacific Dec. 25, or more than 14 hours after the disruption started, the missing state data was re-established, but even this was a near thing. The recovery occurred "just before the data was deleted," the Amazon statement acknowledged. The troubleshooters merged the state data back into the control plane "carefully" to avoid disrupting any running load balancers. By 10:30 a.m. Pacific Dec. 25, 22 hours after its start, most load balancers were back in normal operation.

AWS continued to monitor its running load balancers closely and waited until 12:05 p.m. Pacific before announcing that operations were back to normal.

Compared to previous events, there was a greater degree of transparency into this event than in some previous AWS outages. Immediately after a 2011 power outage at its Dublin, Ireland, data center, Amazon officials stated the local power utility said a lightning strike had been responsible. As it turned out, the utility later reported no strike ever occurred. In its Dec. 29 statement, the human error is there for all to see, along with the fits and jerks of the response. In this explanation, Amazon's response more closely resembles the standard set by Microsoft in its explanation of its own Windows Azure Leap Day bug Feb. 29 last year.

Even so, potential cloud users will frown on the fact that some developers have some access to running EC2 production systems. "We have made a number of changes to protect the ELB service from this sort of disruption in the future," the AWS team stated.

Normally a developer has one-time access to run a process. The developer in question had a more persistent level of access, which Amazon is revoking to make each case subject to administrative approval, controlled by a change management process. "This would have prevented the ELB state data from being deleted. This is a protection we use across all of our services that has prevented this of problem in the past, but was not appropriately enabled for this ELB state data."

So the official explanation says an unexpected event occurred but it won't occur elsewhere, due to protections already in place. The team said it could now reprogram the control plane workflows "to more thoughtfully reconcile current load balancer state" with a central service running the cloud.

Both the solution and the choice of words describing it illustrate that cloud operations are complex and service providers have attempted to think of everything. That they couldn't do so at this stage of cloud computing is illustrated by the disruption and the "more thoughtful" approach that followed it.

"We want to apologize. We know how critical our services are to our customers' businesses, and we know this disruption came at an inopportune time. We will do everything we can to learn from this event and use it to drive further improvement in the ELB service."

In the past Amazon's explanations of what's gone wrong have been terse to the point of being difficult to comprehend. The apologetic note following a clear, technical explanation parallels the pattern set in Microsoft's Leap Day event.

Bill Laing, Microsoft's corporate VP for servers and Azure, wrote after the Leap Day incident: "We know that many of our customers were impacted by this event. We sincerely apologize for the disruption, downtime, and inconvenience this incident has caused. Rest assured that we are already hard at work using our learnings to improve Windows Azure,"

The idea that cloud computing requires transparency, particularly when something goes wrong, is catching on and may yet become a standard of operation across service providers. Microsoft is moving toward offering more infrastructure as a service on top of Azure's platform as a service, a form of computing more oriented toward developers. Infrastructure as a service needs to attract enterprise workloads, and to do so, must establish enterprise trust. Amazon, despite the outage, is trying to move in that direction.

Well, I always enjoy Mr. Babcock's writing and this article is additional proof that he is one of the best in the business. As for AWS, no other company has done more to advance cloud computing than Amazon. With 70 percent of the market, AWS has become the gold standard in cloud computing. AWS could also be getting too big and growing too quickly, which could account for some of the mishaps they have experienced in their US East center in northern Virginia. Operating and maintaining a warehouse-scale computing system calls for higher levels of system intelligence in order to correctly identify when the system needs to take corrective action to protect itself. So far we have seen both EBS and ELB fail to "understand" what had happened and take the proper corrective action in response to human errors. In fact, the system's response made the problems worse in both of these situations. Amazon has a lot of bright people working there. The situation can be improved and it will have to be improved if AWS intends to remain in business because the competition is out there.

We learn that human error is the ultimate cause of the failure, and that processes and procedures that should have been in place to protect against this kind of mistake weren't implemented.That can't be reassuring to federal agencies in Washington DC and northern Virginia that presumably are (or would be) served by this same data center. The White House's "cloud first" policy requires agencies to ramp up their use of cloud services, yet one of the leading cloud service providers suffered at least two local outages in 2012. At the same time, the feds are closing hundreds of their own data centers, so they may have no choice but to use commercial services/centers like Amazon's. Among the lessons learned: government IT teams better devote adequate time and attention to SLAs, cloud architecture, and redundancy.

Enterprise cloud adoption has evolved to the point where hybrid public/private cloud designs and use of multiple providers is common. Who among us has mastered provisioning resources in different clouds; allocating the right resources to each application; assigning applications to the "best" cloud provider based on performance or reliability requirements.