Post Mortem: When Amazon's Cloud Turned On Itself

For the cloud to be a permanent platform for enterprise computing, it can't be an environment where both computing and errors just occur on a larger scale.

The snafus in the cloud, it turns out, aren't so different from those occurring in the overworked, under-automated and undocumented processes of the average data center. According to Amazon's post mortem explanation of its recent hours-long outage, the failure was apparently triggered by a human error.

If so, processes susceptible to human error are not going to be good enough in the future, if the cloud is going to be a permanent platform for enterprise computing.

The cause of Amazon's recent outage, which would have been more of a disaster than it was but for the low Easter holiday traffic, was the result of a configuration error in a scheduled network update. The change was attempted in the middle of the night--at 3:47 a.m. in Northern Virginia-"as part of our normal scaling activity," according to the official explanation. That sounds like the EC2 data center was anticipating the start of early morning activity, where big customers such as Bizo or Reddit start refreshing hundreds of websites in preparation to meet the day's earliest readers.

The primary network serving one of the four availability zones in EC2's U.S. East-1 data center needed more network capacity. The attempt to provide it mistakenly shifted the traffic off a primary network onto a secondary and lower bandwidth network used for backup purposes. This is a change that has been probably correctly implemented thousands of times. It's the kind of error an operator could makes as a wrong choice on a menu or the entry of the name of the last network worked on instead of the one needed. In short, it was a human error that's all too likely to occur with anyone momentarily preoccupied with the price of mangoes or a flare up with a spouse.

However, I thought the Amazon Web Services cloud used more automated procedures than that. I thought clearly obvious errors had been anticipated and worked through, with defenses in place. Two lines of logic, checking the operator's decision, would have halted him in his tracks. A simple network configuration error should not be the source of a monumental hit to confidence in cloud computing. But apparently it is.

What happened next is not so different from what we speculated in Cloud Takes A Hit; Amazon Must Fix EC2 a week ago, based on the cryptic postings on the Services Health Dashboard. Eight minutes after the change marked the start of what the Amazon Service Health Watch dashboard described as "a networking event." The misconfiguration choked the backup network, which caused "a large number of EBS nodes in a single EBS cluster lost connection to their replicas."

An EBS cluster is servers and disk serving as short-term storage for running workloads in a given availability zone. The preceding description doesn't sound like much of an event, but in the cloud, it triggers a massive response. Suddenly large sets of data no longer knew whether their backup copy still existed on the cluster, and a central tenet of the cluster's operation is that a backup copy is always available--in case of a hardware failure.

The networking error in itself was relatively minor and easily rectified. But the error set up a massive "re-mirroring storm," a new and valuable addition to computing lexicon's already long list of disaster terms. So many Elastic Block Store volumes were trying to find disk space on which to recreate themselves that when they failed to find it, they aggressively tried again, tying up disk operations in a zone. You get the picture.

Enterprise cloud adoption has evolved to the point where hybrid public/private cloud designs and use of multiple providers is common. Who among us has mastered provisioning resources in different clouds; allocating the right resources to each application; assigning applications to the "best" cloud provider based on performance or reliability requirements.