Post Mortem: When Amazon's Cloud Turned On Itself

For the cloud to be a permanent platform for enterprise computing, it can't be an environment where both computing and errors just occur on a larger scale.

In building high availability into cloud software, we've escaped the confines of hardware failures that brought running systems to a halt. In the cloud, the hardware may fail and everything else keeps running. On the other hand, we've discovered that we've entered a higher atmosphere of operations and larger plane on which potential failures may occur.

The new architecture works great when only one disk or server fails, a predictable event when running tens of the thousands of devices. But the solution itself doesn't work if it thinks hundreds of servers or thousands of disks have failed all at once, taking valuable data with them. That's an unanticipated event in cloud architecture because it isn't supposed to happen. Nor did it happen last week. But the governing cloud software thought it had, and triggered a massive recovery effort. That effort in turn froze EBS and Relational Database Service in place. Server instances continued running in U .S. East-1, but they couldn't access anything, more servers couldn't be initiated and the cloud ceased functioning in one of its availability zones for all practical purposes for over 12 hours.

The accounts that I have paid the most attention to in the aftermath have been those whose operations didn't fail, despite the Amazon architecture's breakdown. Accounts like the one from Donnie Flood, VP of engineering at Bizo, or Oren Michels, CEO of the Mashery. In talking to Jesse Lipson, CEO of ShareFile, an original EC2 beta customer in 2008 and still a customer, he said, "We're pretty paranoid about betting on any company, even if it's Amazon," and his firm invoked the option of redirecting its traffic to Amazon's West Coast data center when it found its servers failing. ShareFile, which supplies a file sharing and storage service to business, maintains its own "heartbeat" monitoring system for its servers, and the system detected ShareFile servers disappearing after the "network event" in EC2. The system automatically shifted ShareFile traffic toward those that were in the West Coast data center.

I think Amazon itself should have a traffic shifting system that reroutes the bulk of customer traffic when an availability zone or whole data center is no longer available. It should shift it, as individual customers did, from East to West, degrading service no doubt, but keeping customers online. Lipson points out, however, that linking data centers might allow the harm to spread. Inside the Northern Virginia data center, availability zones--which are subdivisions of the data center operating independently--the trouble spread like a contagion. Backup measures that worked in individual cases or across a small set cascaded out of control when invoked on a scale that had previously been unanticipated.

Despite that risk, I still think Amazon must link data centers, but it must also include a circuit breaker that queues up traffic or shunts it away if it turns into a threat to the functioning facility. Within a data center, availability zones need to be, well, available, even if there is trouble in one of them. I think that means architecting services so that they operate in some isolation in one zone from troubles in another. In the aftermath, the EBS and RDS services operated across availability zones, and freezing them in one froze them in all.

All of this is much easier said than done when operating on the scale and complexity of Amazon's EC2. Amazon has done such a good job of pioneering the cloud that there is an immense reservoir of faith among its customers that it will eventually get it right. No one I've talked to says they're willing to switch. Cloud computing may have had a setback, but it will make a quick comeback. There is a widespread belief that when it does, it will be better. Still, it remains to be said: Amazon has got to do better than this. It has got to get it right.

Enterprise cloud adoption has evolved to the point where hybrid public/private cloud designs and use of multiple providers is common. Who among us has mastered provisioning resources in different clouds; allocating the right resources to each application; assigning applications to the "best" cloud provider based on performance or reliability requirements.