Short notes and essays about stuff that interests me (mostly technical stuff).

Tuesday, May 3, 2011

Learning from the AWS outage

Generally, in computer science as in other scientific fields, we study success. We gather the best known algorithms for solving certain problems, study them, consider ways to improve them, and publish and share the best ones we can find.

However, it is just as essential, perhaps even more so, to study failure. Cryptographers analyze security breaks, to understand where the reasoning contained flaws. Wise engineers know that when you find one bug, it's worth looking around for other similar problems in the code. We do the best that we can in our designs and implementations, but, as Richard Feynman said in his report on the explosion of the space shuttle Challenger:

For a successful technology, reality must take precedence over public relations, for nature cannot be fooled.

These past weeks have provided a wonderful new opportunity for the studying of failure, in the form of the recent Amazon Web Services Elastic Block Storage re-mirroring outage. Even more interestingly, a number of organizations have published their own analyses of their own mistakes, errors, and mis-steps in using AWS EBS to build their applications.

Here's a brief selection of some of the more interesting post-mortems that I've seen so far:

The basic cause of the outage was that a configuration change was made incorrectly, routing a large amount of traffic from a primary network path to a secondary one, which couldn't handle the load.

This caused nodes to lose connection to their replicas, and to initiate re-mirroring.

In order for the re-mirroring to succeed, a significant amount of extra disk storage had to be provisioned into the data center.

Over-provisioning resources in the data center could have prevented some of the failures: "We now understand the amount of capacity needed for large recovery events."

Keeping enough of the system running in order to work on the failing parts is very challenging: "Our initial attempts to bring API access online to the impacted Availability Zone centered on throttling the state propagation to avoid overwhelming the EBS control plane. ... We rapidly developed throttles that turned out to be too coarse-grained to permit the right requests to pass through and stabilize the system. Through the evening of April 22nd into the morning of April 23rd, we worked on developing finer-grain throttles"

The Netflix team also talked about the benefits of over-provisioning capacity. They also described the complexity of trying to perform massive online reconfiguration: "While we have tools to change individual aspects of our AWS deployment and configuration they are not currently designed to enact wholesale changes, such as moving sets of services out of a zone completely." To simulate these sorts of situations for testing, Netflix are considering replacing their "Chaos Monkey" with a "Chaos Gorilla"!

The Conversations Network team (a large podcasting site) described an interesting rule of thumb for determining when to initiate manual disaster recovery schemes: "Once the length of the outage exceeds the age of the backups, it makes more sense to switch to the backups. If the backups are six hours old, then after six hours of downtime, it makes sense to restart from backups." However, they also commented that they had overlooked the need to test your backups, as it turned out that once of their crucial data sets was not being backed up on the same schedule as others.

Bryan Cantrill of Joyent talked about the danger of adopting a cloud strategy that leads to "the concentration of load and risk in a single unit (even one that is putatively highly available)."

The Heroku team pointed out that not all Amazon Web Services are the same, particularly when it comes to availability: "EC2, S3, and other AWS services have grown much more stable, reliable, and performant over the four years we've been using them. EBS, unfortunately, has not improved much, and in fact has possibly gotten worse. Amazon employes some of the best infrastructure engineers in the world: if they can't make it work, then probably no one can."

The SimpleGeo team pointed out one of the reasons that over-provisioning is crucial: "When EBS service degradation occurred in one AZ, I suspect that customers started provisioning replacement volumes elsewhere via the AWS API. This is largely unavoidable. The only way to address this issue is through careful capacity planning -- over-provisioning isolated infrastructure so that it can absorb capacity from sub-components that might fail together. This is precisely what we do, and it's one of the reasons we love Amazon. AWS has reduced the lead time for provisioning infrastructre from weeks to minutes while simultaneously reducing costs to the point where maintaining slack capacity is feasible."

At my day job I've been spending a lot of time recently thinking about how to build reliable, dependable, scalable distributed systems. It's a big problem, and one that takes years to address.

Building distributed systems is extremely hard; building reliable high-performing highly-available systems is harder still. There is still much to learn, so Amazon are to be commended, praised, and thanked for their openness and their timely release of detailed information about the failure, which is greatly appreciated by us all.