CIO analysis: Examining Amazon's cloud failure

Amazon's recent web services outage took down a number of high-profile websites, including Reddit, Foursquare, and Quora. Given the coverage, speculation, and FUD surrounding this situation, I thought a CIO perspective would be useful.

The issues affecting EC2 customers last week primarily involved a subset of the Amazon Elastic Block Store (“EBS”) volumes in a single Availability Zone within the US East Region that became unable to service read and write operations. In this document, we will refer to these as “stuck” volumes. This caused instances trying to use these affected volumes to also get “stuck” when they attempted to read or write to them. In order to restore these volumes and stabilize the EBS cluster in that Availability Zone, we disabled all control APIs (e.g. Create Volume, Attach Volume, Detach Volume, and Create Snapshot) for EBS in the affected Availability Zone for much of the duration of the event. For two periods during the first day of the issue, the degraded EBS cluster affected the EBS APIs and caused high error rates and latencies for EBS calls to these APIs across the entire US East Region. As with any complicated operational issue, this one was caused by several root causes interacting with one another and therefore gives us many opportunities to protect the service against any similar event reoccurring.

It's a detailed techno-geek discussion, but in plain English Amazon's systems experienced local congestion that cascaded out of control, thus causing lack of availability. Even worse, Amazon actually lost some customer's data.

From a CIO perspective, we should keep several points in mind:

1. The sky did not fall and cloud is here to stay

This outage may cause a short-term chilling effect on cloud adoption. As general business publication, The Economist, reports, such outages:

have raised the question of whether customers can really trust the basic idea behind the cloud—that you can buy computing services from the internet, just like gas or water from a utility.

However, anecdotal evidence and CIO mind share both suggest the increasing importance of cloud computing and SaaS. A recent study from Computer Economics shines the light of empirical research on growth and investment in SaaS, as shown in the following chart:

Centralized service facilities is a core tenet of cloud computing and SaaS. Cloud vendors, such as Amazon, share their infrastructure among many customers on a "rental" basis. These shared facilities create economies of scale and facilitate the leverage that makes cloud computing financially attractive.

However, centralization also has a dark side -- a negative magnifying effect occurs when the central facilities break down. In the old days, every major organization maintained its own infrastructure, reducing the risk that many companies would go down simultaneously from a single point of failure. Although traditional infrastructure can be expensive, inefficient, and redundant, the cloud creates a broader failure footprint when things go bad.

Nonetheless, for most organizations, cloud and SaaS offer many compelling economic benefits combined with an excellent business case. As another article in The Economist stated:

[A]s countless individuals and companies have come to find that the benefits of doing things online greatly outweigh the risks.

That said, smart organizations recognize the risks inherent when key components and systems are centralized in the cloud and plan accordingly. Gartner analyst, Lydia Leong, offers sage comments on this point:

There are a lot of moving parts in cloud IaaS [Infrastructure as a Service]. Any one of them going wrong can bork your entire site/application. Your real problem is appropriate risk mitigation — the risk of downtime and its attendant losses, versus the complications and technical challenges and costs created by infrastructure redundancy.

3. Cloud is still maturing and evolving

From both business and technology perspectives, the world of cloud is changing and evolving over time. Amazon itself is still evaluating how to plan better and make their own systems more robust.

It’s a little odd to see that when the problem of non availability of nodes happened, Amazon almost began to get into a denial –of-service attacks within their environment. Amazon now claims that this aspect of crisis related actions have been set right but one may have to wait till next outage to see what else could give way. It may be noted that Amazon cloud services suffered a major outage in 2008 – the failure pattern looks somewhat similar upon diagnosis.

Clearly, the systems need to operate differently under different circumstances – while it’s normal for nodes to keep replicating on storage/access concerns, the system ought to exhibit different behavior with a different nature of crisis. With the increasing adoption of public cloud services, certainly the volume, complexity and range of workloads would increase and the systems would get tested under varying circumstances for availability and reliability. All business and IT users would seek answers to such questions as they consider moving their workloads onto the cloud.

4. Self-reliance and proper planning are critical

Outsourcing does not relieve enterprise buyers of responsibility to manage their own destiny. For cloud computing, this means the enterprise must design applications for resiliency while planning for disaster recovery.

Cloud automation CTO, George Reese, suggests "designing for failure," to give applications a degree of independence from data center interruptions:

Under the design for failure model, combinations of your software and management tools take responsibility for application availability. The actual infrastructure availability is entirely irrelevant to your application availability. 100% uptime should be achievable even when your cloud provider has a massive, data-center-wide outage.

The advantage of the traditional model is that any application can be deployed into it and assigned the level of redundancy appropriate to its function. The downside is that the traditional model is heavily constrained by geography. It would not have helped you survive this level of cloud provider (public or private) outage.

The advantage of the "design for failure" model is that the application developer has total control of their availability with only their data model and volume imposing geographical limitations. The downside of the "design for failure" model is that you must "design for failure" up front.

Companies need the Cloud equivalent of offsite backup. At a minimum, you need to be sure you can get access to a backup of your infrastructure–all the AMI’s and Data needed to restart. Storage is cheap. Heck, if you’re totally paranoid, turn the tables and backup the Cloud to your own datacenter which consists of just the backup infrastructure. At least that way you’ve always got the data. Yes, there will be latency issues and that data will not be up to the minute. But look at all that’s happened. Suppose you could’ve spun up in another region having lost 2 hours of data. Not good, not good at all. But is it really worse than waiting over 24 hours or would you be feeling blessed about now if you could’ve done it 2 hours into the emergency?

These are the kind of trade offs to be thinking about for disaster recovery. It’s chewing gum and bailing wire until you get an architecture that’s more resilient, but it sure beats not having any choices and waiting.

5. Learn from success stories

Some Amazon customers made it through the event relatively unscathed. Learn from their lessons to help your organization become better prepared.

Why were some websites impacted while others were not? For Netflix, the short answer is that our systems are designed explicitly for these sorts of failures. When we re-designed for the cloud this Amazon failure was exactly the sort of issue that we wanted to be resilient to. Our architecture avoids using EBS as our main data storage service, and the SimpleDB, S3 and Cassandra services that we do depend upon were not affected by the outage.

At SimpleGeo failure is a first class citizen. We talk a lot about it in design discussions, it influences our operational procedures, we think about it when we're coding, and we joke about it at lunch. I believe that this emphasis on understanding system failure mechanisms and being open about them is the first step towards dealing with them. Before we introduce a new component into our infrastructure we plan how we'll deal with it when it inevitably fails.

ADVICE TO ENTEPRISE CIOs

For many organizations, moving at least some apps and systems to SaaS and the cloud is inevitable. The question is deciding which applications to move and when, figuring how to perform the migration, and developing the right skills internally.

The Amazon situation demonstrates that cloud migration is not an all or nothing proposition. In this case, organizations with better planning and architecture design survived while others went down. In addition, the importance of sophisticated technical knowledge cannot be over-emphasized.

CIOs should decide where their organization fits on the cloud investment / risk curve. As in other areas of business, increasing investment in robust design can reduce operational risk, but at higher financial cost. Every CIO must evaluate his or her own organization to determine the right tradeoffs between investment, technical development, and risk.