Amazon EC2 outage: how it affected us, and what we’ve done since

As most of you know by now, we were affected by the Amazon EC2 outage, which resulted in approximately a day of on/off downtime for Pagelime. We’ve communicated openly about it to anyone who reached out, and we sent out a mass email with our personal cell numbers and personal emails. We wanted to make sure to stay open, and available when you needed to reach out to us. I’m going to take time with this post to explain the details behind the outage.

Here’s what happened, how it affected us, and what we’ve done since to mitigate this issue:

The North Virginia region of the Amazon Elastic Cloud infrastructure had a major set of issues with their storage: the Elastic Block Store (the EBS). The EBS is meant to be a highly redundant form of storage with very low rates of failure, where any single disk failure should not affect availability of the actual data. Turns out this isn’t exactly the case: the entire block store seems to have become unavailable within a number of regions.

A number of web companies were affected, including Foursquare, Reddit, Quora, and HootSuite to name a few. A number of web apps like ourselves assumed the issue would be resolved promptly.

Amazon took about a day to repair the issue, at which point service was restored, and things began to operate normally. This puts our current uptime at 99.4% for the year. We need this to be better for both our users and our peace of mind.

Here’s why we use Amazon AWS and not a custom brewed hosting solution:

Amazon AWS is fast. It gives us really good response times, and the storage performs very well for the price. Which allows us to keep our costs down for our users, while providing the best service.

Amazon AWS is highly available. It allows us to host servers in Virginia, North California, Ireland, and Asia Pacific at the same time, so that a Pagelime server can always be available to close to where you are. The same goes for their simple storage service: S3.

Amazon AWS is highly scalable. We can provision new computing and storage resources very quickly. It puts scaling into our hands.

An ideal setup with AWS should not have failed even in the outstanding scenario we had over the past two days. Here’s why ours failed:

All of our data is stored in multiple availability zones (around the world) by both our databases and our data files stored in S3. This worked like a charm, our database immediately failed over to an available instance, and our data was un-affected. This is good.

The same goes for servers hosted in multiple availability zones. Only the ones in the US-East zone were affected. This is good.

However, for speed of use, Pagelime caches all of the data files, such as content, images, and documents, on Elastic Block Store volumes… the very volumes that failed completely. This cache allows us to quickly publish content without a lot of round-tripping between the database and the servers. This crippled us for a day, and we’re fixing it.

Pagelime also runs the publish engine from ONE single destination. The reason is that we want the publish engine to always originate from one IP address, so that firewalls and hosts can white-list us. This publish engine happened to be in the affected zone. We’re fixing this as well.

Soon after the outage happened, we initiated plan B, and began to migrate all of our cache/engine to a different availability zone. This was great as an emergency response, but we want to be resilient to these failures in the future. Here’s what we’re doing to prevent this from happening again:

We are purging the publish cache. From now on, the data will be published directly from the data store. This may result in longer load times when you press the publish button, or when you publish an image gallery, but it should reduce the potential of future failures. We unfortunately have to cut this performance optimization for the sake of reliability.

We are adding code to the Pagelime application that will actually fail-over in the software itself to different storage models should one appear to be failing.

We are creating a backup publish engine in a different part of the world. And for those folks who have bypassed firewalls, we will send this IP out as well, to be added to their web host’s firewalls.

We’ve learned a lot from this. We were really proud of our cloud infrastructure, and the speed / reliability we were getting for the price. After this incident we’re a bit sobered, and we realize that we need to put even more effort into it.

We’re grateful for the outpouring of support we’ve received from you via email. Thanks for standing by us – we’ll make sure to pay it back in kind.

This entry was posted
on Saturday, April 23rd, 2011 at 11:33 am and is filed under Uncategorized.
You can follow any responses to this entry through the RSS 2.0 feed.
You can leave a response, or trackback from your own site.