Search This Blog

AWS Outage - October 10, 2012

Last night at about Midnight Eastern time, Newstex began receiving errors from PagerDuty, immediately warning about service issues that it detected. We always begin our investigation with a few simple steps.

Don't Panic

The number one rule during any potential crisis is always, "Don't Panic". The worst thing that can happen is a panic'd engineer goes out during a perceived crisis, and ends up causing more issues then originally were, if there even were anything.

Discovery Phase

The first phase in any crisis situation is the Discovery Phase. This is where you monitor your systems and attempt to discover the cause of the alerts being sent out.

Verify the alert

Next, it's important to verify that the error alert wasn't erroneous, or already fixed by the time you get the alert. Temporary issues are quite common, and although the alerts attempt to only send themselves when things have been confirmed down, there's always the chance that the alert was misconfigured, or the alerting system is having an issue. It's important to verify that what the alert is telling you is true. In our example, it was telling us that several of our FTP servers were having issues. This is easy to verify by simply attempting to log into those FTP servers with known good accounts (we have several testing accounts just for this purpose).

Check the AWS Status Page

After you've verified that it is an actual issue, it's important to check the AWS Status Pages. In our case, this turned out to be the end of the discovery phase. We noticed immediately that SimpleDB was having major issues, and that was the cause of our issues. If this is the problem, you immediately move on to mitigation. If this isn't the problem for you, it's important to also check your log files to identify what the potential issue might be.

Recovery Phase

Once you have discovered the issue, it's time to go about mitigating the impact and fixing the situation.

Mitigate the impact on customers

Customers generally don't care why a system is down, they only care that it is down. Blame may work in politics, but it doesn't comfort the user to know that you're down "because of amazon". That just ends up putting more work on your shoulders as now they will simply question why you are running on AWS. Instead, it's more important to restore services as quickly as possible, or at the very least mitigate the impact to your customer. In our case, although we couldn't fully restore all services, we were able to keep our FTP servers accepting new connections and files, focusing on the real-time needs of our customers. We knew it wasn't important to make sure that our internal administration system was operating, nor was it important to make sure that our delayed feeds were reading. The most important aspects of our system, the ones requiring real-time delivery, were our top and only priority.

Monitor and Recover

During any crisis phase, it's important to continually monitor the situation, and be prepared to escalate to another step. In our example, we had already begun to prepare databases in us-west-1 in the event the issues extended beyond 4am. Fortunately, at around 3am Eastern time, the services were fully restored.

After the services were fully restored, we went through each of our services one-by-one and verified manually that everything was working again. We continued to monitor through the morning hours to make sure everything stayed stable.

Post-Mortem

The last phase of any crisis situation is the Post-Mortem. This is the time when you can go back in detail through your logs and identify exactly why the issue was escalated to an outage. Its at this point you want to take your time and find out exactly what triggered the domino effect, and work on potential solutions to those issues. You don't need to necessarily make any changes during this phase, but you do need to at least present the root cause, and why it became an escalation to an actual outage.

After you have discovered the issue and potential solutions, it's up to you to determine the cost benefit of implementing those solutions. Remember that every issue comes with an associated risk, which can be calculated as a combination of How likely is this issue to occur, and How big of an impact will this be. This means that something that is lower impact, but much more likely to occur, will be a higher priority then something that is extremely unlikely to occur, but a much higher impact.

Great support Staff is key

In our case, Newstex was fully prepared to roll to a different region, but we determined the impact would be greater then simply waiting for Amazon to fix the root cause. Our monitoring and support staff were entirely on the situation and handled it incredibly well, minimizing the impact in most cases so that customers don't even know there was an issue. Having an incredibly great support staff is absolutely key. You don't want to have to wake up your customers because they're being told the system is down, you have to know and resolve these issues before your customers notice.

Popular Posts

Ever wonder how sites like battle.net support things like this in Google Chrome?

Well I did, so I did a little bit of digging. It turns out Google Chrome supports an open standard called Open Search. This format is relatively simple, and very easy to add to your own site. I just added it to some of our systems in under 5 minutes.

Adding OpenSearch to your site is incredibly simple, you just have to add a simple tag to your index HTML page, and add a simple XML file that it points to. The link tag looks like this:
<link rel="search" type="application/opensearchdescription+xml" href="http://my-site.com/opensearch.xml" title="MySite Search" />

For a while, I have been creating command line tools provided right with boto which I used to manage AWS. Recently, others have become interested in these tools as well, and I've seen several other contributors adding to these tools to make them even more useful to others. One recent submission by Ales Zoulek added some nice features to my list_instances command, which I use on a regular basis to list out the instances that are currently active for my account in EC2.

Amazon now lets you add Tags to EC2 objects such as Instances and Snapshots. This allows you to actually "Name" your EC2 instance, as well as add some metadata that could be used for AMI initialization, etc. Ales added the ability to list these tags by name within the list_instances command line application:

Last week, Amazon announced the launch of a new product, DynamoDB. Within the same day, Mitch Garnaat quickly released support for DynamoDB in Boto. I quickly worked with Mitch to add on some additional features, and work out some of the more interesting quirks that DynamoDB has, such as the provisioned throughput, and what exactly it means to read and write to the database.

One very interesting and confusing part that I discovered was how Amazon actually measures this provisioned throughput. When creating a table (or at any time in the future), you set up a provisioned amount of "Read" and "Write" units individually. At a minimum, you must have at least 5 Read and 5 Write units partitioned. What isn't as clear, however, is that read and write units are measured in terms of 1KB operations. That is, if you're reading a single value that's 5KB, that counts as 5 Read units (same with Write). If you choose to operate in eventually consistent mode, you'r…