What Other Services Fail with EBS?

Amazon had another outage today which impacted users of EBS in a single availability zone of the us-east-1 region. On the plus side, this outage did not seem to cascade and cause availability problems for EBS users in other availability zones. But there were a number of other things which failed across the Amazon services within that availability zone. Looking at them gives us some interesting insight into dependencies within the services provided by AWS.

The first and most obvious casualty of EBS problems are the APIs used to interact with AWS, the AWS console, and Elastic Beanstalk. Due to automated failovers in cases of problems, both ones provided by Amazon and custom solutions from Amazon users, the load on the API increases drastically in these sorts of events. This means that automated failover capabilities are actually less helpful than one might hope. As such, to remain up during these events, you need to have the capacity to sustain the loss of an availability zone without provisioning additional instances or changing the configuration of AWS services (such as ELBs).

Another big component which tends to fail in concert with EBS is RDS. While not surprising, especially with the announcement of provisioned IOPS support for RDS, this drives home the fact that RDS uses and depends on EBS for its data storage. And while Multi-AZ support promises to handle failover for these cases, the reality is that it seems to get stuck without failing over in these cases for at least some users.

Two of Amazon’s newer services also saw some downtime today — CloudSearch and ElastiCache. CloudSearch provides searching and indexing functionality and so it’s pretty obvious that its disk usage will be similar to that of a database and so EBS makes some sense. ElastiCache, on the other hand, provides a memcached compatible server for your in-memory caching needs. What does an in-memory cache need disk, and especially disk with the persistence guarantees of EBS as opposed to ephemeral disk? I suspect Amazon is using EBS there to provide a volume to snapshot and handle the scale out of the cache to new instances.

Did you notice problems with other services? Interested in helping us to build tools to monitor and deal with these problems — join us!

4 thoughts on “What Other Services Fail with EBS?”

That’s a good point Jeremy.
That’s why in AWS it worth having active-passive or active-active infrastructure setup in different availability zones or regions. If having issues with one of the zones, don’t waste your time, just switch to other zone.

I don’t think users can fully rely on AWS monitoring tools like CloudWatch. Monitoring stats information is not always up to the minute. Use additional monitoring tools to stay ahead of time.

@Ev — very true but the problem that many people saw last week is that they weren’t able to switch to other zones as the Amazon API control plane was overloaded and unavailable. In their post-mortem (https://aws.amazon.com/message/680342/), Amazon describes how they continue to work to keep the API fully available during outage scenarios but things just aren’t entirely there yet.

Sorry, I missed your original point about API accessibility. I wonder what would you suggest to protect against such risks?
Externalize DNS servers and have DNS master-slave spread across different providers to minimize failure risk? In this case you’ll at least be able to redirect traffic.
Latency based DNS routing? Any other ideas?

External DNS servers is probably the best bet for anything public facing. For internal pieces of your architecture, you could use a variety of other ways to coordinate access to services from the simplest of using /etc/hosts to point at the active database server to something like Zookeeper to handle the coordination.