AWS us-east outage: one month later

Exactly one month ago I posted a post-mortem of the large-scale outage which affected our primary infrastructure on the east coast over the weekend of June 30th 2012. I offered an apology to the many dotCloud customers who were affected, and I identified 4 root causes along with corrective actions.

After 4 weeks of hard work some of these corrective actions are still underway, but I would like to offer a first report of our progress.

Root cause 1: API availability

Even when your application is up and running, it can be distressing to be unable to interact with it through the dotCloud developer tools. Although application uptime is always a priority, we are working to improve the reliability of our API and developer tools as well, to guarantee maximum comfort even during hardware failures:

Our product process has been adapted to speed up the fixing of bugs which may affect API availability.

Certain timeouts were caused by upgrades to the API throughout the day. We have grouped routine updates to the API in weekly bundles, to minimize service disruption.

Work in progress: certain command-line operations rely on components which are less resilient to hardware failure. We are improving the design of these components to make them more reliable.

Root cause 2: Uptime of sandbox applications

Any application is a reliable as its weakest link. When underlying hardware fails, individual application containers fail as well – that is a fact that cannot be avoided. To increase application reliability, dotCloud services can be scaled horizontally, which deploys them to multiple containers, across multiple machines and facilities. Horizontal scaling is a feature of our Live offering. Unfortunately during the outage we have received requests from many dotCloud users who could have avoided downtime, if they had been better informed of the need to use the Live flavor combined with horizontal scaling.

Corrective actions:

We now invest more time in support explaining how scaling works and why it’s important

Work in progress: we are working on an upgrade to the dashboard that will display warnings when applications are not scaled properly

Work in progress: we want to make scaling even more straightforward and accessible. If you have feedback on how we can make scaling easier for you, please get in touch!

Root cause 3: Communication

An important part of managing an incident is communicating to your customers as frequently and transparently as possible. That is particularly important for us at dotCloud because our entire business is based on earning the long-time trust of our customers. Throughout the week-end our engineering and support teams have pulled ridiculous hours to answer the hundreds of support requests filed by anxious developers. The result was higher-than-usual response time, and lower-than-usual support satisfaction. That is not acceptable and we have taken several measures to improve how we communicate the state of the platform during large-scale incidents:

Status API: We created a new API to expose the status of the platform. The status API exposes an “outage flag” which indicates if the platform is currently experiencing large-scale outage. It also exposes a stream of recent status updates, which can be consumed programatically. You can find out more about the status API by reading the docs.

Outage alerts on website: The dotCloud website is now outfitted with a special banner which will automatically display alerts in the case of an outage.

Our internal communication processes and tools have been upgraded. On-call engineers can now use a single tool to propagate status updates to all channels. Our on-call procedure has been changed and now designates a communication operator, tasked with producing timely and detailed status updates while his teammates fix problems.

Work in progress: several customers have asked that we open our priority support offering to smaller applications. We have heard you and and are working on a new support plan which will allow anyone, for a fee, to get priority treatment from our awesome support team. More on that soon.

Root cause 4: Dependency on AWS us-east

Last but not least: a great many people have asked if we could reduce our exposure to the AWS us-east region.

As I have already explained, I don’t believe in blaming AWS for outages. AWS is simply a hosting provider: all hosting providers experience failures and it is the platform’s job to protect the application against these failures. However, owning up to our responsibility doesn’t mean we should put all our eggs in the same basket. dotCloud already takes advantage of availability zones to minimize exposure to outages, but this last outage has made it clear that cross-AZ deployment is not a silver bullet. Over the next months we will be taking gradual steps to reduce our exposure to any single infrastructure provider:

As a first step, we have moved our status page (http://status.dotcloud.com) away from its multi-AZ deployment, and onto a different hosting provider.

Work in progress: we are in the process of migrating critical platform components out of AWS.

Conclusion

More than any technology, it’s the way we treat our customers that will determine our success and longevity as a business. Nothing is more important than earning your trust. And our plan for earning your trust is simple: “say what you do, do what you say”. In other words: we strive to be transparent about our plans and actions, without over-promising; and we work hard to deliver on our promises. By maintaining this discipline I believe we can deliver a service that you can trust – every time.