System Down! An Application Outage Survival Guide

A major system outage is every business’s worst nightmare, especially if you rely on applications to generate revenue. Depending on the scale of the outage this can be a very expensive problem—eCommerce stands to lose millions in sales for every hour of downtime. Furthermore, frequent public outages can damage credibility, causing customers to search for other, more reliable alternatives.

Unfortunately, outages are inevitable. But an outage doesn’t have to be the end of the world. With a little planning and communication you can significantly mitigate the effects of a major outage on your brand and your business.

Before The Outage

It’s impossible to foresee every scenario that may lead to a disaster for your application, but that doesn’t mean you can’t prepare. Companies that plan for failure respond faster when a problem arises and can often reduce their chances of facing a major outage by preventing smaller problems from snowballing.

Invest in IT. If IT is underfunded, chances are they’ll be forced to cut corners and delay important software or hardware upgrades. In other words, they’re incurring technical debt—leaving important stuff to be done later when the time or resources are available. This may make it easier for performance problems to arise and go unnoticed, especially if the tooling for monitoring the application is inadequate or missing altogether.

Break down the silos. When developers and operations work together and are involved in every stage of the application lifecycle, it’s much easier for them to troubleshoot problems. ExactTarget, for example, set up monitoring screens displaying dashboards from every department across the organization to help give everyone visibility into the entire application, making it easier to identify and troubleshoot problems.

Plan for failure. Netflix knows that the best way to prepare for failure is to experience it. That’s why they constantly simulate failure with what they call their Simian Army, a set of tools that randomly induces certain failure conditions by killing off nodes/availability zones, creating artificial latency and more.

During The Outage

An outage can be a very stressful situation for everyone involved, but it’s important that everyone keeps their wits about them—now is your opportunity to prove to your customers and the world that you can handle disaster calmly and gracefully.

Communicate early and often. You may be tempted to pretend nothing’s wrong so as to not draw more attention to yourself, but with websites like downforeveryoneorjustme.com it’s easy for anyone to call your bluff. Be the one to tell your users what’s happening–don’t make them resort to twitter to get news and vent their frustrations (or crack jokes).

Argue about data, not feelings. Edmunds.com has a “data-driven” DevOps culture that allows them to respond quickly to problems, often without even calling a war room session—a couple people get together, look at their monitoring tool, and find the problem.

Two pairs of eyes are better than one. There should be no single source of truth when it comes to application data—the more people that have access and visibility, the better. Care.com unites their tooling between Dev, Ops and QA so that everyone’s looking at the same information. This way it’s much easier to find problems before they bring down the application.

After The Outage

Once you’ve resolved the problem and your app is back up and running, your instinct will be to return to business as usual and pretend nothing had happened. However, this is the best time to talk about what happened and re-establish credibility with your customer base. All you have to do is be completely honest.

Have a blameless postmortem. If people feel like they’ll be punished for speaking up about their own mistakes, they probably won’t–leaving you in the dark and often causing the same problems to happen again and again. Etsy is well known for its blameless postmortems, where it encourages employees explain their actions and rationales without fear of retribution.

Be transparent. When you’re done with the postmortem, publish the results on your blog. This makes you appear more trustworthy as a brand—your end users know that even if it happens again, you were always honest about what went wrong. A great example of this is Skype’s postmortem blog for a 2010 outage.

Apologize. Issuing an apology and taking the blame can help you get back your credibility and show your customers that you care. Be careful with your tone, however–if you come across as flippant or accusatory you might end up angering your customers even more, like Dreamhost did after a billing issue. MailChimp also did a great job with this.

Outages are a fact of life for any business that depends on web applications. But they don’t have to be a disaster—if you plan well and are transparent about what’s going on you can mitigate the effects of failure and maintain credibility in your users’ eyes.