How to handle downtimes

Sep 14 2014 :: by Max

Every app is going to go down at some point. No matter how good your team is you are going to have a major downtime. It’s not a question of “if” it’s a question of “when”. Even Google goes down from time to time. And when downtimes do happen, things get ugly quickly. These are going to be the hardest times for you and your customer support department.

Every serious outage causes loses. How much you lose depends directly on how your customer support agents have handled it.

Get ahead of the downtime

Groove, one of our competitors, had a major downtime a couple of months back. 11 hours passed between the time when the downtime happened and when they noticed it. I'm not trying to make fun of a competitor – I really feel for them. But 11 hours is freaking crazy, sorry.

You don't want to be in their shoes. To prevent this you can use tools that monitor the server uptime and notify you as soon as your app goes down. Like Pingdom and PagerDuty. When something goes wrong our phones start blowing up with automated text messages and calls. Chances that the entire team will sleep through these alerts are pretty slim (hey, a side-benefit of having your team spread across different time zones).

During the downtime: communicate continuously

You need to have a central place where you can continuously communicate with your customers throughout the downtime. Like most companies, we use Twitter for this. Some big companies have dedicated status pages (you can use StatusPage.io, if you want one).

When tickets start pouring in you need to apologize, let them know your handling it and send everyone to that place (hey, another side-benefit: downtimes are great for getting new Twitter followers). And post regular updates afterwards. Here is what customers want to know:

That you are aware of the downtime and you're working on it.

What the downtime means for them. Was their data lost?

What are the next steps for them. Should they do something immediately? Change their passwords for example.

After the downtime: tell users what the problem was

After everything is settled write a blog post. Users need to know what happened, how it affects them and what measures have you taken to prevent this from happening in the future. If the outage affected all of your users, you may also want to send an email to all customers.

Here are some general guidelines for the blog post:

Don't point fingers. Even if the downtime was not your fault try not to shift the blame. Don't say something like "That was totally [a third-party service] fault, we have nothing to do with that."

Don't try to be funny. If you have a B2B app, people probably lost money because of the downtime. This is a very bad time to make jokes.

Don't pretend that the downtime was not a big deal. When you say something like "a very few customers were affected", it sounds like you think it was not that big of a deal. Guess what, it was a huge deal for those who were affected.

Don't forget to say that you're sorry. Preferably multiple times. Check out our guide on how to apologize.