A guide to handling incidents, downtime and outages

Outages and downtime are inevitable. Designing your systems to handle failure is a key part of modern infrastructure architecture which makes it possible to survive most problems, however there will be incidents you didn’t think about, software bugs you didn’t catch and other events which result in downtime for your service.

Microsoft, Amazon and Google spend $billions every quarter and even they still have outages. How much do you spend?

There are some companies who constantly seem to have problems and suffer from it unnecessarily. Regular outages ultimately become unacceptable but if you adopt a few key principles and design your systems properly, the few times when you do have service incidents you can be forgiven by customers.

Step 1: Planning

If critical alerts result in panic and chaos then you deserve to suffer from the incident! There are a number of things you can do in advance to ensure that when something does go wrong, everyone on your team knows what they should be doing.

Use proper config management, be it Puppet, Chef, Ansible, Salt Stack or some other systems to be able to make mass changes to your infrastructure in a controlled manner. It also helps your team understand novel issues because the code that defines the setup is easily accessible.

Unexpected failures

Be aware of your whole system. Unexpected failures can come from unusual places. Are you hosted on AWS? What happens if they suffer an outage and you need to use Slack or Hipchat for internal communication? Are you hosted on Google Cloud? What happens if your GMail is unavailable during a Google Cloud outage? Are you using a data center within the city you live in? What happens if there’s a weather event and the phone service is knocked out?

Step 2: Be ready to handle the alerts

Some people hate being on call, others love it! Either way, you need a system to handle on call rotations, escalating issues to other members of the team, planning for reachability and allowing people to go off-call after incidents. We use PagerDuty on a weekly rotation through the team and consider things like who is available, internet connectivity, illness, holidays and looping in product engineering so issues waking people up can be resolved quickly.

Step 3: Deal with it, using checklists

Have a defined process in place ready to run through whenever the alerts go off. Using a checklist removes unnecessary thinking so you can focus on the real problem, and ensures key actions are taken and not forgotten. Have a channel for communication both internally and externally – there’s nothing worse to be the customer of a service that is down and you have no idea if they’re working on it or not.

Step 4: Write up a detailed postmortem

This is the opportunity to win back trust. If you follow the steps above and provide accurate, useful information during the outage so people know what is going on, this is the chance to write it up, explain what happened, what went wrong and crucially, what you are going to do to prevent it from happening again. Outages highlight unknown system flaws and it’s important to tell your users that the hole no longer exists, or is in the process of being closed.