What is a post-mortem?

When you have a service outage, or any form of unexpected service disruption, you first resolve the issue and then proceed to discuss what happened, why and how.

The process of discussion is referred to as a ‘post-mortem’.

How do we make post-mortems effective?

You shouldn’t just get a bunch of people together in a room to discuss an incident. You need to:

Schedule the meeting at the right time.

Invite the right people.

Have the right attitude.

Talk about the right things.

Know the right follow-up actions to take.

Notify the right people.

Let’s go over each of these items briefly, they might sound obvious but I’ve seen each of them carried out in vastly different ways (hint: not all of them good)…

Schedule the meeting at the right time

You should schedule the post-mortem to occur as soon as possible after the event. DO NOT have this discussion two weeks after the incident because people will forget important details. Have this discussion while everything is still fresh in people’s minds.

The trouble with getting a post-mortem scheduled at the right time is that you’ll need a prepared document (a post-mortem template) that is filled in with as much detail (as can be reasonably recalled from memory †) almost immediately after the incident. This document doesn’t just magically appear, it’s someones responsibility to create it.

† this is harder than you imagine, even with modern day communication tools like Slack, there are so many things going on at once (multiple people chatting and trying to isolate the problem, then finding a quick and safe solution) that it’s easy to forget things or not notice them happening (hence a post-mortem helps bring all this information together).

You should have the post-mortem document filled in (as much as you can) before having the post-mortem as a way to help drive the conversation, and (ideally) whoever was on-call at the time of the incident would be marked as the point-person for prepping this post-mortem document.

One last useful point is to make sure all those who need to be involved in the post-mortem are reminded of the meeting a day before. This allows them to read through the post-mortem document and prep accordingly (and generally makes for a smoother discussion).

Invite the right people

This is a tricky one to get right, and will vary depending on your organisational structure (and even how the incident itself unfolded).

You shouldn’t invite only those people involved during the incident because there are insights that can be gained from Product Managers (just as an example) who might not have been online at the time of the incident, and who might be able to elucidate certain aspects that the engineers/support team were not aware of.

You also don’t necessarily want to invite too many people outside of those involved in the incident. As the old saying goes: “too many cooks spoil the broth”. You could find the ‘noise to signal’ ratio goes up.

Have the right attitude

Post-mortems should be blameless. Do not come into the meeting with an ‘axe to grind’. People make mistakes, we’re dealing with software and often-times complex distributed systems so we should have an attitude of support, understanding, patience and a willingness to want to genuinely improve things.

Talk about the right things

This is where the post-mortem ‘template’ comes in (I’ve linked to it at the top and bottom of this post), as it includes different topics that can help steer the conversation in the right direction.

Things like: observations, symptoms, hypothesis, remediation, impact etc. You should also identify who the meeting moderator is (the person responsible for keeping the meeting on track and not falling into a war of words) and also who the notetaker is (they can’t be the same person).

It’s also probably worth mentioning that you shouldn’t be having a 3 hour meeting (of any kind, let alone a post-mortem), so make sure the short time you have together is yielding the right feedback and information.

Know the right follow-up actions to take

Have real actionable tasks as takeways from the post-mortem. You don’t want to just discuss how things could be improved, you want to actually improve them. It’s very important you take this opportunity to identity things you can do to prevent this issue from occuring again.

If service downtime/disruption is important to your business (and let’s face it, when is it not), then you need to take incidents seriously and put the time and effort into ensuring stability for the future.

Once you have the post-mortem document filled in fully, reviewed and all takeway tasks have been actioned (or at least scheduled to be actioned in the near future), then you can finish up this whole process by sharing what you learnt with your colleagues who were not involved (which leads us onto the next point).

Notify the right people

We generally take the finished post-mortem document and email it around to our development mailing list so that all engineers in the organsation get to see what happened, why, and how it was resolved.

This is super important because there are so many benefits that can come from this sharing of knowledge. Probably top of that list would be that it supports the notion that your organisation respects ‘accountability’ and that it’s honest about mistakes that are made.

It highlights to others (not just tech and engineering) that these mistakes aren’t punished but treated as opportunities for growth and learning. As well as exposing engineers (of varying skill levels) to different architectures, systems and services that are in place and helps them not only understand them a little better but encourages them to investigate more.

Because of all that (and more), who you share the post-mortem document with really does depend on you and your company’s values/standards.

A template for success

Below I link to a Google document we use as a template for our post mortem meetings, and I’ve included the relevant sections below just for an ‘at a glance’ view.

Note: you don’t have to include all of these if you don’t want. Take what is useful to you and leave whatever isn’t.

Observations

What empircal things did people see?

e.g. System X was Y, which indicated Z.

Symptoms

What was the physical outcome (the user experience) of these observations?

e.g. Users were seeing X output.

Hypothesis

What do we believe to be the contributing factors?

Note: there’s typically never a single ‘root cause’

Remediation

What did we do to resolve the incident?

Note: this isn’t the same as ‘fixing’ the problem, which suggests something more long term was put in place.

e.g.

We did X to resolve the problem.

We also did Y to resolve the problem.

Impact

What was the ‘cost’ of this incident?

Note: this isn’t the same thing as the symptoms.

e.g.

Thing A broke for N hours.

Thing B broke and all entered data for User C, during the incident, was lost.

Key Points

There’s lots of interactions during the incident, what were the most important things that happened?

e.g.

Person A identified incident at N time.

Person B notified stakeholders at N time.

Person C ramped up resources at N time.

Participants

Who was involved and what are their roles in the company.

Note: you don’t know everyone and what they do, so make it easier for people reading the post-mortem to understand the breadth of skills involved.

e.g.

Jane Doe (Engineer)

Joe Bloggs (Support Staff)

Timeline

Describe what happened and when. This should be much more detail than those extracted for the ‘key points’ section.

Note: if you work for a distributed company, then identifying the timezone of the incident (or at least the timezone you’re reporting it from) can help clarify when these things happened.

e.g.

Timezone: BST
2018-12-22 (Saturday)

03:00 - Joan said a thing happened.

03:10 – Joe notified the stakeholders.

Additional Details

Not everything said is going to have happened within the incident channel you were looking at during the incident (maybe it happened in an email thread between support staff, product and other stakeholders). This is something that is likely to be fleshed out during the post-mortem itself as more people who were having those conversations give their perspective.

e.g.

Bilbo Baggins (Product Manager): said something useful that not everyone would have seen

Communication

Where was the ‘war room’ (i.e. the incident channel where everyone was gathered to help problem solve)?

e.g.

#some-slack-channel

Link to Logs

Link to Dashboards

Images

If you have any screen shots of the broken system, then that’s useful for people not involved to understand the impact more visually. But also, linking to a specific point in time of a graph that shows a spike (and hopefully dip later) in errors can also be useful to reference back to.

Tasks

Put together a list of tasks for people to complete. Doesn’t have to be done at the time of sending out the post-mortem document to the wider organisation, but ideally you’d have those things done before you sent it out.

Jane: responsible for doing this task.

Another task that has no specific person assigned to it.

Confirm to Bilbo that we’ve done everything we can.

Questions

During the post-mortem questions will be raised and they should be placed here. You’re not necessarily going to have all the answers during that meeting.

Ideally you’d have answers tied to this questions before you sent out the post-mortem document to the wider organisation though.

What we did well

A list of things the team did well during the incident.

e.g.

The incident was identified quickly

The solution was quickly rolled out

No one panicked

Cross team communication was exceptional

What we didn’t do so well

A list of things the team didn’t do so well during the incident.

e.g.

It took a long time to understand what the alarm meant and who was affected

The on-call person didn’t acknowledge the initial alarm and so it kept firing

We didn’t setup an incident slack channel so conversations were happening everywhere

What can we learn from in future?

A list of the things we should pro-actively improve upon. Generally this will be resolutions to the things that didn’t go well.