Atlassian Incident Handbook

Incident postmortems

Incident postmortems

We practice blameless postmortems at Atlassian to ensure we understand and remediate the root cause of every incident with a severity of level 2 or higher. Here's a summarized version of our internal documentation describing how we run postmortems at Atlassian.

Overview

Defining incidents and incident values. Know the right tools and team roles.

At Atlassian, we track all postmortems with Jira issues to ensure they are completed and approved. You may decide to use a simpler system, like a Confluence page for each postmortem, if your needs are less complex.

The goals of a postmortem are to understand all contributing root causes, document the incident for future reference and pattern discovery, and enact effective preventative actions to reduce the likelihood or impact of recurrence.

For postmortems to be effective at reducing repeat incidents, the review process has to incentivize teams to identify root causes and fix them. The exact method depends on your team culture; at Atlassian, we've found a combination of methods that work for our incident response teams:

Face-to-face meetings help drive appropriate analysis and align the team on what needs fixing.

Postmortem approvals by delivery and operations team managers incentivize teams to do them thoroughly.

Designated "priority actions" have an agreed Service Level Objective (SLO) which is either 4 or 8 weeks, depending on the service, with reminders and reports to ensure they are completed.

Attending to this process and making sure it is effective requires commitment at all levels in the organization. Our engineering directors and managers decided on the approvers and SLOs for action resolution in their areas. This system just encodes and tries to enforce their decisions.

The delivery team for the service that failed (the team that owns the "Faulty Service" on the incident issue) is responsible for completing the postmortem. That team selects the postmortem owner and assigns them the postmortem issue.

The postmortem owner drives the postmortem through drafting and approval, all the way until it's published. They are accountable for completion of the postmortem.

One or more postmortem approvers review and approve the postmortem, and are expected to prioritize follow-up actions in their backlog.

We have a Confluence page which lists the postmortem approvers (mandatory and optional) by service group, which generally corresponds to an Atlassian product (e.g. Bitbucket Cloud).

We built some custom reporting using the Jira REST APIs to track how many incidents of each severity have not had their root causes fixed via the priority actions on the postmortem. The engineering managers for each department review this list regularly.

Follow up with the responsible dev managers to get the commitment to specific actions that will prevent this class of incident.

Raise a Jira issue for each action in the backlogs of the team(s) that own them. Link them from the postmortem issue as "Priority Action" (for root cause fixes) or "Improvement Action" (for other improvements).

Look up the appropriate approvers in Confluence and add them to the "Approvers" field on the postmortem.

Select the "Request Approval" transition to request approval from the nominated approvers. Automation will comment on the issue with instructions for approvers.

Follow up as needed until the postmortem is approved.

When the postmortem is approved, we have automation to create a draft postmortem blog in Confluence for you to edit and publish. Blogging postmortems share your hard-earned lessons, which multiplies their value.

Once the postmortem process is done, the actions are prioritized by the development team as part of their normal backlog according to the team's SLO.

We find that gathering the team to discuss learnings together results in deeper analysis into root causes. This is often over video conference due to our distributed teams, and sometimes done in groups where incidents involve large groups of people.

Our suggested agenda:

Remind the team that postmortems are blameless, and why

Confirm the timeline of events

Confirm the root causes

Generate actions using "open thinking" - "What could we do to prevent this class of incident in the future?"

Ask the team "What went well / What could have gone better / Where did we get lucky"

Suggested calendar booking template:

Please join me for a blameless postmortem of <link to incident>, where we <summary of incident>.

The goals of a postmortem are to understand all contributing root causes, document the incident for future reference and pattern discovery, and enact effective preventative actions to reduce the likelihood or impact of recurrence.

In this meeting we'll seek to determine the root causes and decide on actions to mitigate them.

If you don't have the responsible dev managers in the room, then avoid committing to specific actions in the meeting because it's is a poor context for prioritization decisions. People will feel pressured to commit and don't have complete information. Instead, follow up with the responsible managers after the meeting to get commitment to fix the priority actions identified.

Our postmortem issue has an extensive series of fields to encourage collecting all the important details about the incident before holding the postmortem meeting. Below are some examples of how we fill out these fields.

Field

Instructions

Example

Incident summary

Summarize the incident in a few sentences. Include what the severity was, why, and how long impact lasted.

Between <time range of incident, e.g. 14:30 and 15:00> on <date>, <number> customers experienced <event symptoms>. The event was triggered by a deployment at <time of deployment or change that caused the incident>. The deployment contained a code change for <description of or reason for the change>. The bug in this deployment caused <description of the problem>.

The event was detected by <system>. We mitigated the event by <resolution actions taken>.

This <severity level> incident affected X% of customers.

<Number of support tickets and/or social media posts> were raised in relation to this incident.

Leadup

Describe the circumstances that led to this incident, for example, prior changes that introduced latent bugs.

At <time> on <date>, (<amount of time before customer impact>), a change was introduced to <product or service> to ... <description of the changes that led to the incident>. The change caused ... <description of the impact of the changes>.

Fault

Describe what didn't work as expected. Attach screenshots of relevant graphs or data showing the fault.

<Number> responses were incorrectly sent to X% of requests over the course of <time period>.

Impact

Describe what internal and external customers saw during the incident. Include how many support cases were raised.

For <length of time> between <time range> on <date>, <incident summary> was experienced.

This affected <number> customers (X% of all <system or service> customers), who encountered <description of symptoms experienced by customers>.

<Number of support tickets and social media posts> were raised.

Detection

How and when did Atlassian detect the incident?

How could time to detection be improved? As a thought exercise, how would you have cut the time in half?

The incident was detected when the <type of alert> was triggered and <team or person paged> were paged. They then had to page <secondary response person or team> because they didn't own the service writing to the disk, delaying the response by <length of time>.

<Description of the improvement> will be set up by <team owning the improvement> so that <impact of improvement>.

Response

Who responded, when and how? Were there any delays or barriers to our response?

After being paged at 14:34 UTC, KITT engineer came online at 14:38 in the incident chat room. However, the on-call engineer did not have sufficient background on the Escalator autoscaler, so a further alert was sent at 14:50 and brought a senior KITT engineer into the room at 14:58.

Recovery

Describe how and when service was restored. How did you reach the point where you knew how to mitigate the impact?

Additional questions to ask, depending on the scenario: How could time to mitigation be improved? As a thought exercise, how would you have cut the time in half?

Recovery was a three-pronged response:

Increasing the size of the BuildEng EC2 ASG to increase the number of nodes available to service the workload and reduce the likelihood of scheduling on oversubscribed nodes

Disabled the Escalator autoscaler to prevent the cluster from aggressively scaling-down

Not specifically. Improvements to flow typing were known ongoing tasks that had rituals in place (e.g. add flow types when you change/create a file). Tickets for fixing up integration tests have been made but haven't been successful when attempted

Recurrence

Has this incident (with the same root cause) occurred before? If so, why did it happen again?

This same root cause resulted in incidents HOT-13432, HOT-14932 and HOT-19452.

Lessons learned

What have we learned?

Discuss what went well, what could have gone better, and where did we get lucky to find improvement opportunities.

Need a unit test to verify the rate-limiter for work has been properly maintained

Bulk operation workloads which are atypical of normal operation should be reviewed

Bulk ops should start slowly and monitored, increasing when service metrics appear nominal

Corrective actions

What are we going to do to make sure this class of incident doesn't happen again? Who will take the actions and by when?

When you're writing or reading a postmortem, it's necessary to distinguish between the proximate and root causes.

Proximate causes are reasons that directly led to this incident.

Root causes are reasons at the optimal place in the chain of events where making a change will prevent this entire class of incident.

A postmortem seeks to discover root causes and decide how to best mitigate them. Finding that optimal place in the chain of events is the real art of a postmortem. Use a technique like Five Whys to go "up the chain" and find root causes.

Here are a few select examples of proximate and root causes:

Scenario

Proximate cause & action

Root cause

Root cause mitigation

Stride "Red Dawn" squad's services did not have Datadog monitors and on-call alerts for their services, or they were not properly configured.

Team members did not configure monitoring and alerting for new services.

Configure it for these services.

There is no process for standing up new services, which includes monitoring and alerting.

Create a process for standing up new services and teach the team to follow it.

Stride unusable on IE11 due to an upgrade to Fabric Editor that doesn't work on this browser version.

An upgrade of a dependency.

Revert the upgrade.

Lack of cross-browser compatibility testing.

Automate continuous cross-browser compatibility testing.

Logs from Micros EU were not reaching the logging service.

The role provided to micros to send logs with was incorrect.

Correct the role.

We can't tell when logging from an environment isn't working.

Add monitoring and alerting on missing logs for any environment.

Triggered by an earlier AWS incident, Confluence Vertigo nodes exhausted their connection pool to Media, leading to intermittent attachment and media errors for customers.

AWS fault.

Get the AWS postmortem.

A bug in Confluence connection pool handling led to leaked connections under failure conditions, combined with lack of visibility into connection state.

Fix the bug & add monitoring that will detect similar future situations before they have an impact.

When your service has an incident because a dependency fails, where the fault lies and what the root cause depends on whether the dependency is internal to Atlassian or 3rd party, and what the reasonable expectation of the dependency's performance is.

If it's an internal dependency, ask "what is the dependency's Service Level Objective (SLO)?":

Did the dependency breach their SLO?

The fault lies with the dependency and they need to increase their reliability.

Did the dependency stay within their SLO, but your service failed anyway?

Your service needs to increase its resilience.

Does the dependency not have an SLO?

They need one!

If it's a 3rd party dependency, ask "what is our reasonable expectation* of the 3rd party dependency's availability/latency/etc?"

Did the 3rd party dependency exceed our expectation (in a bad way)?

Our expectation was incorrect.

Are we confident it won't happen again? E.g. We review and agree with their RCA. In this case, the action is their RCA.

Or, do we need to adjust our expectations? In this case, the actions are to increase our resilience and adjust our expectations.

Are our adjusted expectations unacceptable? In this case, we need to resolve the disconnect between requirements and solution somehow, eg find another supplier.

Did the 3rd party dependency stay within our expectation, but your service failed anyway?

In this case, your service needs to increase its resilience.

Do we not really have an expectation?

The owner of the 3rd party dependency needs to establish this, and share it with teams so they know what level of resilience they need to build into their dependent services.

*Why "expectation"? Don't we have SLAs with 3rd parties? In reality, contractual SLAs with 3rd parties are too low to be useful in determining fault and mitigation. For example, AWS publishes almost no SLA for EC2. Therefore, when we're depending on a 3rd party service, we have to make a decision about what level of reliability, availability, performance, or another key metric we reasonably expect them to deliver.

Wording postmortem actions:

The right wording for a postmortem action can make the difference between an easy completion and indefinite delay due to infeasibility or procrastination. A well-crafted postmortem action should have these properties:

Actionable: Phrase each action as a sentence starting with a verb. The action should result in a useful outcome, not a process. For example, “Enumerate the list of critical dependencies” is a good action, while “Investigate dependencies” is not.

Specific: Define each action's scope as narrowly as possible, making clear what is and what is not included in the work.

Bounded: Word each action to indicate how to tell when it is finished, as opposed to leaving the action open-ended or ongoing.

From...

To...

Investigate monitoring for this scenario.

(Actionable) Add alerting for all cases where this service returns >1% errors.

Atlassian uses a Jira workflow with an approval step to ensure postmortems are approved. Approvers are generally service owners or other managers with responsibility for the operation of a service. Approval for a postmortem indicates:

Agreement with the findings of the post-incident review, including what the root cause was; and

Agreement that the linked "Priority Action" actions are an acceptable way to address the root cause.

Our approvers will often request additional actions or identify a certain chain of causation that is not being addressed by the proposed actions. In this way, we see approvals adding a lot of value to our postmortem process at Atlassian.

In teams with fewer incidents or less complex infrastructure, postmortem approvals may not be necessary.

When things go wrong, looking for someone to blame is a natural human tendency. It's in Atlassian's best interests to avoid this, though, so when you're running a postmortem you need to consciously overcome it. We assume good intentions on the part of our staff and never blame people for faults. The postmortem needs to honestly and objectively examine the circumstances that led to the fault so we can find the true root cause(s) and mitigate them. Blaming people jeopardizes this because:

When people feel the risk to their standing in the eyes of their peers or to their career prospects, this usually outranks "my employer's corporate best interests" in their personal hierarchy, so they will naturally dissemble or hide the truth in order to protect their basic needs.

Even if a person took an action that directly led to an incident, what we should ask is not "why did individual x do this", but "why did the system allow them to do this, or lead them to believe this was the right thing to do".

Blaming individuals is unkind and, if repeated often enough, will create a culture of fear and distrust.

In our postmortems, we use these techniques to create personal safety for all participants:

Open the postmortem meeting by stating that this is a blameless postmortem and why

Refer to individuals by role (eg "the on-call Widgets engineer") instead of name (while remaining clear and unambiguous about the facts)

Ensure that the postmortem timeline, causal chain, and mitigations are framed in the context of systems, process, and roles, not individuals.