After It Broke: Executing Good Postmortems

No matter how much automation, redundancy, and protection you build into your systems, thing are always going to break. It might be a change breaking an API to another system. It might be a change in a metric. Perhaps you just experienced massive hardware failure. Many IT organizations have traditionally had a postmortem, or root cause analysis, process to try to improve the overall quality of their processes. The major problem with most postmortem processes is that they devolve into circular pointing matches. The database team blames the storage team, who in turn blames the network team, and everyone walks out of the meeting angry.

As I’m writing this article, I’m working a system where someone restarted a database server in the middle of a large operation, causing database corruption. This is a classic example of an event that might trigger a postmortem. In this scenario, we moved to new hardware and no one tested the restore times of the largest databases. This is currently problematic, as the database restore is still happening a few hours after I started this article. Other scenarios would be any situations where you have unexpected data loss, on-call pages, or a monitoring failure that didn’t capture a major system fault.

How can we do a better postmortem? The first thing to do is execute blameless postmortems. This process assumes that everyone involved in an accident had good intentions and executed with the right intentions based on available information. This technique originates in medicine and aviation, where human lives are at stake. Instead of assigning blame to any one person or team, the situation is analyzed with an eye toward figuring out what happened. Writing a blameless postmortem can be hard, but the outcome is more openness in your organization. You don’t want engineers trying to hide outages to avoid an ugly, blame-filled process.

Some common talking points for your postmortems include:

Was enough data collected to gather the root cause of the incident?

Would more monitoring data help with the process analysis?

Is the impact of the incident clearly defined?

Was outcome shared with stakeholders?

In the past, many organizations did not share a postmortem outside of the core engineering team. This is a process that has changed in recent years. Many organizations like Microsoft and Amazon, because of the nature of their hosting businesses, have made postmortems public. By sharing with the widest possible audience, especially in your IT organization, you can garner more comments and deeper insights into a given problem.

One scenario referenced in Site Reliability Engineering by Google is the notion of integrating postmortems into disaster recovery activities. By incorporating these real-world failures, you make your disaster recovery testing as real as possible.

If your organization isn’t currently conducting postmortems, or only conducts them for major outages, you might start to think about trying to introduce them more frequently for smaller problems. As mentioned above, starting with paged incidents is a good start. It gets you to start thinking about how to automate responses to common problems and helps ensure that the process can be followed correctly so that when a major issue occurs, you're not focused on how to conduct the postmortem, but instead on how to find the real root cause of the problem.

Being professional and non-accusatory, staying positive and keeping the attitude of "what did we learn about this outage that we can use to reduce downtime or eliminate future outages?" is one of the most important things that can be done.

Obviously, sharing the learned information AND documenting it for future similar issues, will bring the most value to the process.

Some outages are out of our control, and we seek to remove their causes.

Some causes of outages are human errors and mistakes in judgment, and we strive to reduce them and improve our actions so we always (believe we) know what will happen when we hit that ENTER key.

I work in a 7x24 critical care hospital organization, and we've seen outages decrease more and more over the fifteen years I've been here. A great portion of that improvement comes from having friendly, efficient PIR's (Post Incident Reviews) and RCA's (Root Cause Analyses) that identify the details, identify what went wrong (without pointing fingers and without threatening someone with retribution, censure, loss of job, etc.).

I used to think of admitting mistakes as "falling on my own sword", and guiltily feeling I was "taking one for the team." I don't feel that way any longer. Instead of holding on to that adolescent / immature attitude, my team has matured into adults who readily admit their mistakes and culpability. Our Manager does not act emotionally or irrationally, and moves forward with the understanding that every learns from the process.

Once everyone understands they're not going to lose their job for isolating a data center or a hospital, we can move forward with a work environment that's professional AND friendly. We all do our jobs better, we don't have to deal with that additional stressor.

I don't think of my environment as "blameless"; I think that term isn't realistic. If I made the mistake, I own the problem, the blame is on my shoulders. But there's no retribution for my incorrect action if I caused an outage. Instead, the change comes in the training, in the notifications to the customers and support teams, and in our understanding of the additional steps we may need to take before making changes. This can include, but are not limited to:

Thoroughly understanding what's going to happen as the result of any command entered--BEFORE the ENTER key is pushed

Purchasing (AND USING) a test lab / sandbox environment to test changes before they go into production

Identifying every customer and support team that a change may impact, and notifying them and scheduling the change to meet their needs

Using Change Management EVERY TIME. This results in the rest of the organization trusting my Network Analysts aren't trying to slide one under the rug, aren't doing "Cowboy Networking" (like Captain James T. Kirk used to use "Cowboy Diplomacy" and violate the Prime Directive), and can be trusted to know what we're doing.

Following up with the customers immediately after the change, to ensure they're working as expected

Keeping the Help Desk notified at major steps: immediately prior to the change being made, and after the change is complete

Someone from the right team (mine, or other I.T. Support teams) being available 7x24 during the changes

We appreciate that our network is the lifeline in a critical health care environment. And because we follow the rules, dot the "i's" and cross the "t's", we are given the budget we request every year. We're trusted to build that Info Highway with the right growth capability and alternate routes to take that we can make major changes, shut down major routes on that highway during business hours, and our customers know their traffic will still flow as needed.

Another step is to add monitoring - if it was missing so that you get early warning that trouble may be on the way next time. How long do those database backups take?

At my previous gig we simply added a notification step into the scheduler system that sent a syslog message at the beginning and end of each job. Then made some SAM templates to look into the Orion syslog tables for the syslog messages to show up.

For me the session I have participated in are the ones where it is said up front that this is not a blame session and that all egos are left outside the room. Those sessions usually are the most honest and generally provide a foundation for lasting teamwork moving forward. As a pre-school/kindergarten teacher friend of mine is fond of saying " we call them mistakes... not purposes". We learn from our mistakes, it is human nature. The true test is "what did we miss and how do we not miss that or something worse next time?" Lastly, publish the results across the organization, it builds credibilty, understanding and good will.

Unfortunately management here just want to point the finger and blame someone as long as it isn't them. They fail to see that they are actually the problem. Not enough staff, too much work and unreasonable deadlines all lead to fatigue and mistakes caused by too many interruptions to critical tasks. Root Cause documents just get filed in the round filing cabinet and no lessons are learned. They they come up with the great idea of getting in some contractors to sort things out. Guess who they expect to train the contractors. Yep, no work gets done while we train the contractors then they leave because it is an insane place to work and we have to make up for the lost time as well as doing the normal workload. The spiral downwards continues.

The postmortem is so valuable - if the information is then used. I've seen times where the team would look and say what can we do better and then build a plan to execute. But, I've also seen times where the what do we need is reflected upon and then forgotten.

After CIS/MIS contracted for DISA in mid 90’s
Worked with Toyota for 3 years
Worked for GE for 4 years (Here was where I first found SW products)
Did a bunch of various network engineering projects
Contracted for GE for 4 more years
Currently working with General Dynamics Mission Systems
William Eckler
IT Business Operations Services .ılılı..ılılı.
GENERAL DYNAMICS Mission Systems

SolarWinds solutions are rooted in our deep connection to our user base in the THWACK® online community.
More than 150,000 members are here to solve problems, share technology and best practices, and directly
contribute to our product development process.
Learn more today by joining now.