Four Questions to ask for an effective Technical Post Mortem

A technical post mortem (TPM) is a retrospective analysis of observable events that preceded and influenced a technical failure. The purpose of a technical post mortem is to identify those influences, find out what went wrong and why, so we learn from those experiences and proactively make changes going forward. A technical post mortem is performed to identify trouble areas, determine what can be done to prevent future failures, create best practices for your business and inform process improvements which mitigate future risks and promote iterative best practices.

This outline is not meant to be comprehensive but is meant to serve as a starting point for your technical post mortems. These questions are meant to generate discussion about what went well, what the team struggled with during the failure, and what the team would do differently moving forward. The 4 questions we must ask are:

1. What happened? – You can’t analyze what you don’t understand so establishing a clear understanding of what went wrong is crucial.

2. Why did it happen? Identify the major events that led to the failure and try isolating the root causes for those failures. Determine if they are the underlying cause of the failure or did they initiate a process that leads to the technical failure. Low hanging fruit include defects in design, process or poor maintenance practices. In addition to looking strictly at technical causes of failure, also examine the underlying organizational, management, and team environment. Be aware some team members may ignore warning signs of impending failure due to the organizational culture – see the wiki for NASA Challenger Disaster for example.

3. How did we respond and recover? How we responded to the failure can determine how quickly we identified the root cause and put the fixes in place. A major technical fail can all have a direct impact on shareholder value, revenues, market share and brand equity so a quick recovery is paramount. A useful technical post mortem requires a reasonable level of honesty, insight, and cooperation from the organization. Thetechnical post mortem should make recommendations of how to continue things that worked, and how to fix things that didn’t work. Remember, the idea is to learn from your successes and failures, not just to document them.

4. How can we prevent similar unexpected issues from occurring again?

Unexpected technical issues do arise in mission-critical or complex hardware systems. However, the key to prevention is technical planning to prevent narrow problems from propagating through the entire system in the future. Each of the failures uncovered in step 2 represent a risk going forward so schedule regular inspections or system checks in your CMMS. When a risk is detected in the future, certain actions should immediately and automatically go into effect to prevent similar failures. Planning must also consider the business process and management responses the team initiates when a failure occurs. A complete post-mortem addresses both technical and management issues.

Sadly,technical post mortem have a habit of turning into “the blame game”. A bad post mortem can create dissension and institutionalize mistakes. If you want honest post-mortems, management has to develop a reputation for listening openly to input and not punishing people for being honest. Well-run post-mortems help a maintenance team create a culture of continuous improvement.

Summary

Ensure your technical post mortem will be successful by carefully preparing in advance, analyzing the failure systematically, producing actionable findings, and actively sharing the results. A solid post-mortem can help your organization become more effective going forward, helping you learn from mistakes and focus on what worked best. Don’t let memories fade by scheduling the postmortem too long after the end of the project. A technical post mortem should occur within 1-2 weeks of the technical failure. A post mortem stored in a filing cabinet somewhere does no one any good. Store your post mortems in the asset record in your CMMS so they can be easily found in the future.

Empower your maintenance team

Leverage the cloud to work together, better in the new connected age of maintenance and asset management.