This blog is more or less a copy and paste of a wiki page that my team at work use as part of our Problem Management process. It is heavily inspired by lots of good writing about blameless postmortems for example from Etsy and the Beyond Blame book. Hope you find it useful.

RCA Approach

This page describes a 7 step approach to performing RCAs. The process belongs to all of us, so please feel free to update it.

It implies there is one root cause. In practice it is often a cocktail of contributing causes as well as negative (and sometimes positive) outcomes

The name implies that we are on a hunt for a cause. We are on a hunt for causes, but only to help us identify preventative actions. Not just to solve a mystery or worse find an offender to punish.

Therefore RCA is proposed to stand for Recurrence Countermeasure Analysis.

Step 1: Establish “the motive”

Ask the following:

Question: Does anyone think anyone in our team did something deliberately malicious to cause this? i.e. they consciously carried out actions that they knew would cause this or something of similar negative consequences or they clearly understood the risks but cared so little that they weren’t deterred?

and

Question: Does anyone think anyone outside our team… (as above).

The assumption here is that the answer is “NO” to both questions. If it is “NO”, we can now proceed with a blamelessmanner, i.e. never stopping our analysis at a point where a person should (or could) have done something different.

If either answers are “YES”. This is beyond the scope of this approach.

Step 2: Restate our meaning of “Blameless”

Read aloud the following to everyone participating in the RCA:

“We have established that we don’t blame any individual either internal or external to our organisation for the incident that has triggered this exercise. Our process has failed us and needs our collective input to improve it. If at any point during the process anyone starts to doubt this statement or act like they no longer believe it we must return to Step 1. Everyone is responsible for enforcing this.

What is at stake here is not just getting to the bottom of this incident, it’s getting to the bottom of this incident and every future occurrence of the same incident. If anyone feels mistreated by this process, by human nature they will take actions in the future to disguise their actions to limit blame and this will damage our ability to continuously improve.”

Step 3: Restate the rules

During this process we will follow these rules:

Facts must not be subjective. If an assertion of fact cannot be 100% validated we should agree and capture our confidence level (e.g. High, Medium, Low). We must also capture the actions that we could do to validate it.

If we don’t have enough facts, we will prioritise the facts that we need go away and validate before reconvening to continue. Before suspending the process, agree a full list of “Things we wish we knew but don’t know”, capture the actions that we could do to validate them and prioritise the discovery.

If anyone feels uncomfortable during the process due to:

Blame

Concerns with the process

Language or tones of voice

Their ability have their voice heard they must raise it immediately.

We are looking for causes only to inform what we can do to prevent re-occurrence, not to apportion blame.

Step 4: Agree a statement to describe the incident that warranted this RCA

Using an open discussion attempt to reach a consensus over a statement that describes the incident that warranted this RCA. This must identify the thing (or things) that we don’t want to happen again (including all negative side-effects). Don’t forget the impact on people e.g. having to work late to fix something. Don’t forget to capture the problem from all perspectives.

Write this down somewhere everyone can see.

Step 5: Mark up the problem statement

Look at the problem statement and identify and underline every aspect of the statement that someone could ask “Why” about. Try to take an outsider view, even if you know the answer or think something cannot be challenged, it is still in scope for being underlined.

Step 6: Perform the analysis

Document the “Why” question related to each underlined aspect in the problem statement.

For each “Why” question attempt to agree on one direct answer. If you find you have more than one direct answer, split your “Why” question into enough more specific “Why” questions so that your answers can be correlated directly.

Mark up the answers as you did in Step 5.

Repeat this step until you’ve built up a tree with at least 5 answers per branch and at least 3 branches. If you can’t find at least 3 branches, you need to ask more fundamental “Why” questions about your problem statement and answers. If you can’t ask and answer more than 5 “Why”s per branch possibly you are taking too large steps.

Do not stop this process with any branch ending on a statement that could be classified “human error”. (Refer to what we agreed at step 1).

Do not stop this process at something that could be described as a “third party error”. Whilst the actions of third parties may not be directly under our control, we have to maintain a sense of accountability for the problem statement where if necessary we should have implemented measures to protect ourselves from the third party.

Step 7: Form Countermeasure Hypothesis

Review the end points of your analysis tree and make hypothesis’ about actions that could be taken to prevent future re-occurrences. Like all good hypothesis’ these should be specific and testable.

Use whatever mechanism you have for capturing and prioritising the proposed work to track the identified actions and get them implemented. Use your normal approach to stating acceptance criteria and don’t close the actions unless they satisfy the tests that they have been effective.

Last week I spent an inspiring 3 days at the DevOps Enterprise Summit (DOES15) in San Francisco. I had the pleasure of speaking but most importantly learning from everyone I heard present and chatted with. The most interesting thing about events like this is that they can change your perspective on things you felt you knew well. For example fundamentals such as “What is DevOps these days?”.

We all like to create taxonomies to make sense of things and I found myself grouping the practitioners I spoke to into 3 categories.

People working for the DevOps poster-childs (Netflix, Google etc.) An inspiration to us all through what they achieve both with IT and through their willingness to be open and share.

People working for large enterprises who are on tremendous journeys of DevOps transformation, have fantastic stories to tell, and are still living day-to-day around many things they would like to change dramatically.

People who haven’t yet built up momentum around DevOps and seemed almost overwhelmed by the stories and performance of people in categories 1 and 2.

Naturally it was category 3 that I felt most drawn to understanding and talking to them inspired me to write this post.

Home truth #1: Improving IT is not at all new to DevOps(!)

Whether you have just heard the name, or have been doing it for several years, if you are ambitious and passionate about what you do, you are without a doubt already committed to improving the IT function (and hence directly the businesses) in which you operate.

Home truth #2: writing off DevOps as just fashionable name for improving IT is a mistake.

I believe the “doing DevOps” is something every organisation must consciously start doing – today (if they haven’t already). It doesn’t take everyone (at first), or even everyone in particular business unit, department or team. It just takes at least two people to grit their teeth and agree that they are going to consciously make a collaborative effort to improving IT with a new level of energy, ambition, and a “new” name…

So here is what will be different once you start “doing DevOps”.

Just the act of starting something new and exciting will hopefully immediately inspire new levels of energy, motivation, ambition, and sense of purpose (perhaps even create flow).

You now have a useful name for your efforts to improve IT and one you can research to tap into the wealth of blogs, podcasts, meetups, conferences, Open Source, tools, and lessons learnt out there.

You can now relate the things you are doing (and trying to do) to the practices demonstrated by DevOps poster children.

You are now part of the huge support network in the form of the DevOps community which has growing dramatically built on a solid foundation of inclusivity and sharing.

Your new community is filled with individuals and companies fully motivated by the opportunity to share their experiences for the greater good of our industry and the greater good for society and humanity.

You have a better chance than ever of getting internal investment in your cause (DevOps being in vogue has advantages).

By stating (especially in public) your ambition and commitment to build a lean, automated, responsive, reliable level IT organisation, you are now more likely to be able to grow an inspired workforce and more likely to attract talent from outside.

So my advice (especially to people who identify with Category 3) is as follows:

Don’t let anyone tell you that you aren’t doing DevOps (it’s a journey).

If you are doing DevOps on any scale in your company don’t let anyone convince you that you aren’t key to the future success of the organisation (YOU ARE!)

Don’t feel disheartened by where you think your organisation is today relative to some kind of DevOps utopia / companies you read about / your perceived view of your peers. It’s the rate in which you can learn to continuously improve IT within your organisation that will secure your organisation’s future and not precisely where you all are today.

On the subject of Continuous Delivery where the intention is to fail fast, it’s actually rather sloppy of me to defer talking about people my fourth blog on this. When it comes to implementing Continuous Delivery there is nothing more potentially obstructive than people. Or to put things more positively, nothing can have a more positive impact than people!

Here are my top 4 reason that people could cause impedance.

#1 Ignorance A lack understanding and appreciation of Continuous Delivery even among small but perhaps vocal or influential minority can be a large source of impedance. Many Developers and Operators (and a new species of cross-breeds!) have heard of Continuous Delivery and DevOps, but often Project Managers, Architects, Testers, Management/Leadership may not. Continuous Delivery is like Agile in that it needs to be embraced by an organisation as a whole, simply because anyone in an organisation is capable of causing impedance by their actions and the decisions they make. For example the timelines set by a project manager simply may not support taking time to automate. A software package selected by an architect could cause a lot of pain to everyone with an interest in automation.

A solution to this that I’ve seen work well has been awareness sessions. Whatever format that works best for sharing knowledge (brownbag lunches, webinars, communication sessions, memos, the pub etc) should be used to make people aware of what Continuous Delivery can do, how it works, why it is important, and what all the various terminology all means.

I once spent a week doing this and talked to around 10 different projects in an organisation and hundreds of people. It was a very rewarding experience and by the end of it we’d gone from knowing 1 or 2 interested people to scores. It was also great to make connections with people already starting to do great things towards the cause. We even created a social media group to share ideas and research.

#2 Ambivalence? As I’ve discussed before some people reject Continuous Delivery because they see it as un-achievable and / or inappropriate for their organisation. (Often I’ve seen this being due to confusion with Continuous Deployment.) Also, don’t overlook a cultural aversion to automation. In my experience it’s only been around 5 years since the majority of people “in charge” were still very skeptical about the concept of automating the full software stack preferring.

A solution here (assuming you’ve revisited the awareness sessions where necessary) is to organise demos of any aspects of Continuous Delivery already adopted and demonstrate that it is real and already adding value.

#3 Obedience Another source of impedance could perhaps be a misguided perception that Continuous Delivery is actually forbidden in a particular organisation. So people will impede it due to a misinformed attempt at obedience to the management/leadership. Perhaps a management steer to focus only tactically on “delivery, delivery, delivery” does not allow room for automation. Or perhaps they take a very strong interest in how everything works and haven’t yet spoken about Continuous Delivery practices, or even oppose certain important techniques like Continuous Integration. Or perhaps a leadership mandate to cut costs makes strategic tasks like automation seem frivolous or impossible.

A solution here is for management/leadership to publicly endorse Continuous Delivery and cite it as the core strategy / methodology for ongoing delivery. Getting them along to the above mentioned training sessions can help a lot. Getting them to blog about it is good. As can be setting up demos with them to highlight the benefits of automation already developed. Working Continuous Delivery into the recognition and rewards processes could also be effective (if you please C suite!).

#4 Disobedience Finally, if people know what Continuous Delivery is, they want it, they know they are allowed it, why would they then disobey and not do it? Firstly it could be down to other sources of impedance that make it difficult even for the most determined (e.g. Infrastructure). But it could also easily be a lack of time or resources or budget or skills.

Skills are relatively easy to address so long as you make time. Depending on where you live there could be masses of good MeetUps to go and learn at. There are superb tutorials online for all of the open source tools. #FreeNode is packed with good IRC channels of supportive individuals. The list goes on.

Another thing to consider here is governance. As I’ve confessed before, some people like me really like things like pipeline orchestration, configuration management, automated deployments etc. But this is not the norm. It is very common for such concerns to be unloved and to slip through the cracks with no-one feeling accountable. Making sure there is a clear owner for all of these is a very important step. Personally I am always more than happy to take this accountability on projects as opposed to seeing them sit unloved and ignored.

Finally as I’ve said before, DevOps discussions often focus around the idea that an organisation has just two silos – Development and Operations. But in my experience, things are usually lot more complex with multiple silos perhaps by technology, release, department etc., multiple vendors, multiple suppliers, you name it. Putting a DevOps team in place to help get started towards Continuous Delivery can be one effective way of ensuring there is ownership, dedicated focus and skills ready to work with others to overcome people impedance. Of course heed the warnings.

Obviously overall People Impedance is a huge subject. I hope this has been of some use. Please let me know your own experiences.