The 5 whys: Another attempt

Towards the end of the week before last and the beginning of last week we’d been having quite a few problems with our QA environment to the point where we were unable to deploy anything to it for 3 days.

It still wasn’t quite spot on due to some mistakes I made while facilitating but these were my observations:

Once we got to answering the whys for the 4th and 5th first level whys the whiteboard was way too cluttered and it had become quite difficult to see exactly where we'd got up to.
As a result we lost the discipline around answering the question why and drifted off into general discussion around the original question but stopped drilling down further looking for a potential root cause.
The next time I think it would probably work better to look for the first why and collect any potential other whys on the same level in a 'parking lot' type area which we could then go to later on.

Having said that, a neat thing about having the whys alongside each other was that we were able to see that the first two whys were linked to each other.
Both changes had been done by someone in the operations team based on conversations they had with people on our team.
We realised that our communication with the operations team hadn't been entirely clear and had left room for doubt which had led to unexpected changes being made to the servers.
This was an example of us stopping before we'd drilled down to 5 levels having realised that we could influence the situation positively even if we hadn't found the root cause of the problem.

Drilling down into the 'System tests failing' led to the most interesting insights:

System tests failing

Noone cares about them

We can push to QA even if they're broken

Used to them failing

Perception amongst devs that they're flaky

There had previously been a time when data changed frequently and broke them.

Seen as being owned by the QAs

The tests were defined by QAs

The time from checkin to system tests failing is quite long

Looking back at this now we probably should have drilled a bit further down on some of the whys.

We actually ended up discussing the perception amongst the developers that the tests were flaky and it was pointed out that most of the failures were actually real.

We don’t currently have a ‘stop the line’ mentality if the systems tests fail but have agreed to adopt that approach for the next iteration and check at the end of this week to see if we’ve improved.

Even though I didn’t facilitate the exercise perfectly I think there was still a far greater level of analysis done by the team in this exercise than in others that I’ve seen.

I’ve noticed that a lot of retrospective type exercises tend to only encourage surface level analysis so we never really go deeper into a subject and see if we can actually make some useful changes to the way that we work.