Technical musings, things of interest, etc.

Menu

Monthly Archives: March 2018

A typical post-mortem meeting delves into Five Whys, not assigning blame, figuring out what mistakes were made that lead to the outage or whatever, and then assigning and prioritizing the work to correct those mistakes and, most importantly, to learn from them.

The post-mortem itself presumes a mistake, that we must find that mistake and learn from it. I wonder then, is it possible to have an incident where the cause is not a mistake, but even maybe a correct decision?

Let us imagine a very simple game. The house flips a coin. You wager 50c – if the house wins, they keep your 50c. If the house loses, they pay you $2. You give the house your 50c and play. So we figure, half the time we lose 50c, but half the time we gain $2. The expected value (or EV) of an iteration of the game is $1.50. Since the game only costs 50c to play, it has a positive EV. EV (given a tolerable amount of catastrophic risk) gives us a simple mathematical basis for evaluating our decisions.

Anyway, so you decide that since the game is positive EV, it is correct to play (assuming you like money). And then, you lose. Later, you have a post mortem for the game, and you decide that the thing that caused the losing was to play to begin with, and the best way to avoid that in the future would be to not play anymore, or to play a different game, perhaps.

In games involving variance, this post-hoc analysis of decisions is referred to pejoratively as “results-oriented thinking.” It boils down to overweighting our agency, and the belief that each time we realize risk, our decision was poorly conceived.

a method of analyzing a poker play based on the outcome as opposed to the merits of the play. (source)

Trying to be good at games involving variance (like poker, or Magic: the Gathering) is one way to figure out how miserable humans really are at this. There are times when you (correctly) estimate you are 9:1 against your opponent, and you convince them to go all-in, and then they suck out and you lose (this happens exactly 10% of the time, by my math).

Here is the horrible truth about games of variance: sometimes, making the correct decision will straight up cause you to lose. Perhaps more terrifying: sometimes making the wrong decision will cause you to win.

It is not fun, and if no one has explained this to you, you might think that poker is just not your game. But, if you make this play at every opportunity it is presented to you, in the long run, you will win a lot of money.

This relates to post-mortems in myriad ways, some of which I will have to address in future posts. The most obvious is answering the question “how could we have prevented this?” is not an adequate post-mortem. It’s not even a reasonable way to start. We have to evaluate the decisions we made, on their own merits, ignoring their outcome. This is admittedly tough to do, since in most cases, we only have post-mortems when things go bad.

Here are a few suggestions for post-mortem questions that nobody’s asking:

Should we have actually prevented this?

Post-mortems assume this. I think assuming this will make you risk-averse. Not all outages are worth attempting to prevent. If you think the decisions that led to an outage had a fair risk component, you basically agree that those decisions were fine.

Should we try to prevent this in the future?

This is a different question. We know different things now. The stakes have changed, and the odds are different, and the costs are lower.

Critically, every post-mortem involves a cohort of stakeholders, sometimes ones with actual stakes (and pitchforks and torches). If your system experiences a failure mode, your customers will assume that it makes sense that you will adjust your process so that you will never experience this failure mode again (regardless of the costs, etc). Therefore, this failure mode will have a greater cost the next time it occurs.

In our favor though, we may potentially understand the failure mode better, and we may have an easier solution for it than if we’d gone out looking for dragons early in the process (we have somewhat validated a risk), which may have taught us that the odds of this failure mode are (perhaps!) higher than expected and a solution for this actual failure mode is cheaper than a solution for many theoretical failure modes.

Did we understand this risk up front or was it emergent?

Did we just come out on the bad side of a calculated risk? Was this an unknown risk that we could have known about if we’d done reasonable due diligence (that is, due diligence that was likely to be worth it)?

How should our decision making change in the future?

Is there a critical dependency that is too poorly understood? Maybe we should develop some more expertise here.

Are we taking too little risk? Too much risk? It’s tough to answer this question honestly during a post-mortem, but it’s something worth keeping in mind.

Final Notes

There is a big world of thinking about decisions that I didn’t cover here. This mode of thinking can improve your decision-making in post-mortems, software design, career decisions, relationships, finances, and pretty much any area where risk and reward are traded off.

Find ways to think harder about the decisions you make than the outcomes you experience.