The STEMpunk Project: Performing A Failure Autopsy

What follows is an edited version of an exercise I performed about a month ago following an embarrassing error cascade. I call it a ‘failure autopsy’, and on one level it’s basically the same thing as an NFL player taping his games and analyzing them later, looking for places to improve.

But the aspiring rationalist wishing to do the something similar faces a more difficult problem, for a couple of reasons:

First, the movements of a mind can’t be seen in the same way the movements of a body can, meaning a different approach must be taken when doing granular analysis of mistaken cognition.

Second, learning to control the mind is simply much harder than learning to control the body.

And third, to my knowledge, nobody has really even tried to develop a framework for doing with rationality what an NFL player does with football, so someone like me has to pretty much invent the technique from scratch on the fly.

I took a stab at doing that, and I think the result provides some tantalizing hints at what a more mature, more powerful versions of this technique might look like. Further, I think it illustrates the need for what I’ve been calling a “Dictionary of Internal Events”, or a better vocabulary for describing what happens between your ears.

Process:

Performing a failure autopsy involves the following operations:

List out the bare steps of whatever it was you were doing, mistakes and successes alike.

Identify the points at which mistakes were made.

Categorize the nature of those mistakes.

Repeatedly visualize yourself making the correct judgment, at the actual location, if possible.

(Optional) explicitly try to either analogize this context to others where the same mistake may occur, or develop toy models of the error cascade which you can use to template onto possible future contexts.

In my case, I was troubleshooting an air conditioner failure[1].

The garage I was working at has two five-ton air conditioning units sitting outside the building, with two wall-mounted thermostats on the inside of the building.

Here is a list of the steps my employee and I went through in our troubleshooting efforts:

Notice that the right thermostat is malfunctioning.

Decide to turn both AC units off[2] at the breaker[3] instead of at the thermostat.

Decide to change the batteries in both thermostats.

Take both thermostats off the wall at the same time, in order to change their batteries.

Instruct employee to carry both thermostats to the house where the batteries are stored. This involves going outside into the cold.

The only non-mistakes were a) and c), with every other step involving an error of some sort. Here is my breakdown:

*b1) We didn’t first check to see if the actual unit was working; we just noticed the thermostat was malfunctioning and skipped straight to taking action. I don’t have a nice term for this, but it’s something like Grounding Failure.

*b2) We decided to turn both units off at the breaker, but it never occurred to us abruptly cutting off power might stress some of the internal components of the air conditioner. Call this “implication blindness” or Implicasia.

*b3) Turning both units off at the same time, instead of doing one and then the other, introduced extra variables that made downstream diagnostic efforts muddier and harder to perform. Call this Increasing Causal Opacity (ICO).

*d) We took both thermostats off the wall at the same time. It never occurred to us that thermostat position might matter, i.e. that putting the right thermostat in the slot where the left used to go or vice versa might be problematic, so this is Implicasia. Further, taking both down at the same time is ICO.

*e) Taking warm thermostats outside on a frigid night might cause water to condense on the inside, damaging the electrical components. This possibility didn’t occur to me (Implicasia).

In case this isn’t clear, here are two separate diagrammatic representations of the process. They are convey the same content, but the first is computer-generated and cleaner while the second is handwritten and contains a good deal of exposition:

***

Interventions:

So far all this amounts to is a tedious analysis of an unfolding disaster. What I did after I got this down on paper was try and re-live each step, visualizing myself performing the correct mental action.

So it begins with noticing that the thermostat is malfunctioning. In my simulation I’m looking at the thermostat with my employee, we see the failure, and the first thought that pops into my simulated head is to have him go outside and determine whether or not the AC unit is working.

I repeat this step a few times, performing repetitions the same way you might do in the gym.

Next, in my simulation I assume that the unit was not working (remember that in real life we never checked and don’t know), and so I simulate having two consecutive thoughts: “let’s shut down just the one unit, so as not to ICO” and “but we’ll start at the thermostat instead of at the breaker, so that the unit shuts down slowly before we cut power altogether. I don’t want to fall victim to Implicasiaand assume an abrupt shut-down won’t mess something up”.

The second part of the second thought is important. I don’t know that turning an AC off at the breaker will hurt anything, but the point is that I don’t know that it won’t, which means I should proceed with caution.

As with before I repeat this visualization five times or so.

Finally, I perform this operation with both *d) and *e), in each case imagining myself having the kinds of thoughts that would have resulted in success rather than failure.

Broader Considerations:

The way I see it, this error cascade resulted from impoverished system models and from a failure to invoke appropriate rationalist protocols.I would be willing to bet that lots of error cascades stem from the same deficiencies.

Building better models of the systems relevant to your work is an ongoing task that combines learning from books and tinkering with the actual devices and objects involved.

But consistently invoking the correct rationalist protocols is a tougher problem. The world is still in the process of figuring out what those protocols should be, to say nothing of actually getting people to use them in real time. Exercises like this one will hopefully contribute something to the former effort, and a combination of mantras or visualization exercises is bound to help with the latter.

This failure autopsy also provides some clarity on the STEMpunk project: the object level goals of the project correspond to building richer system models while the meta level goals will help me develop and invoke the protocols required to reason about the problems I’m likely to encounter.

Future Research:

While this took the better part of 90 minutes to perform, spread out over two days, I’m sure it’s like the first plodding efforts of a novice chess player analyzing bad games. Eventually it will become second nature and I’ll be doing it on the fly in my head without even trying.

But that’s a ways off.

I think that if one built up a large enough catalog of failure autopsies they’d eventually be able to collate the results into something like a cognitive troubleshooting flowchart.

You could also develop a toy model of the problem (i.e. solving problems in a circuit that lights up two LEDs, reasoning deliberately to avoid Implicasia and changing one thing at a time to avoid ICO.)

Or, you could try to identify a handful of the causal systems around you where error cascades like this one might crop up, and try to preemptively reason about them.

I plan on exploring all this more in the future.

Notes:

[1] I’m not an HVAC technician, but I have worked with one and so I know enough to solve some very basic problems.

[2] Why even consider turning off a functioning AC? The interior of the garage has a lot of heavy machinery in it and thus gets pretty warm, especially on hot days, and if the ACs run continuously eventually the freon circulating lines will frost over and the unit will shut down. So, if you know the units have been working hard all day it’s often wise to manually shut one or both units down for ten minutes to make sure the lines have a chance to defrost and then manually turn them back on.

[3] Why even consider shutting off an AC at the breaker instead of the thermostat? The same reason that you sometimes have to shut an entire computer down and turning it back on when troubleshooting. Sometimes you have no idea what’s wrong, so a restart is the only reasonable next step.