It's about what broke, not who broke it

I used to teach a class to new people who had just joined the company.
Over the span of almost four years, it moved around a bit in terms of
which week I'd see them (third, first, then second), but the class
always focused on the same thing: troubleshooting and outages.

I got into it almost by accident. One of my teammates had created it,
and had a slide deck and a bunch of interesting topics to share. He
would teach it about once every two weeks. Then, one summer, he went on
vacation and needed coverage, so I offered to take it for him. I turned
out to enjoy it so much that I asked to do more, even when he was still
available.

Over the months and years which followed, it acquired a life of its own.

While there are details I'll probably never be able to share, there are
ways of talking about what went on in that class. For starters, I would
usually mic myself up a few minutes early, and would chat with the
people who had arrived before the start time.

I'd ask them "who here has heard the rumor that (something that seems
completely outlandish including who supposedly did it) happened, and it
took down (way more things than you'd imagine)"? Some hands would go
up, some people would murmur. Then I'd say "is it true or false?", and
it was always interesting to see which way the room would go. Depending
on what they had heard and how talkative they were, it might go either
way.

I'd then say "it's true" and later, during the actual class, would tell
them the story of a time someone did something innocent, tripped over a
three-year-old bug, and managed to effectively unplug everything.

I had to then tell them that this person still worked there. I'd then
say that I would not tell them who it was, and that it didn't matter.
It wasn't their fault that something broke. They managed to find
something that had been written into the code years before they
probably ever thought of joining the company, and were doing something
that should have worked when it fell over.

The important thing at the time was that we cared about what broke, not
who broke it. Who broke it is frequently just a roll of the dice: who
got that particular task, bug, or ticket assigned to them, and happened
to run this valid command instead of that also-valid command? Why would
you ever assign blame based on that?

If anything, you'd want to find out the general pattern of what had
broken and then go scour the code base to see if it had happened
anywhere else. Chances are, someone didn't just come up with that
particular string-mangling "wizardry" out of thin air, and they either
picked it up from some other part of the code, or someone else later
copied from them and did the same stuff in their own code.

Maybe they never got the memo that you're really not supposed to be
doing old-school C-style char* manipulation with [] accesses and pointer
math and all of this in C++ code that isn't a particular "hot spot" in
the system. There are reasons that we like using actual strings and
things with bounds-checking, right?

In any case, however it got there, it has to be found and eradicated in
the code. Then it has to actually get built and pushed. (Far too many
outages occur because the fix is only "in trunk" and never got pushed in
time.) After that, the follow-up involves some kind of static code
analysis, lint rules, or whatever else is necessary to positively keep
people from putting it back in.

Teaching is not enough. There are far too many people going through the
revolving doors of these companies to ever think that you could possibly
get them all and keep it all fresh in their heads. You have to design a
system such that the natural thing to do yields a good result and
doesn't put anyone in harm's way.

"What broke, not who broke it" is another one of those cultural
touchstones within a technical environment. You should keep an eye on
it and see if it's still being honored, or if it starts being ignored.
When things change, be prepared to change with them, or be prepared to
suffer.