The Day I Broke Production

One of my more exciting days as a developer was the day I broke the production environment.
Although I was fortunate to escape unscathed, it was a formative experience; I'll never be quite as callous
with my typing or my thinking in the production environment again.

Our product is an amalgamation of Erlang, Java and PHP, and figuring out a sane deployment approach had quite a learning curve.
For PHP upgrades it was usually a no-op or sometimes requiring an Apache restart.
For Java upgrades it involved killing the running process and then starting a new one with the updated
code. For Erlang components it involved killing the beams and then running the start scripts again.
For Java wrapped by Erlang, it involved killing and restarting the Erlang process.

Well, that's how it was supposed to work anyway.

The Problem

After some--but not all--deployments we encountered a peculiar problem: we would have dead processes
hanging around in pg2. This would cause gen_server:call to fail when pg2:get_closest_pid happened
to select one of those no-longer-alive processes, which in turn would cause the callers to blow up. Generally
these pushes would douse our house of cards with gasoline and the next incoming request would provide the spark
to alight the entire system.

It was hardly surprising that there would be some rough patches when deploying an alpha product,
but it nonetheless got somewhat grating to have the problem pop up after some but never all pushes.
While we eventually figured out better interim solutions and finally understood the underlying problem
(more on that later), in the meantime we had a manual cleanup process.

The very first few times this happened the cleanup was very manual, meaning that we typed some commands
in the Erlang shell and then mashed down the return key. Surprisingly enough, this segues us into the story of the day I broke production.

Worse than the Disease

After one push it immediately became clear that the dead process plague was visiting our fair system. At that point
two of the other engineers had been taking care of the zombies, but one of them was on a month long vacation and the
other wasn't immediately available, so I had two choices: leave quietly for a long lunch, or try to fix it myself.

So I fixed the problem, Sort of. What wrote was something along the lines of:

Which happens to have one minor flaw: it removed all living processes from the process groups, leaving only
dead processes in the process groups to service requests. Oops.

From there I messaged the QA lead to kindly ignore the incoming wave of test failures that were about to
be unleashed, and went about fixing my fix. First, I deleted and recreated all the existing process groups
on all the nodes to clean out the dead processes, and then I had to go around to each of the nodes and restart
the applications running on them (which would cause them to rejoin the new process groups, thus populating
them with live processes).

Since we only had about eight nodes at that point it only took a few minutes to get it all sorted, but they
were some rather tense minutes.

Various and Sundry Solutions

Since that incident we've improved the situations in quite a few ways, in particular:

as we began to investigate the underlying issue it turned out that some of our gen_server implementations
weren't trapping their exits and thus their terminate function weren't being called and they
were never leaving their process groups after stopping.

Many experienced Erlangers might identify the underlying problems as
deploying the Erlang code in the wrong way. On the simple side there is the l function
which reloads a module, and on the more sophisticated side is appup which helps
manage upgrading Erlang applications.

I think what we were doing was unquestionably wrong for a pure Erlang system, or for a situation the entire
team and/or company is Erlang-fluent, but I'm less willing to concede that it doesn't make sense
for larger corporation which has its own deployment mechanisms which everyone is
already familiar with. Large companies are always concerned about the cost of adopting
new things, and the key to faster adoption is reducing the cost of adoption, often at the cost of purity.
(This topic probably merits a real discussion rather than just an afterthought.)

Although the system got away unscathed, this day was definitely a learnable moment for me.
Even simple and obvious solutions have a way of blowing up in the furnace of production deployment,
and a combination of more defensive coding and more extensive failure testing (a kind of
testing that I find many developers neglect to a fault) would have prevented the entire
situation from arising.

What were your closest misses with breaking a running system?

Hi folks. I'm Will, known as @lethain on Twitter.
I write about software and management topics,
and love email at lethain[at]gmail.
Get email from me by subscribing to
my weekly newsletter.