Hut 8 Labs

How to Square the Circle, Achieve Perpetual Motion, and Tune Your Alert Emails Just Right

“You can tune a piano,” goes the old joke, “but you can’t tuna fish.” I’ve
come to believe that you can’t really tune the automated alerts you get from
your monitoring systems either—at least in the sense that we usually mean when
we complain that “our email alerts are out of control and need to be tuned
properly.” Instead we should be spending our time and attention tuning a
larger, more complex system—of which alerts are just one part.

When you tune a piano string, you make small adjustments to return it toward
the ideal pitch from which it has drifted (say, 440 Hz for an A). It’s
overwhelmingly impractical to tune a real piano string to A perfectly,
because of the messy nature of the physical world—but the ideal you’re trying
for is straightforward and unambiguous.1 Unfortunately, there’s no such
ideal for alerts.

“Of course there is,” says a guy in back, “I’ve read about it in tons of blog
posts. The ideal for alerts is this: every time I get an alert, it should
indicate an actual problem that requires my attention.”

OK, guy at the back, that’s pretty easy to do—let’s just turn off the alerting
system altogether, and I promise that every alert you receive will be
indicative of a problem requiring your attention.

“Don’t play dumb,” he retorts. “We still want to get alerts whenever something
is wrong—as soon as we know it’s wrong, in fact—we don’t want to be sitting
around thinking things are all hunky dory when production is on fire. So
that’s the ideal towards which we tune alerts—we want alerts immediately
when there’s an actual problem, and only then.”

Fair enough, but that’s a very different kind of ideal than a string vibrating
at precisely 440 Hz. In fact it’s not even clear that this can be called an
“ideal” at all, because under this definition an alert can drift from ideal in
at least two, often contradictory directions. That is, it can fire when
there isn’t an actual problem (a false positive) or not fire when there is (a
false negative)—and when you make one of these problems better for a given
alert, you tend to make the other worse.

For example, if you’ve ever monitored CPU usage of a production system, you’ll
be familiar with the false positive alerts you get when the system becomes
briefly and legitimately busy doing a burst of real work. So you tune the
alert to back off a bit—perhaps you will tolerate up to 80% utilization
instead of 70% before alerting—only to find that some truly nasty condition
occasionally pegs the CPU right around 72% forever, causing all sorts of other
problems in the meantime. All right, you think, I’ll set the CPU threshold
back to 70% but not alert until we’ve exceeded that for 30 minutes, which is
longer than any legitimate work spike—but now you’ve guaranteed that you won’t
find out about actual problems for at least 30 minutes. So you set a new rule,
which etc. etc. etc.

As tricky as this balancing act is (and, if you’ve ever struggled with this in
the real world, you know that it is plenty tricky), there’s a subtle (and even
more dastardly) problem buried in this discussion so far: the tossing around
we’ve been doing of that term “an actual problem.”

What makes one condition of your system “an actual problem” and another “not an
actual problem?” There’s no unambiguous, measurable criterion you can
reference to answer that question. If you polled different people in your
business, you’d almost certainly get vastly different answers—one of the folks
in marketing might not care, for example, if page load times were above 1
second for their latest micro-site, but might care very much if it’s down,
while another might think anything over 500 millis should be synonymous with
downtime—for the very same site. So what we mean by saying an alert caught
“an actual problem” ends up being something like: “when the alert fired, it
alerted me to a situation that, with my own infinitely flexible and
idiosyncratic human knowledge and judgment, I was glad to know about at just
that time and no later.”2

This is a pretty bad situation we find ourselves in—not because there’s some
unattainable perfection our alerts can’t ever realistically achieve, but
because we don’t even have a good enough definition of perfection to let us say
whether a change to one of our alerts is actually making it better or worse.
That’s a horrific state of affairs, because it means that as we attempt to
steadily improve our operations in nice small steps, we’re just going to end up
endlessly jerking away from whichever flavor of catastrophe last burned us: our
alert volume will build up until it’s just background noise, which will lead to
an alert on an “actual problem” being ignored, which will lead to a fantastic
blamefest that results in our cutting a bunch of alerts, which will lead to an
“actual problem” that generates no alert, which will lead to another blamefest
resulting in a buildup of alerts … and round and round we go, with no end in sight.

OK, then … is all hope lost?

So how do we get out of this spiral of blame and abject existential horror?
Here’s a hint: how would we design our alerts if we had access to an infinite
supply of brilliant, unsleeping, free interns (who were also intimately
familiar with our systems and business) to respond to them? Well, assuming
we’re completely heartless3, we’d make those alerts sensitive as all hell,
because if it’s free (and we’re heartless), why not throw human intelligence
and attention at every little blip and bump we monitor to see if there’s an
“actual problem?”

What if, on the other hand, the condition on which we were alerting was
extremely benign—for example, the website on which we host our high-school
poetry going down? We’d make that alert extremely insensitive, because
honestly: who cares if it goes down for a day or two, or even a week?4

Economics to the rescue

In other words, we can recast the idea of an alert as something that doesn’t
even have a Platonic ideal in and of itself—but which is one piece of an
economic equation, with an associated cost and benefit profile.

I mean those words literally, by the way: each alert has some cost and benefit
in some actual number of probabilistic dollars and cents, where the cost is
dominated by the investment of human attention and intelligence that it
occasions, and the benefit is equal to the cost of an “actual problem” (times
the probability that the alert has identified such a problem).

Now we can start comparing the relative badness of a given false positive and
its corresponding false negative, and to see the outlines of a system that can
actually be tuned towards an unambiguous ideal—the absolute minimum overall
cost.5

A gentle objection from the guy at the back

“All right,” says the guy at the back, “I’ve sat quietly for a bit, but now I
think I’ve got you. Maybe you’re right that the Platonic ideal was a little
too simplistic, but what you’re proposing is way too complex—there’s just no
way in hell you can actually figure out the cost of human attention and
intelligence and the probability of a false negative etc. etc. etc. and get all
that down to dollars and cents. In practice it’s just impossible.”

Well, guy at the back, we agree about one thing: we’re never going to calculate
those costs down to the cent, or even the dollar. But—and here’s the great
part—in practice we don’t have to—we just need back of the envelope, order
of magnitude estimates that allow us to compare a couple choices (e.g. making
an alert more or less sensitive) and say, relatively, which course is
probably better.

“But,” says the guy, “you’re still asking me to put a cost on things like
minutes of downtime. The leaders of my business are never going to do
that—downtime is one of those things that just can’t ever happen.”

Oh, guy at the back, let me buy you a beer or ten.

The spirit behind that “can’t ever happen” is a gigantic problem—it’s
equivalent to saying “our perpetual motion machine just can’t ever run down,
because we love our customers and failure is not an option.” This is, at its
heart, a moral argument, where blame and punishment are what’s under
consideration—and in that mindset, people often get genuinely angry if you
even suggest that there’s an economic tradeoff to consider.6

If the above sounds uncomfortably close to your own situation, you have much
more profound problems then a simple pass at your nagios configuration could
ever solve. If you operate the software of your business, then you must be
able to reason about the economics of what you do—be it preventing downtime,
investing in backups, or even speeding up deploys—in at least comparative
orders of magnitude. If the leaders in your business refuse to partner with
you on that, then you don’t have a ton of great options.7 In this situation,
as my friend and colleague Dan Milstein says—and I don’t repeat this
lightly—”maybe the world is telling you to brush up your LinkedIn profile.”

But in my experience—and I hope that experience is far from unique—the
leaders of a business usually welcome the chance to have an economic discussion
around such operational risks and investments, as it represents a chance to
better understand (and inform) some aspects of the business’s economic equation
that are often opaque to them, and remove some of the fear and anxiety that
this opacity creates.

In Conclusion …

In practice I think you’ll find that, far from our original false Platonic
ideal of “every alert indicates an ‘actual problem,’” you’ll end up happily and
profitably tolerating some number of false positives, since for most businesses
they tend to be considerably cheaper than a single false negative.8

But thinking about alerts in their larger economic context also lets us improve
our overall economics by means of other investments too. Besides just changing
the sensitivity of our alerts, we can also improve the overall economic
equation by:

driving down the cost of receiving an alert—for example by making alerts
easier and quicker to digest and disregard if appropriate, so they consume
less human attention and intelligence

driving down the cost of the failures we want to alert on, by providing
backup or alternative systems that make failures less expensive (for example,
providing materials that allow cashiers to take those old paper impressions
of credit cards if the electronic system goes down)

By recognizing our alerts as part of a broader economic system, in other words,
we’re setting ourselves up for a world with an actual forward direction and a
lot more options for how to travel in that direction—and, correspondingly, a
lot less shaking our fist at all those email alerts in impotent rage.

I was originally too abmitious here, stating that there is a perfect
tuning for a piano. It turns out that, as several early readers have
pointed out to me,
there’s no way to tune an entire piano perfectly. So, we’ll stick with a
string being tuned to a single note, unrelated to all others. ↩

Yes, in some respect, creating a “perfect” alerting system would mean
designing an AI that contained, besides its electronic sensors, an exact,
evolving, realtime copy of all your wisdom, experience, judgment, domain
knowledge, preferences, etc.—and alerted you when it calculated, with 100%
certainty, that you would want to handle a situation—because, paradoxically,
while possessing all your wisdom, experience, judgment, domain knowledge,
preferences, etc.—as well as on-board networking and a faster CPU than
yours—the AI was somehow unable to address the situation itself. ↩

Yes, there’s a real moral question here about making the lives of these
poor interns unbearably miserable, but … interns. ↩

In my case, I’d probably want to be alerted if it somehow ever
accidentally came up, so that I could immediately shut it down again. ↩

Or, if you’re an optimist, the maximum overall benefit—but if you’re an
optimist, what are you doing working in operations anyway? ↩

For more on the moral vs. economic mindset, see Hut 8’s own Dan Milstein
talk about post-mortems, axe murderers and the stupidity of our future selves,
available as video or
slides. ↩

In particular, it’s tempting but inadvisable to take it upon yourself to
translate this moral viewpoint into an economic one. You’d essentially just be
assigning an infinite cost to downtime, which leaves you just as lost as the
false Platonic ideal of “all alerts must indicate an actual problem” did when
it implicitly assigned an infinite cost to wasted human attention. ↩

There’s a corollary to this, too: often when you join an organization you
will initially experience their alert volume as horrifically out of control, and
grow to understand it as you become better associated with both the workings
and the economics of the systems you’re monitoring. ↩

Subscribe

Subscribe to our e-mail newsletter to receive updates and new articles.