A Sure-Fire Recipe For Monitoring Disaster

Here’s a recipe for a threshold-based alert that will go horribly wrong, beyond a shadow of a doubt.

You install some package of plugins for monitoring MySQL’s replication status.

One of the alerts is based on replication delay relative to the master. You can choose thresholds for warning and critical delays, just like all Nagios threshold-based checks.

You feel sure that the delay behind the master is very important to monitor, so clearly you must choose not one, but two thresholds. What are the right numbers?

You reason that the server ought to run with no delay in normal circumstances. But just to give it lots of room for an occasional abnormality, you set a warning at 1 minute, and a critical alert at 5 minutes. That seems too lax, but you’re afraid to set the tolerances any tighter.

What happens next? You deploy the checks and you don’t see any alerts. You leave work thinking that tomorrow you should probably tighten the thresholds.

Because really, if your application is reading data that’s 5 seconds stale, something is wrong and someone’s got to do something about it.

At 3am the nightly batch job kicks off and replication gets delayed by 18 minutes, catching back up at 4:30. You get paged. You learn something new: replication gets badly delayed every night. Who knew!

And during the day? Replication delay never exceeds a second. How are you supposed to set a threshold in circumstances like this? You can’t. If you’re a competent Nagios admin, the best you can do is suppress the alert between 3 and 5am every day. Problem “solved.”

There’s already a lot wrong with this scenario, but it gets worse. The business analytics team needs a batch job with intensive queries that runs every 5 minutes all day long, and it usually delays replication by 5 to 10 seconds for a quarter of a minute or so. Occasionally it spikes further, into the 30-second range or longer, and you get a warning message from Nagios, followed unhelpfully by an all-clear message. You try to optimize the queries, but they can’t be shortened. At first the alert doesn’t happen often, but then it’s happening twice a day, and then ten times a day. You open up the Nagios config and set the warning to 5 minutes and critical to 15 minutes. Some time later you have to increase this even further, because “rare” long delays aren’t rare any more.

At this point a sane person asks what use the Nagios check is, since it’s telling you about a non-problem you’re unable to solve. If you’re smart, you simply remove the useless alert on replication delay. But it isn’t about replication delay, it’s about thresholds in general.

Anytime you configure an alert like the above, you’ve just picked a fight, and you’re on the losing team. I will talk about the reasons for this in my next post.

By the way, the story from above? It’s not just a story, it really happened. I did that myself in 2007. Face, meet palm.