December 18, 2014

3am Pages Suck!

As sysadmins, we all know the pain that comes from getting paged at 3am because some computery thing somewhere has caught on fire. It’s dark, you were having a perfectly pleasant dream about saving the world from killer robots, or cake, or something, when all of a sudden your phone starts making a noise like a car alarm. It’s your good friend Nagios, disturbing your slumber once again with word of a problem and very little else.

We might hate it for being the bearer of bad news, but Nagios is a well-known and time-tested monitoring and alerting tool. It does its job well- it runs the checks we tell it to, when we tell it to, and it dutifully whines when those checks fail. The problem with its whining, however, is that by default there is very little context around it.

Adding Context to 3am

As an example, let’s take a look at everyone’s favorite thing to get woken up by, the disk space check.

We know that the disk space has just crossed the warning threshold. We know the amount and percentage of free space on this volume. We know what volume is having this issue, and what time the notification was sent. But this doesn’t tell us anything more. Was this volume gradually getting close to the threshold and just happened to go over it during the night? If so, we probably don’t care in the middle of the night - a nice slow increase means that it won’t explode during the night and can be fixed in the morning instead. On the other hand, was there a sudden drastic increase in disk usage? That’s another matter entirely, and something that someone probably should get out of bed for.

This kind of additional context provides really valuable information as to how actionable this alert is. And when we get disk space alerts, one of the first things we do is to check how quickly the disk has been filling up. But in the middle of the night, that’s asking an awful lot - getting out of bed to find a laptop, maybe arguing with a VPN, finding the right graphite or ganglia graph - who wants to do all that when what we really want to do is go back to sleep?

Here we have a bunch of the most relevant context added into the alert for us. We start with a visual indicator of the problematic volume and how full it is, so eyes bleary from sleep can easily grok the severity of the situation. Next is a ganglia graph of the volume over the past day, to give an idea of how fast it has been filling up (and if there was a sudden jump, when it happened, which can often help in tracking down the source of a problem). The threshold is there as well, so we can tell if a critical alert is just barely over the threshold or OH HEY THIS IS REALLY SUPER SERIOUSLY CRITICAL GET UP AND PAY ATTENTION TO IT. Finally, we have alert frequency, to know easily if this is a box that frequently cries wolf or one that might require more attention.

Introducing Formatters

All this is done by way of formatters used by nagios-herald. nagios-herald is itself just a Nagios notification script, but these formatters can be used to do the heavy lifting of adding as much context to an alert as can be dreamt up (or at least automated). The Formatter::Base class defines a variety of methods that make up the core of nagios-herald’s formatting. More information on these methods can be found in their documentation, but to name a few:
* add_text can be used to add any block of plain text to an alert - this could be used to add information such as which team to contact if this alert fires, whether or not the service is customer-impacting, or anything else that might assist the on-call person who receives the alert.
* add_html can add any arbitrary HTML - this could be a link to a run-book with more detailed troubleshooting or resolution information, it could add an image (maybe a graph, or just a funny cat picture), or just turn the alert text different colors for added emphasis.
* ack_info can be used to format information about who acknowledged the alert and when, which can be especially useful on larger or distributed teams where other people might be working on an issue (maybe that lets you know that somebody else is so on top of things that you can go back to sleep and wait until morning!)

All of the methods in the formatter base class can be overridden in any subclass that inherits from it, so the only limit is your imagination. For example, we have several checks that look at graphite graphs and alert (or not) based on their value. Those checks use the check_graphite_graph formatter, which overrides the additional_info base formatter method to add the relevant graph to the Nagios alert:

In this method, it calls other methods from the base formatter class such as add_html or add_attachment to get all the relevant information we wanted to add for these graphite-based checks.

Now What?

If you’re using Nagios and wish its alerts were a little more helpful, go ahead and install nagios-herald and give it a try! From there, you can start customizing your own alerts by writing your own formatters - and we love feedback and pull requests. You’ll have to wrangle some Ruby, but it’s totally worth it for how much more useful your alerts will be. Getting paged in the middle of the night still won’t be particularly fun, but with nagios-herald, at least you can know that the computers are pulling their weight as well. And really, if they’re going to be so demanding and interrupt our sleep, shouldn’t they at least do a little bit of work for us when they do?