README.md

nagios-herald

nagios-herald is a project that aims to make it easy to provide context in Nagios alerts.

It was created from a desire to supplement an on-call engineer's awareness of conditions surrounding a notifying event. In other words, if a computer is going to page me at 3AM, I expect it to do some work for me to help me understand what's failing.

Customizing Nagios Alerts

Nagios is a time-tested monitoring and alerting tool used by many Operations teams to keep an eye
on the shop. It does an excellent job of executing scheduled checks, determining when a threshold has been exceeded, and sending alerts.

Past experience with Nagios has shown that, typically, those alerts provide little information beyond the fact that a host is down or a service is not responding as defined by check thresholds. It's bad enough to be woken up by an alert; it would make the on-call experience more bearable if the alerts could tell the engineer more about what's going on. But what's useful in an alert?

When notified, an engineer often performs a set of procedures to gather information about the event before attempting to correct it. Imagine being able to automatically perform those procedures (or some subset) at the time of the alert. Imagine further, that the results of those procedures are embedded in the alert!

Enter nagios-herald!

Generic Nagios Alert

While it does provide necessary information, it could be formatted for better legibility. For example,
the following line, which contains the information we need, is dense and may be difficult to
parse in the wee hours of the morning:

Common questions would be "Which volume is problematic?" or
"Why is this considered a 'WARNING' alert?" In this example, it's not readily apparent what
those answers are. Let's add that context with nagios-herald.

Nagios Alert with Context

nagios-herald can highlight and colorize text, embed images (such as Ganglia graphs), include search results, and much more.

The previous disk space alert example can be tailored to look like this:

Notice the handy stack bar that clearly illustrates the problematic volume? See that Ganglia graph
showing disk space utilization for the node in the last 24 hours. Curious why the alert fired? Check
the highlighted df output that neatly defines which threshold was exceeded and why.

NOTE: In this example, the Nagios check ran df and supplied that input.

More Examples

For more examples of nagios-herald in action, see the example alerts page.

This is possible because nagios-herald provides extensible formatters.

Formatters

Adding context to alerts is done by the formatters. Formatters generate all the content that may
be used by one or more message types. For example, text returned by a Nagios check
can be highlighted to grab the operator's attention.

Installing nagios-herald

Installation of nagios-herald is as easy as cloning this repository to a location of your choice.
To enable nagios-herald to send notifications, configure Nagios and,
optionally, write a config.yml file. At a minimum, specify the logfile configuration
variable.

Dependencies

Ruby Gems

nagios-herald and its tools depend on the following Ruby gems:

app_conf

choice

mail

Stack Bars

Generating stack bars requires the following (which are included in this project for your convenience):