The sysadmin community loves monitoring tools. Partly because of Tom Limoncelli’s time management book, I’ve been using Nagios for several years now to monitor all sorts of IT hardware and software. It includes features for scheduling check scripts and contact rules (page me during weekdays, page someone else at night), and dashboard views for the visually oriented (curiously unnecessary once you have alerting configured well). I’ve found it takes investment of time and attention to configure all the checks you want, but it’s worth it. Because of its open-source, extensible nature, Nagios is especially good for monitoring weird things that other enterprise monitoring systems aren’t even aware of. There are a lot of books about Nagios available, and I’d recommend them for anyone new to Nagios. I found James Trumbull’s book published by APress particularly useful, though it looks like it might be due for a new edition by now.

One strategy–from the security world–on which sysadmins and developers both agree is enumerating goodness (take note, “antivirus” is Doing It Wrong). Basically, it’s too hard to guess all the myriad ways your technology might fail and then write monitoring or test scripts for them. Instead, focus your monitoring and testing on covering the important functionality of your product.

But at what level of detail should we monitor? When the monitoring system wakes you up at 2 am, which do you want it to say?

CRITICAL. No users can log in to http://myapp/login.

CRITICAL. The app can’t contact the database.

CRITICAL. The database server is down/unpingable.

My preference is “4. All of the above”, because the user-facing effect is sometimes hard to guess from the sysadmin-facing event (and vice versa). Nagios is good at #3 and maybe #2. #1 is more in the realm of functional testing, and your app may already have functional test scripts that could provide this sort of information.

check_tap.pl

Plugin Documentation (more or less straight from check_tap.pl --help)

check_tap.pl

This plugin allows Nagios to check the output of anything that emits
Test Anything Protocol output. So you can wed Nagios’s monitoring and
alerting infrastructure to your unit and functional tests for deep
application-level monitoring in development or even in production.

-e, –exec=”/path/to/executable args”

-w, –warning=INTEGER:INTEGER

Minimum and maximum number of allowable test FAILURES, outside of which a
warning will be generated. Default is 0 tolerable failures.

-c, –critical=INTEGER:INTEGER

Minimum and maximum number of allowable test FAILURES, outside of
which a critical will be generated. Default is 0 tolerable failures.

-t, –timeout=INTEGER

Seconds before plugin times out (default: 15)

-v, –verbose

Show details for command-line debugging (can repeat up to 3 times)

Verbosity

Use -v to see a bit more info in the one line, including the first
test that failed. This is especially useful because Nagios will
include it in the alert/notification.

Use -vv to see test summary and failures.

Use -vvv to see full test script output.

Warning and Critical Thresholds

THRESHOLDs for -w and -c specify the allowable amount of test failures
before the plugin returns WARNING or CRITICAL. Use ‘max’ or ‘min:max’.

The default of 0 tolerated failures is good for people like you who
have high standards. But you might want to crank up the CRITICAL
threshold if you want to differentiate between WARNING and CRITICAL
amounts of fail.

See more threshold examples at
http://nagiosplug.sourceforge.net/developer-guidelines.html#THRESHOLDFORMAT

Examples:

check_tap.pl -s/full/path/to/testfoo.pl

check_tap.pl -s /full/path/to/testfoo.pl

will run ‘testfoo.pl’ and return OK if 0 tests fail, but CRITICAL if
any fail. Excluding TODO or SKIPped tests, of course.

check_tap.pl -s/full/path/to/testfoo.pl -c2

check_tap.pl -s /full/path/to/testfoo.pl -c 2

will return OK if 0 tests fail, WARNING if more than 0 tests fail,
and CRITICAL if more than 2 fail.

Non-Perl and remote test scripts

check_tap.pl -e'/usr/bin/ruby -w'-s/full/path/to/testfoo.r

check_tap.pl -e '/usr/bin/ruby -w' -s /full/path/to/testfoo.r

will run ‘testfoo.r’ using Ruby with the -w flag.

You can use any shell command and argument which produces TAP output,
for example:

check_tap.pl -e'/usr/bin/curl -sk'-s'http://url/to/mytest.php'

check_tap.pl -e '/usr/bin/curl -sk' -s 'http://url/to/mytest.php'

check_tap.pl -e'/usr/bin/cat'-s'/path/to/testoutput.tap'

check_tap.pl -e '/usr/bin/cat' -s '/path/to/testoutput.tap'

In fact, anything TAP::Harness or prove regards as a source or
executable.

Remember that Nagios or NRPE will likely be running this command as a
different, less-privileged user than you’re using now.

License

This nagios plugin is free software, and comes with ABSOLUTELY NO WARRANTY.
It may be used, redistributed and/or modified under the terms of the GNU
General Public Licence (see http://www.fsf.org/licensing/licenses/gpl.txt).

Installation and Nagios configuration

Save check_tap.pl anywhere that makes sense and the nagios user can access. The default plugin location depends on your distribution. For RHEL and ilk, plugins are in /usr/lib/nagios/plugins/.

You can also use NRPE for *nix or NSCLient++/nscp
or NC_Net
for Windows (you’ll also need Perl, like Strawberry Perl) to run check_tap.pl on a machine other than the Nagios server. I’ll leave that as an exercise for the reader, or maybe a future blog post.

Doughnuts for Developers

Wow! Now I can use Nagios’s monitoring, alerting, performance logging, acknowledging, scheduling, dashboarding, and thresholding features for my unit and functional tests! I can run them with different rules in build and production! I can detect random and senseless acts of system administration! I can detect new bugs as soon as I write them! I can hook Nagios up to the sprinkler system to put a damper on that hotheaded programmer in the next cube! I can write frontend functional tests with Selenium or AutoIt or AppleScript and have Nagios alert (the sysadmins of course) when any of my app’s core functionality breaks!!!

There are plenty of other test harnesses out there, but Nagios has some big advantages, especially for monitoring production systems.

Shortcuts for Sysadmins

Two last things which are more good news for sysadmins:

Custom test scripts for things Nagios won’t easily check

Writing your own Nagios plugins is a little bit hard, even with the example plugin I worked on ever so long ago. Writing TAP test scripts, on the other hand, is easy! So, after you’ve set up Nagios monitoring for all the low-hanging fruit like hardware utilization and network availability, write a test script for some of the harder to monitor signs of enumerable goodness. Here’s an example:

#!/usr/bin/perluse warnings;use strict;# This allows the script to run as a CGI or on the command line.# We have to output the header in a BEGIN block or Test::Simple will# output the plan too soon.BEGIN{if($ENV{REQUEST_METHOD}){use CGI qw(header);print header("text/plain");}}use Test::Simple tests=>4;use Test::File;# These use Test::File to test file permissions and size.# See http://search.cpan.org/perldoc?Test::File
file_writeable_ok "path/to/cache";
file_not_writeable_ok "path/to/index.html";my$backup_file="/usr/local/backups/myapp-backup.tar.bz2";
file_min_size_ok $backup_file,911377;# -M is a file test operator returning the file's last modifed age in days.# See http://perldoc.perl.org/functions/-X.html
ok -M $backup_file<1,"Backup file is newer than 1 day old";

#!/usr/bin/perl
use warnings; use strict;
# This allows the script to run as a CGI or on the command line.
# We have to output the header in a BEGIN block or Test::Simple will
# output the plan too soon.
BEGIN {
if ($ENV{REQUEST_METHOD}) {
use CGI qw(header);
print header("text/plain");
}
}
use Test::Simple tests=>4;
use Test::File;
# These use Test::File to test file permissions and size.
# See http://search.cpan.org/perldoc?Test::File
file_writeable_ok "path/to/cache";
file_not_writeable_ok "path/to/index.html";
my $backup_file = "/usr/local/backups/myapp-backup.tar.bz2";
file_min_size_ok $backup_file, 911377;
# -M is a file test operator returning the file's last modifed age in days.
# See http://perldoc.perl.org/functions/-X.html
ok -M $backup_file < 1, "Backup file is newer than 1 day old";

Just run that script as a CGI or with NRPE etc. and you’ve got instant monitoring of things like backup success, which is otherwise kind of hard to hook in to Nagios. You could get even fancier with checksums or whatever you want.

Tolerating ambiguity

In some cases, a certain amount of failure is acceptable. I’ve been monitoring connectivity between crucial hosts with a test script. I don’t want to be bothered by this check if one of our Citrix VMs isn’t available to the gateway server for a while, but it’s a problem if they’re ALL inaccessible because of a firewall issue or something. The configurable warning/critical thresholds on check_tap.pl allow me to define how much aggregate failure is worth getting excited about–something that’s hard to do with vanilla Nagios configuration.

Here’s the gist of my test script. ping_ok() and service_ok() come from a module I wrote 50 years ago to check basic connectivity/responsiveness with Net::Ping.

diag "Checking Citrix farm connectivity from DMZ\n";my@stas=qw(
citrixdc01.mydomain.com
citrixdc02.mydomain.com
);foreach(@stas){
ping_ok($_,'http');
service_ok($_,'http',"XML service on $_ is responding in some way");}my@citrix_farm=qw(
citrixps01.mydomain.com
citrixps02.mydomain.com
...
citrixps29.mydomain.com
citrixps30.mydomain.com
);foreach(@stas,@citrix_farm){
service_ok($_,'1494',"ICA service on $_ is responding in some way");}

5 Responses

Ranjib DeyJune 8, 2011 at 7:32 pm ·

Well, monitoring tools like nagios are more used against in production or mission critical infrastructure/apps. While unit tests are a part of development practices and there are dedicated tools (called as Continuous Integration tools) used for checking unit tests outputs. Jenkin/Hudson, crucise control, goldberg, Go etc are few of them. These tools can monitor a code repository (git/svn/hg etc) and pull in new commits and then run the unit/functional test automatically and then notify/alert accordingly.

If your unit tests are failing, then it means they are not going via a CI tool. To me testing and monitoring are pretty different.

Never the less, i like the idea :-). We do use nagios-cucumber for work flow monitoring against some of our apps.

Cool! Nagios-cucumber looks like a great tool. I knew I wasn’t the only one thinking along those lines.

You’re right that Nagios isn’t a Continuous Integration tool. I’m just thinking that functionality can suffer because of things other than new commits, especially on a production app. So reusing existing functional tests for monitoring seems like a good way to increase monitoring coverage.

And I think sysadmins could benefit from using test tools to prove assertions about the environment and infrastructure underneath the application.

Thanks for your feedback, Ranjib!

HenryAugust 8, 2013 at 12:59 pm ·

You guys are not alone. I have recently trying to run some web automation test but would like to integrate it with Nagios because I don’t really need the Git pulling part. I just need the test to run every minutes and notify me when things fail. But I do totally see the need of bring the best of 2 worlds together here. I would use Jenkins and integrate it with Nagios if I have to.

But kudos to Nathan for creating this plugin so I can actually use only Nagios and not have to spin up additional server for Jenkins.

Nathan,
Thanks for the link to your slide. That’s exactly what I am currently working on. I have setup Selenium to work with Jenkins originally. But I would like to have Nagios trigger the test instead of Jenkins.

Can you tell me if there is a way to trigger check B after check A and only when check A was successful. This was easy to do with Jenkins, but if I can do the same with Nagios, that would be the best. I am using check_mk on top of Nagios.

Can you help me too?

Thanks for visiting my blog. I hope you find it useful.

If you have benefitted from something here, please consider helping me spread the word about my first iOS app, Teüna. It’s a chromatic tuner for any instrument or voice and it runs on iPhone, iPad and iPod Touch. I've taken special care to make it elegant, fast, and accurate, and I think it's honestly the best tuner on the App Store for everyday use.

If you could take a minute to mention Teüna to any musician, student, teacher or parent you know who might like it, I'd really appreciate it. If you're feeling really generous, App Store reviews are the best way to thank me!