Archive for
October, 2010

Haven’t updated in a while, so I’ll write a good post to make up for it.

Recently, I’ve encountered the need to set up fully redundant nagios servers in a typical pair setup. However, reading the documentation [here], the solutions seemed lacking. The official method is to simply run two different machines with the same configuration. The master should do everything a normal nagios server should do, but the slave should have its host & service checks and notifications disabled. Then, a separate mechanism is set up so that the slave can “watch” the master, and enable the aforementioned features if the master goes down. Then, still checking, the slave will disable those features again when the master comes back alive.

Well, this solution sounds just fine in theory, but in practice it really creates several more problems than it solves. For instance, acknowledgements, scheduled downtimes, and comments made do NOT synchronize with the official method. Their mechanism does not allow for this, as it uses the obsessive-compulsive service and host parameters, which can be executed after every single check is run. It therefore has no access to the comment/acknowledgement data, so it simply cannot synchronize.

So can this data be synchronized? The short answer is yes, but I’ll explain the game plan before we dive in. Nagios can (unknowingly) provide us with another synchronization method, its own internal retention.dat file. This file is, by default, written to every 90 minutes by the nagios process, and contains all of the information necessary to restore nagios to the state it was in when it exited. Sounds like exactly what we need! So we’re going to now stop thinking about running nagios, and start thinking about how we can take this blob of ASCII data, and ensure it never gets corrupted and is as frequently as possible being updated. This is, after all, the true goal of the situation.

First and foremost, we will need a nagios installation! You can follow their documentation on this one and set one up for yourself. It’s ok, I can wait.

Second, we need another nagios installation on a second server! Hop to it.

Third, all of the nagios configuration files need to be constantly synchronized between these two hosts. I use puppet to synchronize my config files over the servers I administer, so unfortunately my implementation is highly specific. You may need to find other ways to synchronize your config files, but this should not be terribly difficult (perhaps a Makefile with a versioning repository and ssh keys?). This is needed in the official nagios failover deployment as well. Anyway, one “gotcha” I faced was the need to change the configuration parameter “retain_state_information=90” to “retain_state_information=1”. Do not forget to do this or else synchronizations will only occur once every 90 minutes.

Fourth, you will need to deploy this script on both hosts, and configure the requirements. You will see embedded ERB syntax in this script, that is because puppet allows me to configure discrepancies in my deployment inline, as the final configuration files are generated on-the-fly, then pushed to the clients.

# This is where we point the servers at each-other (configure this properly in your deployment!)…
<% if fqdn == “nagios1.example.com” %>
MASTERHOST=192.168.1.2
<% else %>
MASTERHOST=192.168.1.1
<% end %>

# Ensure only one copy can run at a time…
PIDFILE=/var/run/nagios-watchdog.pid
if [ -e ${PIDFILE} ]; then
exit 1;
else
touch ${PIDFILE};
fi

There is a single requirement to this script, you must give no-password ssh keys to the nagios accounts on each host, but you can use those securely by using the allowed commands directives of the authorized_keys file.

Fifth, and finally, we must implement a mutex operation around running nagios processes. Recall that we are synchronizing copies of nagios internal state data, and having a running nagios process is just a luxury. If you look at the script above, it simply ensures that nagios is running one, but not both servers, and ensures that the newest retention.dat file always has priority. The mutex operation doesn’t need to be infinetely accurate, I used the following relatively barbaric solution:

Sixth, and optionally, as you can see above, I’ve also set up redundant reporting. We do a similar test to ensure that at a maximum, only one email report is dispatched for the given timeframe. In this solution, reports could be theoretically lost forever if a specific set of circumstances is met, but that was deemed acceptable in this deployment. To see the real magic behind that script:

And voila, of course nagios-reporter.pl could be any report generation tool you wish, just be sure to call it in the method that suits your reporting needs.

Seventh, and convenient to have, I also wrote these two quick PHP scripts. Throw them in /var/www/html on each nagios box and do not redirect straight to nagios. Then setup DNS in a round-robin multiple A-record fashion. That is,

So now, you can simply visit http://nagios.example.com and nagios will always be displayed, unless a particularly bizarre set of circumstances has occurred. If this is the case, don’t panic! Remember, we set our minds correctly in the beginning and the integrity of the retention.dat files is not in question. The scipts may just take a minute or two to adjust themselves properly. For those that worry the DNS failover wouldn’t work, I’ve verified that it does on some of the popular browsers. There is no 90-second timeout delay, either, as in all but the rarest circumstances. I verified that a timeout can occur if the first connect() call’s SYN packets are dropped completely, but this is the aforementioned rare circumstance. Most testing on this is done at the iptables level, but be sure to REJECT the SYN packets (not DROP!) if you want an accurate account of the speed of the failover in your web-browser during a real-life outage. Also ensure that your router will send proper ICMP Host Unreachable responses if one of the addressed hosts is offline.

I think that’s pretty much all you need to get going. This deployment has been running for a little while now in a production environment, and has been rock-solid. It’s a bit more work than the official solution, but it solved my monitoring needs and extensive testing, both real-world and artificial, has not yet revealed any issue with this solution.