We run a virtual hosting service and use Nagios to monitor both the virtual machines and the physical servers upon which they live.

If there’s a network outage we only get notifications about the physical servers being down because of how we’ve set up host dependencies, the VMs are merely unreachable. We pay per SMS for notifications so don’t want to overdo it.

If there’s a power outage the same thing happens, but when the physical servers come back it takes time to get the VMs back, and during that time Nagios picks up on the fact that the physical servers are reachable but the VMs are not and decides that the VMs are now in the DOWN state and therefore starts sending out all the notifications we didn’t want to get. Not great, and can get costly in terms of utterly pointless text messages.

I can see what Nagios is doing and why it’s doing it, but I don’t know how to configure so it doesn’t do it, i.e. recognise that those VMs being “down” is part of the same outage in which they were just “unreachable”. Anybody got any good ideas on how to do this?

You can set the higher values to these variables of your virtual host definitions (dependently on how long they need to boot up and become reachable):

max_check_attempts #
check_interval #
retry_interval #

If for example, after real host comes to up state, and it takes 5 minutes for virtuals to became up, set the max_check_attempts to 6, check_interval and retry_interval set to 1 (if your interval_lenth variable in nagios.cfg is set to 60), then 6 and 1s will mean minutes. That way, the virtual will be checked six times every minute, and within those 6 minutes it should come up. If it takes 10 minutes for them to come up, then combine just to last slightly more then 10 minutes, for example: max_check_attempts to 6, check_interval and retry_interval set to 2 or max_check_attempts to 12, check_interval and retry_interval set to 1.

Have in mind that those settings will then always be configured so, even if your real host is not down, and one of your virtual host goes down, then it will take 6 or 12 minutes (accordingly to examples I’ve written above) to get the first notification about it.

Thanks. Yeah, I don’t really want to have to compromise on the quick response if an individual VM drops. Maybe I just need to either take the hit on the notifications front (it really shouldn’t happen anyway) or find a better SMS solution