Mozilla IT & Operations

Monitoring with “Non-Obvious” Nagios

John’s viewpoints and religion (you may not agree):
Monitoring is Exceptions, Trending, History
UNIX philosophy: Effective tools, not kitchen sink – Choose the best tool(s) for the job
SNMP is Your Friend – Use it whenever you can
Solve any problem in computer science with another level of indirection

Nagios has:
Discrete components
Well-deﬁned interfaces
Great documentation
Nagios core just schedules and executes
It’s just an engine – that’s the simple genius of it

Well-defined and simple interface between all the parts, that’s part pf the brilliance.

He talks about compiling Nagios, and goes on to say most folks use packages. Then John mentions about monitoring status.dat, maybe with check_file_age, to make sure that Nagios is still running. And then he talks about the basics of configuration, that anyone who Nagios is familiar with.

Plugins:Nagios Plugins or you can just write one, it’s not a hard thing to do. There are helpers like Nagios::Plugin in Perl. Then there’s more talk about plugin basics, how they work and performance data.

Put secrets in the enable-extra-opts file – they’re then hidden from ps on the server.

If a plugin does something you want, but you want a little more, you can write a wrapper around an existing script.

He goes on to talk about plugin basics, there are lots available (Nagios Exchange – 3rd party plugins), explains the difference between local/remote checks, and active/passive checks. John doesn’t use NRPE, he relies on SNMP, but that’s his self-admitted religion.

Configuration:
Talks about a bit more of the basics, especially the files themselves, that variables are case-sensitive, etc.

check_result_reaper_frequency – checks results every 10 seconds for any fork (plugin) to see if it’s finished. If you set this lower, there’s no harm [says John], and a child waiting to be reached is part of the CPU LOAD count on a unix machine – so it’s worth it, and things will finish quicker.

There’s also a variable for the frequency of checking the contents of a directory [missed which], again this can be lowered.

Set config_dir to point to where the object definitions are, and then leave it alone and put the files in there – as opposed to every time you add/remove a file, updating the cfg_file variable. The order and location of things in config files are irrelevant.

Last value of directive/variable (same thing) in a definition is used, so you can define a directive more than once, but only the last one is used. You can also inherit from templates, including multi-template inheritance (comma-separated list). Templates are usually specified with “register=0″, but you can also use a registered object as a template if you want, but it’s a bad idea – the point is either you have a template or an instantiation, and can probably abstract out if you’re in that situation.

“use” is evaluated first, no matter where in the directive it appears.

Nagios object template inheritance is a directed acyclic graph. Slide 25 is pretty useful, shows precedence and inheritance. Of course if you arrange templates such that the order doesn’t matter, then you’re golden no matter what.

You can append to an inherited value with + for instance “directive +value”. There’s no subtraction, but with host groups and hosts you can use ! for exclusion, e.g. mysql_slaves -db3

Custom directives start with an underscore (_) – case-insensitive. For example, you might want one for your SNMP community string. In your generic host template, define _snmp_community, and if there’s a different one for a specific host, you can define that in the host. And instead of “-c password”, you do “-c _snmp_community”.

Custom object variables, define a variable called “_load_warn” as 3, _load_crit as 5, the default check has that. On big machines, change that in the host definition. Refer to as macros or env variables, eg:

macro $_HOSTBLOOP$
environment variable NAGIOS__HOSTBLOOP

Implied Inheritance
Nagios will sometimes assume a value from a related object
Service objects will inherit from the associated host: contact_groups, notiﬁcation_interval, notiﬁcation_period
Hostescalations and serviceescalations will similarly inherit as well, except notiﬁcation_period becomes escalation_period

Timeperiods are nice b/c you can exclude time periods (like holidays). There are lots of examples

Command/service definition vs. quoting – Command line quoting is sometimes challenging, so try to avoid special characters in your arguments. John’s advice, put quotes in the service definition only.

host notification options:
Don’t have “u” in things that page you, that way if things are unreachable, you won’t get paged, you want to get paged for the firewall being down or whatever. Also take off the s unless you want to be paged when scheduled downtime happens. John uses d,r.

Escalations are to a contactgroup.

Host definitions: host has a host_name and address (IP or FQDN). An IP address avoids alerts if DNS fails, but is harder to maintain. John recommends using FQDNs and having locally cached DNS on the Nagios box.

check_command is used to see if a host is up, usually it’s a ping of some kind, and it’s only checked if a service on the host fails. parents are a list of routers and gateways between Nagios and the servers. One tip is to define a “google” machine, and if Nagios can’t get to Google, all heck has broken loose b/c the whole network is gone.

hostgroups are useful for admin grouping of hosts.

Services:
In Nagios terms, a “service” could be an aspect of a running system, like disk capacity, or memory utilization. A “service” needn’t be offered externally to a device

Nagios tests services based on:
max_check_attempts — how many times to check a service before concluding it is actually down – e.g. maybe a mail queue peaks and that’s OK, but not for more than 30 min. used in conjunction with the next 2.

normal_check_interval — how many “time units” to wait between regular service checks

retry_check_interval – how many “time units” to wait before checking a service that is not “OK”

contact_groups — who to complain to in case of a problem

———
You can use notes for a URL for notes or action. Interesting, so if you have a frequent problem you could put URLs here, or an “Action url”.

dependencies & escalations are a good thing. With escalations, only add contactgroups, so that the oncall person doesn’t think it’s fixed when the manager starts getting paged.

avoid repitition:
General rule: anywhere you can list a host_name or hostgroup_name you can:
– use a comma-separated list of hosts/groups
– exclude with !
– use a wildcard host_name of “”, meaning “all hosts” to have it apply (or not) to multiple hosts

e.g. A service deﬁnition for the HTTP service might include
hostgroup_name webservers
to cause the service to be deﬁned for all hosts in the
webservers hostgroup

Generic Notification Author (genoa) – uses environment variables.

Most are added to environment e.g. NAGIOS_SERVICESTATE including any custom variables

“On-Demand Macros” allow you to refer to values from other conﬁg settings e.g.
$SERVICESTATEID:novellserver:DS Database$

e.g. db1 is down but db2 is still up.

“On-Demand Group Macros” get you a comma-separated list of all values in a host, service or contact group e.g.
$HOSTSTATEID:hg1:,$

Stalking – If enabled, stalking logs any changes in plugin output, even with no state change, it’s logged for later review/analysis. It turns off acknowledgement, you need to ack again, because it’s a new problem.
– e.g. RAID check was “1 disk dead” and is now “2 disks dead”

Volatile services – Something that resets to OK after each check. For things that need attention every time there is a problem. Notiﬁcation and event handler happen once per failure – e.g. intrusion detection system, you want to know about every time.

If you define your topology (e.g. parents) it’s easier to find the root cause of stuff.

execution_failure_criteria
and
notification_failure_criteria
determine what we do
if something we depend on fails, e.g. if ﬁle server down, don’t execute web check and don’t notify me about web problem

Used only for “On-Demand Checks” – e.g. Checking that host is up if a service fails, Checking topological reachability, For “predictive dependency checks”, Checking for “collateral damage”. Lower overhead, good results. You should enable and tune the cache.

Event Handlers:
In a perfect world, nothing would ever go wrong. In a semi-perfect world, problems would ﬁx themselves. Event handlers are one of Nagios’ ways of moving closer to perfection.

An event handler is a command that is run in response to a state change. Canonical example: restart httpd if WWW service fails

But you could do things like open a trouble ticket on failure. You can have Global and speciﬁc host and service event handlers.

Complications: runs as the nagios user, on the nagios server.

External Commands

The Nagios server maintains a named pipe in the ﬁle system for accepting various commands from other processes. External commands are used most often by the web interface to
record information and modify Nagios’ behaviour. But you can do lots of things from shell scripts. Some of the available functionality:
– Add/delete host or service comments
– Schedule downtime, enable/disable notiﬁcations
– Reschedule host or service checks
– Submit passive service check results
– Restart or stop the Nagios server

Nagios can accept service check results from other programs. Since Nagios did not initiate the check, these are called “passive service checks”. Useful for embedded Nagios, asynchronous events, results from other, existing programs.

Nagios supports distributed monitoring of a certain style. Remote Nagios servers are essentially probe engines, submitting their results to a central server with passive service check results. The conﬁguration on the remote servers is a subset of the central conﬁguration. The central server is conﬁgured to notice if the passive results stop coming from the remote server.

The “central aggregation” approach is used by a number of more recent tools, such as Nagios Fusion, Thruk (slide 104), MNTOS (slide 104), and Multisite (slide 104).

Adaptive Monitoring – Can change things during runtime via external commands – e.g. schedule changes, or from an exception handler. I wonder if this could be useful for oncall rotation

Scaling Up
– Nagios can handle a lot without much effort
– As you get larger, advanced features are more important
– Use parent/child and host/service dependencies
– More efﬁcient for humans and machines
– You will need to be more rigorous in your conﬁguration
– Consistency, completeness, tuning
– Version 3 adds scalability and tuning features

Tips/tricsk:
– Use the parent/child topology
– Pre Nagios 3, host checks are not parallelized
– Host checks of a down segment can block all other checks
– Be consistent and use templates and groups
– Make it easy to add another similar host
– Make it easy to add a service to a group of hosts
– Smarter plugins make life (conﬁguration) easier (e.g. default thresholds)

With multiple Nagios servers use allow_empty_hostgroup_assignment=1 – You can deﬁne machine types as common hostgroups, even if you don’t have every type on every Nagios server. So nagios1 might not have a web server b/c it’s in an office and it won’t refuse to start nagios because a type isn’t used.

Organize Your Conﬁg Files
– Put ﬁles in different directories
– One host per conﬁg ﬁle
– Generate conﬁgs from other information you already have
– Or use a script to generate from a list
– Take advantage of your naming convention
– Wildcards in host names based on FQDNs

Contacts: sysadmin, sysadmin-email, sysadmin-page for different levels of contacting.

check_allstorage plugin – made by John – Don’t need to set limits in nagios conﬁg. Gets list of ﬁlesystems from device, cache in /tmp dir, Estimates thresholds based on current usage. NICE.nagios checks on his resources page

Web server monitoring hack – Got a visible web server that can run PHP or CGI? Set up a “hidden” web page to run your check. Use Auth or allow/deny rules to limit access. Use check_http to look for a regular expression. Get remote status over port 80.

Hosts don’t actually have to exist – you can make up a service check like “mail” that will hit a generic MX record.