Chapter 4: Nagios Basics

Chapter 4 - from the book Nagios: System and Network Monitoring by Wolfgang Barth -- Reprinted by permission from No Starch Press and Open Source Press. Available at booksellers now. Full book details are at the bottom of the article.

Nagios Basics

The fact that a host can be reached, in itself, has little meaning if no service is
running on it on which somebody or something relies. Accordingly, everything in
Nagios revolves around service checks. After all, no service can run without a host.
If the host computer fails, it also cannot provide the desired service.

Things get slightly more complicated if a router, for example, is brought into play,
which lies between users and the system providing services. If this fails, the desired
service may still be running on the target host, but it is nevertheless no longer
reachable for the user.

Nagios is in a position to reproduce such dependencies and to precisely inform
the administrator of the failure of an important network component, instead of
flooding the administrator with irrelevant error messages concerning services that
cannot be reached. An understanding of such dependencies is essential for the
smooth operation of Nagios, which is why Section 4.1 will look in more detail at
these dependencies and the way Nagios works.

Another important item is the state of a host or service. On the one hand Nagios allows a much finer distinction than just "ok" or "not ok"; on the other hand
the distinction between (soft state) and (hard state) means that the administrator
does not have to deal with short-term disruptions that have long since disappeared
by the time the administrator has received the information. These states also influence the intensity of the service checks. How this functions in detail is described in
Section 4.3.

4.1 Taking into Account the Network Topology

How Nagios handles dependencies of hosts and services can be best illustrated
with an example. Figure 4.1 represents a small network in which the Domain Name
Service on proxy is to be monitored.

Figure 4.1: Topology of an example network

The service check always serves as the starting point for monitoring that is regularly
performed by the system. As long as the service can be reached, Nagios takes no
further steps; that is, it does not perform any host checks. For switch1, switch2,
and proxy, such a check would be pointless anyway, because if the DNS service
responds to proxy, then the hosts mentioned are automatically accessible.

If the name service fails, however, Nagios tests the computer involved with a host
check, to see whether the service or the host is causing the problem. If proxy
cannot be reached, Nagios might test the parent hosts entered in the configuration (Figure 4.2). With the parents host parameter, the administrator has a means
available to provide Nagios with information on the network topology.

Figure 4.2: The order of tests performed after a service failure.

When doing this, the administrator only enters the direct neighbor computer fo
each host on the path to the Nagios server as the parent.1
Hosts that are allocated in the same network segment as the Nagios server itself are defined without a
parent. For the network topology from Figure 4.1, the corresponding configuration
(reduced to the host name and parent) appears as follows:

switch1 is located in the same network segment as the Nagios server, so it is therefore not allocated a parent computer. What belongs to a network segment is a
matter of opinion: if you interpret the switches as the segment limit, as is the
case here, this has the advantage of being able to more closely isolate a disruption.
But you can also take a different view and interpret an IP subnetwork as a segment. Then a router would form the segment limit; in our example, proxy would then
count in the same network as the Nagios server. However, it would no longer
be possible to distinguish between a failure of proxy and a failure of switch1 or
switch2.

Figure 4.3: Classification of individual network nodes by Nagios.

If switch1 in the example fails, Figure 4.3 shows the sequence in which Nagios
proceeds: first the system, when checking the DNS service on proxy, determines
that this service is no longer reachable (1). To differentiate, it now performs a host
check to see what the state of the proxy computer is (2). Since proxy cannot be
reached, but it has switch2 as a parent, Nagios also subjects switch2 to a host
check (3). If this switch also cannot be reached, the system checks its parent,
switch1 (4).

If Nagios can establish contact with switch1, the cause for the failure of the DNS
service on proxy can be isolated to switch2. The system accordingly specifies the
states of the host: switch1 is UP, switch2 DOWN; proxy, on the other hand, is UNREACHABLE. Through a suitable configuration of the Nagios messaging system (see
Section 12.3 on page 217) you can use this distinction to determine, for example,
that the administrator is informed only about the host that is in the DOWN state
and represents the actual problem, but not about the hosts that are dependent on
the down host.

In a further step, Nagios can determine other topology-specific failures in the network (so-called network outages). proxy is the parent of gate, so gate is also
represented as UNREACHABLE (5). gate in turn also functions as a parent; the
Internet server dependent on this is also classified as "UNREACHABLE".

This "intelligence", which distinguishes Nagios, helps the administrator all the more,
the more hosts and services are dependent on a failed component. For a router in
the backbone, on which hundreds of hosts and services are dependent, the system
informs administrators of the specific disruption, instead of sending them hundreds
of error messages that are not wrong in principle, but are not really of any help in
trying to eliminate the disruption.

Comment viewing options

You might consider a quickstart guide in the book. Most people who purchase a book like this are interested in getting up and running, even in a minimal configuration, first... not memorizing a plethora of detail beforehand.

While manually going through the book, following step-by-step to configure nagios, the daemon complained because there were missing pieces such as defining 24x7 "somewhere" - that's not clearly explained. details like that which can throw a new reader off very easily.

Quote: Although the check_interval parameter provides a way of forcing regular host checks, there is no real reason to do this.

This is not true. Example: Mail Server serving up IMAP on port 143 goes DOWN due to having the power go out. When the machine gets turned back on the IMAP service is not turned on by default (or insert whatever scenario that would make the IMAP service non-functional now, iptables, hosts.deny, etc.). Nagios continues to check for port 143 listening on this server and NOT whether the machine responds or not. This machine will continue to show as DOWN as long as the service is non-responsive.

There are only two fixes that I have found for this. 1: Turn on aggressive_host_checking which will kill any machine with more than 1000 active service checks. 2. Use a host checking mechanism as a service. Preferably a quick one icmp packet check.

Geek Guides

Pick up any e-commerce web or mobile app today, and you’ll be holding a mashup of interconnected applications and services from a variety of different providers. For instance, when you connect to Amazon’s e-commerce app, cookies, tags and pixels that are monitored by solutions like Exact Target, BazaarVoice, Bing, Shopzilla, Liveramp and Google Tag Manager track every action you take. You’re presented with special offers and coupons based on your viewing and buying patterns. If you find something you want for your birthday, a third party manages your wish list, which you can share through multiple social- media outlets or email to a friend. When you select something to buy, you find yourself presented with similar items as kind suggestions. And when you finally check out, you’re offered the ability to pay with promo codes, gifts cards, PayPal or a variety of credit cards.