The Nagios Setup Explained

In this article, we shall discuss Nagios, an open source software that is deployed in most data centres to monitor various system and network parameters.

The practice of appointing prefects is a time-honoured one. Whether it is in schools, colleges, armies or societies, from time immemorial, overseers (or prefects) like district magistrates or religious priests, have played an important part in regulating and monitoring the performance and day-to-day activities of the groups of people they oversee.

Information Technology is no different — an overseer/monitor is required to constantly regulate and monitor the health of the hardware, software and network in modern-day data centres. These “prefects” of the data centre provide vital information to systems administrators — such as the amount of free disk space, network outages, application and hardware downtime, and even server-room temperatures.

Nagios (a recursive acronym for “Nagios Ain’t Gonna Insist On Sainthood”) has been one of the most favoured “prefects” of the data centre, monitoring parameters such as systems status (whether a system is up and running; CPU/memory/disk usage, etc.), service status (whether a service is up and running — e.g., DNS, Web server, mail server, etc.), and many other factors including room temperature and even humidity! It can generate alerts (through email/SMS) when the monitored parameters exceed preset thresholds.

As I sit down to write this article, I am glad to share with you that this “perfect prefect” has saved many of my clients hundreds of hours of downtime. Just recently, a customer decided to move a problematic database from a central database server host, because Nagios had alerted us about a possible problem with one of the schemas, which was adversely affecting the overall health of the database server, and could have severely affected other mission-critical production database schemas on the same host.

In this article, I shall try to dispel a commonly held myth — that Nagios is difficult to install. I distinctly remember that about five years back, a senior manager in a big IT firm called me and mentioned this as one of the reasons why the company planned to outsource the installation to us. It might have required a bit of tweaking then, but now that is no more the case — you can easily install and configure it to meet your requirements.

Let us install and configure Nagios to monitor a sample service, and hence get an idea of how Nagios can benefit you and your organisation.

We’ll install Nagios on an RHEL 5 host called prefect.knafl.org. We will use it to monitor itself — whenever it is available — and send alerts to [email protected] in case of an outage. In a future article, perhaps, we will look at monitoring remote hosts and services.

Installation

On Red Hat Enterprise Linux, Nagios can be easily installed using the EPEL Repository. To the uninitiated, EPEL is: “Extra Packages for Enterprise Linux (or EPEL), a Fedora Special Interest Group that creates, maintains, and manages a high-quality set of additional packages for Enterprise Linux, including, but not limited to, Red Hat Enterprise Linux (RHEL), CentOS and Scientific Linux (SL).”

To ensure that Nagios is available in the EPEL repository, let’s browse the relevant repository (since ours is a 64-bit host, we’re looking at the x86_64 EPEL repository. On jumping to packages whose names begin with “N”, we can see that (as of this writing), there are 65 Nagios packages (RPMs) available for 64-bit RHEL 5. We can check this using the following command (on the URL for the group of packages we just mentioned):

Hopefully, the above packages will be installed on your system without any hiccups. Any doubts about Nagios installation being complex should now be removed.

To configure Nagios, you first need to find its configuration files — which is simple with the rpm tool’s switches. To locate configuration files provided by the Nagios package, simply run the following command:

The Apache configuration file (/etc/httpd/conf.d/nagios.conf) contains the directive for the URLs http://<nagios-host>/nagios/, and http://<nagios-host>/nagios/cgi-bin/, whereas the /etc/logrotate.d/nagios file is a log rotation configuration file.

The main configuration file

The /etc/nagios/nagios.cfg file controls the behaviour of the Nagios process and also the CGIs. There are many configuration directives in this file, and all of them are well documented. Let us look at some of the more important ones to get our basic configuration going:

Log file: This should be the first directive — the log file where host and service events are logged. Be careful that the file is accessible and writeable by the nagios user:

log_file=<path-of-log-file>

Nagios user and group: These are the user and group names under which the nagios process runs. The yum installation, as above, creates both a user and a group named nagios, which we will use:

nagios_user=nagios
nagios_group=nagios

Object definition file(s): This parameter can be specified multiple times. These files contain definitions for each host and service, as well as groups of hosts and services. As an example, the yum installation creates two object configuration files: commands.cfg and localhost.cfg. We will look at these a little later. The parameter syntax is as follows:

Object cache file (object_cache_file): To speed up operations, the nagios service caches the read object definitions and configurations them in a cache file, which is then read by the CGI. This also prevents inconsistencies, such as when an object file is being modified, and is saved before all changes are completed.

Status file (status_file): This file is where the status of all monitored hosts and services is stored by Nagios, to be processed by the CGI scripts.

Resource file (resource_file): This parameter too can be specified multiple times. Resource files contain macros that are expanded by Nagios when executing a command found in the commands file. We can look at this in detail below. The CGIs do not read these files, and they can contain sensitive information such as user names and passwords. Therefore, restrictive permissions such as 600 (only the owner can read/write) should be placed on these files. As you can see, the Nagios RPMs install these files in a separate directory, /etc/nagios/private, which is owned by the root user and readable by the nagios group:

The object and resource definition files

Objects are entities that need to be monitored, or are used for monitoring. Some examples are commands, hosts, groups, services and contacts. Let us explore a host object and a command object in this article.

Host object definitions are used to define a particular host that is being monitored; the mandatory directives are:

host_name: a short name for the host. Multiple services can be monitored on a single host. Normally, the FQDN is used.

alias: a longer description.

address: the IP address of the host being monitored.

max_check_attempts: the number of attempts to check the host, if a non-OK state is returned.

check_period: the period name (which is also defined), during which checks should be made.

contact_groups: the contact groups (people to be contacted) in case of problems (or recoveries) with this host.

notification_interval: the time interval (by default, in minutes) after which notifications will be sent, in case the host is still down.

notification_period: the time period in which notifications should be sent. In case the host is down in a time period that is not in this period, no notifications will be sent.

notification_options: This directive can have the following values:

d: send notifications when the host is down

u: send notifications if the host is unreachable

r: send notifications on recoveries

f: when the host starts and stops flapping (flapping is usually used to determine whether a service/host is stable. Flapping occurs when a service/host changes states too frequently.)

n: no notifications will be sent

A more efficient way to use host definitions is to define templates and use them. A snippet from the file /etc/nagios/localhost.cfg that defines a template, and then uses it for a host object definition, is shown below:

define host{
use linux-server ; Name of host template to use
; This host definition will inherit all variables that are defined
; in (or inherited by) the linux-server host template definition.
host_name localhost
alias localhost
address 127.0.0.1
}

The use statement above specifies that this host definition uses a template called linux-server. It is defined in the same file, as follows:

define host{
name linux-server ; The name of this host template
use generic-host ; This template inherits other values from the generic-host template
check_period 24x7 ; By default, Linux hosts are checked round the clock
max_check_attempts 10 ; Check each Linux host 10 times (max)
check_command check-host-alive ; Default command to check Linux hosts
notification_period workhours ; Linux admins hate to be woken; only notify in the day
; Note that notification_period overrides the value
; inherited from the generic-host template!
notification_interval 120 ; Resend notification every 2 hours
notification_options d,u,r ; Only send notifications for specific host states
contact_groups admins ; Notifications get sent to the admins by default
register 0 ; DONT REGISTER THIS TEMPLATE DEFINITION!
}

This template further uses a template called generic-host, which is also defined in the same file, as:

Command definitions are used to define commands that Nagios will use. They can include macros from resource definition files. The command used in the localhost.cfg file for localhost is defined in /etc/nagios/commands.cfg as:

Once the host, host-groups, commands and time periods have been defined, it is time to define services. For the purpose of this introductory article, we will use only the ping service. Again, the service definition sections in the sample configuration file listed below are self-explanatory.

define service{
use local-service ; Name of service template to use
host_name localhost
service_description PING
check_command check_ping!100.0,20%!500.0,60%
}

This definition uses a template, local-service, defined as:

define service{
name local-service ; The name of this service template
use generic-service ; Inherit default values from the generic-service definition
check_period 24x7 ; The service can be checked at any time of the day
max_check_attempts 4 ; Re-check up to 4 times to determine final (hard) state
normal_check_interval 5 ; Check service every 5 minutes normally
retry_check_interval 1 ; Re-check every minute until a hard state can be determined
contact_groups admins ; Send notifications to all in the 'admins' group
notification_options w,u,c,r ; Send warning, unknown, critical, and recovery notifications
notification_interval 60 ; Re-notify about problems every hour
notification_period 24x7 ; Notifications can be sent out at any time
register 0 ; DONT REGISTER THIS TEMPLATE DEFINITION!
}

The local-service template further uses a template, generic-service. For our use-case scenario, please ensure that you comment out all other service definitions in this configuration file.

Therefore, to sum up the various files used, based on the default configuration, our Nagios instance is set up thus:

Will monitor localhost (IP address 127.0.0.1)

It will be monitored 24×7.

This host is checked using the command /usr/lib64/nagios/plugins/check_ping.

Notifications are sent if the host is down, is unreachable or has recovered.

Notifications go to [email protected], but are sent only during workhours, and will be resent every two hours if the host is still down or unreachable.

The CGI configuration file

The CGI configuration file (/etc/nagios/cgi.cfg) configures the CGI scripts and the Web GUI of Nagios. The significant parameters are:

main_config_file: The path of the main Nagios configuration file, and where the CGI scripts should find it.

physical_html_path: The filesystem path for Nagios HTML files.

url_html_path: The URL portion appended to the base URL, that will access the Nagios HTML files.

refresh_rate: Specifies the refresh rate for various CGIs such as status.

use_authentication: Specifies that the CGI scripts should use authentication.

Once Nagios has been configured, you will need to add an authentication file to be able to access Nagios pages. By default, the Apache configuration directives (specified in /etc/httpd/conf.d/nagios.conf) rely on basic authentication, and allow access only from localhost. The user authentication file /etc/nagios/passwd needs to be created. You can do this using the htpasswd command:

If the Nagios logs are fine, you should now open your browser and connect to http://localhost/nagios/, authenticate as nagios-admin and check the Host summary. The configured host, localhost, should be up.

There is a wealth of information available on Nagios, and the documentation provided along with the installation is also quite good. Go on, build your prefect and manage your data centre.

Varad has over 15 years of experience as a system architect, where his bread and butter business is architecture design, configuration and deployment of Linux clusters, load balancers, messaging solutions, Linux-based domain controllers and LDAP, Virtualisation (KVM), databases, network architectures, et al. When he's some time off from his 18-hours-a-day work grind, he loves to fill in as a trainer and writer.

8 COMMENTS

I am a regular reader of LFU, I been searching for a how to of Nagios and found useful article on this web page. I would like to thank you all for your contribution. And I have a suggestion it is that it will be more understandable if you could add some more screen shots along with the HowTo also as a learner I am having issues with understanding the technical terms used in this article, it would be helpful if you give a brief intro about the technical terms used in the articles.

Nagios is the best option, in a sea of sub-optimal choices. It’s extensible stable, and has a very simple plugin API. The main problem with Nagios, is it’s dreadful configuration, hence the reason why so many tacked on DB interfaces were created. It’s not that the configuration is difficult, as it’s actually fairly understandable. The problem is that it’s not easily manageable, when scaling over a hundred hosts.

POPULAR CATEGORY

Open Source For You is Asia's leading IT publication focused on open source technologies. Launched in February 2003 (as Linux For You), the magazine aims to help techies avail the benefits of open source software and solutions. Techies that connect with the magazine include software developers, IT managers, CIOs, hackers, etc. A free DVD, which contains the latest open source software and Linux distributions/OS, accompanies each issue of Open Source For You. The magazine is also associated with different events and online webinars on open source and related technologies.