Network and System Monitoring Primers

Recently I had a discussion with a collegue about the monitoring
of our systems and network devices. I showed him what we all measure,
and he wondered if it was overkill or not. I told him that somethings
maybe were, but that good monitoring is the first step to knowing
what happens in your network, and that knowing what happens in your
network is the first step to be able to isolate problems when they
arise.

So the question is: "What do you need to monitor?" The answer
is easy: "Everything". That's a pretty big amount. Let reality
kick in and rephrase it to: "Everything you can think off".
This is too vague and misses the final goal: "Everything you
assume to be the normal conditions for a system or service to run
properly.".

"Everything you assume to be the normal conditions for a system
or service run properly". That is quite a lot, and different
for each service you provide or device you have on the network. It
will make you to have to you dig into your network and servers to
find out what is going on.

This also means that you need to find out how your network and
services behave. In the beginning you will get a number of alerts
which will be false positives and you need to tune your alert
settings. Or you will get the same alerts every day at the same
time: they will be normal system behaviour. You can chose between
getting these messages everyday and changing the alert-settings so
that you won't get them anymore. History shows that the last option
isn't the smartest one, because it will hide possible issues from
you.

A lot of the items described in this primer can be considered as
being too nitpicky which might or might not be an issue for you at
this moment. Keep in mind that you should monitor for normal
operations! Everything which happens in your network and systems
which isn't normal is worth investigating!

Software

This primer tries to be generic, but is based on my experience with
Nagios. At the end there will
be a link to the scripts I use for non-standard Nagios features.

Nagios is described as "a host, service and network monitoring
program". If you are a beginner, you will find its configuration
files horrible. But once you get through that, it is easy to expand.
Nagios does do all the checking by executing scripts in its
libexec/ directory. This doesn't mean that it is limited
to doing checks of remote services which are running on other hosts.
For that there is the program called NRPE, which stands for Nagios
Remote Program Executor. This program runs as a daemon on the
remote hosts and runs the same but local installed Nagios scripts.
(FreeBSD: net-mgmt/nagios and net-mgmt/nrpe2)

Systems monitoring

There are several components which needs to be monitored on a system:
Hardware (disks, CPU), the Operating System and the Services.

Hardware

These days you can get a lot of information about components
of your motherboards: CPU temperature, internal temperature,
fan speeds and power voltages. Higher temperatures are bad for
your motherboard. Fan speeds which are suddenly much higher,
or lower, indicate that one of them is broken and might cause
higher temperatures. And power voltage changes indicate problems
with your power supply.
On Linux, this information can be gathered from /proc/acpi.
On FreeBSD this can be gathered via sysutils/healthd.

IDE harddisks characteristics can be monitored via the SMART interface
(Self-Monitoring, Analysis and Reporting Technology), for example
the temperature of the disks and a handful of counters:
Reallocated Sector Count, Seek Error Rate, Spin Retry Count,
Calibration Retry Count, Reallocated Event Count, Current Pending
Sector and UDMA CRC Error Count. If these counters go up, there
might be a problem with your harddisk.
On Linux and FreeBSD this data can be gathered via the smartmontools
software (FreeBSD: sysutils/smartmontools).

RAID hardware is beautiful, a broken harddisk won't wake you
up in the middle of the night anymore (but two broken harddisks
will so it better be monitored). There are various ways to check
them, and every vendors seems to have its own software. The
following works on FreeBSD:

The second one is the media status: how is the device talking
to the switch. On Linux this can be found with the output of
mii-tool, on FreeBSD this can be found on the
media line in the output of ifconfig:

If you have a UPS, see if you can get the status of it. Information
from APC UPS's can be gathered via apcupsd and
apcaccess:

STATUS : ONLINE

Do not check only for ONLINE, check if ONLINE
is the only string because

STATUS : ONLINE REPLACEBATT

can be valid too! (FreeBSD: sysutils/apcupsd)

The Operating System

Diskspace information, or partition information, can be gotten with the
output of df, which gives you the free disk
space. Another important piece of information is the number of
inodes you have available: df -i, because if you don't
have any inodes free, you can't create any more files.

Of course you need to check first if the PID file exists.
If the service is a network based service, then besides checking
if the PID files exist and the processes exist, you should also
check if the service works up to some extent: For ssh you should
get the SSH banner, for ntpd you can check if the services is
synced.

If the server is supposed to transport emails (both mail servers
and application servers), then check if the mail-queue is more
or less empty:

If you use dynamic routing in your network, one other thing you
need to do on one or more machines is to check if you have all
your networks in the routing table. Missing one means that you
can't reach that network from that machine!

Number of users, total processes and swap: Easy to measure, and
it might be an indication that there is something wrong. For
machines which are servers, the number of users logged in
shouldn't be too high: Unless work is done on them, nobody
should be logged in.
For swap, preferable it is not in use.

Uptime. Don't rely on the host not being able to be pinged to
determine if the machine has been rebooted. With todays hardware
and background file system checks the machine is back before the
ping-timeout threshold has been reached. If the uptime has been
reseted, then something has happened!
Note that if you monitor this via SNMP, that the
system.sysUptime OID returns the number of seconds
from the snmpd being active, not the number of seconds
of the machine being active. Restarting the snmpd will
reset this counter!

SNMPv2-MIB::sysUpTime = Timeticks: (760344619) 88 days, 0:04:06.19

FreeBSD Jails are great for setting up small isolated environments,
for example webservers and SMTP servers. The server which hosts
all jails should check for them, and warn if one is missing or
if an unknown one has popped up:

SMTP / spam checker / greylisting / virus scanner
SMTP servers (check if the SMTP server is running) listen for
network connections on port 25 (smtp) and port 587 (submission).
Incoming SMTP traffic but might be greylisted (check if the
greylist daemon is running). The email received goes through
a virus scanner (if you are using a commercial package, make
sure your license hasn't expired) (make sure the virus scanner
daemon is running) (make sure that the signatures are up to
date):

Then the email goes through the spam checker (make sure that
the daemon is running) and then into the mail folder.
Email can come in bulk. That means that one moment your queue
is empty, and the next moment there are 500 messages in the
queue. If your users get a daily mailing like this every day
at 17:00, then you will get a daily alert about it.

NAT gateways
Check the size of the NAT table. The expected size is depending
on the policy of your network. If your network is open (no proxy
server, no restrictions on traffic), then the NAT table will
be very big.
If you have a regulated network (HTTP has to go via the proxy
server, email has to be delivered to the local SMTP servers, DNS
requests have to go to the local DNS server etc), then this
will be relative small. A chance in the size can show that there
is something wrong.

[~] root@freebsd>ipnat -l | wc -l
300

Database replication
Not only the consistency of the data in a database is very
important, but so is the replication of it. And it should be
as realtime as possible. Slony, the replication service for
PostgreSQL, gives these statistics via the sl_status table:

With the SIP and IAX status, not only the OK status is important
but also the time for the answer.

Network device monitoring

Gathering information for network device monitoring is a little bit
trickier than systems monitoring, because you can't run these fancy
scripts on your routers and switches. Often you only can get
information via SNMP...

System Uptime: Embedded devices are often very fast with their
reboots, so they can reboot several times and you will not even
know anything. With the system.sysUpTime OID you can
get the uptime:

SNMPv2-MIB::sysUpTime = Timeticks: (760344619) 88 days, 0:04:06.19

If you have a clean network, and have your network devices
and user devices separated from each other, then there is
a nice border between where the responsibility lays. And it
gives you an easy way to check if all interfaces on your devices
are in the state you expect them in.

If an ifSpeed is suddenly 100Mbps instead of 1Gbps, you know
that there is something wrong. If an ifOperStatus is down instead
of up, you know that there is a problem. If you have redundancy
in your network, these issues might have been hidden because
the remote subnet never has been unreachable.
Routers can "suddenly" have more or less interfaces, for example when
you create or delete a new VLAN. So you have to monitor for the
absence of expected VLANs and the presence of unknown VLANs.