Monitoring your HDD using SMART and Nagios

Monitoring of your computer systems is a good idea. There are many tools
that let you verify that specified services are running, and available for
clients. I use Nagios. You can check that Apache is still running,
Postfix is still accepting mail, and various other things. If you can
write a test, Nagios can monitor it.
Typically, people monitor network connections, applications, and bandwidth
consumption. Until recently, I did not monitor disk health. That
recently changed.
I started using three new tools:

In this article I’ll show you how I added SMART monitoring to my Nagios
installation. munin is straight forward to install, but is outside the scope
of this article. It is for another time.
This article also assumes you have Nagios installed and nrpe running
on the host you are monitoring. I am using Fruity for my nagios
configuration, so I will be glossing over that too.

SMART

Self-Monitoring, Analysis, and Reporting Technology, or S.M.A.R.T.
(sometimes written as SMART), is a monitoring system for computer hard
disks to detect and report on various indicators of reliability, in the
hope of anticipating failures.

My first real introduction to SMART came from reading
Watching a hard drive die
by Greg Smith. Greg is present on the PostgreSQL Performance mailing list.
He knows a lot about hardware and how to get the best out of it. As I was
setting up a 10TB file
server, I wanted to start monitoring the health of those disks.

smartmontools

To install smartmontools:

cd /usr/ports/sysutils/smartmontools/
make install clean

To have smartd start at boot:

echo 'smartd_enable="YES"' >> /etc/rc.conf

I used the default configuration file, but you could get more specific if you
wanted:

cp -i /usr/local/etc/smartd.conf.sample /usr/local/etc/smartd.conf

To start smartd now:

# /usr/local/etc/rc.d/smartd start
Starting smartd.

I know I have two HDD, so I added this to /etc/periodic.conf so I
include drive health information in my daily status reports:

daily_status_smart_devices="/dev/ad0 /dev/ad2"

nagios-check_smartmon

nagios-check_smartmon is a Nagios plugin that allows you to access
smartmontools from within nagios. To install it:

nrpe changes

smartmon must be run with sufficient permission to access the device. The
command runs as the Nagios user, via net-mgmt/nrpe.
The following is the entry I add to /usr/local/etc/nrpe.cfg to monitor the
two HDD in this system:

Nice.
After manually checking the HDD temperature, by putting my hand on the HDD, I
determined all were of a similar temperature. I concluded SMART was wrong,
which is not unknown. I adjusted nrpe.cfg to adjust for the higher reading: