It is intended as a last resort
for maintaining a system's availability and, at the very least, to
ensure that the administrator can remotely log-in to diagnose and
fix faults of a non-persistent manner. Obviously it won't stop a
hardware fault from breaking a system, nor is it any good against a
persistent software problem, but for a system that is generally well
behaved (and particularly if it is located at a remote site and/or
is otherwise essential for operations) it serves to improve the
overall availability of the system.

If your application cannot tolerate a short outage, then a watchdog
alone is not going to solve it, you need to look at other
high-availability solutions for hardware (e.g. RAID for disk error
protection) and software (clustering & application mirroring)
that will provide an acceptable degree of overall system
availability.

With the Linux operating system there are two parts to the watchdog:

The actual hardware timer and kernel driver module that can force a
hard reset, and;

The user-space background
daemon that refreshes the timer and provides a wider range
of health monitoring and recovery options.

Both can function independently, but clearly they are designed to
operate together for maximum protection.[top of page]

The Watchdog Module

Normally the hardware support for a watchdog is simply a timer that
is set to some reasonable time-out, and then periodically refreshed
by the running software. If for any reason the software stops
refreshing the hardware (and has not explicitly shut it down) then
it times-out and performs a hardware reset of the computer. In this
way even kernel panic type of faults can usually be recovered. Often
the chip sets that provide system monitoring (temperature, supply
voltages, fan speeds, etc) have a watchdog timer, though one can
never be sure if the motherboard manufacturer will have used it!

In the context of the Linux operating system, there is a standard
interface to the watchdog hardware provided by the corresponding
kernel device driver (module) provided as /dev/watchdog
(checking for this is a simple test of the module being loaded).
However, such a driver is not usually loaded by default so you may
have to manually configure your system to load it. Typically this is
done by adding the module name to /etc/modules or (better still so
it is loaded on demand) to /etc/default/watchdog by editing
watchdog_module="none" to have the module name.

Linux also provides a software watchdog by means of the 'softdog'
module. While this it better than nothing, it is far less effective
than hardware! Basically if the kernel fails, so does your means of
recovery in this case.

The watchdog hardware + driver module provides the most basic of
protection. It is started by anything that can periodically write to
/dev/watchdog and if that fails for any reason the watchdog
hardware times-out and machine is rebooted by means of a hard reset.

However, a hard reset is something that is normally undesirable as
it risks file system corruption, so it is much better if you can
perform a clean reboot instead.[top of page]

The Watchdog Daemon

To operated the watchdog device, there is normally a background
daemon that can open the device and provide the periodic refresh
activity. However, a machine can also get in to a very unusable
state without actually terminating the background daemon's
operation, therefore the watchdog daemon for Linux can be configured
to periodically run a number of basic tests to verify that
the machine looks OK.

On failing such tests (possibly with a certain amount of re-try
behaviour to avoid being too "trigger happy") the daemon can reboot
the machine in a moderately orderly manner in order to keep a log of
why it happened, and hopefully avoid file system problems, etc.
While doing so, it also has the "insurance" of the hardware timer so
if it fails to reboot nicely, there is a hardware reset to follow
that up.

This “moderately orderly” shut down is not the normal init-based
shut down approach where the proper sequence of shut-down scripts
are executed, as that is very likely to fail in a number of the
conditions for which watchdog action it is needed (e.g. system out
of memory, out of process table space, etc).

So instead it performs the “blunderbuss approach” to stopping all
processes by signalling everything with SIGTERM and then after 5
seconds with the non-ignorable SIGKILL, then it tries to update wtmp
(so the shut down is recorded), update the random seed (to preserve
entropy), sync the CMOS clock to system time (to help ensure the
system time is reasonable on reboot), and finally sync and un-mount
the file systems before it attempts reset by means of the hardware
timer (if that is possible).

The hardware reset approach is preferred over the kernel's reboot
API as the kernel stops the watchdog hardware on a normal shut-down
or reboot, and thus could hang just after that point without any
means of automatic recovery (e.g. a hung RAID card or similar).

'watchdog' provides the driver open/refresh/close actions
along with various other system checks.

When the system boots, it starts wd_keepalive as early as possible
to protect against serious faults during booting, then once other
services are up changes to run the full watchdog. The normal
watchdog cannot be started early because some of the tests it could
perform might depend on resources that start later in the chain
(e.g. network file system, other daemons to monitor, etc). Similarly
on shut-down the main watchdog is stopped early and wd_keepalive
started in its place to deal gracefully with the stopping of
services that might be monitored.[top of page]

Do I need a Watchdog?

From the introduction it can be
seen that most systems that are used "interactively", like a home
PC, don't really need it. Basically if it crashes while you are
using it then you typically try Ctrl+Alt+Del (maybe also Ctrl+Alt+F1
to try text-mode login) and, if that fails, then simply push the
reset button (or hold down the power button for 5 sec) to recover
the machine.

Where the watchdog is most useful is situations like ours where you
have hardware control computers running continuously or, more
commonly, servers operating at remote sites. Both are situations
where you may be sleeping or on holiday when it goes wrong and/or
recovery involves a tiresome trip to the site. In such cases the
last resort of an automated reboot is quite valuable.[top of page]Last Updated on 26-Jan-2016 by Paul
Crawford
Copyright (c) 2013-16 by Paul S. Crawford. All rights reserved.
Email psc(at)sat(dot)dundee(dot)ac(dot)uk
Absolutely no warranty, use this information at your own risk.