Building a Self-Healing Network

Computer immunology is a hot topic in system administration. Wouldn't it be great to have our servers solve their own problems? System administrators would be free to work proactively, rather than reactively, to improve the quality of the network.

This is a noble goal, but few solutions have made it out of the lab and into the real world. Most real-world environments automate service monitoring, then notify a human to repair any detected fault. Other sites invest a large amount of time creating and maintaining a custom patchwork of scripts for detecting and repairing frequently recurring faults. This article demonstrates how to build a self-healing network infrastructure using mature open source software components that are widely used by system administrators. These components are NAGIOS and Cfengine.

NAGIOS is a network monitoring system with a web-based interface that tracks the health of servers and the services they provide. It does this by periodically polling the server/service with a health-checking script. If it detects what it believes is a failure state based on repeated health-check failures, it will note the specific server and take actions such as paging and emailing system administrators.

Cfengine is a policy engine that will detect a delta (difference) in a system's current configuration state and its optimal configuration state based on policy. It was developed by Mark Burgess of Oslo University College. Cfengine has many functions that facilitate self-healing. However, Cfengine runs only periodically because its delta detection process is too computationally intensive to run continuously. In most deployments, Cfengine runs once an hour.

By combining these two software packages, you can create a self-healing capability on your network. First, configure NAGIOS to do health checking on a server and, in the event of a failure, to invoke Cfengine on the remote server to repair the fault. The system will operate in a secure manner with little system or network overhead.

Implementation

The network for the example configuration is fairly straightforward and you'll find it easy to tailor to your specific environment. The network has a monitor host (named monitor, at 192.168.0.10) running NAGIOS, and a web server (named webserver, at 192.168.0.20) running an Apache HTTP server. The goal is for the Apache server to continue to serve pages to hypothetical users, and for any fault that occurs to be rectified in short order. For clarity I've split these functions across two hosts, but there is no reason that both functions could not run on the same host.

The example network runs Fedora Core 3. Installation should be very similar to any other Red Hat/RPM-based system. If you are comfortable with installing and configuring software on your preferred flavor of Linux, you can easily accomodate other distributions. The configurations should work across all platforms with few modifications once the software is installed.

The concept is simple. NAGIOS detects a fault with the HTTP service. As part of its event handling system, it requests remote execution of Cfengine via the cfrun utility. Cfengine runs and detects the missing httpd process and restarts it. Voilá!

Download & Installation

Both NAGIOS and Cfengine are available from the DAG Repository for all versions of Red Hat and Fedora. If your package manager is configured for DAG, it's as simple as:

yum -y install nagios nagios-plugins cfengine

For the web server (assuming you also need Apache):

yum -y install cfengine httpd

To find out how to configure your package manager to use DAG, visit the Dag FAQ. If you're a build-from-source person, visit the Cfengine and NAGIOS websites to download the source tarballs directly. The Cfengine Wiki has more details on other subjects.

Configuring Cfengine

In this case you will be setting up a very simplistic Cfengine instance, whereas the sole purpose of this Cfengine configuration is to restart a failed HTTP server. Cfengine can do many more worthwhile things, and I recommend Luke A. Kanies' excellent articles Introducing Cfengine and Integrating Cfengine with CVS.

Cfengine keeps its configuration data in /var/cfengine/inputs. There are a few key files you will put into this directory to get your Cfengine instance up and running. On your web server, cfagent.conf should contain:

Make sure that your Cfengine config parses properly by running Cfengine from the command line:

/usr/sbin/cfagent -qIv

You'll see verbose output. Remove the v flag and the only remaining output will be that indicating a difference between system state and Cfengine policy. For example, if you execute:

killall httpd;/usr/sbin/cfagent -qI

you'll see that cfagent restarts the httpd daemon. Now that you have it installed, start up all your Cfengine services:

for i in cfenvd cfservd cfexecd; do
chkconfig $i on; service $i restart;done

Now your Cfengine config works: it returns your system to the desired state, a live httpd server, via Cfengine policy. This Cfengine rule, executed once an hour by cfexecd, will restart the httpd server if it's down. However, if you want automated dynamic response to failure, you need to integrate a second part to monitor the httpd server and kick off Cfengine when a failure occurs.