Every host on a network has down time, from the coolest RaQ
to the lowest NT 486. The job of keeping down time to a minimum
falls to the system administrator. Various solutions are available,
spanning the range of needs and budgets. One way is to use
high-availability servers with Fiber-Channel RAID arrays, multiple
redundant CPUs and power supplies and a transaction-oriented file
system. The servers can be arranged behind $50,000 load-balancing
and failover systems to swap out servers automatically upon
failure.

A solution at the other end of the cost spectrum is running a
backup server which is manually switched with the primary server
when necessary. In this scenario, if a server fails unexpectedly,
it can be many minutes or hours before the poor system
administrator can make the switch. This solution is both inelegant
and widely used in company networks.

A third way is “server clustering”, or making multiple
servers appear to users as if they were the same server, for
fault-tolerance and load-balancing purposes. Very interesting
efforts are underway to offer completely Linux-based server
clustering solutions. These include the open-source Linux Virtual
Server, and other work being done by the High Availability Linux
Project. These projects show great promise, and they may be the
right answer for sites wishing to be close to the bleeding edge.
However, small businesses need fully supported solutions that do
not require substantial modification to their existing, possibly
heterogeneous, networks. This is the gap which Polyserve hopes to
fill with Understudy. As you will see below, I think it does the
job nicely.

Understudy is a software-based server clustering utility that
implements load balancing and failover protection for Linux (Red
Hat, Debian and Slackware), Solaris, Cobalt, FreeBSD and Windows
NT. It supports between two and ten heterogeneous servers in a
cluster, all of which must be located on the same DNS subnet.
Polyserve hopes to release a newer version soon that circumvents
the single subnet requirement. A cluster of servers can provide any
service, including web, mail, news or file sharing.

When a server goes down, it is marked inactive within the
cluster and another server takes its place in seconds. When the
server comes back up, it is immediately reintegrated into the
cluster. By using Understudy in conjunction with a load-rotation
scheme called “round robin DNS”, a site can also provide simple
load balancing. Load balancing requires one additional IP address
for each server in the cluster. Simple failover requires only one
IP address for a “virtual host”, which is how users see the
cluster.

Installation

Installation of the Red Hat Linux version was simple. After
reading the release notes, I wouldn't expect major difficulties on
other platforms. Understudy provides a “quick-start” white paper
on their web site which is recommended reading, along with the
white papers on web server specifics and on round robin DNS. They
are easily understood if you have ever configured a web server or
changed your DNS configuration.

I downloaded the RPM for the free 30-day trial and ran
rpm as root to install it. After
installing files, Understudy started its dæmon and reminded
me to assign a password for the administration tool, which I did. I
repeated this process on each of the four servers that would make
up the cluster.

Configuration: A Tour

The four servers were administered remotely, so I could not
run the graphical local administration tool, which requires X on
the server. However, Polyserve also offers a graphical remote
administration tool, available for either Red Hat or Windows 98/NT.
I downloaded and installed the RPM on my local Debian system using
Alien, the Debian RPM manager. There were no serious problems,
although I needed to modify the startup script it created to
properly point to the copy of the Java Runtime Environment (1.1.7)
and the libraries it also installed. It filed everything away in
/usr/local/polyserve with a startup script in /usr/bin.

Figure 1.

Next, I set up my first cluster with failover protection
using a pair of servers. This requires a single “virtual host”,
which is simply an unattached IP address in the same subnet as the
real hosts. This was a straightforward process, following the
instructions in the Quick Start Guide. Firing up the graphical
administration tool, it prompted me for a cluster IP and password.
I supplied the IP of the first server and was presented with the
main window (Figure 1). The main box, titled “Cluster status”,
listed the name of the server I supplied, with the reassuring
status of “OK”. The menus include “File”, “Cluster”, “Server
Log” and “Help”. “Cluster” has the most interesting choices:
“Add Server”, “Add Virtual Host”, “Add Service Monitor to
Selected Host”, “Delete Selected Item” and “Update Selected
Virtual Host”. I chose “Cluster --> Add
Server” and was prompted for a server name or IP. I filled in my
second server. Voilà: the “Cluster status” told me both
servers were okay. So far, so good.

Now, to add my first “virtual host”. This requires adding a
new host in your DNS tables (such as in /var/named on your DNS
server):

virtual1 60 IN A 150.1.1.1

This simply adds a new host name with a Time To Live (TTL) of
60 seconds with its Address.

I added this new line with an appropriate IP address for my
subnet, and restarted the named dæmon. Back in the
administration tool, I selected “Add Virtual Host”. It prompted
me for the name or IP of the virtual host, and listed selection
boxes to determine which real server was to be the primary server
and which was to be the backup. I entered my information.

Figure 2.

At this point, the Cluster status looked a bit more
interesting (Figure 2). It listed both real servers, and
subheadings described that the first server was Active for the
virtual host, and the second server was Inactive. I tried to telnet
to the virtual host. It connected me to the first server. I went
back to the administration tool and deleted the virtual host. I
re-added it, but this time, decided that the second server would be
the primary server for this virtual host. The display reflected the
change immediately. I telnetted to the virtual host. Sure enough,
it connected me to the second server.

What's happening behind the scenes is something like this:
Understudy runs as a dæmon on each server. The IP address of
the virtual host is automatically aliased to the primary server. A
small amount of traffic is constantly passed between the real
hosts, via broadcast ARP messages. Through the dæmon, each
host knows which is acting as primary. When the primary server goes
down, the backup immediately reassigns the virtual host's IP
address to itself. It continues to listen, so it can release the IP
address when the primary server comes back up.

Note that Understudy will not allow you to use an IP address
already assigned or aliased to a real host as a virtual host. I
imagine that otherwise it would be easy to “hijack” the IP
address of someone else's host in your subnet.

Figure 3.

Next, I set up a “service monitor” (Figure 3). This allowed
me to choose particular ports to monitor, such as for mail, web,
FTP or TELNET. If the active server does not respond at that port,
the inactive server will step in. I selected HTTP, and the Cluster
Status reported that the web server was up on both real servers. I
verified, using Lynx, that requests to the virtual host went to the
primary server, unless a service it was monitoring was down, in
which case requests went to the backup server. In all cases, Lynx
showed the URL of the virtual host name, as expected.

For the next test, I set up round robin DNS. Round robin DNS
is a feature built in to name servers such as BIND (versions 4.9
and up). Round robin allows servers to share loads transparently by
rotating between any number of IPs for a given host name. The only
problem is that no correction is made if one or more servers go
down, so out of every cycle of requests, some are sent to a dud
server. With Understudy, this is no longer a problem. You can set
up round robin DNS for a number of virtual hosts, where each
virtual host has a different primary. If any server is down, its
requests are sent to the next secondary. Full examples for doing
this are available in the Understudy documentation. These
instructions were reasonably clear and easy to follow. At the
conclusion of a couple hours of work, I had a fully redundant set
of servers with no interruption to existing services on the
servers.

One final function of the administration tool is a server
log, which accesses dæmon messages for each server in a
cluster. This brings me to a minor complaint: the logs are somewhat
difficult to parse. It would be nice to see an integrated cluster
log, providing a summary of the server logs.