I wasn't bored: I don't have time to be
bored. Texas Agricultural Extension Service operates a fairly large
enterprise-wide network that stretches across hell's half acre,
otherwise known as Texas. We have around 3,000 users in 249
counties and 12 district offices who expect to get their e-mail and
files across our Wide Area Network. Some users actually expect the
network to work most of the time. We use Ethernet networking with
Novell servers at some 35 locations, about 15 have routers that are
connected via a mixture of 56Kb circuits, fractional T1,
frame-relay and radio links. We are not currently using barbed wire
fences for our network, no matter what you may have heard.

I am privileged to be part of the team that set up and
maintains the network. We do not live in a perfect network
world—things happen. Scarcely a day goes by that we do not have
one or more WAN link outages, usually of short duration. We
sometimes have our hands full just keeping all the pieces
connected. Did I mention that the users expect the mail and other
software to actually work?

Cruising the USENET newsgroups, I read a posting about “Big
Brother, a solution to the problem of Unix Systems Monitoring”
written by Sean MacGuire of Montréal, Canada. I was intrigued to
notice that Big Brother was a collection of shell scripts and
simple C programs designed to monitor a bunch of Unix machines on a
network. So what if most of our mission critical servers were
Novell-based? Who cares if some of our web servers run on
Macintosh, OS/2, Windows 95 or NT? We use both Linux and various
flavors of Unix in a surprisingly large number of places.

System administrators often reported difficult installations
and software incompatibilities with the monitoring software; thus,
frustrated users often gave us our first hint that all was not
well. We had cooked up a number of homemade monitoring systems;
pinging and tracerouting to all the servers can be very
informative. We even looked at a bunch of proprietary (and
expensive) network monitoring systems. It is amazing how much money
these systems can cost.

According to the blurb by Sean MacGuire on Big
Brother:

Big Brother is a loosely-coupled distributed set
of tools for monitoring and displaying the current status of an
entire Unix network and notifying the system administrator should
need be. It came about as the result of automating the day to day
tasks encountered while actively administering Unix systems.

The USENET news article provided a URL to the home site of
Big Brother, http://www.iti.qc.ca/iti/users/sean/bb-dnld/. I
pointed my browser to it and was rewarded with a blue image of a
sinister face peering out under the caption “big brother is
watching ”against a purple background. After my initial shock, I
learned that Big Brother featured:

Web-based status display

Configurable warning and panic levels

Notification via pager or e-mail

Free and included source code

I was fascinated, especially by the last item: “Free and
includes source code.” (I often tell people that Linux isn't free,
but priceless.) So what could a priceless package do for me? What
does Big Brother check?

Connectivity via ping

HTTP servers up and running

Disk space usage

Uptime and CPU usage

Essential processes still running

System-generated messages and warnings

Overall, very sensible. Looking for some “gotchas”, I found
I would need a Unix-based machine, a functioning web server and
browser (for the display), a compiler, Kermit and a modem line (for
the pager). A web server was no problem, as we run many. A C
compiler came with Linux, and we use Kermit on many machines with
modems. So far, so good.

The Big Brother web site provided links to a few
demonstration sites, and a link to download the program as well. I
connected to a demonstration site and was greeted with an amazing
display:

As you can see, Big Brother is watching. While enduring the
scrutiny of the Orwellian face peering out at me, I examined the
rest of the display. It is colored like a traffic signal
(green/yellow/red), and the update time is clearly displayed
beneath it. To the right of “Big Brother” are four buttons,
marked clearly Help, Info,
Page and View. Beneath the
header area is a table with six column headings and three rows,
each neatly labelled with a computer host name. The boxes formed by
the intersection of the rows and columns contain attractive green
and yellow balls. The overall effect is like a decorated tree. The
left side of the screen has a yellow tint, gradually becoming black
at the center.

Selecting the Help button gives a brief
explanation of Big Brother. Choosing the Info
Button provides a much longer and more detailed explanation of the
system, including a graphic that really is
worth a thousand words. The Page button sends a
signal to a radio-linked pager—not at all what I had expected.
Finally, the View selection provides a brief but
perhaps more useful view of the information, isolating only the
systems with problems.

In my case, only the “iti-s01” system was displayed. My
browser cursor indicated a link as it passed over each colored dot,
so I clicked on the blinking yellow dot and received this
message:

This puzzled me at first. How on earth could it know that? It
turns out that Big Brother (BB) checks the system /var/log/messages
file periodically and alerts on any line that begins with either
WARNING or NOTICE. As I am certain Sean MacGuire is very
conscientious, I suspect he adds that line to his message file, so
the viewer can see how Big Brother reports its findings.

Suddenly, my screen spontaneously updated. The update time
had changed by five minutes, and a blinking yellow dot appeared
under the column labelled procs. I clicked on
the blinking yellow dot and was informed that the
sendmail process was not running. This got me
really interested—Big Brother can monitor whether selected
processes are running.

Being a little puzzled about the screen's ability to update
itself, I viewed the document source and discovered some HTML
commands that were new to me:

The first META line instructs browsers to
get an update every 120 seconds. The second tells the browser to
get a new copy after the expiration time and date—very clever.

I returned to the graphics window and discovered that the
yellow area on the left had changed to red. A new host name row
appeared with a blinking red dot under the column labelled
conn. I clicked on the blinking red dot and read
this message:

The connection to the machine called
router-000 had been interrupted, and the
administrator had been paged. Amazingly, while in Texas, I had
become aware of a network outage in Montréal, Canada. This really
had possibilities—perhaps someday I may get to take a vacation.