My, you're getting old!

Aging is an interesting phenomenon. We're all aware of what
aging brings about: aches and pains, fading memory, the discovery
that you have a favorite easy chair, a new-found interest in
*redacted*. Yet we don't see the effects of the aging process on a
day-to-day basis. Like evolution, it takes a long time for
age-related deterioration to become apparent. And when it does, the
discovery can be an unpleasant experience -- or at least an
instructive one.

Computer hardware ages too, and deteriorates from constant use.
Case in point: I was contacted the other day by a colleague whose
client had recently experienced a mysterious rebooting of their
OpenServer host. This is a machine that our company had built and
installed early in 1998. By today's standards, this server is ready
for an easy chair. Yet, in more than 5-1/2 years of continuous
operation, the client had experienced absolutely no unannounced
downtime with this unit, which says a lot about the general
stability of the UNIX environment -- and, of course, the quality of
our products :-).

My first thought was that this was just another situation where
the power had failed long enough to cause the uninteruptible power
source (UPS) to trigger an automatic shutdown due to a low battery.
The server lives in an office in Chicago, long-famous for deep-dish
pizza, losing baseball teams and unreliable electrical service.
Power blips, sags, surges, spikes and outright failures are the
norm during the summer, especially given all the air conditioning
machinery in operation.

However, the client was absolutely certain that no protracted
power outage had occurred. This piqued my curiosity, given the
total lack of problems experienced with this machine. Logging into
the server, I rummaged around for a while but could not find any
kind of obvious problem. syslog had nothing in it that
caught my attention, other than an occasional complaint about no
tape being loaded at backup time (as an aside, syslog had
grown to be quite huge, which suggested to me that the client was
not reviewing the log at regular intervals). syslog did show
that a normal shutdown had occurred around the time in question,
but, of course, didn't enumerate a cause.

Not finding any kind of smoking gun to say otherwise, I was
reasonably certain that UNIX itself was not the culprit. On this
machine, a kernel panic would not have caused an unannounced
restart, as the PANICBOOT option in /etc/default/boot was set to
NO. Plus whatever lead up to the panic might have logged
something into syslog -- no such entry existed. While
the notion of a power-induced shutdown was still dancing in my
head, I was also wondering if maybe I was seeing some sort of
imminent hardware failure looming.

A problem experienced in some modern servers is microprocessor
overheating. Today's CPU speeds are much higher than the screaming
200 MHz of this relatively old machine (although 200 MHz did seem
awfully fast when this unit first went into service), and thermal
factors are much more important than they were in the past. I have
repaired relatively new servers that had unexpectedly gone down
because the CPU had overheated, often the result of a dirt-clogged
CPU cooler.

However, that was not a likely scenario with this particular
machine. Its processor simply doesn't run all that hot, its CPU
cooler has relatively coarse fins that don't tend to collect dirt,
and there is no automatic thermal shutdown feature in the
motherboard hardware -- the CPU would simply go up in smoke if it
did overheat.

I also considered possible memory or other hardware problems,
for example, a failing host adapter. None of these were likely, as
they would have resulted in the kernel voicing complaints to
various logs.

Given all this, it was hard for me to ignore my original
conclusion: extended AC power failure had occurred and the UPS had
shut down the server. Yet the client insisted that wasn't the case.
This seemed to make no sense to me at all, as this particular UPS
-- an industrial strength ferroresonant unit -- was capable of
sustaining operation on battery for close to an hour, assuming a
fully charged battery. Nevertheless, I combed through the UPS log
and, sure enough, a short duration power loss had occurred around
the time the mysterious reboot took place.

Unlike the inexpensive line-interactive and double-conversion
UPS's that have flooded the marketplace (a classic example of
caveat emptor, if there ever was one), a ferroresonant UPS
isolates its load from the power line with a special transformer
whose design tends to make it act like the electrical equivalent of
a flywheel. When a momentary reduction or loss of power line
voltage occurs, the massive ferroresonant transformer replaces the
missing AC cycles from energy stored within the transformer's iron
core. Aside from totally shielding the load from power line cruft,
this flywheel action gives the battery/inverter segment of the UPS
more than adequate time to come on-line and sustain the output,
resulting in no measurable break in the output. Naturally, the
transformer can only replace a small number of AC cycles, which
means unless power comes back relatively quickly, the battery must
be ready to immediately shoulder the load.

That is what should have happened in this case. However, the UPS
log showed that that the UPS shutdown came within a few seconds of
the power interruption. The cause of the mysterious restart quickly
became apparent: the UPS' battery was simply unable to produce
enough output. As it turned out, this was the original battery and
thus was nearly six years old. Father time had managed to do what
Chicago power could not do in the past: trigger an unannounced
server shutdown.

The moral of this story: load test your UPS at least every six
months to determine if it can sustain operation following power
failure. Change out your UPS batteries at least once every four
years, or better yet, every three years. Naturally, if you are not
comfortable with working inside a UPS, seek someone who is. Better
he should get the daylights shocked out of him than you!

While on the subject of aging, another gotcha involves hard
drives. The server in question was built with a Seagate Barracuda
SCSI hard drive, long considered one of the best and most
trustworthy designs ever conceived. Seagate's specifications for
the Barracuda line at the time stated that these drives had a five
year service life. Service life is determined by a combination of
past experience, wear measurements taken on laboratory samples and
failed units returned under warranty, mathematical extrapolations,
educated guessing, and possibly by tarot card reading. In other
words, if the manufacturer says the drive will survive five years
of service, consider that to be an optimistic estimate, not a
take-it-to-the-bank fact.

Generally speaking, expected service life will decrease as the
drive is subjected to more start/stop cycles. Start/stop cycles
cause wear to the heads and media, an unavoidable consequence of
media contact as the heads land in the parking zone. However, in
the case of most servers, start/stop cycles tend to be infrequent.
For example, this client's server has averaged only two to three
cycles per year. Therefore, drive aging has been primarily the
result of spindle bearing and head actuator mechanism wear.

Modern SCSI drives are able to automatically spare out bad
blocks that develop as the unit ages (resulting in what are
referred to as "grown" defects). As a result, most of the effects
of wear tend to not be noticed at all, something that is not true
of IDE drives. In a continuously running drive, spindle bearing and
head actuator mechanism wear is quite gradual. Like a slowly
weakening UPS battery, the effects of such wear tend to go
undetected until either the drive becomes noticeably noisy or an
abrupt failure occurs. As a rule, such failures result in immediate
and total loss of all data (your backups are current, right?).

The only sure way to avoid a catastrophic, wear-induced drive
failure is to replace the drive before the published end-of-service
life has been achieved. If you have any clients running on older
servers, be sure to get this information in front of them so they
can plan for the inevitable. After all, putting new hardware into
service is always more fun than dealing with the aches and pains of
age.

"Nevertheless, I combed through the UPS log and, sure enough, a short duration power loss had occurred around the time the mysterious reboot took place."

I neglected to mention that the "UPS log" referred to in the article is not a log on the server but one in the UPS itself. BEST Technology (now Powerware) ferroresonant UPS's are intelligent devices, capable of logging a wide variety of power related events. The software provided with the UPS allows one to log in to the UPS, interrogate operational parameters, read the log, load test the battery, make coffee, etc.