From the Publisher

To make this process even more fun, there have been a lot of computer-related problems—all related to our Linux systems.

The end of March is quickly approaching
as I write. Here at SSC, March is always an exciting time as it is
the end of our accounting year. For the last two weeks Gena
Shurtleff and I (with help from others) have been working on a
budget for the next year of LJ--something that
has made some of us very grouchy. [mainly, our publisher --Ed.]

To make this process even more fun, there have been a lot of
computer-related problems—all related to our Linux systems. [Note:
if you are humor-impaired you may want to skip a lot of this.]
First, our main server failed. Linux apparently caused a pin to
break off the cable to the external SCSI disk drive. Then, a week
or so later, Linux broke a head on the disk drive in our firewall.
Next, the fan in our editor's computer started growling at her. On
top of all that, various systems in the office were mysteriously
crashing or exhibiting very strange behavior in general. The end
result was too many hours of downtime, lost e-mail and an unhappy
working relationship with our computers.

We began to wonder if Linux was good enough for us, and if
perhaps Windows NT might support “automatic pin re-soldering” and
“disk-head replacement”.

Once things calmed down we took a closer look at the
problems. By the way, we refers specifically
to Jay Painter, our new systems administrator, Peter Struijk and
me. First, we concluded that it probably wasn't the fault of Linux
that hardware was breaking. As scary as it is that multitasking
operating systems write to disks whenever they think it's a good
time, this really isn't a reason for a cable to break.

To address the issue of being down for longer periods of time
than we thought appropriate, let's look at some specific
cases.

Note: at this time, our various systems ran a host of
different software versions (kernels and C libraries), and we were
in the midst of converting them to Debian in order to have
consistency across all machines.

The Server Failure

The failure of the cable caused the data on the server disks
to be scrambled. Fortunately, we had back-up copies of the files,
so it seemed like a good time to do an upgrade. We made the logical
decision to reload a standard Slackware system (rather than try to
change to Debian), restore the user files and be on our way. It
turned out to be not so easy. The new load of Slackware had more
differences from the old version than we expected. Libraries were
different. NFS and NIS were different. Adobe fonts we use for doing
reference cards had to be reloaded. Configuration files for
groff had to be updated. A lot of work was done
to get the new configuration talking to all the old systems and to
get everything tuned.

Firewall Failure

Possibly inspired by the extra work the firewall was doing
during the time the server was down, a head died on the disk in the
firewall the next weekend resulting in the loss of a lot of mail.
Why was it lost? Why wasn't it queued and then forwarded? Was this
another Linux shortcoming?

On investigation we found that backup MX (mail exchanger)
records were in place to take care of this very problem; however,
they were pointing to the wrong machine. Again, the problem could
not be pinned on Linux; it was an administrative error by a
previous systems administrator. The mistake went undetected because
this backup had never been exercised before, since Linux had been
working flawlessly.

Strange Software Problems

Let's move on to those strange software
problems I mentioned. Surely we can find something to
pin on Linux here.

One machine, used as our DNS name server, had been less than
reliable. Two things in particular happened quite regularly. The
first was that syslogd, the system log daemon,
would hang in a loop eating up all available CPU time. While this
problem appeared to be related to the location of the log file
disappearing (caused by a reboot of the file server or a network
problem), we haven't been able to fix it. However, it doesn't
appear to happen in newer Linux releases (our problem machine is
running 1.2.13) and, while it is irritating, it does not cause the
machine to crash—just to run slower than normal.

The other problem on this same machine was stranger, although
it turned out to be fixable. Multiple copies of
crond (the cron daemon) kept appearing on the
machine even though only one was initiated at boot time. One day, I
found 13 crond jobs running, killed 12 and, a
few hours later, found three still running.

At this point Jay jokingly said, “Maybe there is a cron job
starting cron jobs.” Well, since there were processes being
started by cron that didn't exist on the other machines, I started
looking around for suspicious jobs. The first couple of extraneous
jobs I found were benign, but then I found one that made both of us
realize that Jay's attempt at a joke wasn't a joke at all. There
was, in fact, a cron job that initiated another cron job. Or, more
accurately, a cron job that grepped for everything but a cron job,
attempted to kill all cron jobs and then started a new cron job. In
other words, it looked like a partially written script to do “who
knows what”--nothing that would actually work. It was signed and
dated, so we could see both who wrote it and that the creation date
was about the time when stability problems first appeared. Again,
we had found “pilot error”, not a problem with Linux.

There are more of these stories than there is space to tell
them. Basically what we found out was that even though various
distributions may have some kinks in them like a wrong file
permission at install time, they do all install. That's true for
Caldera, Linux FT, Debian, Red Hat, Slackware, Yggdrasil and all
the others. Software does not wear out. If the
system is running it is not likely to stop aside from hardware
errors. As a case in point, I still have a 0.99 kernel running on
the main machine at fylz.com that was installed in August, 1993. It
is an NFS server with three 38.8KB modems on it. The hardware is a
386DX40 with 8MB of RAM. Why haven't I upgraded it? It works, and
it is extremely stable. The last reboot was in November 1996, when
I turned off the machine to remove a zip drive from the SCSI
bus.

Trending Topics

Webinar: 8 Signs You’re Beyond Cron

Scheduling Crontabs With an Enterprise Scheduler
11am CDT, April 29th

Join Linux Journal and Pat Cameron, Director of Automation Technology at HelpSystems, as they discuss the eight primary advantages of moving beyond cron job scheduling. In this webinar, you’ll learn about integrating cron with an enterprise scheduler.