Search

Tales from the Server Room - Panic on the Streets of London

I've always thought it's better to learn from someone else's mistakes
than from my own. In this column, Kyle Rankin or Bill Childers will tell
a story from their years as systems administrators while the other will
chime in from time to time. It's a win-win: you get to learn from our
experiences, and we get to make snide comments to each other. Kyle tells the
first
story in this series.

I was pretty excited about my first trip to the London data center. I
had been to London before on vacation, but this was the first time I
would visit our colocation facility on business. What's more, it was the
first remote data-center trip I was to take by myself. Because I still
was relatively new to the company and the junior-most sysadmin at the time,
this was the perfect opportunity to prove that I knew what I was doing and
could be trusted for future trips.

The Best Laid Plans of a Sysadmin

The maintenance was relatively straightforward. A few
machines needed a fresh Linux install, plus I would troubleshoot
an unresponsive server, audit our serial console connections, and do a
few other odds and ends. We estimated it was a two-day job, but just in
case, we added an extra provisional day.

[Bill: If I remember right, I had to fight to get that extra day tacked
onto the trip for you. We'd learned from past experience that nothing
at that place seemed easy at face value.]

Even with an extra day, I wanted this trip to go smoothly, so I came up
with a comprehensive plan. Each task was ordered by its priority along
with detailed lists of the various commands and procedures I would use to
accomplish each task. I even set up an itemized checklist of everything
I needed to take with me.

[Bill: I remember thinking that you were taking it way too
seriously—after all, it was just a kickstart of a few new machines. What could
possibly go wrong? In hindsight, I'm glad you made all those lists.]

The first day I arrived at the data center, I knew exactly what I needed
to do. Once I got my badge and was escorted through multiple levels of
security to our colocation cages, I would kickstart each of the servers
on my list one by one and perform all the manual configuration steps
they needed. If I had time, I could finish the rest of the maintenance;
otherwise, I'd leave any other tasks for the next day.

Now, it's worth noting that at this time we didn't have a sophisticated
kickstart system in place nor did we have advanced lights-out
management—just a serial console and a remotely controlled power system.
Although
our data center did have a kickstart server with a package repository,
we still had to connect each server to a monitor and keyboard, boot from
an install CD and manually type in the URL to the kickstart file.

[Bill: I think this experience is what started us down the path of a
lights-out management solution. I remember pitching it to the executives
as “administering from the Bahamas”, and relaying this story to them
was one of the key reasons that pitch was successful.]

Kicking Servers Like Charlie Brown Kicks Footballs

After I had connected everything to the first server, I inserted the CD,
booted the system and typed in my kickstart URL according to my detailed
plans. Immediately I saw the kernel load, and the kickstart process
was under way. Wow, if everything keeps going this way, I might even
get this done early, I thought. Before I could start making plans
for my extra days in London though, I saw the kickstart red screen of
death. The kickstart logs showed that for some reason, it wasn't able to
retrieve some of the files it needed from the kickstart server.

Great, now I needed to troubleshoot a broken kickstart server. Luckily,
I had brought my laptop with me, and the troubleshooting was
straightforward. I connected my laptop to the network, eventually got a
DHCP lease, pointed the browser to the kickstart server, and sure enough,
I was able to see my kickstart configuration files and browse through
my package repository with no problems.

I wasn't exactly sure what was wrong, but I chalked it up to a momentary
blip and decided to try again. This time, the kickstart failed, but at
a different point in the install. I tried a third time, and it failed
at the original point in the install. I repeated the kickstart process
multiple times, trying to see some sort of pattern, but all I saw was the
kickstart fail at a few different times.

The most maddening thing about this problem was the
inconsistency. What's
worse, even though I had more days to work on this, the kickstart of this
first server was the most important task to get done immediately. In a
few hours, I would have a team of people waiting on the server so they
could set it up as a database system.

If at First You Don't Succeed

Here I was, thousands of miles away from home, breathing in the warm
exhaust from a rack full of servers, trying to bring a stubborn
server back to life. I wasn't completely without options just yet. I had
a hunch the problem was related to DHCP, so I pored through the logs on
my DHCP server and confirmed that, yes, I could see leases being granted
to the server, and, yes, there were ample spare leases to hand out. I even
restarted the DHCP service for good measure.

Finally, I decided to watch the DHCP logs during a kickstart. I would
start the kickstart process, see the machine gets its lease, either
the first time or when I told it to retry, then fail later on in the
install. I had a log full of successful DHCP requests with no explanation
of why it didn't work. Then I had my first real clue: during one of the
kickstarts, I noticed that the server had actually requested a DHCP lease
multiple times.

Even with this clue, I started running out of explanations. The DHCP server
seemed to be healthy. After all, my laptop was able to use it just fine,
and I had a log file full of successful DHCP requests. Here I turned to
the next phase of troubleshooting: the guessing game. I swapped cables,
changed what NIC was connected and even changed the switch port. After
all of that, I still had the same issue. I had kickstarted the machine
so many times now, I had the entire list of arguments memorized. I was
running out of options, patience and most important, time.

[Bill: I remember seeing an e-mail or two about this. I was comfortably
ensconced at the corporate HQ in California, and you were working on
this while I was asleep. I'm sure I'd have been able to help more if
I'd been awake. I'm glad you were on the case though.]

Not So Fast

I was now at the next phase of troubleshooting: prayer. Somewhere around
this time, I had my big breakthrough. While I was swapping all
the cables around, I noticed something interesting on the switch—the
LEDs for the port I was using went amber when I first plugged in the
cable, and it took quite a bit of time to turn green. I noticed that the
same thing happened when I kickstarted my machine and again
later on during the install. It looked as though every time the server
brought up its network interface, it would cause the switch to reset
the port. When I watched this carefully, I saw during one install that
the server errored out of the install while the port was still amber
and just before it turned green!

What did all of this mean? Although it was true that the DHCP server was
functioning correctly, DHCP requests themselves typically have a 30-second
timeout before they give an error. It turned out that this switch was
just hovering on the 30-second limit to bring a port up. When it was
below 30 seconds I would get a lease; when it wasn't, I wouldn't. Even
though I found the cause of the problem, it didn't do me much good. Because
the installer appeared to reset its port at least three times, there was
just about no way I was going to be able to be lucky enough to get three
consecutive sub-30-second port resets. I had to figure out another way,
yet I didn't manage the networking gear, and the people who did wouldn't
be awake for hours (see sidebar).

The ultimate cause of the problem was that
every time the port was reset, the switch recalculated the spanning tree for
the network, which sometimes can take up to a minute or more. The long-term
solution was to make sure that all ports we intended to kickstart were
set with the portfast option so that they came up within a few seconds.

[Bill: One of the guys I worked with right out of college always told
me “Start your troubleshooting with the cabling.” When troubleshooting
networking issues, it's easy to forget about things that can affect the
link-layer, so I check those as part of the cabling now. It doesn't take
long and can save tons of time.]

The Solution Always Is Right in Front of You

I started reviewing my options. I needed some way to take the switch
out of the equation. In all of my planning for this trip, I happened
to bring quite a toolkit of MacGyver sysadmin gear, including a short
handmade crossover cable and a coupler. I needed to keep the original
kickstart server on the network, but I realized if I could clone all of
the kickstart configurations, DHCP settings and package repositories
to my laptop, I could connect to the machine with a crossover cable and
complete the kickstart that way.

After a few apt-gets, rsyncs, and some tweaking and tuning on
the server room floor, I had my Frankenstein kickstart server ready to
go. Like I had hoped, the kickstart completed without a hitch. I was
then able to repeat the same task on the other two servers in no time
and was relieved to send the e-mail to the rest of the team saying that
all of their servers were ready for them, right on schedule. On the next
day of the trip, I was able to knock out all of my tasks early so I could
spend the final provisional day sightseeing around London. It all goes
to show that although a good plan is important, you also should be flexible
for when something inevitably goes outside your plan.

[Bill: I'm glad you planned like you did, but it also highlights how
important being observant and having a good troubleshooting methodology
are. Although you were able to duct-tape a new kickstart server out of your
laptop, you could have spent infinitely longer chasing the issue. It's
just as important to know when to stop chasing a problem and put a
band-aid in place as it is to fix the problem in the first place.]

Kyle Rankin is a Systems Architect in the San Francisco Bay Area and the
author of a number of books, including The Official Ubuntu Server
Book,
Knoppix Hacks and Ubuntu Hacks.
He is currently the president of the North
Bay Linux Users' Group.

Bill Childers is an IT Manager in Silicon Valley, where he lives with his
wife and two children. He enjoys Linux far too much, and he probably
should
get more sun from time to time. In his spare time, he does work with the
Gilroy Garlic Festival, but he does not smell like garlic.