jueves, 4 de agosto de 2011

Unix Horror Stories: The good thing about Unix, is when it screws up, it does so very quickly

The project to deploy a new, multi-million-dollar commercial system on two big, brand-new HP-UX servers at a brewing company that shall not be named, had been running on time and within budgets for several months. Just a few steps remained, among them, the migration of users from the old servers to the new ones.

The task was going to be simple: just copy the home directories of each user from the old server to the new ones, and a simple script to change the owner so as to make sure that each home directory was owned by the correct user. The script went something like this:

As you see, this script is pretty simple: obtain the user and the home directory from the password file, and then execute the chown command recursively on the home directory. I copied the files, executed the script, and thought, great, just 10 minutes and all is done.

That's when the calls started.

It turns out that while I was executing those seemingly harmless commands, the server was under acceptance test. You see, we were just one week away from going live and the final touches were everything that was required. So the users in the brewing company started testing if everything they needed was working like in the old servers. And suddenly, the users noticed that their system was malfunctioning and started making furious phone calls to my boss and then my boss started to call me.

And then I realized I had thrashed the server. Completely. My console was still open and I could see that the processes started failing, one by one, reporting very strange messages to the console, that didn't look any good. I started to panic. My workmate Ayelen and I (who just copied my script and executed it in the mirror server) realized only too late that the home directory of the root user was / -the root filesystem- so we changed the owner of every single file in the filesystem to root!!! That's what I love about Unix: when it screws up, it does so very quickly, and thoroughly.

There must be a way to fix this, I thought. HP-UX has a package installer like any modern Linux/Unix distribution, that is swinstall. That utility has a repair command, swrepair. So the following command put the system files back in order, needing a few permission changes on the application directories that weren't installed with the package manager:

swrepair -F

But the story doesn't end here. The next week, we were going live, and I knew that the migration of the users would be for real this time, not just a test. My boss and I were going to the brewing company, and he receives a phone call. Then he turns to me and asks me, "What was the command that you used last week?". I told him and I noticed that he was dictating it very carefully. When we arrived, we saw why: before the final deployment, a Unix administrator from the company did the same mistake I did, but this time, people from the whole country were connecting to the system, and he received phone calls from a lot of angry users. Luckily, the mistake could be fixed, and we all, young and old, went back to reading the HP-UX manual. Those things can come handy sometimes!

Morale of this story: before doing something on the users directories, take the time to see which is the User ID of actual users - which start usually in 500 but it's configuration-dependent - because system users's IDs are lower than that.

Send in your Unix horror story, and it will be featured here in the blog!