Recovering from a Hard Drive Failure

Have you ever woke up in the morning and said to yourself, “today is the day that I'm finally going to backup my workstation!” only to find out that you're a day late and about 320Gb short? Well, that's about what happened to me recently, but don't worry, the story has a happy ending. I'm getting ahead of myself though.

Most people's excuse for not performing routine maintenance or regular backups is that they just don't have time. So when I discovered that I had some down time, I decided to take to take care of a few issues on my workstation. I performed a system update. Since I leave my system on all the time, I decided to upgrade the kernel and try to get software suspend working so I could cut down on energy consumption and heat production in my office. Finally, I resolved to finish backing up my home directory.

The system update went without incident and the kernel compiled and installed without error. The next step was to reboot into the new kernel. When the kernel panic'ed, I figured that I had missed something in the kernel configuration, so I rebooted back to my older kernel, which also panic'ed. Since this system had been running not 15 minutes ago, I knew things were about to get ugly.

At this point, I remembered that I had been doing some testing with an Ubuntu live CD, so I booted the live CD. At least now, I could get some work done, even though my workstation was “toes up.” This would also give me a platform from which to work on my regular hard drive, or so I thought. When I attempted to mount /dev/sda3, I was told that it didn't exist. Fdisk told me that my partition table was mostly gone! All that was left was /dev/sda1, where I keep my kernel, and /dev/sda2, which is where I swap. I posted a message describing my situation to the Gentoo user's group and was told that I should look into a program called testdisk.

I figured that I should at least assess /dev/sda1, so I tried to mount it. No such luck. The filesystem wasn't recognized. A quick look at /proc/filesystems told me that Ubuntu hadn't loaded ext2 support into the kernel. Further investigation revealed that Ubuntu loaded all of it's drivers from an initial ram disk and they weren't immediately available in /lib/modules. I couldn't bring myself to dissect an initial ram disk image on a system that was RUNNING on a ram disk, so out came the Gentoo installation CD.

It was while watching the Gentoo CD boot, that I saw the IDE disk seek error messages for the first time. I don't reboot my system very often and the Ubuntu live CD hides those messages from you, so who knows how long I'd been working with a drive that needed to be replaced?

Once the Gentoo CD had booted, it was time to try to recover my system. I discovered that testdisk wasn't installed on the CD, so I had to wget and untar it first. Oddly enough, I had to run testdisk and reboot a couple times before I had a partition table that looked sane. When I tried to mount the filesystem, I was told that mount couldn't find a valid filesystem. As a list ditch effort, I decided to try to fsck the filesystem anyway. The fsck program reported that it couldn't find a superblock, but this was the first good news I had received so far; I knew I could use the -b parameter and ask fsck to use a backup superblock. At least fsck hadn't choked completely. So, I issued a command like fsck -y -t ext2 -b 8192 /dev/sda3 to see what would happen. When fsck started to spew error messages indicating fix-ups it was performing, I decided that the process would take a while and went to be for the night.

When I woke up, I found that fsck had finished so I mounted the resulting filesystem. I was really hoping to see all of my files intact, but no, all I saw was /lost+found. When I cd'ed into the lost+found directory, I got my first glimpse of just how bad things had been. The fsck program had done it's job and recovered my filesystem, but it was unable to recover any of the file names at the root of the partition, so it moved the files to the lost+found directory and renamed each file after it's I-node number. All I had was a list of files and directories with names resembling #19539303. And the directory list was several screens in length; I usually keep a pretty clean / directory, so obviously, fsck had encountered a lot of trouble.

One of these oddly-named directories was my /home directory. I made an educated guess as to which one that was and sure enough, I had user directories. (My /home directory was the one reported with the largest file size.) Deeper inspection revealed that most of my files seemed to be there, and they were properly named! I was in business!

When my new disk arrived, I installed it and started copying my old files onto the new drive. I was immediately struck by how slow this process was going. It was as if I were transferring the files over a dial-up modem! It didn't help that the IDE subsystem had reset a few times in the process. At this rate the new drive would be out of warranty by the time my file recovery was complete, so I had to do something. It turns out that I had accumulated a lot of files in my home directory that I really didn't need. I had downloaded games and other software and simply built them in my home directory rather than installed them on the system. After I had pruned out all of the files and directories that I didn't care about, I was able to recover the rest of my /home directory.

So there you have it. When I started, I had a dead machine, a failing hard drive, a corrupt partition table, and a corrupt filesystem. When I had finished, I had at least recovered the important files from the system and had been able to carry on my day-to-day work without too much interruption, thanks to the Live CD. But there are some lessons to be learned here, which is why I chose to write about my experience.

I should have backed up yesterday. But for the record, my business files were on my server and I have redundant, off-site backups of them. I was mostly interested in recovering my password wallet, a few pictures and videos that I'd saved, and a few miscellaneous documents. OK, lesson learned.

But there's more. I was grateful to be able to keep running using a Live CD. However, I'm a KDE user and the Ubuntu CD that I had was Gnome-based. I got my work done, but it would have been nice to be in an environment that I was accustomed to using. In the future, I'll be keeping a Knoppix or Kubuntu CD handy.

I also found that my Gentoo CD just wasn't up to the task of system recovery. I'll be burning a genuine recovery disk, as soon as I have a system on which to burn CD's.

I really needed to have a set of emergency CD's handy for this situation. I could see having a CD wallet that had a Live CD, a Recovery CD, and an Installation CD. Having these CD's handy would have saved me a lot of time.

That said, I have to say that I'm glad to have been able to recover my data and that I wasn't down too terribly long in the process. I also wanted to mention how helpful the Linux Community is in times like this. I'm a fairly experienced Linux user, but it was sure nice to be able to ask questions before I actually committed changes to disk. I hope my tale of woe serves as both warning and encouragement to you; stuff happens, and you can recover from it.

______________________

Mike Diehl is a freelance Computer Nerd specializing in Linux administration, programing, and VoIP. Mike lives in Albuquerque, NM. with his wife and 3 sons. He can be reached at mdiehl@diehlnet.com

Congrats at getting your data back. Whenever this happens to me (sometimes I use old hard disks from the office...), I turn the computer off, grab a cup of coffee and think. Then I boot my computer with my Parted Magic usb stick and connect by usb an extra hard disk (if pressed for hard disk space). And then, I image with dd (noerror option!) the faulty disk to a file on the usb hard disk. Then I run fsck on this file (you could make a second copy of this file, to stay on the safe side), thus not further degrading my data on the disk.

with a live cd, a external usb hdd and some patience, you can save yourself alot of data recovery headaches, with dd make a dd image of the whole drive and gzip it to your external hdd, that way you can restore the image back to the hdd (or to another drive) and then start with various recovery techniques until you have what you want back.

"Most people's excuse for not performing routine maintenance or regular backups is that they just don't have time. So when I discovered that I had some down time, I decided to take to take care of a few issues on my workstation. I performed a system update. Since I leave my system on all the time, I decided to upgrade the kernel and try to get software suspend working so I could cut down on energy consumption and heat production in my office. Finally, I resolved to finish backing up my home directory."
Thanks for the information

First, I would power the system off and probably let the disk cool down. Then, only upon arrival of new disk, I would copy old one block-by-block to the new with ddrescue or dd_rescue/gddrescue. In one pass. Then put that old disk aside forever. Then, if any partitiona are lost, I would search them with testdisk or gpart. Then fsck and whatever. The number of either read or write operations on the old failed disk should indeed be minimized.

But instead, from the start, it should be known that using a single large disk for your important data is certainly a recipe for disaster in the long run. And it should never hurt to spend on 2 or more disks and make RAID1 or RAID5 array out of them. And to implement some backup strategy, of course, too.

I have to echo, doing fsck with repair is about the worst thing you can do and anyone following this advice can do infinitely more damage than has already occurred. (been there, done that :( ). Stick with dd and do dissection on the image.

My 2 cents worth: distros such as RIPlinux (recovery is possible), SystemRescue, TRK (TrinityRescueKit) etc and to a lesser extent GParted and PartedMagic, were created for times like yours/those; (and whilst on the subject, rebuilding the LiveCD/USB with sleuthkit and autopsy added, is not a bad idea.)

No, definately don't try fsck on a dead disk. If fsck notices problems in a partition, it will usually try to fix them. This means that you'll just lose more data. Don't fiddle with dead disks; you could end up losing otherwise recoverable data.

The first thing you should do is use dd or ddrescue to get as much as possible off the disk, and into an image on another. Then you can play around with the data without risk of losing the whole lot. There are great tools that can recover partition information, and others, like foremost, that can attempt to recover files from a disk image. This is much safer than playing around with a partially damaged, dying disk.

If the kernel had just received an audit notification from the IRS then "panicked' would be correct, and in this case "panicked" is not incorrect, but since he's talking about a "kernel panic" this is poetic license. E.g. "I ssh'd into the system" or "I googled it".

The great thing about English is that there's no noun that can't be verbed.

While we have been known to make mistakes, our editors are in fact so multi-talented that they speak Geek as well as English, and thus are able to discern the finer differences between actual mistakes and the intentional alterations of "geek-speak."

In this case "panic'ed" is accepted terminology. It refers to a kernel panic.

So the rest of the misspellings and bad grammar are just geek-speak? Oh I know, it's better to spend all kinds of time making dopey excuses than learning correct English. sorry, I thought LJ was a professional publication, not some pre-teen's basement blog.

So you're posting in the sysadmin category, and list yourself as a self-employed administrator, and you are only just learning this now? Backups are so easy to do these days, and a 1TB drive is now $100.

But I think the biggest problem here is the order of operations:
Step 1. Upgrade kernel
Step 2. Get suspend working
Step 3. Back stuff up (in case step 1 or 2 goes bad??)

Any kind of maintenance/upgrades are exactly the sort of thing you typically need backups for.

1. My WORK data was backed up off-site. I was hoping to recover some of my PERSONAL data.

2. I started with a corrupt partition table, a corrupt root filesystem, and misnamed files in /lost+found, and recovery was still possible. I think that's nice to know!

3. The kernel upgrade didn't cause the problem. A hard disk that had probably failed WEEKS ago caused the problem. As mentioned in the article, I was able to fall back to the previous version of the kernel.

4. The system continued to run, even though the drive was in quite bad shape.

The point of posting this in Sys Admin wasn't to serve as an example of how to run, say, a data center. The point was to demonstrate that even when bad things catch you by surprise, they may not be as bad as they appear.

Still, I hope it was an interesting read.

Mike Diehl is a freelance Computer Nerd specializing in Linux administration, programing, and VoIP. Mike lives in Albuquerque, NM. with his wife and 3 sons. He can be reached at mdiehl@diehlnet.com