Mater artium necessitas

I recently had an issue at a remote location (12000km away) where the old multi-purpose Linux server that had been working for the past 5 years wouldn’t boot again after a nasty power failure.
The server was used as a firewall, a local email store, a file server and a backup server, so its failure is a big deal for the small business that was using it.

RAID configurations explained

You can’t always have complete redundancy, so some amount of bad crash is to be expected over the years. Fortunately, I always construct my servers around a simple software RAID1 array and that leaves some hope for recovery.
In this instance, the server would start and then miserably fail in a fashion that would suggest a hardware failure of some sort. Not being able to be physically present and having no dedicated system admin on location, I directed the most knowledgeable person there to use a spare internet router to recover Internet connectivity and connect one of the disk to another Linux server (their fax server) through a USB external drive.

Doing this, I was able to remotely connect to the working server and access the disk, mount it and access the data.

Salvaging the data

Once one of the RAID1 drives was placed into the USB enclosure and connected to the other available Linux box it was easy to just remount the drives:

fdisk will tell us which partitions are interesting, assuming that /dev/sdc is our usb harddrive:

The --run argument forces the RAID partition to be assembled, otherwise, mdadm will complain that there is only a single drive available instead of the 2 -or more- it would expect.

Now simply mount the assembled partition to make it accessible in /mnt for instance:

[root@fax ~]# mount /dev/md6 /mnt

We can now salvage our data by repeating this process for every partition.
Using RAID1 means you have at least 2 disks to choose from, so if one is damaged beyond repair, you may be lucky and the mirror one on the other drive should work.

If the drives are not physically damaged but they won’t boot, you can always use a pair (or more) of USB HDD enclosures and reconstruct the RAID arrays manually from another Linux box.

Planning for disasters

The lesson here is about planning: you can’t foresee every possible event and have contingencies for each one of them, either because of complexity or cost, but you can easily make your life much easier by planning ahead a little bit.

Most small businesses cannot afford dedicated IT staff, so they will usually end-up having the least IT-phobic person become their ‘system administrator’.
It’s your job as a consultant/technical support to ensure that they have the minimum tools at hand to perform emergency recovery, especially if you cannot intervene yourself quickly.

On-Site emergency tools

In every small business spare parts closet there should be at least:

Whenever possible, a spare Linux box, even if it’s just using older salvaged components (like a decommissioned PC). Just have a generic Linux install on it and make sure it is configured so it can be plugged in and accessed from the network.

a spare US$50 router, preferably pre-configured to be a temporary drop-in replacement for the existing router/firewall. Ideally, configure it to forward port 22 (SSH) to the spare Linux box so you can easily access the spare box from outside.

USB external hard-drive enclosure.

a spare PC power supply.

some network cables, a couple of screwdrivers.

There are many more tools, such as rescue-CDs (like bootable Linux distributions), spare HDD, etc that can be kept but you have to remember that your point of contact need to be able to be your eyes and hands, so the amount of tools you provide should match their technical abilities.
Don’t forget to clearly label confusing things like network ports (LAN, WAN) on routers, cables and PCs.

The point is that if you can’t be on site within a short period of time, then having these cheap tools and accessories already on site mean that your customers can quickly recover just by following your instructions on the phone.
Once everything is plugged-in, you should be able to remotely carry-out most repairs.

@RobNY: if you just follow the instructions in the article you should have no problem recovering your data (as long as it’s not a physical failure of the drive or the data is irremediably corrupted of course).

@ ‘IGadget’: that picture has been on the internets for a while and I couldn’t find the original for it.
If anybody does, please let me know so I can give proper credit.
Just search for water+cooler+raid in google…

about

This is a simple technical weblog where I dump thoughts and experiences from my computer-related world.
It is mostly focused on software development but I also have wider interests and dabble in architecture, business and system administration.More About me…