Posts Tagged ‘server’

I’ve been supporting RAID and servers for longer than I care to remember… I think the earliest server (excluding ICL mainframes) that I worked on was an ICL Intel 486 system, back in something like 1992.

This was before Windows and as we know, things have moved on significantly since then. Servers were complicated beasts back then, now they’re even more complex. One thing that really annoys me is 1st/2nd line technical support staff who rather than admit they don’t know what they’re talking about will recommend the wrong course of action because they don’t know any different.

Here’s my response to an email from a customer of ours with a 7 disk RAID 5 server that has significant bad sector problems across several of the disk in the volume. The customer has been told by his tech support to rebuild the array and everything will work fine… WRONG, the rebuild will fail because of the bad sectors across multiple drives. This will cause a huge amount of irreversible data loss for the customer who is a professional video editor.

Hi <X>,

My colleague <Y> has just informed me you’ve been in touch after speaking to tech support regarding your RAID.

The rebuild procedure they suggest will unfortunately not complete successfully due to several of your hard drives having bad sectors issues. Rebuilding is an automated software task that relies on all the drives involved being free from bad sectors. Rebuilding is unable to cope with bad sectors – which are a physical problem. This is why drives with bad sectors have to be recovered using hardware rather than software. I wrote a blog post about this sometime ago, titled something like ‘5 things you mustn’t do if your RAID fails’. – take a look at all of it – especially the last point: http://www.dataclinic.co.uk/raid-or-server-failure-the-top-5-things-to-avoid/

As you know, we’ve been doing this long enough to know what we’re talking about, so may I suggest two possible courses of action –

1. We complete the recovery as planned.

Or

2. We first clone all your hard drives (effectively copying them) before returning them to you so you can then try the rebuild. Us having cloned your hard drives means we can go back to the data and perform the recovery when the rebuild doesn’t work.

Rebuilds that are unsuccessful result in massive data loss across the entire RAID and are irreversible due to the old (good) data being overwritten by the new (corrupt) data. It’s one of the largest causes of data loss on any type of RAID 5 system, and we wish tech support companies would stop recommending it as they are assuming all the hard drives in the array are free from bad sectors (which is the reason your RAID fell over in the first place).

As a contractor in the computer support industry I come into a lot of contact with servers and RAID arrays. In fact, my main job is looking after the data held on SAN servers and other form of Network Attached Storage. I work for companies and government institutions as a sort of freelance computer troubleshooter and mostly use IBM, Dell and HP server equipment. The Dell servers are typically Dell Poweredge series and the HP kit is mainly Proliant. Again the equipment is hooked up to a SAN data network.

Data redundancy is a big problem of mine, it’s what happens when I inherit old legacy systems that really should have been decommissioned years ago but because of budgetary constraints have continued to be used. I work on several HP Proliant and Dell SAN servers that I’d love to switch off and migrate the data onto something far more up to date like a Dell Blade or IBM X Server system. Unfortunately, I don’t really have any say in buying new equipment.

Older servers and computer equipment fails more regularly, it just does. It wears out, hard drive fail, memory goes bad and UPS’s fail. What greeted me when I came into work last Monday was a failed SAN server array – 12 disks running in a RAID 5 configuration with a hot spare. Analysis of the server logs showed that one of the hard drives had dropped out of the array on Saturday causing the hot spare to click in. This had seemingly worked fine – the hot spare should simply be ‘rebuilt’ back into the array, but instead the whole array had fallen over.

In the server room the SAN’s RAID BIOS reported that three of the hard drives had now dropped out from the array. Well, that would explain why the SAN server was no longer booting the array. What had caused the three drives to fail was at this point a mystery. The server in question was one that ran part of the council payroll so it was obviously important to get the SAN back up and running as soon as possible, but obviously this had to be done in a method that followed best practice. It became my task and no data could be lost in recovering the SAN either.

Now I’m good a IT and SAN server support I’ll admit but when I discover 2 of the 3 drives that had dropped from the array had mechanical faults, the problem was beyond my abilities. I used a data recovery company a few years back but they were no more. Searching online pointed me to a specialist SAN recovery company called RAID and Server Data Recovery, an online review or two told me they could be trusted and that they were recommended, so I called them.

I spoke to RAID and Server Data Recovery’s specialist SAN recovery team who confirmed what I thought already. Some of the drives had mechanical damage and would need clean room attention in order to progress the data recovery attempt. I got clearance for the costs from finance and loaded the SAN server into the car and drove it down to the recovery company.

Analysis showed 1 drive had a head crash while the other two had firmware issues. Firmware is code that runs the hard drive’s operating system. It can corrupt and when it does the hard drive fails. It seemed that this firmware problem was the cause of the SAN crashing and all that needed fixing was the firmware on the two failed drives. This was indeed the case and after the repairs to the hard disks were completed and the drives re-integrated back into the SAN RAID BIOS, the SAN came back online and the data was accessible again. Panic over. The data was fully restored which was the outcome everyone had wanted.