I got a call this morning that people couldn't access data and email was down so I popped over to the office to see two bays on my R510 blinking amber. I found that I have two failed drives on a RAID5 array. If this were a single drive I'd hotswap the bad one and let the RAID rebuild. With two I'm not even sure what to do.

At first one drive was blinking amber and read "Failed" on the H700 utility while the other just showed as missing. After a reboot the "failed" drive now shows as "missing" as well. I have our technology provider getting back to me with support but I'm in panic mode. The boot has found "Foreign Configurations" on the adapter and now won't even boot to ESXi like before. That array housed our ERP, Email, Sharepoint, and tons of other data.

I'd appreciate any advice as my position (and skillset) is meant to maintain the systems and this is somewhat over my head.

Subbing to this thread and taking notes. I too have a RAID 5 array running that's been acting a bit strange. All systems are go for now, but I'm preparing for a crash.

Best of luck to you.

I hope you have a good backup in place, if not you need to do one ASAP ( just in case it does crash )

That's the thing. I don't. Our files are stored in external disks and cloud storage, but such a backup is nonexistent.

I linked my boss to this thread as well, and I'm looking for a backup solution NOW. But I digress here, I'm not going to hijack this post.

You are not. The OP can serve as a bad example you can learn from. Step #1 is get a backup by any means necessary - N O W ! ! !

The OP has said he is in over his head and needs help. Without regard for their feelings here is a way to proceed. (Feelings don't matter and my advice is sincere after being in his shoes and learned from it.)

The OP yes should contact Dell IF they had a backup. Dell, although great for RAID and server support, isn't going to do much to recover data. Dell is still useful as I explain below.

To the OP: STOP HELPING! All you are doing is making the RAID 5 impossible to recover. Write the controller and drives off now to be sent out for recovery. It's your only hope to recover anything off the drives aside of dumb luck - that reaching for that has a good chance to make recovery impossible.

As the Data appears to be not backed up the OP needs to Stop messing with it! (Preferably in cases like this would be to shut it off and leave it off till there is a plan in place.) As it's up and running leave it alone while you call krollontrack or other firm for data recovery. Get them on the phone and get a plan then have boss sign off on cost. You see every second the drives and controller are powered they want to do housekeeping. And the OP keeps taking extreme risk by punching buttons and swapping stuff without a plan or clue as to how to recover the data. Being powered on reduces your recover chances. Trying things without a data recovery plan can wipe or corrupt the entire remaining disks at the selection of the wrong choice. Force online is force corrupted data into the array - swapping disks out just make a mess as the data becomes outdated and unusable for the RAID 5.

Your choice is clear - data recovery company to get plan to maybe recover from this point or screw around till they fire you and your replacement will cuss you for fing around with an array that may have been recoverable at a data recovery company. You still have time to come out a hero on this. Next if the data is gone you have to have a plan to put things back together including obtaining support from the vendors that were lapsed. If the company blames you for this event walk away - learn from it but walk away. (Other than the oversight of backups.)

The main point here is you need to have a plan and then delegate things like 6 hour drives to get parts. If the server is under warranty Dell can get parts next day or 4 hour for critical warranty coverage if you got that. Dell would have sent you new drives and you could have sent old drives out for recovery and worst case later pay Dell for the non-returned drives. This can still be part of your plan depending on what the data recovery company can do or needs.

Do you have another server or 'something' with enough drive space to hold a 'data dump' of this drive array? A PC with a huge hard drive can work in this emergency...

I agree with Gary and Robert's observations. If RAID 5 was just housing ESXi (WHich seems like an inordinate amount of space for it) you could just reinstall ESXi on a flash drive and get things running again.

You could try powering off the server and back on and forcing the array back online. I've had that work with people who had raid 5 failure and no backups. There were corrupt files but a lot was salvaged. Here's hoping you have good backups.

Perhaps this will be a good example to some of the SW members that think there is next to zero chance of seeing a 2 drive raid 5 failure. We aren't kidding when we say we've seen it happen.

- We have backups but apparently my predecessor only set it to back up the "data" drives of some VMs and not the C: drives. Big problem for our ERP ﻿system. I can't even process starting from scratch on that.

- On boot it says it has detected and can import a foreign config on disks. Not sure if that is a good or bad idea.

- Our support company guy said that I can put an identical drive in the bay of the drive that failed first and there's a 75% chance it can rebuild and that having 512GB cachcade helps my chances. Of course this was before when one of the drives showed up as "failed" instead of both now being "missing" with foreign configs.

- I'm driving to Atlanta today (6hrs roundtrip) since it's amazingly the closest place that carries 512GB SSDs in stock.

- The majority of the VMs had complete backups so not only our data but our server applications should be fine when the bring them in today. I'll pull them over to another host to get them rolling. Sharepoint and Exchange should be good.

- The one VM that was only backing up it's "D:" drive and not it's main OS was our ERP system.

I'm banking on about a 35% chance I'll get fired. Although I can't conceive as to why my predecessor backed the servers up in this manner, I also had the responsibility to check his work when I came in last year. Damn assumptions.

If all else fails I may have to send the disks off to a RAID data recovery service to get the small (but integral) amount of data we need. FML.

I'm banking on about a 35% chance I'll get fired. Although I can't conceive as to why my predecessor backed the servers up in this manner, I also had the responsibility to check his work when I came in last year. Damn assumptions.

You made a mistake. That's not revelant right now, you've learnt a lesson and the company is down.

You're employed now and your responsibility now is to get things working that you can, find out what is broken and see what can be done to fix it.

- On boot it says it has detected and can import a foreign config on disks. Not sure if that is a good or bad idea.

Should be fine, making a clone of the disk is not a bad idea but it's important to know, what DATA is on that RAID 5 area.

briansorrells wrote:

- We have backups but apparently my predecessor only set it to back up the "data" drives of some VMs and not the C: drives. Big problem for our ERP ﻿system. I can't even process starting from scratch on that.

What do you have of the ERP system right now? Where is it's data? the RAID 5 area or the RAID10?

briansorrells wrote:

- I'm driving to Atlanta today (6hrs roundtrip) since it's amazingly the closest place that carries 512GB SSDs in stock.

Why are you focused on getting replacement 512GB SSD's? That RAID5 array is gone. Cachecade is not part of RAID, it won't help you fix the issue.

briansorrells wrote:

- The majority of the VMs had complete backups so not only our data but our server applications should be fine when the bring them in today. I'll pull them over to another host to get them rolling. Sharepoint and Exchange should be good.

exchange and sharepoint will be dead in the water if the DC cannot be recovered - unless you already have a DC up and running somewhere else?

IMO, the key response now isn't to be driving around the country to buy SSD's but to be getting the machines on the RAID10 array up and running.

I am sorry to hear your pain. While it may be cold comfort - there are lessons to be learned.

re : " ﻿On boot it says it has detected and can import a foreign config on disks. Not sure if that is a good or bad idea.2 - the only time I ever saw this was when I lost a raid-5 array. :-( We came up after a power failure, and I said yes. Wrong answer - the entire array was lost. I did get all the data back after sending it to a professional data recovery firm. I am paranoid and would not move beyond that step, and seek professional help.

As to getting fired, You inherited a mess - and that's not really your fault. You need to get as much back as possible - and the above sounds like a great start. Concentrate on getting things back up. It sounds like ERP is your only real sticking point. Blame the foul up on your predecessor's failure to plan for ﻿all﻿ eventualities. And put it back to your boss, who presumably had approved your predecessor's design?? In terms of getting fired, it woujld also be a huge mistake from the company's perspective. YOU have intimate knowledge of how to get things back on the road and that must be important. Additionally, you are best placed to be able to update things so this never happens again.

Obviously, you need to put what you have learned into practice and re-think your backup strategy. Clearly the one you have at present failed - so how can you make it better. In doing that, you have some freedom to look at things like backup to the cloud, or DR to the cloud. And while you are at it, isn't this a great time to re-archichitect the core applications to be more fault tolerant?

Regarding the trip to Atlanta. My advice - and take it for what it's worth - don't drive. You are stressed and 6 hours in US traffic is hard work. You are not in the best condition. Find a motorcycle courier and authroise them to get the parts to you. That will, if nothing else, cut the time in half - as do not need to drive there AND back. And tell the courier company, fast delivery will be compensated. Someone who knows the road can get it to you FAST. I did this once - had a hard disk motorcycled from South London to Swindon - we had just an hour or so later. I did not ask too many questions - but the chap got a jolly nice tip. So save yourself the stress and let someone else do the driving.

After some further thinking on this and still wondering what was on the RAID 5 I was wondering this. If you have backups of everything and the ERP is your most critical business system why have you not been working to get that up ad running first and foremost on any piece of hardware you can put together? Most ERPs run off of some sort of database. If you have a full backup of the database and it is a verified good backup, get with the ERP vendor/consultant and get it running. You are running around like a chicken with your head cut off running down the wrong solution. WIth all the time spent since your initial posting you could have had some sort of infrastructure running at this point and Monday morning isn't too far off.

Two drives failing at once is highly suspect. I'd power off completely so the controller and backplane can reset and I'd reseat the drives while it's powered down just to ensure there isn't an intermittent connection somewhere and go from there. I wouldn't reseat the drives with it powered on as the controller will assume a drive failure.

Yes, you're in a tight spot. It happens to everyone at one point, no matter what level of expertise they have.

You're a help desk guy hired to run an entire plant's IT. They knew what they were getting and knew you'd have to grow into the job. You've been picking away at things a little at a time. If something was going to fail, it was very likely it was going to be something you hadn't gotten to yet.

Rockn is right - make sure you're concentrating on the correct solution, not necessarily the obvious one. Your mission is to get the systems back up. Come up with 2 or 3 different ways to do that and being doing them.

Firstly, you should be getting 4 new SSDs, not 1 or 2. If you plan to save the array, you need to take it off line NOW. Plan to rebuild and restore from backups to a clean disk array. If you can't get SSDs today, skip it. Use Winchesters. You just need an array, not necessarily the same one.

In retrospect, protecting the installations and data should have been your first priority - not low hanging fruit like cable management. But, that's done and the real challenge is forward, not backward.

Run your restore scenarios in parallel. Get help if you need it. Keep your chain of command informed. Don't let them wonder what the status is. If you're 6 hours from Atlanta, you've access to a heck of a lot of IT support in your area. Make some calls if you need to.

Work as if you're NOT going to get fired. And if you do, you can say, "And I did all I could to get us back up and running. Boy, did I learn a costly lesson." I would hired a guy who learned - but not one who gave up.

- We have backups but apparently my predecessor only set it to back up the "data" drives of some VMs and not the C: drives. Big problem for our ERP ﻿system. I can't even process starting from scratch on that.

First priority, then, is to rebuild the OS and install the ERP software. Since you have answered our question about what is on the drives, I'm assuming you don't know. Put a bootable disk in there and find out. If your data is intact on the RAID10 array, start planning your recovery.

- Our support company guy said that I can put an identical drive in the bay of the drive that failed first and there's a 75% chance it can rebuild and that having 512GB cachcade helps my chances. Of course this was before when one of the drives showed up as "failed" instead of both now being "missing" with foreign configs.

I have no idea what this means. Before you do anything this guy tells you to, make damn sure you aren't going to lose your data forever.

- The majority of the VMs had complete backups so not only our data but our server applications should be fine when the bring them in today. I'll pull them over to another host to get them rolling. Sharepoint and Exchange should be good.

- The one VM that was only backing up it's "D:" drive and not it's main OS was our ERP system.

I think your recovery is in sight as long as you don't do anything stupid - meaning without thinking or by listening to your MSP who will later say, "Oh, we thought you meant..."

Explain to everyone in a calm voice that restoration is a ﻿process﻿, not a single moment in time. And you are following that process to ensure an orderly recovery. Then do it. You're on the right track.

- We have backups but apparently my predecessor only set it to back up the "data" drives of some VMs and not the C: drives. Big problem for our ERP ﻿system. I can't even process starting from scratch on that.

[ ... ]

Unless you've signed something he passed the system to you in an acceptable state you're safe. Blame him and relax, there's no CV update triggering events detected. RAID5 can fail and it did fail. With 5-7 SSDs it's OK and reasonable to use RAID5 so... No big deal! Recover from backups and take as much time as you need.

- This is a 12 bay server. Six 2TB drives in RAID10 and Six 512GB SSDs in RAID5. They have seven virtual disks, two of which were located on the RAID5 array. These two VDs house the handful of VMs that we've lost.

- The ERP systems is apparently one of those systems that takes months to set up and is insanely complex. The backups handled the ERPs "data drive" but not the C: drive with the programs on it. So we have the database and such, just not much else. They also let their maintenance expire so any support we buy will be 3rd party.

- We have a DC running at another location so we're fine on that front.

- The storage expert dude is convinced that since they didn't fail at the same time that replacing the first one to go bad and doing "something" to the other missing disk will allow the RAID to rebuild itself. I just hope the attempt will not thwart outside recovery efforts should it not succeed. I'd rather send it to a recovery service than start over.

I appreciate the good advice. I hope this guy knows what he's doing. The company is super-reputable and we use them for many services and rarely disappoint. Right now I'm waiting on storage dude to call me back so he can hopefully walk me through some steps.

Not sure I understand your setup, but this seems like sort of an odd setup.

Personally if I had set this up I would have if at all possible mirrored the OS HDDs. If you use a flash card, etc for hypervisor that is no issue, if you run like server OS then hypervisor then it is iffy. Ideally if possible I'd mirror the host HDD, maybe put a couple VMs system drive on these, set up another mirrored set for the other higher use VMs, then use raid5 or preferably raid6 for the data.

The logic for using the mirror/raid combo is if the OS HDD goes out, reboot and you should be running again. This also gives you the ability to stay running if the raid array goes out, generally it would still work in a degraded state replace the HDD when you can and rebuild.

It almost sounds like your predecessor put the VM OS drives and data on the same raid volume, or at least the same disks. Either way that is kind of silly.

Multiple simultaneous drive failures are most often caused by the drives reaching their MTBF limit, or a fault in that batch from production. Either way, you need to do a restore from backup. If you want to have the upper hand, pro-actively replace drives one at a time, spacing replacements out by a week or two. That will reduce the likelihood of multiple simultaneous drive failures. Also, if you are planning to continue using RAID-5, don't - go to RAID-6 (active standby drive).

There are lots of ways the drive and controllers can fail and taking a closer look may give options.

If the drives didn't fail at the same time then whatever is contained on the first one to fail is pretty much worthless.

The 2nd drive however as its SSD, if it has suffered an electronic death you may be in trouble, if it was removed by the raid controller due to exceeding a fault or time tolerance then you may be able to get most of your data back.