We have multiple machines with lots of different drives in them. For the most part now though, we are trying to standardise on IBM X series servers. At present I've not figured out how to disable onboard raid, as a result of this many of our servers have two raid 0 arrays then we do software raid on this.

Verify the drive is dead

This indicates that md1 is in a degraded state and that /dev/sdb2 is the failed drive. Notice that /dev/sdb1 (same physical drive as /dev/sdb2) is not failed. /dev/md0 (not yet degraded) is showing a good state. This is because /dev/md0 is /boot. If you run:

touch /boot/t
sync
rm /boot/t

That should make /dev/md0 notice that its drive is also failed. If it does not fail, its possible the drive is fine and that some blip happened that caused it to get flagged as dead. It is also worthwhile to log in to xenX-mgmt to determine if the RSAII adapter has noticed the drive is dead.

If you think the drive just had a blip and is fine, see "Re-adding" below

Re-adding a drive (poor man's fix)

Basically what we're doing here is making sure the drive is, infact, dead. Obviously you don't want to do this more then once on a drive, if it continues to fail. Replace it.

So we removed the bad drive, added it again and you can now see the recovery status. Watch it carefully. If it fails again, time for a drive replacement.

Actually replacing the drive

Actually replacing the drive is a bit of a todo. If the box is in a RH owned location, we'll have to file a ticket and get someone access to the colo. If it is at another location, we may be able to just ship the drive there and have someone do it on site. Please follow the below steps for drive replacement.

Data

There's a not insignifcant amount of data you'll need to place the call. Please have the following information handy:

There are two ways to get the drive stats. You can get some of this information via hal, but for the full complete information you need to either have someone physically go look at the drive (some of which is in inventory) or use RaidMan. See "Installing RaidMan" below for more information on how to install RaidMan.

Specifically you need:

Drive Size (in G)
Drive Type (SAS or SATA?)
Drive Model
Drive Vendor

To get this information run:

# cd /usr/RaidMan/
# ./arcconf GETCONFIG 1

4) The phone number and address of the building where the drive is currently located. This will go to the RH cage.

This information is located in the contacts.txt of private git repo on puppet1 (only available to sysadmin-main people)

Call IBM

Call 1-770-858-5079 and follow the directions they give you. You'll need to use the M/T above to get to the correct rep. They will ask you for the information above (you wrote it down, right?)

When they agree to replace the drive, make sure to tell them you need the shipping number of the drive as well as the name of the tech who will do the drive replacement. Sometimes the tech will just bring the drive. If not though, you need to open a ticket with the colo to let them know a drive is coming.

Get the package, give access to the tech

As SOON as you get this information, open a ticket with RH. at is-ops-tickets redhat.com. Request a ticket ID from RH. If the tech has any issues getting into the colo, you can give the AT&T ticket request to the tech to get them in.

NOTE: this can often take hours. We have 4 hour on site response time from IBM. This time goes very quickly, sometimes you may need to page out someone in IS to ensure it gets created quickly. To get this pager information see contacts.txt in puppet1's private repo (if puppet1 is down for some reason see the dr copy on backup2.fedoraproject.org:/srv/

Prepwork before the tech arrives

Really the big thing here is to remove the broken drive from the array. In our earlier example we found /dev/sdb failed. We'll want to remove it from both arrays:

Next get the current state of the drives and save it somewhere. See "Installing RaidMan" for more information if RaidMan is not installed.

# cd /usr/RaidMan
# ./arcconf GETCONFIG 1 > /tmp/raid1.txt

Copy /tmp/raid1.txt off to some other device and save it until the tech is on site. It should contain information about the failed drive.

Tech on site

When the tech is on site you may have to give him the rack location. All of our Mesa servers are in one location, "the same room that the desk is in". You may have to give him the serial number of the server, or possibly make it blink. It's either the first rack on the left labeled: "01 2 55" or "01 2 58".

Once he's replaced the drive, he'll have you verify. Use the RaidMan tools to do the following:

First we're going to re-scan the array for the new drive. Then we'll re-get the configs. Compare /tmp/raid2.txt to /tmp/raid1.txt and verify the bad drive is fixed and that it has a different serial number. Also make sure its the correct size. Thank the tech and send him on his way. The last line there creates a new logical drive from the physical drive. "Simple_volume" tells it to create a raid0 array of one drive. The size was pulled out of our initial /tmp/raid1.txt (should match the other drive). The last two numbers are the Channel and ID of the new drive.

Rebuild the array

Now that the disk has been replaced we need to put a partition table on the new drive and add it to the array:

Installing RaidMan

Unfortunately there is no feasible alternative to managing IBM Raid Arrays without causing downtime. You can get and do this via the pre-POST interface. This requires downtime, and if the first drive is the failed drive, may result in a non-booting system. So for now RaidMan it is until we can figure out how to get rid of the raid controllers in these boxes completely.

Red Hat, Red Hat Enterprise Linux, the Shadowman logo, and JBoss are trademarks or registered trademarks of
Red Hat, Inc. or its subsidiaries in the United States and other countries.
Linux® is the registered trademark of Linus Torvalds in the U.S. and other countries.
The Fedora Project is maintained and driven by the community and sponsored by Red Hat. This is a community
maintained site. Red Hat is not responsible for content.