What is a punctured RAID array?

What is a puncture stripe or a punctured RAID array and how to recover from it?

To understand the concept of a punctured stripe first we need to understand what exactly a RAID array is and how the information is stored on the disks in a RAID configuration. In the following post I am considering RAID5 (with three drives) as an example and will try to explain how the puncture happen and how to get rid of it.

What is RAID5:

In RAID5 the data is distributed in the form of parity across all the member disks. In the case if one of the drive goes bad the data can rebuild again by calculating the parity across all the drives.

But if two drives goes bad then there is no way to rebuild the data back to its original state. In most of the LSI* based controller whenever one disk fails from a container (Virtual Disk), the controller marked that virtual disk as degraded.

What causes a puncture?

Usually there are several things which can cause a puncture but it usually starts with a failed drive. For an instance

John is a busy system admin and his job is to monitor a Dell PE 1950 which has a PERC 5/i controller installed [RAID5 with three disks] . He did not bother to do anything unless there is a amber light reporting an error on the front LCD panel. One ugly Monday he came to work and saw a drive in slot 0 blinking amber. He called the support and ordered a new drive. Once he received the new drive he yanked the bad hard drive out and put the new drive in. As soon as he puts the new drive in, it starts rebuilding and in an hour or so all the drives are green again.

What did John do wrong?
Most of us will say he didn’t do anything wrong. So lets move forward.

After couple of days John find out that drive in slot 1 is now blinking amber. Oh! Bummer. He called the support again and got another drive and continue with the same thing.

What did John do wrong this time?
Hmmm lets say nothing because there is a possibility of multiple drive failure in a week difference. No big deal.

One day some of the users were experiencing some disk issues and John thought it may be because the server is up and running from last several months. So he rebooted the server and during the POST he saw the message “one of the virtual disk is in offline sate. System halted”. Now John calls the support and after some troubleshooting and looking at the controller logs, the tech on the phone says you have a punctured array.

In the above scenario Mr John did almost everything correct except one little thing. When John realizes that one of the drive in slot 0 is bad he quickly grabs a new one and put it in, without bothering about the state of other disks. When disk on slot 0 was blinking amber there were some bad logical blocks exists on another disk on slot 1 (Usually in the controller logs it appears as medium error: sense key 3/11)

The medium error or the bad blocks are common in the controller log and with each patrol read the controller will try itself to correct those medium error. If a certain block is unrecoverable, the controller try to relocate the data on that bad block to a new location. When the number of unrecoverable bad blocks exceeds a certain threshold value the controller mark that drive as “Predictive failure” first and eventually set it to failed.

When John replaced disk 0 and inserted a new one, it starts rebuilding. During the rebuild process the parity on drive 1 and 2 will be combined together to reconstruct the data on drive 0. In the rebuild process the logical bad blocks from drive 1 got copied to disk 0 and disk 2. After the rebuild process completes John has three drives with bad blocks instead of one.

So if we look back at the very first day when John’s server lost one drive the status of the drives are like this:
Drive 0 : Failed
Drive 1 : Online but with lots of medium error (Predictive Failure)
Drive 2 : Online
If we look at the above disks status it is clear that the RAID array has two drives which are about to die so basically its time to start everything from scratch anyways. It should have never reach here on the first place if John was properly monitoring the server.

Precautionary measures:

Always keep the firmware on the drives and the controller to the latest

Make sure the Patrol Read is turned on

Periodically run consistency check on the virtual disks

Always review controller logs before replacing a drive

Does that mean that all the three hard drives are bad?
No. The drives are not physically bad it’s the logical blocks on the drives which are marked as bad.

How to remove a punctured stripe or a punctured array?

The best way to get rid of the punctured array is to delete the configuration from the drives, controller and start everything from the scratch but with precaution. If not properly handled, this problem can haunt you later with the same level of destruction.

Note 1: Before following the steps mentioned below, backup all your data if your OS is still accessible.

Note 2: For the initial configuration we are assuming the server has 3 drives in a RAID 5 configuration.

Step:1 In a puncture scenario the first thing you need to find out is the actual drive which has bad logical blocks and which is the root cause of the puncture (In the above scenario it was drive in slot 1). The best way to find out that information is by going through the controller logs. On a LSI based controller the logs can be gathered using MegaCli utility on a Linux box. If you are running a Dell server, the DSET utility can be a option which saves the controller logs in: data\dell\RAID Controllers\Controller0.log

Here is a list of keywords and there output from a actual controller log:

Note: On older PERC controller (PERC 5/I and PERC 6/I running older firmware) sometimes the controller logs will not show any relevant output on searching the keyword Punctur, even if there is a puncture. Because the older firmware are not smart enough to recognize a puncture. In that situation the best way to find out the puncture is by analyzing the LBA (Logical block address). If there are multiple drives showing a bad block at a same LBA that shows the existence of a puncture. For ex:

Step:2 After finding the culprit, boot the server in the controller utility (By pressing specific combination of keystroke usually prompt during the POST. In the case of Dell PERC controller its Ctrl+R). In the controller utility select the first line in VD Mgmt tag and press F2 key and select Reset Config

Step: 3 Once the configuration is cleared, replace the drive in slot 1 with a new drive. Verify the new drive shows in Ready State from the PD Mgmt Tag (To navigate to next tab press Ctrl+N, and to previous tab press Ctrl+P). After verifying that, go back to the VD Mgmt tab and create a RAID0 on all the hard drives using the following steps:

a. Press F2 on very first line under VD Mgmt and select Create New VD

b. On next dialog box select the options as it is shown in figure:

a. Once you press OK a warning will appear reporting the newly created drive needs to be initialize. Acknowledge the warning and on the next screen select the newly created virtual disk (In our case its Virtual Disk 0) and press F2 > Initialization > Start Init.

Note: The above step will do a full initialization on the hard drive clearing all the bad blocks which were copied during the rebuild process. Repeat the above steps on all the three hard drives.

Step: 4 Once the initialization completes clear the configuration again and create a RAID 5 with three hard drives and do a full initialization as explained in above step.

Step: 5 Now clear the configuration again (Step: 2) and change the physical location of the drives. Put drive 0 on slot 1, drive 1 on slot 2 and, drive 2 on slot 0. Go back to VD Mgmt tab and create a RAID5 and do a full initialization as discussed in Step 3(a).

Step: 6 Let the background initialization finishes and then analyze the controller logs to verify that there are no medium errors distributed across multiple hard drives on the same logical block.

There can be several other situation in which a puncture can occur but the situation narrated above is most common which I usually come through. So, never replace a hard drive until you are sure that there is no other drive in the array which has any bad blocks.

FAQ

Can we get rid of this puncture in the same setting without losing the data?
The answer is no. We cannot replace two disks at the same time in a RAID5 setting.