Overview
In two previous posts it was demonstrated how to create a RAID 5 device with mdadm and format it with ext4. Using RAID5 and ext4 together can increase the fault tolerance levels of your storage, but not to forget a good administrative practice is to test the failure process and become familiar with the recovery steps.

The steps that follow will show how to verify an array is running healthy, simulate a failure, identify a failure has occurred, and how to recover from it.

This guide was tested on a Debian 6.0 "Squeeze" system (Linux debian 2.6.32-5-amd64) with a RAID 5 array created out of five 5GB disks, one as a hot spare, with the ext4 filesystem as the primary partition. root access is required for the steps in this guide.

The steps outlined below increase the risk of data loss. Do not run them on a production system without fully understanding the process and testing in a development environment.

These instructions are not meant to be exhaustive and may not be appropriate for your environment. Always check with your hardware and software vendors for the appropriate steps to manage your infrastructure.

Formatting:
Instructions and information are detailed in black font with no decoration.

Code and shell text are in black font, gray background, and a dashed border.Input is green.
Literal keys are enclosed in brackets such as [enter], [shift], and [ctrl+c].

The /proc/mdstat file shows that md0 is the active RAID5 device composed of 5 devices: sdb, sdc, sdd, sde and sdf. Device sdf is a hot spare. The end of the third line states that all 4 data disks are online and up with the [4/4] and [UUUU].

Output from the mdadm command provides similar details as mdstat. First column is device number, second is number of major events, next is number of minor events, fourth is the RAID device number, then the status, and the device name. The first four disks are all healthy because they are active and sync'd. The last disk is a hot spare and currently does not hold any data from the array.

Everything is green checkmarks and with this information the array could be rebuilt in the event of a failure.

This mdadm command changes the application to manage mode, says to mark the device target as faulty, specify the md device as /dev/md0, and the target device to set faulty is /dev/sdc.
The second line confirms the command has been processed.

The mdstat file shows that the sdc disk has failed and sdf is no longer marked as a spare. From [4/3] and the [U_UU] on the next line it confirms the overall status that one of the disks is not fully snyc'd and it is the second disk marked as down. The next line is new and shows the overall recovery process, currently 2.6% complete, 140032 out of 5241344 blocks have been rebuilt, the estimated completion time is 4.2 minutes, and the I/O speed.

mdadm --detail displays that the RAID disk 1 has been changed to /dev/sdf (it was /dev/sdc), and is currently rebuilding. The spare, which was /dev/sdf, has been changed to /dev/sdc and it is marked as faulty.

The kernel has reported to syslog that changes have occurred with md0. The first grouping shows that in the original configuration a device is no longer online, dev:sdc. The second grouping shows the disk has been removed from the array while the last group shows the disk dev:sdf has been added and that it is online. The final line shows that md is currently recovering the RAID array md0.
Not shown are the timestamps (makes everything ugly). It is interesting to note that on my machine from the time the kernel detected the error to when recovery started was about one-hundredth of a second, .011.

Output from df and ls show that the filesystem is still mounted and usable.

The array is remained untouched on this step but mdadm now has /dev/sdc designated as a hot spare drive in the event another disk does fail.

Conclusion
If you performed the steps from this guide you would have verified the status of a RAID array is healthy, broken it by manually failing a drive, monitored the rebuild process, and re-added a drive in to the hot spare pool.

Testing and becoming familiar with the process to recover from failures can help you respond in a level headed manner and increase the likelihood of successful problem resolution. You will also be able to see how long the process takes so you know if something might be the matter when you have to perform these actions in a production environment.

Eric Wamsley
Howdy! I am a technology dude based in the USA. My goal is to combine data, technology, and people; then document the process here so we can all learn from my errors and maybe even get a smile or two.