How to Use MDADM Linux RaidA highly resilient raid solution! By Red Squirrel

Dealing With Drive Failures
While today's drives are much faster and bigger than before, this comes with a cost: reliability. While drives may be more reliable on a per MB bassis, they are less reliable on a per drive bassis. While ANY drive can fail, newer ones are more likely to fail, especially when you have lot of them, as you are simply increasing your chances of a single failure. Thankfully, with any raid level above 0 you can survive at least one failure. You just need to be ready to deal with it.

Unless you have a fancy disk controller and case where a LED will go red when a drive fails, it is very important to document your drives upon installation. Using the smartctl command you can find out the serial number of a particular drive. What you want to do is insert the drives one by one, type dmesg -c to find out the name of the drive as detected by the system, then use smartctl -a /dev/[name] to get the serial number. Make a spreadsheet of the drive bays on your case, and enter the serial number as well as other info such as the size. You can even put which array they are for if there are multiple arrays, but the important piece of information is the serial number.

With alerting setup, and your drives properly documented you can have piece of mind that when a drive fails you will know about it, and you will be ready to deal with it. A drive failure email will look something like this:

Let it rebuild, and you will now have a fully redundant array again. If another drive fails before the point a new drive is added and the rebuild is complete, you could lose all data. There are several ways around this. One way to at least minimize this chance is to add a hot spare, which will cause the rebuild process to start right away if a drive fails, minimizing the "danger time". All you need to do is simply add another drive (as we just did) while the array already has enough. It will automatically get picked up and a rebuild will occur if a drive fails. Another solution is to have another raid level such as raid 6 which can survive 2 failures. Raid 10 can also work but it's not always 2 drives that can fail as it depends which drives go. If you are very paranoid, you can combine raid levels such as do a raid 61. Basically, this is two raid 6 arrays that are then added as a raid 1. So think of the two raid 6 arrays as two hard drives, then take those hard drives and make a raid 1. This is kind of wasteful though, you lose over half of all total drives worth of space. You are better off sticking with just raid 6, and just make sure you have good backups. If you want to play around though, you could try it.

Let's just stick to raid 6. First we need to add another drive, then transform the array. We will not gain any disk space out of this, but we will gain redundancy.

Note the backup file argument. You will want to specify a location that is NOT in the array. I am honestly not sure how much space is really required for this backup file, to be safe I would just use the largest disk you have. It will be deleted after the rebuild is complete. You also need mdadm 3.1 or higher for this feature to work. It is rather new, and you should really backup your stuff before doing this (or ANY) changes to a raid array. Expect this reshape to take a while compared to just a normal rebuild. I found it took about 2x as long. As far as I know, if a drive was to fail here, you would probably be safe as it is still a raid 5. But I must repeat: backup your data! Raid is not a backup! Here's how it will look like when done:

If for whatever reason you want to go from 6 to 5, you can also do that. You'll then end up with a hot spare.

Growing an array

Another nice thing about mdadm is the ease of growing an array. Say you start off with 5 drives in a raid 6 and you are running out of space, no problem. Simply add another drive, and grow the array. The drive needs to be the same size as all the others of course. You can also add multiple drives to grow it much bigger in one shot. Let's add 2GB to this array by adding 2 more drives.

Now remember, a raid device is like a hard drive. Just because your hard drive is bigger, does not mean your file system sees that extra space! We now need to expand the file system volume. This procedure has nothing to do with mdadm and is based on the file system. We will use the resize2fs command which is fairly simple:

There you have it, we made a 4 disk raid 5 array, dealt with a disk failure, converted to raid 6 and increased it's size by 2GB. All this was done without ever bringing the file system offline. You could have VMs or any other data running on there during any of these procedures. The procedures we went through could occur over the course of any time such as several years, and you never need to bring it offline

There are much more things to explore such as chunk sizes and other settings, but this article was only to really scratch the surface of mdadm. I have been using mdadm for over 5 years without any issues. My data has survived some pretty harsh crashes due to various hardware failure, hard power downs due to failing UPS, and other such crashes. At one point I even lost 2 drives out of a raid 5 forcing the array offline due to some unexplainable hardware issue. The drives were not failed, there was some other hardware issue. I was later on able to bring both drives back into the array, rebuild, run a fsck on the file system, and short of VMs and other active files, all the data was fine. I restored the backups of the VMs and was ready to rock and roll. It is truly a resilient system and I actually recommend it over hardware raid for mass data storage simply because you are not relying on a specific controller, and you can even spam arrays across multiple controllers.