This post is allegedly about surprising facts. But this fact should not surprise you. Still, some folks invest in RAID with the expectation that it removes the burden of backing up their files. Don’t make that mistake.

Instead, you should think about RAID as the first tier in your data protection strategy. When you consider which RAID configuration you might use, think about the coming day when the entire volume will be corrupted. I’m not talking about just one disk. I mean the whole shebang.

What’s your data recovery plan? How frequently are you willing to put up with the hassle to restore your data? Some RAID implementations are aimed at making failure an infrequent event. Other configurations significantly increase the chance of failure.

Make sure you have a backup plan, then consider your RAID options in that context.

Fact 1 may not have been so surprising. Here are a few more that might actually surprise you:

2. Software RAID is almost always a better choice than hardware RAID

Software RAID has advanced significantly in the last few years (as of 2012). Hardware RAID still has the three key vulnerabilities it has always had: First, it is expensive. Second, if your RAID card fails, your RAID volume fails; it is a single point of failure. Third, if your RAID card fails, you must find an exact replacement for that card to recover your data.

On the other hand, software RAID costs nothing, and if your controller card or motherboard fail, you can just move your disks to another machine and set up the appropriate software to read them.

Yes, hardware RAID can be faster than software RAID, but that gap is closing, and the flexibility and reliabilty offered by software RAID outweighs that single advantage. The only case where hardware RAID is the right choice is when absolute speed is the only priority, and you’re willing to take risks with your data.

There are some articles on the web that compare hardware versus software RAID. They are good reading, but in some cases the information they contain is old. You should be sure to make your decisions on the state of the art. As of 2012 there are a number of new capabilities offered by software RAID that make it worth considering:

Hot swapping works with software RAID. SATA 3G and SATA 6G made that possible. If a disk goes bad, swap it out, no down time.

Software RAID only consumes a small slice of CPU cycles. In my tests with mdadm on ubuntu I saw only 2% to 4% of one CPU dedicated to RAID. On a multi-core machine this is nothing.

Software RAID works with SSD caching. The most used data migrates to a very fast cache.

Software RAID supports variable size volumes that can be extended by adding more disks (specifically ZFS supports this, maybe others do as well).

3. Some “RAID cards” aren’t hardware RAID

Over the last few years SATA disk controller cards and motherboards have come out that claim to offer hardware RAID. They are really just disk controllers with BIOS that implements RAID in software.

How can you detect these cards and motherboards? Usually price is the giveaway. A $20.00 card is not likely to implement true hardware RAID. Also these cards usually offer windows-only support. Here’s a good writeup.

4. On-disk data compression can make your RAID volume faster

That seems counter-intuitive because it takes computing power and time to compress data. Here’s why it can make your disk performance faster: The bottleneck in disk IO is bandwidth to the disk (your SATA pipe). If the data is compressed before writing, there’s less of it to write, so it moves more quickly to the disk.

ZFS offers compressed volumes. I’m sure other software RAID implementations offer it as well. Here’s some discussion of this topic.

5. Maybe you don’t need RAID at all

Consider an SSD instead. Yes, SSDs are expensive, but a single SSD is less expensive than the multiple disks you’d need to build a comparably fast RAID volume. For example, as of this writing (May 2012) an Intel 250GB SSD prices in at $350 and it’s faster than many RAID configurations built with spinning disks. See one of my other posts for details on SSD speeds compared with RAID on SATA 3G.

SSDs can also be used as cache for a RAID volume. For our next server I’m contemplating a two disk mirror for reliability, augmented with an SSD cache for speed. This can be done easily with ZFS.

And yes, for ultra crazy speed, you can build a RAID volume out of SSDs.

6. The hottest new RAID tech comes from Oracle, and it is open source!

Nick Black of sprezzatech.com pointed me towards ZFS. I’ve looked at it deeply and decided it’s the way to go for our data. ZFS’ designers prioritized reliability and scalability, and it’s got most all existing filesystems and RAID implementations beat on those points. ZFS was built by Sun Microsystems for their Solaris OS. They released it under an open source license and it is now available for Windows, Mac OS and Linux.

Oracle acquired ZFS through their acquisition of Sun. ZFS’s features are touted by Oracle for their hardware solutions. I’ll bet Oracle hates that Sun open sourced ZFS, but that’s a story for another blog.

Stay tuned for a blog from me on ZFS. For now, some rules of thumb for choosing RAID levels:

7. If speed is the only priority, choose RAID0

In RAID0 the data is split or “striped” across the multiple disks and written (or read) in parallel. With N disks, speed up is N-times for reading and N-times for writing. Here’s the downside though: Total failure of your RAID volume is N-times more likely. You should assume that RAID volume failure is an absolute certainty.

Choose RAID0 only if you can easily rebuild your RAID volume. Make sure you have a strong backup workflow.

8. If reliability is the only priority, choose RAID1

RAID1 is called “mirroring.” The data is fully duplicated on two (or more) disks. If one disk fails everything is OK; The RAID volume will continue operating, and it can be rebuilt when you replace the defective disk. RAID0 also offers N-times speed up for reads, but no speed up for writes.

9. In nearly all other cases, RAID10 is the way to go

In RAID10, pairs of disks are mirrored to create reliable volumes (RAID1), then those reliable volumes are combined via RAID0 for speed. Four disks combined in this way offer 2 times speedup over a single disk for reads and writes, yet they can also sustain loss of two disks and still operate. You get both speed and reliability.

Many folks would consider RAID5 for these applications. I think its a bad choice nowadays in comparison to RAID10 because RAID5 is subject to very slow write speeds; sometimes slower than writing to a single disk. See my tests here. Also RAID5 can only survive loss of one disk. In a 4 disk setup RAID10 can sustain two disk losses.

The main advantage of RAID5 is that it offers more total storage than RAID10. But the speed and reliability of RAID10 more than offset that advantage.

What about the other RAID levels? The three I mention above cover 99% of RAID use cases. The situations in which other RAID levels are useful are limited. If you want to dive in though, here’s a good starting point.

10. And the tenth surprising fact about RAID: There are only 9 surprising facts about RAID!

Related

Tucker, I assume you’ve found the ashift property for zpools? That’s critical for good performance on 4k drives; it can usually be detected, but not always. Also, you can only set it upon zpool creation. It’s thus nice to learn about that sooner rather than later.

Be sure you’re doing a scrub operation periodically. The list seems to go with a month (this is also mdadm’s default data check period on Debian), though I take my usual leave-cron-to-the-astrologers stance and do one for every 2^40 bytes written. This is based off practical realities of my write loads and up-to-date research on silent corruption.

Honor the 4G+1G rule: 4G RAM to use ZFS, plus 1G per TB that’s using compression or deduplication.

I’m sure you know that ZFS wants raw block devices rather than (either GPT or MBR) partitions. This is so that it can safely turn on write-caching. Recent Linuxes, however, seem to be turning the write cache on for SATA disks by default (hdparm -W, ), so you won’t see a performance win here (you will, of course, be able to *safely* use said write cache, unlike in the ext/mdraid case). Also, this way you don’t run the risk of suboptimal (read: misaligned) partitioning. Partition tables are for suckers and little old women.

I’ve been turning on the read-write-verify flag on disks which support it (hdparm -R), and haven’t noticed any problems, though I’ve not yet done any rigorous testing any my day-to-day disk loads are pretty forgiving.

Growlight (http://nick-black.com/dankwiki/index.php/Growlight) shows you the status of these flags, and will warn you of any suboptimal partitioning, ZFS use, ashift settings, W$ settings (whether the issue be performance or safety). It will also, of course, create only optimal aggregations (unless you curse at it and demand stupid things via arguments against which the man page cautions). And it’s pretty to boot!

“Software RAID works with SSD caching. The most used data migrates to a very fast cache.”
Can you point a link where there is a good explanation on how to build that? I love software raid, i just need ssd caching to be fullfilled…

It is worth emphasising that in a four disk configuration that RAID10 can only recover from a two-disk failure ~50% of the time. If two drives in the same RAID1 mirror die then it is game over.

By comparison, a four disk RAID6 array will provide the same amount of disk space as a RAID10 array but will be able to cope with the loss of /any/ two drives. This is not to say that it is a superior solution — the write performance of RAID6 is markedly worse — but one that is worthy of consideration.

It is also worth highlighting that RAID10 can provide four times the read performance of a single drive. This is because read operations can be distributed between the two drives in each of the RAID1 arrays. (Similarly, a two disk RAID1 volume will yield twice the read performance of a single drive.)

Finally, for sequential writes one can expect RAID0 to provide ~N times the performance of a single drive.

As an aside I encourage people to think of the redundancy provided by some RAID levels as /a means of minimizing downtime/. Nothing more, c.f, point 1).

Interesting points you make about RAID. However, in real world scenarios, you’ve got some flaws in your recommendations. As was stated previously from Freddie, there are other raid levels available. Choosing raid levels is not absolute, there are different RAID levels to meet a certain need. While its true that raid0 is for speed, and raid1 is for reliability, its only applicable to a 2 disk array, when you move beyond 2 disks, there are other options available that makes more sense. RAID10 is not the best way to go in almost all circumstances since it comes at a huge space and price cost, its a 50% loss in space and 50% increase in price for hard drives to get your desired space.

And with regards to software raid vs hardware raid, in a production environment, I would never ever consider software raid for a tier1 level storage. I don’t care how good it looks on paper and the numbers gives, it will never be as reliable or stable as hardware raid. Your point of ssd caching is moot as hardware raid also have that available with on board cache cards.

I’m familiar with a number of real world scenarios that use exactly this approach.

Your statement suggesting that “software RAID will never be as reliable as hardware RAID” is odd and not compelling because it is identical to suggesting that “on motherboard SATA ports will never be as reliable as hardware RAID cards.”

“If you think hardware RAID is better, wait until the controller card dies…”

There, fixed it for you. The issue with power failures is moot. All mission critical systems (at least in my experience) have dual PSUs both of which are connected to (usually different) UPS units. Even in the non-redundant case you are just as likely to experience a failure of a battery backed RAID card as you are a PSU/UPS.

On a more fundamental level all applications which care about data integrity (read: databases) will take care to ensure that data is flushed to disk to ensure integrity. At this point data caches become a moot point.

So your controller card dies, what do you do? You replace it, with either an identical model or one from the same line. Doesn’t even need to be the same series most of the time. Hardware RAID card manufactures like LSI make sure almost all of their RAID cards are of the same specification. Generally speaking, and with few exceptions, you can replace any LSI RAID card with any other LSI RAID card with the same or greater number of connectivity ports…

Only downside to hardware RAID is cost. And it’s a small price to pay for better .

I would never use software RAID for any mission critical server. It add a layer of vulnerability that a hardware array doesn’t. For as many servers I built and manage I have never lost an RAID array either with a failed card or drive. If the card fails there is info on the drives that can reconstitute the raid array . As for raid level the best is 5 or 6 and you are not giving up that much speed for the protection of the data. Also software always needs to be patched and upgraded as with san devices in which I had to worry every time I updated the OS. I will always favor hardware over software.

Thanks James for the comment. I think you and your “hardware raid” compatriots are a dying breed. Either that our you’re talking about Windows servers, in which case, yes hardware is always more reliable than software.

I personally prefer hardware raid for convenience in auto deployment. Since hardware RAID will display the device as a single drive, you are hardware independent in your kickstarts. Let’s take that situation:

Now, the first two drives must be partitioned for the OS (/boot, swap, /, etc.) while the second contains the sensitive data that must NOT be be touched during installation (we talk here about 33 TB).

In software RAID, I pretty much needs the same hardware to test the kickstart. Why? Because I need to determine by hand which drive to use for the OS and how to partition. I also need to know before hand which drives should not be touched and make sure that they don’t change. So I have a catch 22: either 1) I test my kickstart on a non representing machine and change it before using it in prod (dangerous) or 2) I reproduce the exact same machine in development to make sure it works perfectly (costly).

In hardware RAID, it’s easy: I configure my RAID and hardware presents me the virtual disk as sda (first mirror) and sdb (RAID6). I can easily reproduce in lab environment using only 2 drives as I know both will be presented the same to my kickstart: sda and sdb.

I don’t say there is not a way to tweak and make it work in software RAID. But when you have that amount of data, the last thing you want is to lose them because of a “oups”.

* RAID10 is too “random” to really be considered for sensitive data IMO. The reason is simple: no matter the amount of drives, you can only safely sustain 1 drive failure. Worst case scenario is you lose the second drive of a mirroring set and your RAID just failed. YES, the chances are low (1/[n-f] where f=number of failed drives. if f>50%n, probability is 100%), but they are not null. RAID6 is a sure “3 strikes you’re out” scenario. And as of write/read speed, you give a little on read but write is faster if you don’t include the parity calculations (see your wiki link comparison). I do not have real life values as we chose RAID6 for reliability reason, not for performance, and didn’t take the time to do tests. For the 4 drives situation (easy to compare):

“Thanks James for the comment. I think you and your “hardware raid” compatriots are a dying breed. Either that our you’re talking about Windows servers, in which case, yes hardware is always more reliable than software.”

Nice article. Reinforced my assumptions about hardware/software raid. With software raid there is always the possibility to mount the harddisks on another machine if the hardware dies. Good luck with a hardware raid controller failure – if you don’t have the exact same hardware raid controller available you will have to consider some serious downtime or even complete data loss.

But however your view on hardware/software-raid may be, the worst of the two worlds are clearly combined by windows “fakeraid” devices. Horrible idea to use this stuff at all!

Please tell me how to get a safe 1GB RAM BBWC using software RAID. I rely on that to handle high bursts of random IO. That’s my main reason to use hardware RAID, a safe cache. The only alternative is to have software RAID setup with a mirrored SSD cache (which is slower than a RAM based cache).

Power failing is a far more frequent event than RAID card failure (which are solid state). For instance, Sparky gets in the datacenter and accidentally pulls A and B circuits. Oops. Or the last install accidentally mislabeled a server, so someone yanks it and powers off the wrong server. These oops are more likely to happen than a RAID card exploding.

I agree that there is not an exact match in software RAID for Battery Backed Write Cache (BBWC). ZFS offers something close, namely ZiL which is essentially a Write Cache to SSD. It also supports SSD for read cache.

Right, and write cache on SSD means you should have a mirrored SSD array for write caching, and it’s still not remotely on the same level as RAM. So the assertion that traditional RAID controllers are on the way out is … just out of touch with reality. Not to mention there’s no officially supported ZFS module (thanks to Sun’s licensing) for Linux or Windows.

Proxmox runs fine on Linux software RAID. Some guys at Proxmox just don’t recommend it, but *not* because there are problems with it. If you look around the forum, there are a lot of admins happily using this kind of setup. I’m one of them.

I’ve been using windows 2008 R2 and windows 2012 software raid 0, just cause it’s a test box AM I WANT SPEED…
But i’ve noticed a funny, after a minute of moving a 500GB VHD one of the disks will jump to 100% and the other drops to 2%, the transfer rate dives to 2 MB… then after a few seconds comes back up, VERY STRANGE.
Didn’t have this issue with the same drives on a HP 410 hardware raid. Just saying.
(if you know why windows is doing this please let me know)
James

Article author is either misinformed or has some personal vendetta against RAID hardware manufacturers…

– For RAID 0/1 (speed/reliability), Software RAID is fine.
– RAID 1+0, besides being costly, cannot sustain multiple drive failures from the same RAID subgroup. RAID 5/6 are much better options when you need both speed increase and the ability to sustain failures of ANY one/two drive(s) across the array.
– RAID 5/6 read/write speeds improve greatly when you add more drives. Speed difference between R1+0 and R5/6 much less an issue.
– Software RAID 5/6 is terrible, don’t waste time with it. Parity being calculated on a dedicated processor on a hardware RAID card is much more efficient than wasting processor cycles on a general CPU. That CPU was likely specifically purchased with those specs for processing other tasks on the server.

I do admit it is cheaper to replace a motherboard than a RAID card but unless everyone’s only concern is reliability, hardware RAID is here to stay. If software RAID was really that much better than hardware RAID, Adaptec/PMC/LSI/Areca/etc would all be bankrupt by now.

Still sitting on the fence with software raid, zfs certainly seems impressive but still has licensing issues on linux.
I have been building servers with both software and hardware raid and have used LSI, Adaptec, Areca, Dell, HP, Intel and 3ware raid cards over the years. Every single brand has failed at some point and almost failure has been recoverable except for one. I have had a several very messy software raid failures too.

Power failures can occur even with UPS systems. Human error, plugs work their way out or are not properly replaced after maintenance. ATS switches can fail, network problems mean systems don’t shut down when UPS is low etc. All of them can probably be blamed on bad planning or design or human error but they still occur. ZFS and btrfs, copy on write filesystems are making it safer to use software raid compared with bbu supported raid cards. As zfs has license issues I am waiting for brtfs to mature before using software raid everywhere.

A major advantage with software raid is that it is possible to rebuild software raid systems ignoring errors during rebuild. I might be wrong but I still haven’t seen a raid card that will continue a rebuild after an error has been detected during the rebuild. This is an important issue with arrays using 2TB+ drives. Even with RAID6 the chance of a secondary error is important when there are more than 8 drives in the array. Using 3 parity disks and other solutions are being discussed. I would be willing to rebuild an array with a few errors and restore the damanged files from backup rather than have to restore the whole thing. Downtime restoring large volumes is bad.

HW RAID is also much more expensive and I have not found any brand really reliable. We are obliged to keep spare cards of the same model on hand in case of failure, as cards get older it can be hard to find a compatible replacement model if we don’t have spares.

To put all that simply, I am waiting for a mature software RAID aware filesystem that is not ZFS and after 20 years of using hardware RAID its reliability has been pretty crap.

Very productive discussion found here. How do you think about these issues:

(1) Corrupted sectors become critical once raid array loses all the redundancies i.e. lose two drives in RAID6 or one in RAID5 or 1 or 1+0. I observed hundreds cases that rebuild eventually failed due to corrupted sectors of one of the disks in raid array. Proactive fix of corrupted sectors is very helpful to avoid this scenario. In hardware raid, at least 3ware/LSI controllers have such a function called “auto verification” (3ware) and “read patrol”. I couldn’t find similar action in software raid. I like to learn if there is any in software raid.

(2) Time Limited Error Recovery (TLER): hardware raid does require it so that it leads to each disk not trying as hard to recover from read error, read it from the other disk(s) instead. Definitely, TLER drive is required to take advantage of it such as enterprise drive and NAS drive (more cost). As far as I know, software raid does not have TLER requirement so that read error will be handled by individual drives and it often leads to deep scanning mode spending minutes and often raid array becomes degraded. Does anyone know how to avoid this?

It would be great if software raid can address these issues. Maybe, there are and I just don’t know. I appreciate your reading and consideration in advance.

Thanks for an informative article.
I don’t want to start a fight, but I got a little tickled by the hardware raid folks. It’s all software raid (what do you think runs on the little controller card). The question is: is the hardware raid accelerator needed, and what benefit does it provide – or – have the CPUs gotten fast enough and primary memory big enough that the hardware accelerator is a moot point?