For a long time I've heard about how bad an idea a large (>5TB?) RAID-5 array is simply because there's a high risk for another drive to fail.

Has RAID-Z1 managed to remedy this for an array of any size (if you absolutely need a number consider 4x2TB or 5x2TB)? Maybe a safer way to re-replicate the data that isn't as intense on all the drives?

3 Answers
3

Even given what one of the other answers here laid out, namely that ZFS only works with actual used blocks and not empty space, yes, it is still dangerous to make a large RAIDZ1 vdev. Most pools end up at least 30-50% utilized, many go right up to the recommended maximum of 80% (some go past it, I highly recommend you do not do that at all, for performance reasons), so that ZFS deals only with used blocks is not a huge win. Also, some of the other answers make it sound like a bad read is what causes the problem. This is not so. A bit rot inside a block is not what's going to screw you here, usually, it's another disk just flat out going bad while the resilver from the first disk going bad is still going on that'll kill you.. and on 3 TB disks in a large raidz1 it can take days, even weeks to resilver onto a new disk, so your chance of that happening is not insignificant.

My personal recommendation to customers is to never use RAIDZ1 (RAID5 equivalent) at all with > 750 GB disks, ever, just to avoid a lot of potential unpleasantness. I've been OK with them breaking this rule because of other reasons (the system has a backup somewhere else, the data isn't that important, etc), but usually I do my best to push for RAIDZ2 as a minimum option with large disks.

Also, for a number of reasons, I usually recommend not going more than 8-12 disks in a raidz2 stripe or 11-15 disks in a raidz3 stripe. You should be on the low-end of those ranges with 3 TB disks, and could maybe be OK on the high-end of those ranges on 1 TB disks. That this will help keep you away from the idea that more disks will fail while a resilver is going on is only one of those reasons, but a big one.

If you're looking for some sane rules of thumb (edit 04/10/15 - I wrote these rules with only spinning disks in mind - because they're also logical [why would you do less than 3 disks in a raidz1] they make some sense even for SSD pools but all-SSD pools was not a thing in my head when I wrote these down):

With change in disk sizes and performances, would you still recommend the same rule of thumb? (2014)
–
Lord Loh.Aug 26 '14 at 22:13

Any source or motivation for the rules of thumb?
–
Kenny EvittDec 26 '14 at 0:55

The source is the experience of myself and coworkers across 1000's of ZFS deployments at Nexenta. As for an update -- the rules stand (04/10/15), nothing has changed that makes me want to edit the bullet points, though I WOULD say I wrote those rules without SSD's in mind. The rules are not necessarily the same for SSD's, depending on circumstantial factors. With them you've got some other considerations, too, like HBA bottlenecking.
–
Nex7Apr 10 at 22:48

@Nex7, what is the logic for this in your blog article? "8. RAIDZ - Even/Odd Disk Counts: Try (and not very hard) to keep the number of data disks in a raidz vdev to an even number"
–
Costin GușăJun 22 at 17:01

RAID-Z is aware of blank spots on the drives, where R5 is not. So RAID-Z only has to read the areas with data to recover the missing disk. Also, data isn't necessarily striped across all the disks. A very small file might reside on just a single disk, with the parity on another disk. Because of this RAID-5 will have to read exactly as much data as the space used on the array (if 1mb is used on a 5TB array, then a rebuild only needs to read 1 mb).

Going the other way, if most of a large array is full, then most of the data will need to be read off all the disks. Compared to R1 or R10 where the data only needs to be pulled off exactly one disk (per failed disk; if multiple disks fail only in situations where the array is still recoverable too).

What you're worrying about is the fact that with every sector read operation there's a chance you'll find a sector that wasn't written correctly or is no longer readable. For a typical drive these days that's around 1x10^-16 (not all drives are equal, so lookup the specs on your drives to figure out their rating). This is incredibly infrequent, but comes out to about once every 1PB; for a 10TB array there's a 1% chance your array is toast and you don't know it until you try to recover it.

ZFS also helps mitigate this chance, since most unreadable sectors are noticeable before you start trying to rebuild your array. If you Scrub your ZFS array on a regular basis, the scrub operation will pickup these error and work around them (or alert you so you can replace the disk if that's how you roll). They recommend you scrub enterprise-grade disks about one to four times a month; and consumer-grade drives at least once a week, or more.

+1 Also, the perblock checksums allows ZFS, should it find corruption in an array, to single out the affected files. Most R5 HBAs will simply mark the whole volume as corrupted, or report back to the OS that a sector is corrupted, either way the HBA has no way of knowing which disk is wrong in a corruption scenario.
–
Chris SMar 14 '12 at 13:01