This post started life earlier this year as a post on the death of RAID-5 being signaled by the arrival of 3TB drives. The point being that you can’t afford to be exposed to a second drive failure for 2 or 3 whole days especially given the stress those drives are under during that rebuild period.

But the more I thought about RAID rebuild times the more I realized how little I actually knew about it and how little most other people know about it. I realized that what I knew was based a little too much on snippets of data, unreliable sources and too many assumptions and extrapolations. Everybody thinks they know something about disk rebuilds, but most people don’t really know much about it at all and thinking you know something is worse than knowing you don’t.

Anyway you’d think that the folks who should know the real answers might be operational IT staff who watch rebuilds nervously to make sure their systems stay up, and maybe vendor lab staff who you would think might get the time and resources to test these things, but I have found it surprisingly hard to find any systematic information.

I plan to add to this post as information comes to hand (new content in green) but let’s examine what I have been able to find so far:

Netapp points out that there are many variables to consider, including the setting of raid.reconstruct.perf_impact at either low, medium or high, and they warn that a single reconstruction effectively doubles the I/O occurring on the stack/loop, which becomes a problem when the baseline workload is more than 50%.

Netapp also says that rebuild times of 10-15 hours are normal for 500GB drives, and 10-30 hours for 1TB drives.

I’m not sure how we project this onto larger drive sizes without more lab data. In these two examples there was little difference between N Series 14+2 146GB and DS5000 14+2 300GB, but common belief is that rebuild times rise proportionally to drive size. The 2008 Hitachi whitepaper “Why Growing Businesses Need RAID 6 Storage” however, mentions a minimum of 24 hours for a rebuild of an array with just 11 x 1TB drives in it on an otherwise idle disk system.

What both IBM and Netapp seem to advise is that rebuild time is fairly flat until you get above 16 drives, although Netapp seems to be increasingly comfortable with larger RAID sets as well.

3. A 2008 post from Tony Pearson suggests that “In a typical RAID environment, say 7+P RAID-5, you might have to read 7 drives to rebuild one drive, and in the case of a 14+2 RAID-6, reading 15 drives to rebuild one drive. It turns out the performance bottleneck is the one drive to write, and today’s systems can rebuild faster Fibre Channel (FC) drives at about 50-55 MB/sec, and slower ATA disk at around 40-42 MB/sec. At these rates, a 750GB SATA rebuild would take at least 5 hours.”

Extrapolating from that would suggest that a RAID5 1TB rebuild is going to take at least 9 hours, 2TB 18 hours, and 3TB 27 hours. The Hitachi whitepaper figure seems to be a high outlier, perhaps dependent on something specific to the Hitachi USP architecture.

Tony does point out that his explanation is a deliberate over-simplification for the purposes of accessibility, perhaps that’s why it doesn’t explain why there might be step increases in drive rebuild times at 8 and 16 drives.

4. The IBM DS8000 Performance Monitoring and Tuning redbook states “RAID 6 rebuild times are close to RAID 5 rebuild times (for the same size disk drive modules (DDMs)), because rebuild times are primarily limited by the achievable write throughput to the spare disk during data reconstruction.” and also “For array rebuilds, RAID 5, RAID 6, and RAID 10 require approximately the same elapsed time, although RAID 5 and RAID 6 require significantly more disk operations and therefore are more likely to impact other disk activity on the same disk array.”

The below image just came to hand. It shows how the new predictive rebuilds feature on DS8000 can reduce rebuild times. Netapp do a similar thing I believe. Interesting that it does show a much higher rebuild rate than the 50MB/sec that is usually talked about.

5. The EMC whitepaper “The Effect of Priorities on LUN Management Operations” focuses on the effect of assigned priority as one would expect, but is nonetheless very useful in helping to understanding generic rebuild times (although it does contain a strange assertion that SATA drives rebuild faster than 10KRPM drives, which I assume must be a tranposition error). Anyway, the doc broadly reinforces the data from IBM and Netapp, including this table.

This seems to show that increase in rebuild times is more linear as the RAID sets get bigger, as compared to IBM’s data which showed steps at 8 and 16. One person with CX4 experience reported to me that you’d be lucky to get close to 30MB/sec on a RAID5 rebuild on a typical working system and when a vault drive is rebuilding with priority set to ASAP not much else gets done on the system at all. It remains unclear to me how much of the vendor variation I am seeing is due to reporting differences and detail levels versus architectural differences.

6. IBM SONAS 1.3 reports a rebuild time of only 9.8 hours for a 3TB drive RAID6 8+2 on an idle system, and 6.1 hours on a 2TB drive (down from 12 hours in SONAS 1.2). This change from 12 hours down to 6.1 comes simply from a code update, so I guess this highlights that not all constraints on rebuild are physical or vendor-generic.

7. March 2012: I just found this pic from the IBM Advanced Technical Skills team in the US. This gives me the clearest measure yet of rebuild times on IBM’s Storwize V7000. Immediately obvious is that the Nearline drive rebuild times stretch out a lot when the target rebuild rate is limited so as to reduce host I/O impact, but the SAS and SSD drive rebuild times are pretty impressive. The table also came with an comment estimating that 600GB SAS drives would take twice the rebuild time of the 300GB SAS drives shown.

~

In 2006 Hu Yoshida posted that “it is time to replace 20 year old RAID architectures with something that does not impact I/O as much as it does today with our larger capacity disks. This is a challenge for our developers and researchers in Hitachi.”

I haven’t seen any sign of that from Hitachi, but IBM’s XIV RAID-X system is perhaps the kind of thing he was contemplating. RAID-X achieves re-protection rates of more than 1TB of actual data per hour and there is no real reason why other disk systems couldn’t implement the scattered RAID-X approach that XIV uses to bring a large number of drives into play on data rebuilds, where protection is about making another copy of data blocks as quickly as possible, not about drive substitution.

So that’s about as much as I know about RAID rebuilds. Please feel free to send me your own rebuild experiences and measurements if you have any.

Like this:

Related

9 Responses

Hmmm.. correct me if I’m wrong, but shorter rebuild times on XIV isn’t occupied with proportional higher chance of data loss during rebuild?
For 180 disk XIV, data from each disk is spread between all disk, so failure of ANY of them during short rebuild will mean data loss.
For same configuration (180 drives in RAID10), during longer rebuild, if some disk fail, its only 1/179 chance that new failed disk is from previous failed disk pair.

A standard RAID5/6/10 rebuild will take you 20 times as long on a traditional system as it will on XIV.

The I/O being done on XIV is shared by 180 drives, but in a traditional RAID10 it might all be to a single drive.

You have assumed that drives fail randomly, but drives are most likely to fail when under heavy load, so not only will it take 20 times as long on a traditional system, but during that time the single drive load is very high, especially if you consider normal daily I/O load might have been high already.

I don’t want to over-state the risks, but a 3 day rebuild time on 3TB drives is surely an uncomfortable place to be?

I agree with you, and I prefer XIV approach better (generally I prefer wide-stripe arrays) but I wanted to point out, that there are always some drawbacks – in XIV you have shorter rebuild time, but with higher chance of data loss. This what XIV marketing always “forgot” to mention when they speak about rebuild times, but something we all have to be aware of :)

I don’t think it’s sound to just assume that there is a higher chance of data loss from double disk failure on XIV compared to traditional disk. The two approaches are so different and there are so many variables that sometimes it’s best just to rely on field data. One of our US-based XIV specialists pointed out to me yesterday that after four years in the field and thousands of systems deployed there has never been an incident of data loss from double disk failure on XIV. Let’s maybe just call it an honest mistake on the part of the vendors who promised calamity from double disk failure to anyone who went with XIV. Maybe, just maybe, the XIV designers knew what they were doing after all?

So, a few interesting items… (all based on conversations over the last few years, no hands on experience so take it as you wish).

#1. There was a google study a few years back that showed very little correlation between drive activity and drive failure outside of the first year (which was termed infant mortality for drives). So, while it sounds logical that drives would fail significantly higher under load, reality doesn’t seem to statistically back that up. (ref: http://labs.google.com/papers/disk_failures.pdf).

#2. Regarding XIV double drive failure: As more and more arrays embrace wide-striping across 100s of drives in an array, the XIV “risk” is becoming more common. The real data loss scenario under XIV, as I understand it, has more to do with the power subsystem and where metadata is stored (I’ve heard of at least 2 instances via twitter where the power supplies died, and with them, all the data).

#4. Prior to a “tipping” point, you’d need to lose 1 drive in an access node and 1 drive in a data node to incur data loss. Data isn’t distributed between access nodes, and until the access nodes hit a certain point, its my understanding that M0 gets placed on the access node that takes the incoming write IO and M1 gets distributed to a data node. After this “tipping point”, M0 and M1 get distributed among different data nodes. (ref: XIV redbook IIRC)

#5. You can lose an entire XIV node/shelf without data loss (but the data evacuation time is horrid, from what I understand).

#6. The amount of data theoretically lost during a dual drive event goes down the further along the rebuild it is.

1) A rebuild is not just drive activity – on a traditional disk system it’s drive hammering. I’m not sure Google was really testing that. I can imagine that a certain level of activity is actually good for a drive, it’s the extremes of inactivity and hammering that might cause a problem. So I’m not sure what level of activity they were plotting, and you imply that even the Google study showed a correlation in the first year (the first year is surely important?). I have seen a disproportionate number of drive failures on DS4000/LSI/Engenio/Netapp-E-Series during rebuilds, and have heard about the same effect on CLARiiON for example. Check out Hu Yoshida’s latest post. http://blogs.hds.com/hu/2012/01/why-raid-and-erasure-codes-need-to-be-considered-in-disk-purchases.html

2) Power loss on any disk system can cause an outage, which is temporary loss of access to data, but I don’t see how it would cause loss of data on the hard drives?

3) Other risks are fire, malice etc. XIV replication function is included with every XIV, no extra licensing required. If you want your data to be safe from risks like these then buy two and replicate.

4) Correct. No block is ever mirrored across two I/O nodes. Statistically I/O modules have a slight extra chance of failure since they have more complex code running on them and higher workload, so at least one of the locations for any given data chunk will be a data module.

5) Firstly there is no data evacuation. The data is already on other drives. It’s data re-copying that we do. I’d say the time to do it is rather impressive – we’re talking about 4 hours for a whole 12-drive module – much less time than it takes to rebuild a drive on most disk systems.

6) I’m told that there has never been a double drive failure causing data loss on XIV.

Fortunately, most double drive failures on an XIV would result in zero data loss, and the majority of the rest involved just a few GB of data that can be restored in less time than a typical RAID-5 double-drive failure incident. Here is my blog post that goes into full detail:

[…] I’ve been too busy to blog recently, but I have just paused long enough to add a significant update regarding IBM Storwize V7000 drive rebuild times to my earlier post on RAID rebuild times. Rather than start a new post I thought it best to keep it all together in the one place, so I have added it as point 7 in “Hu’s on first, Tony’s on second, I Don’t Know’s on third” […]