This is an article which discusses the increase in storage capacity while performance and hard error rates have not improved significantly in years, and what this means for protecting data in large storage systems. "The concept of parity-based RAID (levels 3, 5 and 6) is now pretty old in technological terms, and the technology's limitations will become pretty clear in the not-too-distant future â€" and are probably obvious to some users already. In my opinion, RAID-6 is a reliability Band Aid for RAID-5, and going from one parity drive to two is simply delaying the inevitable."

That only works if you have relatively small amounts of data that is not modified frequently. If you have a large operation with frequently changing data the potential for failure increases significantly.

I am not a fan of cheap storage because what most people think they are getting in savings they end up paying for in terms of management, performance and reliability. I don't care for SATA storage because I am not convinced that it will work reliably in the long run as opposed to SAS or Fibre Channel.

While the possibility of a massive failure for most cases is slight, it depends entirely on how the solution is deployed and what mechanisms are in place to protect the data. Having redundant storage to prevent data loss can get real expensive.

"It is not the end of the world. The mainframe solved this one decades ago."

You seem to have missed the point. If a drive fails in a raid, it takes hours to rebuild the raid. The larger the drives, the longer it takes to rebuild the raid. If you have 1TB, it takes maybe 10 hours. 2TB may take 24h. 4TB maybe 2 days. 8TB drives may take one week? Because drives doesnt get much faster, only larger.

At some point, it will take long time to rebuild the raid. Scaringly long time. Say it takes one week. When you rebuild a raid, it stresses the other discs very much, to the point it is common another disc breaks! This happens more often than you think. Then you are screwed, if another disc fails.

Therefore you use raid-6, which allows two discs to fail. But there is likelihood that both drives fail during rebuild. At some point in the future, the discs get so large, another disc will fail as fast as you rebuild a broken drive. This is due to larger and larger drives.

A decade ago, the mainframes didnt have this large drives. To rebuild a raid was no problem, it went very quick. Today, it takes a very long time. Therefore, you are wrong, mainframes have not solved this problem. This is the reason people say that raid-5 is soon obsolete. This is what the article is about.

Also, enter Silent Corruption. Discs will read/write bits errorneously without even noticing it! You will not get a notification: there was an error. This is a BAD thing. 20% of a discs surface is dedicated to error correcting codes, and the codes can not fix every error, nor even detect every error. There are lots and lots of errors in every read and write, that gets corrected on the fly. But sometimes there will be errors that can not get error corrected by the disc. Nor even detected. It is like the lamp on the oven says it is turned off, but the lamp shows wrong, the oven is in fact turned on - the HW doesnt detect this, so it lies to you. Look at a spec of a new drive, it says "unrecoverable error: 1 in 10^14". There are errors that even doesnt get detected. But ZFS detects, and also recovers the data. The "1 error in 10^14" doesnt apply with ZFS. Because ZFS detects and corrects them.

SUN knows about these problems and ZFS does fix this problem. ZFS also allows other more safe, configurations than raid-5 and raid-6, which makes ZFS less susceptible to taking a long time to rebuild a raid. For instance, three discs are allowed to fail in raidz3 configs. Or you can mirror lots of discs, and combine them into flexible raid configs. The best thing is that ZFS does NOT like HW raid and they only disturb ZFS. Therefore, ditch HW raid while you can get a fair price, and use ZFS to get a cheaper and safer solution. 48 SATA 7200 rpm discs, reads 2-3GB/sec and writes 1GB/sec. And the data is safe too.

CERN did a study on silent corruption, and the moral was that one error in 10^14 is not correct. This article is not correct, according to studies at CERN. The errors occur more frequently, in practice: http://storagemojo.com/2007/09/19/cerns-data-corruption-research/
Your data is at risc. Silent Corruption and bit rot eats your data. Silently. Without the HW telling you. The HW doesnt even notice.

Clearly, something has to be done to cope with these errors that large drives and future filesystems will face. The main architect of ZFS explains some of the future problems that will be more and more common. http://queue.acm.org/detail.cfm?id=1317400

Maybe if you are a CEO for a company with a critical system, worth of billions of dollars or of thousands of lifes, you are willing to take any measure to minimize the risc of loosing data? And, ZFS doesnt require extra, specialized hardware. Just a Sata controller card with no raid functionality + 7200 rpm SATA discs.

RAID - aRrAy of Inexpensive Discs(?). With ZFS it becomes true. A good HW raid card costs much. What happens if the vendor goes bankrupt? Where to find a new HW card? You are locked in.

ZFS code is open and you can do whatever you want with it. ZFS is future proof. And it doesnt cost anything. Move your discs to a Mac OS X computer, or FreeBSD computer, or Solaris SPARC, or Solaris x86 and write "zpool import" and you are done with the migration. All data is stored endian neutral.

To me it is a no brainer why not use ZFS. It is better, safer, easy to administer and free. Ive heard to create a raid with Linux and LVM takes like 30 commands. With ZFS you write "zpool create raidz1 disc0 disc1 disc2 disc3" and you are done. No formatting. Copy your data immediately. No fsck exists. All data is always online.

But these great advantages that ZFS gives, is nothing new with SUNs technology. DTrace is also as good as ZFS. And Niagara Sparc. Zones. etc. And they are all open tech. And GOOD tech.

I am a little sleepy, so I may be looking at this wrong, but would RAIDZ, with the whole checksumming bit and all that, mitigate some the issues here? I also wonder what is the comparable time to rebuild a raidz volume to what the author presented here.

I also wonder what is the comparable time to rebuild a raidz volume to what the author presented here.

The time to rebuild a RAIDz array depends not on the array size, but on the used data size of the array. RAIDz only rebuilds the blocks referenced by the higher levels of the filesystem stack, which is possible because it is so integrated with the higher level fileystem.

So you basically still have to work out your solution to meet a specific projects needs between speed, security etc. I could double the # of SATA drives for RAID 10 and still be waaaaay under the cost of the SAS solution. Might not even need another box to hold the drives. I think array performance is a bigger factor when making a decision because the cost to be overly redundant with SATA is still waaaay cheap.

You're right on the price, but there is one point where the SAS drives win: 8x10k SAS drive is way more faster than 3x any 1TB SATA drive available on the market. And even not taking the hard drive rotation speed into consideration, SAS is more efficient than SATA.

I thought modern raid systems used error correcting codes, which means even random bad bits did not irreversible damage the data if there was a drive failure.

What am I missing?

PS. My days of studying raid-systems was over a decade ago.

True if you are running a system that actually does checksumming, like NetApp or ZFS. But many lower end systems do not. Even if you have checksumming and so can detect, say, a flipped bit in a sector that you need for reconstructing a stripe, you would need a second parity stream to reconstruct the stripe (one for the dead disk's missing sectors, one for the "bad" sector from the otherwise "good" disk). So IMHO multiple-parity raid plus checksumming is a great idea, if you have to use parity raid that is.

My main computer at work has 3T of disk space. The data is spaced out over: 21 35g, 78 17.5G, 14 70g drives (113 drives in totals).

We use RAID5 sets of up to 14 drives per set. With this configuration, the rebuild times for a failed drive is small.

We use i5/OS which uses a scatter/pack storage method (data is written to the drives in such a way that all disks have the same used percent). Each RAID control has it's own CPU to handle the parity generation & checking, this removes a lot of processing from the main CPU's (4 in my case). It also means that we actually do parity checks/validation on both reads and write.

PS: Each ASP (Auxiliary Storage Pool) can be up to 16T and we can have 32 ASPs per box.

We can also add/removed drives for an ASP with the system active. For example if I want to replace some of the 17.5G drives with 35G drives, I would do the following: 1) Identify the drives to be replace, 2) Issue command to removed drives from ASP (this will stop any new data from being written to the drives and also start moving the data on the drives to other drives), 3) Physically replace the drives, 4) Initialize the new disks, 5) Attach new disks to the ASP, 6) Issue command to re-balance the drives (spread the data equally across the drives).

And, did you know that your hw raid doesnt protect against Silent Corruption? It might be that errors are introduced by the drive or the controller, without the hw even noticing.

CERN did another study, they wrote 1MB special bit pattern each second on 3000 servers with Linux and HW raid. Efter 3 weeks they found 152 instances where the file showed errors. They knew how the file should look like, and the file didnt. They thought everything was fine, but it wasnt. This, is called Silent Corruption. To counter that, you need end-to-end checksums. HW raid doesnt do that.

We do end to end checksum; that's one of the reasons for having a CPU on each RAID card. On write, the card will do a read to validate the data.

This is a massive multi-user box; we average around 2000 users at the same time. It also back ends 1500 time clocks. It's not unusual to see 4000 programs running a the same time. Thus, disk read/write speed is very important; the more disk arms you have the faster data can be moved to/from the drives. Having 3x1T drives would be too slow since you could only read/write one data packet at a time since you only have 1 arm (assuming raid, but even 3 arms would be slow).

Holey Moley! That is some impressive numbers! The geek in me, like. :o)

If we talk about end-to-end integrity. I have asked lots of knowledgeable people and it turns out that HW raid doesnt do any good end-to-end data checks. Sure, the HW raid do some basic checks, but that is not much. I suggest you check it up, exactly how good the end to end check is. Didnt you see the CERN studies of HW raid?

This would have been much more interesting 4-7 years ago but with the prevalence of SATA, a faster SATA interface on the horizon and 512 GB SSDs having been available since late last year, I have to say,"So what?".

There are ways to solve this problem using SSDs; although
hard disks typically provide more storage density per unit area, that's not true when you get into the enterprise space - have a look at what the largest SAS disk is and what it'll cost you.

And, you might even save money right off the bat when you factor in the lower cooling and power reqirements of SSDs.

Is it? I can't think of too many companies that are betting the farm on SATA storage in terms of performance or reliability. How are 7,200 RPM SATA disks going to compare to a 4 GB Fibre Channel array using 15,000 RPM disks, there not. SATA only beats FC in terms of capacity and cost.

SSD's are also immature technology that is only beginning to be integrated into devices like Sun's Unified Storage System 7000 series arrays, and even then they are being used as cache, not primary storage. While this might not be the case in five to ten years, SSD's are not quite ready for prime time data center use just yet.

Is it? I can't think of too many companies that are betting the farm on SATA storage in terms of performance or reliability. How are 7,200 RPM SATA disks going to compare to a 4 GB Fibre Channel array using 15,000 RPM disks, there not. SATA only beats FC in terms of capacity and cost.

I wasn't separating SATA from SSDs in this context.
I was implying that despite the high cost of SSDs, you
could balance that against using a cheaper interface and still win on almost all counts.

However, there is no reason why SSDs can't be made with SAS interfaces.
And don't be knocking SATA hard disks too much. I seen data from several sources that they have lower failure rates than some of the higher end drives, even if they're not as fast.

SSD's are also immature technology that is only beginning to be integrated into devices like Sun's Unified Storage System 7000 series arrays, and even then they are being used as cache, not primary storage. While this might not be the case in five to ten years, SSD's are not quite ready for prime time data center use just yet.

SSDs have been around a long time ( see articles at www.storagesearch.com ). It's not a matter of being ready for the data center, it's the enterprise resistance to change and the slower adoption of newer tech by the big players.

There is a large amount of consumer grade SSD's which might work well for those who do not expect high throughput I/O and a long life cycle. When you start paying $20,000.00 and upward for a storage array, everything changes.

I am sure I am not the only one who thinks that SSD's are not ready and waiting for early adopters and risk takers in the field to shake them out before recommending them as part of a replacement or upgrade solution for existing storage.

Is it? I can't think of too many companies that are betting the farm on SATA storage in terms of performance or reliability. How are 7,200 RPM SATA disks going to compare to a 4 GB Fibre Channel array using 15,000 RPM disks, there not. SATA only beats FC in terms of capacity and cost.

If your budget (in money, power, and space) is truly unlimited, you're right. But what if you can choose, say, mirrored SATA or parity raid fibre given the constraints at hand? What if you can choose SATA that is half full, versus fibre at 80% ? What if you can afford more SATA spindles? I don't think it is that clear cut given the normal constraints.

Budget does come into play. Our history with SATA storage here has not been a good one. We bought storage from a vendor that after six months decided to get out of the SATA storage game and dropped support for the devices. We now have over a hundred 500 GB SATA drives that we pulled from the arrays and junked the rest.

We also have SATA solutions from other vendors (NetApp, HP) and the jury is still out. Power, AC, and space do come into consideration but some of the people I work with also expect a level of performance that SATA might not meet, plus factor in the idea of using SATA over SCSI and FC makes some uncomfortable.

Most of the server rooms I have worked in are near or over capacity for power and AC, but new ideas are usually the hardest sell.

Budget does come into play. Our history with SATA storage here has not been a good one. We bought storage from a vendor that after six months decided to get out of the SATA storage game and dropped support for the devices. We now have over a hundred 500 GB SATA drives that we pulled from the arrays and junked the rest.

That's pretty bad. I guess it's kind of in line with what I mean though: SATA is just one piece of the puzzle. If the rest of the stack is junk, it almost doesn't matter what the drive interface is. From your other posts I think you are saying the same thing. I just wouldn't blame SATA, rather junky arrays.

We also have SATA solutions from other vendors (NetApp, HP) and the jury is still out. Power, AC, and space do come into consideration but some of the people I work with also expect a level of performance that SATA might not meet, plus factor in the idea of using SATA over SCSI and FC makes some uncomfortable.

Most of the server rooms I have worked in are near or over capacity for power and AC, but new ideas are usually the hardest sell.

True. It's kind of funny that by that mentality anybody would accept anything but direct attach storage. I mean, just because the SAN controller has fibre ports on both sides doesn't mean there isn't a very complicated black box in the middle. Thinking of it as "fibre from host to spindle" is sort of meaningless when there is no direct path from host to physical disk.

Our problem is the Government wants to build a SAN, but they want to use existing components that are in production (a bad idea) and I really don't think it sank in that mixing components is a good idea (we have 1 GB, 2 GB and 4 GB FC arrays and libraries). Our stuff is direct attach at the moment, which works but is not flexible.

Unfortunately this is what happens when you build something piecemeal and buy the key pieces (the FC switches) last.

In Germany there is a saying "Just because everybody have used a train, everybody believes to be an expert of the railway business". Or for other countries: Just having used an airplane doesn't make you a aviation business expert.

That said, may people tend to think just because they have one or two disks in their server at home, they are storage experts. For example: The SATA vs. SAS thing. Of course SAS doesn't make storage more available, but better mechanics do so. And often you find this components with SAS. Or they don't think about the implications of the operating hours specification.