Keeping RAID Alive

When it comes to protecting data on disk, few technologies are more universal RAID; it faces challenges in the future data center, but is hardly alone in that.

When it comes to protecting data on disk, is seems that no technology is more universally applied than a Redundant Array of Independent Disks (RAID). RAID has two problems that are leading many to think that RAID has a limited future in tomorrow's data center. Vendors though are doing their best to address these issues and there are some sensible workarounds that will keep RAID a viable option for the foreseeable future.

The goal of RAID is to protect you from data loss if a drive fails and to provide that protection in a cost and capacity efficient way. RAID does this by striping your data across a group of drives, if one of those drives fail, data is still available to the users and applications because the other drives can calculate what was supposed to be on the failed drive. The failed drive can then be replaced and the data that was on the original drive can be rebuilt from the existing drives.

RAID has two problems that some think may make the technology unusable in the future. First, as the capacity per drive continues to increase the time required to rebuild a drive is taking longer than ever, now measured in days. This is important because for you not to experience data loss the drives have to complete a rebuild before another drive fails. It is also a problem in that the rebuild process usually takes a significant performance toll on the applications using that storage. This ties into the second problem which is that the likelihood of a multiple drive failure increases as capacity increases. There is statistic on drives called a Bit Error Rate. As capacities of drives increases the chances of errors on those drives increase as well. The combination of longer rebuild times plus an increasing likelihood of failures bring a perfect opportunity for data loss and have lead many to pronounce RAID dead. The suppliers though are saying "not so fast."

Faster Storage Controllers

When a drive fails in RAID a race is started to rebuild the data before another drive or two (with RAID 6) fails. A series of mathematical calculations are made to determine what was on the failed drive and then that data is written to that drive. The faster the math calculations and the writes are made the more overall system performance is impacted. With most RAID systems, you can throttle the rebuild time down so that application performance is not impacted but you do so at the risk of having a longer window of exposure. A simple way around this is for manufacturers to supply systems with controllers that have excess performance capability that can rebuild at full speed with little impact on application performance. While this is an improvement, you are still going to have to deal with some period of time where you are exposed.

More Storage Intelligence

Another option is to be more intelligent with the failure itself. For example, most arrays will start a rebuild after a drive has reached a certain threshold of errors. In this process, the whole drive is marked bad, basically off-lined, and then the rebuild starts. In a way, data is unnecessarily put at risk while the rebuild happens. Instead of failing the drive prior to the rebuild some systems have the intelligence to mark certain sections or even platters of the drive bad. Additionally some systems can keep the drive online and copy the good data to a new drive before failing the old one. Then have the new drive replace the old error prone drive in the RAID group. With this capability, data is safer and the copy is faster as no mathematical calculations need to be made.

Flash Storage

A final option may be the use of solid-state storage instead of mechanical hard drives. RAID rebuilds are read and write heavy operations that, in the mechanical world, involve high capacity drives. Solid-state storage is well suited to high read and write operations and, typically the capacity per drive or module is smaller. We have seen reported numbers of less than 30 minutes to rebuild a failed Flash module in a 5TB Flash storage volume. There has been some concern about RAID on Flash increasing chances of a wear out of the memory cells but several vendors have now designed Flash specific RAID algorithms that provide the protection against component failure and balance the write workload across the memory modules.

RAID may have its challenges in the future data center but it is not alone. As the state of the art in storage emphasizes greater performance and higher capacity in the same physical space, error rates will continue to increase. It is up to the storage suppliers to address these issues through additional controller and system intelligence so that the user is protected while being able to benefit from the advances in the technologies. The user's job is to understand what options are available and how each vendor is implementing them.