Transcription

2 In the beginning There was replication Long before advanced data protection techniques were known, data was copied Replication is wasteful To survive N faults, N+1 copies were needed Applied to disks, (N+1) times the hardware, power, floor space, and cooling are required Not Cheap Not Green Not Performant 2010 Storage Developer Conference. Insert Your Company Name. All Rights Reserved. 2

3 Enter RAID In the 1980s, RAID was invented By storing a little extra information, regarding a larger set of information, errors can be corrected RAID 5 stores parity information: Parity is the property denoting even or odd If the number of 1's across a set of drives is even parity bit is set to 0, if odd it is set to 1 If any disk is lost, the parity along with the bits on the surviving disks will yield the content of the lost disk Example RAID 5 recovery: P = 1 0 X 1 P = Storage Developer Conference. Insert Your Company Name. All Rights Reserved. 3

4 Paradise Lost RAID 5 was great Gave similar protection to making 1 copy, yet overhead was significantly less For example: Using 3 disks for data and 1 for parity, the overhead was only 33% However, two factors would conspire to destroy the practical usefulness of RAID 5 Disk capacity outpacing performance The growing chance of Latent Sector Errors (LSEs) which increased with disk capacity 2010 Storage Developer Conference. Insert Your Company Name. All Rights Reserved. 4

6 Impact on RAID 5 RAID 5 can tolerate only one error at a time After first failure, data is in a vulnerable state No additional redundancy exists Secondary disk failure causes irrecoverable loss This was exceedingly unlikely when a disk could be rebuilt in minutes (as was the case in 1991) Today, disks can take hours or days to rebuild Longer rebuild time means the chance of a secondary failure is ~500 times greater 6

7 Disks can fail in many ways Jon Elerath 2007 [2] Outright disk failure is just one possibility More commonly, one or more sectors may be found unreadable at some future time A latent failure while rebuilding RAID 5 will cause data loss 7

8 Chance of a LSE during a rebuild Drive manufacturers often report LSE rates of 1 per every to bits ( TB) read When disks were only a couple of MB or GB, this probability was negligible Consider a RAID 5 array using 2 TB disks: After a disk failure, all other disks need to be read flawlessly, without encountering a LSE. For a 4 disk array, 6 TB of data must be read This works out to a 41% chance of a LSE during rebuild assuming LSE rate of (5% if ) 8

9 Impact of a LSE during rebuild A disk sector is corrupted (usually 512 bytes) Effect may be minor, even unnoticed Other times it may lead to corruption of a file If the sector contained critical metadata, it may result in severe file system corruption In some cases, especially with desktop-class drives, the drive may spend many minutes in a recovery mode, causing it to be kicked from the array and thus failing the whole rebuild 9

10 Quantifying Risk We now know: bigger disks = increased risks But how significant is this risk? How much data is expected to be lost? Fortunately there are techniques for calculating these risks if one knows the Disk's: Mean Time to Failure (MTTF) Capacity and performance Rate of Latent Sector Errors 10

11 Mean Time To Failure (MTTF) Average time between failures Over useful life of a component Not to be confused with expected life A 30-year old human has a MTTF of 900 years [3] This doesn't imply they will live another 900 years It implies a 1 in 900 chance of failing over 1 year Example application of MTTF: Assume a drive has a MTTF of 20 years We operate 1,000 such drives over 6 months This works out to 500 drive-years We should therefore expect (500 / 20) = 25 failures 11

12 Mean Time To Repair (MTTR) Average time to fully repair a failed component Includes: Time for operator to replace failed drive Time to rebuild lost data on the new drive Time to replace can vary significantly May be hours or days, or zero with hot spares Time to rebuild is often estimated Take a drive's capacity and divide by its throughput This is a best case scenario: in practice rebuilds may compete with normal I/O requests (1/3) * (Capacity / Throughput) is more realistic [4] 12

13 Estimating time to data loss The MTTFs of sub-components can be combined to yield the MTTF for the system as a whole: MTTF Essentially, the inverse of the sum of the inverses Also known as the Harmonic Sum When the MTTFs are identical, a shortcut exists: Where N is the number of sub-components This explains why RAID 0 is so unreliable computer MTTF MTTDL sys = ( ) MTTF + MTTF + MTTF 1 = cpu mem psu = RAID 0 MTTF sc MTTF Has only a fraction the reliability of an individual disk 13 N disk NumDisks

14 Estimating time to loss in RAID 5 There are two paths to data loss in RAID 5: Disk Failure followed by another during rebuild Disk Failure followed by a LSE while rebuilding We know how to predict the time to the first failure MTTFirstFailure = MTTF NumDisks This doesn t imply data loss, only that a rebuild must occur We must estimate the likelihood of a secondary failure Assume the array had N disks to start After the first failure N-1 disks remain One of these must fail during the rebuild to cause data loss disk 14

17 Combining paths to loss There are two paths to data loss in RAID 5: Disk Failure followed by another during rebuild Disk Failure followed by a LSE while rebuilding We can now calculate the MTTF for each path, but how can they be combined into a single estimate? MTTDL RAID5 ( 1 1 ) MTTDL + MTTDL 1 = RAID5_ DF RAID5_ LSE We simply use the Harmonic sum, as we learned before 17

18 What good is a MTTF number? The MTTF statistic on its own is not very meaningful However it can be used to generate actionable information, such as chance of data loss or expected amount of data loss over a period of time. Failures can be assumed to be random processes Constant failure rates imply a Poisson distribution FailureChanceOverTime( t) =1 MTTDL Where e is Euler s number ~= e t 18

23 Why RAID 6 is so much better Every additional tolerated failure increases MTTF by: MTTF / (MTTR N) MTTF is usually many years, while MTTR is a time in hours With current disk MTTF and MTTR times, each additional tolerated failure increases reliability by a factor of several hundred to a few thousand! Reliability metrics for a RAID 6 array (6+2): FailureChanceOverTime(10 years) = 0.13% EDL total (10 years) = 7.20 MB 23

24 Problem Solved? For that RAID 6 system, the chance of data loss over 10 years is about 1 in 780 It would seem the data loss daemon has been slain However, there are two factors not accounted for: Some storage systems are massive (in the PB scale) Disk capacities keep doubling 24

25 Issues of Scale Large systems require a large number of arrays One cannot create a RAID 6 array Too many disks would have to be touched for each update The chance of tertiary failures would be too great Each array has its own independent chance of failure Recall that Its true whether the component is a disk or an array Consider a 5 PB storage system This requires 427 individual RAID 6 arrays MTTF sys = MTTF Assuming 2 TB disks in a 6+2 configuration Failure of any array causes irrecoverable data loss 25 sc N

28 Is Replication The Answer? When spending millions of dollars for a storage system, who wants to double or triple that cost? Instead, we can take the same path that was taken from RAID 5 to RAID 6 Scale out fault tolerance Maintain same level of storage efficiency Only additional cost: Increased processing 28

29 Reliability for arbitrary K-of-N Where K is the number of data Disks, and N is the total number of disks in the array System tolerates N K failures without loss [7] MTTF DF = MTTF LSE = 29

Why RAID is Dead for Big Data Storage The business case for why IT Executives are making a strategic shift from RAID to Information Dispersal Executive Summary Data is exploding, growing 10X every five

an analysis of RAID 5DP a qualitative and quantitative comparison of RAID levels and data protection hp white paper for information about the va 7000 series and periodic updates to this white paper see

www.freeraidrecovery.com Practical issues in DIY RAID Recovery Based on years of technical support experience 2012 www.freeraidrecovery.com This guide is provided to supplement our ReclaiMe Free RAID Recovery

What is RAID and how does it work? What is RAID? RAID is the acronym for either redundant array of inexpensive disks or redundant array of independent disks. When first conceived at UC Berkley the former

A Detailed Review Abstract This white paper discusses the EMC CLARiiON RAID 6 implementation available in FLARE 26 and later, including an overview of RAID 6 and the CLARiiON-specific implementation, when

BY Shashwath Veerappa Devaru CS615 Aspects of System Administration Using Multiple Hard Drives for Performance and Reliability RAID is the term used to describe a storage systems' resilience to disk failure

Chapter 1 Storage Devices Summary Dependability is vital Suitable measures Latency how long to the first bit arrives Bandwidth/throughput how fast does stuff come through after the latency period Obvious

9916 Brooklet Drive Houston, Texas 77099 Phone 832-327-0316 www.safinatechnolgies.com RAID Made Easy By Jon L. Jacobi, PCWorld What is RAID, why do you need it, and what are all those mode numbers that

RAID The basic idea of RAID (Redundant Array of Independent Disks) is to combine multiple inexpensive disk drives into an array of disk drives to obtain performance, capacity and reliability that exceeds

Silent data corruption in SATA arrays: A solution Josh Eddy August 2008 Abstract Recent large academic studies have identified the surprising frequency of silent read failures that are not identified or

Operating Systems RAID Redundant Array of Independent Disks Submitted by Ankur Niyogi 2003EE20367 YOUR DATA IS LOST@#!! Do we have backups of all our data???? - The stuff we cannot afford to lose?? How

Storage and File Structure DBMS and Storage/File Structure Why do we need to know about storage/file structure Many database technologies are developed to utilize the storage architecture/hierarchy Data

Slash Costs and Improve Operations with Server, Storage and Backup Virtualization December 2008 Virtualization consolidates resources to obliterate waste in IT, and the associated cost savings make this

WHITE PAPER The Microsoft Large Mailbox Vision Giving users large mailboxes without breaking your budget Introduction Giving your users the ability to store more e mail has many advantages. Large mailboxes

NK YORK COLLEGE OF PENNSYLVANIA HG OK 2 RAID YORK COLLEGE OF PENNSYLVAN James Moscola Department of Physical Sciences York College of Pennsylvania Based on Operating System Concepts, 9th Edition by Silberschatz,

technology brief RAID Levels March 1997 Introduction RAID is an acronym for Redundant Array of Independent Disks (originally Redundant Array of Inexpensive Disks) coined in a 1987 University of California

RAID Levels and Components Explained Page 1 of 23 What's RAID? The purpose of this document is to explain the many forms or RAID systems, and why they are useful, and their disadvantages. RAID - Redundant

Silent data corruption in disk arrays: A solution Abstract Recent large academic studies have identified the surprising frequency of silent read failures that are not identified or resolved in enterprise

Everything you forgot to consider when building your RAID by www.freeraidrecovery.com 2011 RAID is an acronym for Redundant Array of Independent (Inexpensive) Disks. RAID technology allow to combine several

Disk drives are an integral part of any computing system. Disk drives are usually where the operating system and all of an enterprise or individual s data are stored. They are also one of the weakest links

200 Chapter 7 (This observation is reinforced and elaborated in Exercises 7.5 and 7.6, and the reader is urged to work through them.) 7.2 RAID Disks are potential bottlenecks for system performance and

What is RAID? RAID is the use of multiple disks and data distribution techniques to get better Resilience and/or Performance RAID stands for: Redundant Array of Inexpensive / Independent Disks RAID can

Performance Report Modular RAID for PRIMERGY Version 1.1 March 2008 Pages 15 Abstract This technical documentation is designed for persons, who deal with the selection of RAID technologies and RAID controllers

Symantec Backup Exec 10d System Sizing Best Practices For Optimizing Performance of the Continuous Protection Server Table of Contents Table of Contents...2 Executive Summary...3 System Sizing and Performance

Click on the diagram to see RAID 0 in action RAID Level 0 requires a minimum of 2 drives to implement RAID 0 implements a striped disk array, the data is broken down into blocks and each block is written

Firebird and RAID Choosing the right RAID configuration for Firebird. Paul Reeves IBPhoenix mail: preeves@ibphoenix.com Introduction Disc drives have become so cheap that implementing RAID for a firebird

Whitepaper Abstract This whitepaper introduces the procedure rebuilding a degraded RAID. You will find information about RAID levels and their rebuild procedures; rebuild time and its dependent factors.

Optimizing Large Arrays with StoneFly Storage Concentrators All trademark names are the property of their respective companies. This publication contains opinions of which are subject to change from time

To ensure the functioning of the site, we use cookies. We share information about your activities on the site with our partners and Google partners: social networks and companies engaged in advertising and web analytics. For more information, see the Privacy Policy and Google Privacy &amp Terms.
Your consent to our cookies if you continue to use this website.