Data Integrity and Availability:
The Challenge of Scale for Modern Storage Systems

Guest Editor's Introduction • Sundara Nagarajan • May 2012

Recent reports from Amazon Web Services (AWS) indicate that the company's S3 storage service will soon have more than a trillion objects in storage and be capable of handling a million requests per second. Clearly, we are living through an era of transformation in storage system architectures designed to deliver continuously scalable service. Consumers and enterprise users want most of their data to be stored in the most economical manner, with a small part stored for rapid access as needed. Yet, even if a storage solution is available free of cost, users are uncompromising on a key property: access to their data on demand without fail. This invariant raises several challenges for storage system designers. Faults in storage systems can cause latent errors that remain long undetected until access uncovers them as failures. Hardware and software defects can cause faults that have a significant negative impact on reliability and availability, leading to data loss or delays in getting to the data.

For this issue of Computing Now, I collected a set of articles that highlight the design trade-offs and choices for modern-day storage system architectures with regard to consistency, data integrity, and availability. This theme explores real-world problems and solutions associated with contemporary storage systems.

Storage System Components

A storage system is characterized by a stack of hardware and software components, including:

Data-protection mechanisms, such as RAID or erasure codes at the device level;

Block and file semantics for storage organization;

Data management entities, snapshots, clones;

Provisioning of space;

Network access protocols, and so on.

These elements are combined with storage-efficiency techniques such as deduplication and compression to deliver the optimum access speed and cost/capacity. Modern storage systems deploy different kinds of storage devices with different levels of cost-to-performance and reliability characteristics. These systems are essentially networks of components interconnected via high-speed connections and working together to deliver the storage services. Time to data or availability (or latency) and data loss are two important quality-of-service parameters that every storage system must ideally optimize, tending to zero..

Hardware storage devices have evolved rapidly in terms of the amount of data they can store. However, their reliability has not progressed at the same rate. Techniques such as RAID provide the first line of defense at the storage subsystem level, but are these techniques scalable to terabytes of capacity and beyond? Augmenting device capabilities and low-level error correction techniques, sophisticated file system designs offer data organization, space provisioning, and data management functionality. The primary goal of this layer is to meet the applications' performance and cost-efficiency needs. Such designs use metadata to store and retrieve blocks of data rapidly, and they achieve consistency by ensuring that the metadata and data reflect the latest updates. As the amount of data stored in a file system grows to petabytes and exabytes, corresponding consistency characteristics vary by file system design.

Traditional solutions to hard crashes, malfunctioning hardware systems, software defects, and other such threats to data integrity are no longer sufficient. The growing demand for storage capacity in the IT infrastructure poses new technical challenges. To effectively and efficiently fulfill this demand academic and industrial researchers must find innovative solutions that can scale petabytes and beyond.

Theme Articles

At the Symposium on Principles of Distributed Computing in 2000, Eric Brewer spoke about what came to be known as the CAP theorem — a conjecture that distributed systems can't simultaneously guarantee consistency, availability, and partition tolerance. Given that contemporary storage system architectures are inherently large-scale distributed systems, the theorem has attracted a lot of attention from storage system designers since then. In the first article in this month's theme, Brewer reviews the state-of-the-art in "CAP Twelve Years Later: How the 'Rules' Have Changed." In this lucidly authored paper, he gives guidance for designing large-scale systems with consistency and availability.

In "Loris — A Dependable, Modular File-Based Storage Stack," Raja Appuswamy, David van Moolenbroek, and Andrew Tanenbaum evaluate the traditional storage stack for reliability, heterogeneity and flexibility. They then propose a redesign in which most of the storage stack deals with finer-grained failure domains. As a result, it can potentially handle threats to data integrity — that is, data corruption, system failures, and device failures — more effectively than the traditional stack.

Storage efficiency is an important design consideration for modern storage systems. Deduplication reduces the space required for storage by eliminating redundant chunks of data, whereas fault tolerance relies on increased redundancy to handle faults in a system. The evident contradiction between the two is the subject of Eric Rozier and his colleagues' very interesting article, "Modeling the Fault-tolerance Consequences of De-duplication."

In addition to the articles in this month's theme, we include video interviews with two thought leaders from the storage domain. Tanya Shastri spoke with Steve Kleiman, senior vice president and chief scientist at NetApp, on the industry vision of things to come as we face mounting demands on data integrity and availability at increasing scale and diversity of devices. Prof. Remzi Arpaci-Dusseau, of the University of Wisconsin-Madison, has been studying data integrity characteristics of storage devices and file systems. In a video interview on our theme, he shares his thoughts on the state-of-the-art in data integrity research. We thank them both for sharing their insights.

Data Integrity and Availability in Storage Systems,
An Interview with Steve Kleiman

An Interview with Remzi H. Arpaci-Dusseau

Sundara Nagarajan ("SN") is a technical director at NetApp and a visiting professor at International Institute of Information Technology, Bangalore, India. He is also CN's regional liaison to IEEE Computer Society activities in India. Contact him at s dot nagarajan at computer dot org.

IEEE Annals of the History of Computing covers computer history with scholarly articles by leading computer scientists and historians, as well as first-hand accounts.

Cloud Computing magazine is committed to the timely publication of peer-reviewed articles that provide innovative research ideas, applications results, and case studies in all areas of cloud computing.

IEEE Computer Graphics and Applications magazine bridges the theory and practice of computer graphics, from specific algorithms to full system implementations.

Computing in Science & Engineering addresses the need for efficient algorithms, system software, and computer architecture to address large computational problems in the hard sciences.