Thank you for reading our Data Lake 3.0 series! In Part I of the series, we introduced what a Data Lake 3.0 is. In Part II of the series, we talked about how a multi-colored YARN will play a critical role in building a successful Data Lake 3.0. In Part III of the series, we looked under the hood and demonstrated a Deep Learning framework (TensorFlow) assembly on Apache Hadoop YARN. In this blog, we will talk about how to create a Big Data archive to power the Data Lake 3.0 storage.

Apache Hadoop provides a cheap and effective way to store vast amounts of data on commodity hardware via the Hadoop Distributed File System (HDFS). Robust features, such as fault-tolerance and data locality, are enabled by the default 3x replication factor. The simplicity of this replication policy comes at the cost of a 200% storage overhead. As organizations continue to onboard more data and workloads to Hadoop, the storage overhead has become more burdensome, particularly with data sets that are infrequently accessed. This has left many organizations with the dilemma of either expanding their clusters to keep lightly used data accessible or archiving it off-cluster.

In this blog, we will introduce HDFS Erasure Coding as a solution to this quandary, enabling efficient storage (< 50% overhead) and thus effectively providing archival storage within a cluster while keeping the data accessible for analytics and maintaining the same fault tolerant properties of 3x replication.

HDFS Storage Evolution

Heterogeneous storage was introduced in Hadoop 2.3, enabling cluster disks to be labeled according to a set of predefined tiers (RAM_DISK, SSD, DISK, and ARCHIVE). For example, standard data nodes (i.e., 12 x 2TB) would have volumes labeled as “DISK” whereas the drives in a more storage-dense and compute-light node (60 x 4TB) would be labeled as “ARCHIVE”. This enabled a single cluster to be backed by data nodes that were, potentially, a collection of heterogeneous storage devices.

Heterogeneous storage laid the foundation to support data movement among tiers based on access patterns and “data temperature.” Hadoop 2.6 introduced storage policies that assign block replicas based on predefined temperatures. For example, all data is considered HOT by default and thus all replicas are stored in the DISK tier. On the other hand, data that is labeled WARM will have one replica in the DISK tier and the other replicas moved to the ARCHIVE tier. An HDFS directory can flow between tiers by assigning the directory a new tier and then invoking the HDFS Mover which reassigns replicas appropriately.

Figure 1: An example of how data would move through storage tiers as its temperature changes.

Heterogeneous storage tiers and storage policies enable organizations to more effectively control the growth of storage capacity independent of computational resources. However, the storage tiers of Hadoop 2.x still leverage replicas for redundancy and thus come with the inherent storage overhead. The next logical step is to introduce an additional tier that encompasses data that is “sealed” and can be encoded via erasure coding to achieve deep, archival storage on-cluster.

Erasure Coding Basics

Erasure codes (EC), or error correcting codes, are a process for encoding a message with additional data such that a portion of the encoded message can be lost or corrupted and the original message can still be recovered. More specifically, if a block of data contains k bits of data, then we can encode it into a longer message of size n such that the original message can be recovered from a subset of the n bits of data.

Figure 2: An illustration of the XOR encoding and recovery process in a RAID-5 array with 3 drives.

A simple example of error-correcting codes is the XOR parity (used in a RAID-5 array). This RAID configuration will stripe blocks of data across an array of physical disks where one of the disks is dedicated to storing the parity. For an array of 3 disks, a stripe of data would be two blocks and the parity block is computed as the bitwise XOR of the two blocks. The three blocks are stored on three different physical disks, which means that we can tolerate the failure of any one of the drives. The original data can be recovered by performing another bitwise XOR among the data on the remaining drives.

Erasure codes have a wide range of applications ranging from communications to storage. In communications, encoded messages allow a receiver to tolerate a noisy medium so they can detect and potentially correct transmission errors at the receiver without requesting the sender to resend. Similarly, data that is encoded and persisted to a storage medium can tolerate a level of failure or data loss (i.e., media or hardware failures) while still being able to recover the original data. In both situations, the error tolerance benefits come at the cost of a storage or transmission overhead. Typically, this overhead is significantly less than the overhead of pure replica-based strategy.

The EC implementation for HDFS is designed to support pluggable codes and, for the initial implementation, Reed-Solomon was selected. Reed-Solomon is a robust code that is widely used in a variety of technologies including DVDs, CDs, and Blu-Ray. The Reed-Solomon encoder is specified by two parameters, k and m, which we will refer to by the shorthand, RS(k,m). Here, k represents the number of blocks of data to be encoded and m is the resulting parity blocks that will be computed. Together, these two parameters allow the user to control the data durability and storage efficiency of the encoder. The most frequently used variants are RS(6,3) and RS(10,4) since they provide robust durability while also supporting efficient storage utilization.

When comparing storage strategies, durability and storage efficiency are two common metrics that enable you to quickly grasp the tradeoffs. Durability refers to the number of simultaneous failures that can be tolerated without sustaining data loss while storage efficiency refers to how much space is consumed by the actual data relative to the redundant bits. In the table below you can see that Erasure Coding via Reed-Solomon achieves greater durability and storage efficiency than the default 3-replica scheme.

Design And Implementation in HDFS

A number of design and implementation decisions were made to support EC in HDFS. For the purpose of this blog post, we are going to focus on the block layout design choice as it has the greatest impact to the end-user in terms of how and when EC should be leveraged.

Figure 3: Contiguous Block EC.

Files that are stored in HDFS are typically broken into logical blocks (i.e., 128M blocks) that are then physically stored in a corresponding storage block among data nodes in a cluster. This one-to-one mapping of logical blocks to physical storage blocks is referred to as a contiguous mapping and is both simple to understand and implement. For example, to read a file, you simply read each block in order. This approach works well for standard HDFS with replicas as it enables data locality for massive throughput and parallelism on large volumes of data, however, utilizing this same approach for EC implies certain drawbacks:

Lack of support for online encoding. RS(6,3) encoding would require a client to potentially process multiple gigabytes of data to compute the parity blocks for six blocks of data. At best this will introduce a significant write delay for clients and in many cases, this may simply be infeasible given the client hardware.

Significant storage overhead for small files. RS(6,3) would require writing three full parity blocks even if a file itself only breaks down into one or two logical blocks (i.e., 128M file will need three 128MB parity blocks, which is a 300% storage overhead — worse than 3x replicas). This could be solved by combining blocks from multiple files into a single contiguous stripe but it complicates the implementation, especially for delete operations.

Figure 4: Striped Block EC.

An alternative approach is to break the logical blocks into smaller cells (e.g., 1M) and then strip the cells across the physical storage blocks in a round-robin fashion. Smaller cells mean that a client can compute parity and write the data online instead of requiring an offline process. Similarly, since we are breaking down the blocks into smaller cells, we are as efficient at storing smaller files as we are with larger ones. That said, reading a file becomes more complicated as a single logical block is spread across multiple physical storage blocks. This approach does lose the benefits of a contiguous layout, namely data locality which leads to increased read latencies and a more complicated data recovery process.

In order to enable the community with EC as soon as possible, the implementation has been broken into two phases by the Apache Hadoop community. The first phase implements the foundation needed to support EC zones in HDFS via data striping. This means that directories in HDFS can be marked as an EC zone (similar to an encryption zone with HDFS encryption) and any data subsequently written to that directory will be encoded. The second phase of the implementation focuses on the contiguous block layout and the automated movement from an EC zone to a standard HDFS tier via a storage policy and the HDFS Mover tool. This phase will also see the additional benefits of data locality for large files.

Use Cases

Many of our customers are anxiously anticipating the release of Hadoop 3.0 and, specifically, the new capabilities surrounding EC. We outline a few of the most common use cases based on discussions with customers across a number of industry verticals.

Scenario 1: Archival Storage

A very common data pipeline for Hadoop involves ingesting raw data from a variety of sources and periodically running a series of ETL operations to cleanse, quality check, and publish data via Apache Hive or another repository. The raw source data is rarely needed again but we certainly do not want to delete it in case we need to replay the ingest pipeline or make the data available to an analyst. So, instead of maintaining 3 replicas of rarely-accessed raw source data, we can keep this data in an EC zone where it is still online and accessible if needed and keep our storage costs down.

Scenario 2: Backup and Disaster Recovery

A robust DR strategy for Hadoop often involves maintaining a geographically disparate cluster with a copy of the production data. Often, this cluster does not need to operate on the data so maintaining the data in traditional replica-based HDFS is expensive and wasteful. With EC, data can be backed up to DR cluster and persisted to an EC zone. If the cluster is dual-purpose (i.e., data science and ad-hoc analytics) then data can be read directly from the EC zone with a performance hit or moved out into HOT or WARM tiers as needed.

Scenario 3: Efficient Utilization of Heterogeneous Hardware

The ultimate vision for EC in HDFS is to complete the picture for heterogeneous storage support in Hadoop and provide transparent data movement among storage tiers. In this scenario, the “temperature” (i.e., access frequency) can be monitored and, over time, the HDFS mover will trigger data movement across the storage tiers as dictated by the storage policies specified by an administrator.