Facebook shares lessons learnt from deploying Raid in HDFS clusters

Facebook deployed Raid in large Hadoop
Distributed File System (HDFS) clusters last year, to increase capacity by tens of petabytes,
as well as to reduce data replication. But the engineering team faced a lot of challenges during
the process, such as data corruption and difficulties in implementing Raid across a large
directory.

The social network implemented the technology, which includes Erasure
Codes in HDFS, to reduce the replication factor of data in HDFS.

Raid
(redundant array of independent disks) is a way of storing the same data in different places (thus,
redundantly) on multiple hard disks. The Hadoop
Distributed File System (HDFS) is the primary storage system used by Hadoop applications. It
provides high-performance access to data across Hadoop
clusters and has become a key tool in enterprises for managing big
data and supporting big data
analytics.

The default replication of a file in a Hadoop Distributed File System is three, which can lead
to a lot of space overhead, according to Facebook’s engineering team. HDFS Raid technology helped
the social network cut data replication and effectively reduce this space overhead.

“With Raid fully rolled out to our warehouse HDFS clusters last year, the cluster’s overall
replication factor was reduced enough to represent tens of petabytes of capacity savings by the end
of 2013,” the engineering team wrote on Facebook’s official blog.

With Raid fully rolled out to our warehouse HDFS clusters last
year, the cluster’s overall replication factor was reduced enough to represent tens of petabytes of
capacity savings by the end of 2013

Facebook

But deploying
Raid in large HDFS clusters with hundreds of petabytes of data presents a lot of challenges.
“We wanted to share the lessons we learned along the way,” Facebook engineers said on the blog.

Space-saving lessons

When Facebook deployed Raid to production, it saved much less space than expected. “After
investigating, we found a significant issue in Raid, which we christened the ‘small file problem',”
the engineers said.

The engineering team found that files with 10 logical blocks provide the best opportunity for
space-saving. The fewer blocks a file has, its space-saving capacity drops too. If a file has less
than three blocks, Raid cannot save any space.

The team’s analysis found that more than 50% of the files in the production clusters were small
(less than three blocks).

To overcome this problem, the IT team grouped blocks together. “We developed directory Raid to
address the small file problem based on one simple observation: in Hive, files under the leaf
directory are rarely changed after creation. So, why not treat the whole leaf directory as a file
with many blocks and then Raid it?”

Preventing data corruption

Another challenge while using HDFS Raid is data corruption because of a bug in Raid
reconstruction logic. To prevent corruption, the team calculated and stored CRC
checksums of blocks into MySQL during the deployment of Raid so that whenever the system
reconstructs a missing block, it compared the checksum with the one in MySQL to verify the data
correctness, the team said.

Another challenge was that implementing Raid across a directory with more than 10,000 files
would take a whole day to finish.

“If there’s any failure during the Raid process, the whole process will fail, and the CPU time
spent to that point will be wasted,” the engineers said. To overcome this issue, they “parallelised
the Raid” using a mapper-only MapReduce job, where each mapper Raids only one portion of the
directory.

“In this way, failure could be tolerated by simply retrying the affected mappers. With
MapReduce, we are able to Raid a large directory in a few hours,” the team said.

Currently, at Facebook, the HDFS Raid technology is implemented as a separate layer on top of
HDFS. But this is presenting a few problems for the team – such as wasting network bandwidth and
disk I/Os. Also, when files are replicated across clusters, sometimes parity files are not moved
together with their source files. This often leads to data loss.

So, Facebook engineers are building Raid to be natively supported in HDFS. “A file could be
Raided when it is first created in HDFS, saving disk I/Os,” the engineers said.

“Once deployed, the NameNode
will keep the file Raid information and schedule block-fixing when Raid blocks become missing,
while the DataNode will be
responsible for block reconstruction.

Email Alerts

By submitting my Email address I confirm that I have read and accepted the Terms of Use and Declaration of Consent.

By submitting your personal information, you agree to receive emails regarding relevant products and special offers from TechTarget and its partners. You also agree that your personal information may be transferred and processed in the United States, and that you have read and agree to the Terms of Use and the Privacy Policy.

It can be tempting to stray from the security roadmap security professionals have put in place when data breaches like the Sony and Anthem breaches are all over the news. But experts say it's crucial to stick to the security basics.

The Open Data Platform has arrived, but not all Hadoop vendors are on board. The initiative, aimed at boosting interoperability, formed a backdrop for discussion at the Strata + Hadoop World 2015 conference.