Data Lake Showdown: Object Store or HDFS?

Alex Woodie

The explosion of data is causing people to rethink their long-term storage strategies. Most agree that distributed systems, one way or another, will be involved. But when it comes down to picking the distributed system–be it a file-based system like HDFS or an object-based file store such as Amazon S3–the agreement ends and the debate begins.

The Hadoop Distributed File System (HDFS) has emerged as a top contender for building a data lake. The scalability, reliability, and cost-effectiveness of Hadoop make it a good place to land data before you know exactly what value it holds. Combine that with the ecosystem growing around Hadoop and the rich tapestry of analytic tools that are available, and it’s not hard to see why many organizations are looking at Hadoop as a long-term answer for their big data storage and processing needs.

At the other end of the spectrum are today’s modern object storage systems, which can also scale out on commodity hardware and deliver storage costs measured in the cents-per-gigabyte range. Many large Web-scale companies, including Amazon, Google, and Facebook, use object stores to give them certain advantages when it comes to efficiently storing petabytes of unstructured data measuring in the trillions of objects.

But where do you use HDFS and where do you use object stores? In what situations will one approach be better than the other? We’ll try to break this down for you a little and show the benefits touted by both.

Why You Should Use Object-Based Storage

According to the folks at Storiant, a provider of object-based storage software, object stores are gaining ground among large companies in highly regulated industries that need greater assurances that no data will be lost.

“They’re looking at Hadoop to analyze the data, but they’re not looking at it as a way to store it long term,” says John Hogan, Storiant’s vice president of engineering and product management. “Hadoop is designed to pour through a large data set that you’ve spread out across a lot of compute. But it doesn’t have the reliability, compliance, and power attributes that make it appropriate to store it in the data lake for the long term.”

Object-based storage systems such as Storiant’s offer superior long-term data storage reliability compared to Hadoop for several reasons, Hogan says. For starters, they use a type of algorithm called erasure encoding that spreads the data out across any number of commodity disks. Object stores like Storiant’s also build spare drives into their architectures to handle unexpected drive failures, and rely on the erasure encoding to automatically rebuild the data volumes upon failure.

If you use Hadoop’s default setting, everything is stored three times, which delivers five 9s of reliability, which used to be the gold standard for enterprise computing. Hortonworks architect Arun Murthy, who helped develop Hadoop while at Yahoo, pointed out at the recent Hadoop Summit that if you only storing everything twice in HDFS, that it takes one 9 off the reliability, giving you four 9s. That certainly sounds good.

But one of the problems with big data is the law of large numbers. As the amount of data you’re storing creeps up into the petabyte range, your chances of losing a single byte of data suddenly becomes significant.

“When you do the math, when you get to 1PB, the equation changes,” Hogan says. If you have a system with 1PB of data on it, and you’re running a system with five 9s of reliability, your chances of losing data in a year is 12 percent, he says. While the odds of any one file being lost are incredibly small, you don’t get to pick which file is going to be lost, and that worries big companies.

Storiant recommends customers to store two copies of data, which translates to a mind-boggling 18 9s of reliability. “That’s the thing–as you start talking about large data, even if you have good reliability, you’re still going to lose data…You need a dozen or more 9s as it gets bigger, and some mechanism to know if you’re losing data.”

Bit loss rears its ugly little head when the data sets get really big. “You need to proactively manage that stuff isn’t silently disappearing even if you have three copies,” Hogan says. “Are those three copies coordinating with one another and restoring data that’s disappeared from one of the copies with bit loss? The answer in Hadoop is no, it’s no doing that. The bigger the data, the more important the reliability story and the more difficult the reliability story becomes. Unless you’re going to extreme measures to proactively fix that, then it’s just going to be gone and you won’t even know.”

Why You Should Use Hadoop for Your Next Data Lake

While the prospect of bit loss is enough to wake even the most hardened of CIOs up in the middle of the night, there are also some pretty good reasons to build your data lake on Hadoop. Consider these points that Soam Acharya, head of application architecture at Hadoop as a service provider Altiscale, made in a recent blog post.

“Many people choose Object Stores because they are marketed as a convenient, scalable, cheap,” Acharya writes. “However, if you want to do more than just park your data, HDFS is the better choice.”

For starters, HDFS was specifically designed to support the high-bandwidth access patterns that big data workloads demand. If you want to actually do data science on the data—as opposed to just having it sit there as an archived copy—then you need to be able to get at it easily and manipulate it.

“Data science involves constant inspection and transformation of large data sets spanning multiple files,” he writes. “To this end, being able to manipulate directories as well as files is important. While the names of files in an Object Store can have slashes in them, Object Stores do not truly support directories.”

You just can’t get the kind of intelligent data storage you can get from HDFS from an object store, he says. “Object Stores are great for objects, e.g., photos, Word documents, and videos,” Acharya writes. “But for interactive, high-bandwidth, sophisticated analysis of very large data sets, HDFS can’t be beat.”

Where We Go From Here

There are pros and cons to both technologies. There appears to be real momentum among large firms to use object stores—in particular those running in a private cloud environment–as a long-term repository for massive, unstructured data that needs to be kept for compliance reasons. Using HDFS for this task would seem to not make great sense.

On the other hand, object stores can’t deliver the richness of functionality that HDFS offers. Today’s modern object stores are typically accessed via a REST API, which assures that the system will be open and the data accessible to a broad range of applications. But if you’re doing big data analytics and trying to iterate rapidly, the idea of extracting data via a Web service call sounds farfetched.

One of Hadoop’s strengths is how it lets you bring the compute to the data, but object stores rely on fast networks to move lots of data to the compute. That architecture reflects traditional HPC used in supercomputing sites, not modern, Web-scale systems.

There are several projects in the works that seek to combine the power of both approaches. One of these is Ozone, an object store designed to extend HDFS to support the concept of “bucket spaces.” Hortonworks launched Ozone last year and the project is now incubating. Storiant is also working with the Hadoop distributor to make the object store look like HDFS, thereby enabling users to work with the data as it sits in the object store.