An IT industry insider's perspective on information, technology and customer challenges.

January 31, 2012

HDFS: Coming Soon To An Array Near You

As enterprises around the globe are beginning to exploit the value in their unstructured data, Hadoop is starting to find its way into more enterprise settings.

Enterprise IT buyers are justifiably a demanding lot, which creates plenty of opportunity for IT vendors such as EMC to create products that meet their newer needs.

Such is the case with the new native HDFS (Hadoop Distributed File System) capabilities from EMC's Isilon group.

As you'll see, it makes exploiting the power of Hadoop and large unstructured data sets incredibly more efficient and powerful.

And before too long, I suspect other storage vendors will scramble to offer something very similar :)

What's This All About?

Big data processing and data science is quickly finding its way into more mainstream IT settings. Like the prospectors of yesteryear searching for minerals buried in the ground; this new crowd is searching for "digital wealth" -- a continual stream of fresh insights to power their businesses.

Loosely speaking, large data sets come in two primary flavors: structured and unstructured. And Hadoop has quickly emerged as the toolset of choice for exploiting unstructured and semi-structured data.

Hadoop is actually a rather large collection of tools, but one key storage-related component -- the Hadoop Distributed File System -- is responsible for managing very large datasets and delivering them at very high performance.

"Classic" HDFS has many familiar limitations, proving ample opportunity for EMC and others to innovate. Such is the case with new native HDFS feature just announced by Isilon.

Traditional HDFS

In a traditional Hadoop implementation, HDFS assumes quintessentially "dumb" storage. Data is typically copied three times (twice in the rack, once in a separate rack). Many enterprises also keep a "safety copy" of their datasets on more traditional storage, so you're usually talking a 4x multiplier on basic capacity.

HDFS metadata is managed by a "namenode". It's a well-known single point of failure. Although a second namenode can be configured, it's really just a metadata logger, and assists with recovery vs. a failover as we'd all prefer to see. When the namenode is down (or recovering) work comes to a screeching halt. In larger environments, a namenode failure can take down the entire environment for quite a while.

Data has to be moved into -- and out of -- HDFS environments.

Typically, data is captured using traditional NFS, moved to HDFS where it is analyzed, and -- frequently -- then made available to analysts using Windows/CIFS. Moving a few gigabytes around is no big deal; moving hundreds of terabytes or petabytes can be a real sore spot; especially if you're doing in continually.

And, of course, you're not getting any useful work done while you're waiting for a 36 hour copy job to complete :)

There's nothing approaching modern data protection in most Hadoop environments: no snaps, no remote replication, etc. That might be tolerable if you can afford a very lengthy data outage; less tolerable if you're counting on the system to deliver useful results day-in and day-out.

Going further, there are no real tiering or archiving capabilities in HDFS, so users have to roll their own. And -- in the subtle-but-important category -- there's no simple way to independently vary compute and storage performance or capacity. Everything is typically configured in identical storage/compute nodes.

Choose wisely :)

There's more, but you get the idea. When you consider Hadoop -- and HDFS -- in enterprise settings, there's plenty of opportunity to do things in a vastly better way.

The Isilon Solution

If you're not familiar with Isilon, maybe you should be. It's the leading and fastest-growing scale-out NAS product in the market, finding plenty of homes in both industry-specific and (more recently) enterprise settings. It's architecturally different than traditional NAS products from EMC (VNX) as well as NetApp and others.

Simply put, an Isilon cluster can now expose any file as either NFS, CIFS -- and now HDFS. The same files can be accessed using different protocols at different stages in the workflow.

No copying. That's big.

All the different Isilon protection flavors are available. No more 4x copies -- unless that's what you *really* want. Less money on storage means you can field far larger storage farms with the same budget.

Any Isilon node can function as a namenode on behalf of the cluster. That means that failover semantics are pretty much as you'd expect -- better availability with far less hassle.

And, of course, all the Isilon local and remote replication capabilities are inherited. Real, enterprise-class data protection if you need it. The popular Isilon performance and capacity management tools "just work". And so on.

Because storage is separate from compute, administrators can "tune" storage capacity and performance separately from processing performance. Less waste of one or the other -- depending.

You're the one in charge of design, integration, support, maintenance, capacity, performance, etc.

The classic "one man band" :)

You now have a new, attractive approach you can consider -- creating a large, scale-out, self-managing and self-optimizing pool of "file capacity" that's transparently shared between intake (NFS), processing (HDFS) and analyzing (CIFS).

Even the hard core Hadoop shops have looked at what we're doing, and often cast a wistful eye -- if only we had made this available before they invested in all that gear.

We do take trade-ins, folks :)

The Greenplum Connection

Elsewhere at EMC, our Greenplum division is tearing it up with advanced capabilities to support the next generation of data scientists.

As part of the new Greenplum UAP (unified analytics platform), Greenplum HD offers an enhanced enterprise distribution of Apache Hadoop.

That means that the new Isilon capability was developed with real-world knowledge of how proficient users actually use Hadoop. It means that EMC can offer enterprise-class support for both storage and software.

And for customers who prefer a one-throat-to-choke approach, the Greenplum data computing appliance offers a complete, turnkey solution based on the very latest technologies.

I've found that when executives get the data science bug, they want to move fast. And we're prepared to help them do just that.

Do You Want This On Your Next Storage Array?

Maybe you're reading this and thinking, "gee, that's cool, but we're not using Hadoop yet".

Given that most storage purchases sit on the floor for three years or more, I'd encourage you to look out a bit further.

One part I like about this new capability is that it basically "future proofs" your investment in file system capacity -- should you wake up one morning and find yourself in a meeting to discuss a new Hadoop project :)

From a pure storage administration perspective, Hadoop (and HDFS) is no big deal if you're using an Isilon array. It's just another access mechanism to the exact same data.

Nothing really changes in the environment. Nothing new to buy. Nothing new to do. Current Isilon customers get native HDFS support at no charge as part of OneFS 6.5

It just works.

Industry Implications?

If you follow the storage industry like I do, there's historically been a lot of bantering back-and-forth over the years around what "unified storage" might really mean.

Regardless of the past, the future is now becoming clearer. It's quickly becoming a big data world that inherently favors purpose-built storage architectures that scale both performance and capacity with ease.

And, perhaps, it's also quickly becoming about the storage protocols and certified software stacks needed in this new big data world.

Comments

Chuck - Great Post and WOW - the possibilities of HDFS with Isilon are a-(wait for it)-mazing! I will be interested to follow this as it matures and see what solutions we can provide based on HDFS/Isilon combo.
Big Data is real, with great financial benefits if you can intelligently sort through it all (cough, Greenplum!) its not just a buzzword. I hear and have these conversations everyday with our EMC customers.
Great slides and another great read!
S~

Very interesting indeed, the ability to have data readily available across the "big three" (NFS, CIFS, HDFS) is compelling for sure. Especially if the administrative overhead is minimized! I will be bothering my local EMC folks about this.

Yet again, a vendor bastardizing a clean model to enable them to shove their existing product into (and further sell) and effectively compromising the architecture, performance and continued lifecycle.