Is Hadoop a New Storage Paradigm?

John Webster
, ContributorIndustry analyst focused on cloud, storage technologies and AIOpinions expressed by Forbes Contributors are their own.

According to historians of economic theory, there are two ways to define creative destruction. One is negative—that capitalist systems unleash forces that lead to their own demise. The other is positive—that those forces result in better products and services that replace the ones that were destroyed. It’s the second one we've latched on to. Creative destruction is a good thing. For the computing industry, I can think of no better example of creative destruction than the open source movement.

Image via CrunchBase

I believe that the storage industry is ripe for at least one creatively destructive event. Not surprisingly, it comes from the open source community. I’m talking about the emergence and current proliferation of Apache Hadoop. Modern storage systems, despite their many modern advancements—including the adoption of solid state—are still tethered to the past. They’re proprietary and expensive—particularly at the Petabyte and even Exabyte scale requirements we can already see on the horizon. (
EMC recently surmised they would soon ship an Exabyte of storage to a single customer, albeit in the form of a train-load of boxes.)

While Hadoop is commonly seen as a big data analytics system, I believe it can also be seen as a storage platform. Here’s why:

Vendors now commonly speak of Hadoop for the enterprise as the “big data lake.” This is particularly true of vendors with a traditional database and/or data warehouse ax to grind. It’s a common repository for all enterprise data—structured and unstructured. These same vendors are more than happy to show prospective users how Hadoop can pre-stage data before its fed into an existing data warehouse as well as be used as an active data archive post the data warehousing process. In the big data lake scenario, Hadoop is very scalable and intelligent storage for the Big Data version of the data warehouse.

It’s built on a file system (the Hadoop Distributed File System – HDFS) as are many modern storage platforms (EMC Isilon OneFS,
Oracle Sun ZFS examples) and one that scales to PBs at relatively low cost—certainly much lower than the NAS, SAN, and even scale-out and Object Store alternatives.

It emerged from the Internet data center environment as storage (HDFS) with a programmable analytics engine (MapReduce) built on top. As a result, there is a tendency to see Hadoop as a place to do data analytics in ways that the traditional data warehouse can’t and at the scale and low cost that Hadoop offers. However, one of the problems the enterprise has with putting Hadoop into production business intelligence (BI) environments is a lack a business user-friendly, Hadoop-based applications that could leverage the power of Hadoop. Vendors have seen this and are building those applications. Once they proliferate, Hadoop will be seen by users, developers, and enterprise IT administrators as a storage platform with unique, innate intelligence that is perfectly suited to these applications. (More on this subject after I attend the Big Data Innovation Summit in Boston.)

The storage industry is answering the demand for large-scale, unstructured data storage with scale-out, object-based storage systems. However, they need a storage system-internal search process to quickly find the files that an application wants at any given point in time. Hadoop is already there and then some. It can not only quickly find specific files among billions, it can offer more intelligence to overlying applications that is closer to the data.

In fact, the idea of applying storage-resident intelligence to data in ways that relieve the processing load on a server running an application was established decades ago. We know this intelligence today as storage-based services that include remote replication, snapshot copy, thin provisioning, data deduplication, and storage tiering to name a few.
Apache Software Foundation’s Hadoop as a highly scalable, low cost storage platform is starting to get storage services common to modern storage systems. Version 2 adds file system-level snapshot copy for example. Other Hadoop distributions add more. WANdisco’s NonStop NameNode adds active-active Hadoop NameNode failover, snapshot copy and remote data replication. Distributions of Hadoop (there are currently six) that blend many storage services common to modern, enterprise storage with the advancing analytics intelligence of Hadoop are either here now or are coming.

Hadoop, iconographed as a child’s toy elephant, is poised to disrupt the status quo within the storage industry—a place that has remained surprisingly immune to the creatively destructive forces of open source. I noted in an earlier blog post that EMC’s new Pivotal Labs division sees this coming when it says “Were all in on HDFS.” Storage vendors ignore the power of Hadoop as an advanced storage platform at their peril. Hadoop challenges us to throw away our traditional notions of what storage is—and that’s totally in keeping with creative destruction.