The Enterprise Data Hub

Just a little under five years ago, we packaged up and shipped the very first release of Cloudera’s Distribution including Apache Hadoop, or CDH. We were very proud of it: It was the only commercially-available, vendor-supported platform on the market. It could store more data, in greater variety, more affordably and reliably than anything that had come before. It could process and analyze that data in parallel and at scale using MapReduce. We bundled a handful of other components (notably Pig and Hive) for data manipulation.

Over the years, CDH has become steadily more capable and more powerful with the addition of new components: Sqoop and Avro for data integration, HBase for NoSQL record management, Hue to provide a graphical interface for end users, Zookeeper for synchronization, Impala for real-time SQL support, the SOLR/Cloud-based Search package for document indexing and search services and more. Early feature gaps got filled in: Performance sped up dramatically across the stack by way of solid ongoing engineering, reliability and true high availability were added, the platform was secured with encryption and authentication and role- and user-based access controls (most recently with the release of the Sentry project), and so on.

The maturation of the platform has been relentless. It has, release by release, gotten more secure, more reliable and more real-time. All the original power and flexibility of the platform live on, of course. It handles petabytes of data affordably. It can do batch transformation and analysis of that data quickly. The roots are there, and strong, but it’s grown well beyond them.

As the platform has gotten more secure, more reliable, more powerful and (especially) more real-time, its role has changed. It’s no longer an ancillary system, off in the corner, used for big batch jobs. Instead, it has become the first place that data lands. It scales and it can store anything. It’s used to pre-process data before delivering it to an enterprise data warehouse, a document repository, an analytic engine, a CRM or ERP application, or other specialized system. Most significantly, because it can do real-time search and analysis on the data directly, in place, it has begun to take over some of the work previously done by those traditional platforms.

It’s emerged as the hub through which data flows to the rest of the enterprise. It’s a sophisticated engineered system on its own, of course, and can do useful work. As its capabilities continue to improve, enterprises will have the opportunity to continue to move workloads around their data centers, to the platform best able to handle them.

An enterprise data hub has to do these things:

Store any kind of data, in any volume, in full fidelity, for as long as you need it. Forever, if you like.

Offer a rich — and, over time, growing — set of tools for processing and analyzing data, in place. It must support popular ways to get at data like SQ L, NoSQL and search, as a rich set of analytic engines that can serve particular business problems, like numerical analysis and machine learning.

Connect to the database, data warehouse, document repositories and other systems that enterprises use to manage data. The connections must be bidirectional; data has to be able to flow both ways. It must connect, as well, to people. It must support the tools and applications that business users, analysts and administrators rely on to work with data. It should, over time, support a steadily richer set of tools and applications — better visualization on top of the better analytic engines in the hub. The data hub has to deliver steadily more value from your data. It has to get better over time.

You might look at that list and conclude that we defined the enterprise data hub to be exactly the product we built. That’s backward, though. We built exactly this product because we believed, from the very first days of Cloudera, that something like a data hub had to exist.

If you took away any of those properties, you could still have an interesting product, but it wouldn’t be an enterprise data hub. Limited storage? Couldn’t make it the center of your data strategy. Poor security? No one would put important data into it. Lack of processing and analytic engines? You might as well buy a filer. Poor connectivity? You wouldn’t be able to get the data to the place you needed it, on demand.

A platform that can do all of those things, though, is both transformative (you can do new things) and disruptive (you can do old things bigger, faster and in new ways). Of course we don’t expect you to retire the existing infrastructure you rely on: your data warehouse, your operational data store, your document management system all provide high-level interfaces and analytic services that you rely on. You can, however, think in new ways about what jobs you need to run in the current systems and what can move to the hub, where there’s more data and more processing power available.

Five years ago, when we started working with customers at Cloudera, we already believed that the enterprise data hub was necessary. That name has emerged as the market has grown, but the concept dates to our earliest days. It’s wonderful to see the broader community and the big data ecosystem innovate, and to see customers adopt the enterprise data hub at last.