MongoDB, Cloudera Form Big Data Partnership

MongoDB and Cloudera, leaders in the NoSQL and Hadoop markets, respectively, will co-market and co-sell their offerings. One goal: Ease customer confusion about big data.

10 Hadoop Hardware Leaders

(Click image for larger view and slideshow.)

MongoDB and Cloudera are the successful leading vendors in the NoSQL and Hadoop markets, respectively, but both firms figure they could be that much more successful if would-be customers weren't so confused about big data.

That's the gist of the reasoning behind a deeper alliance between the two companies announced Tuesday. As part of the deeper partnership, MongoDB and Cloudera say they will co-market and co-sell their software as complementary big-data technologies. In case you couldn't guess, MongoDB will be pitched as an operational database for high-scale applications while Cloudera's Hadoop-based Enterprise Data Hub will be described as an analytical platform.

"I realized we needed to do something after I spoke at the Strata Conference last year on the topic of MongoDB and Hadoop working together," said Matt Asay, MongoDB's VP of marketing, business development, and corporate strategy, in a phone interview with InformationWeek. "Afterward I was blistered by people who said, 'I thought MongoDB and Hadoop were competitors.' "

You might think that anybody confused about the appropriate use of NoSQL and Hadoop might need to do more research, but there are gray areas between the two platforms such as HBase, the NoSQL database that's part of Hadoop. But HBase is suited to super-high-scale but rather simplistic use cases, while MongoDB supports much more complex data modeling, according to Yuri Bukhan, director of the ISV Alliances Program at Cloudera.

Bukhan cites online behavior analysis as common ground where HBase and MongoDB serve in distinct roles. "If you're looking at simple user clicks or sessions, HBase offers very fast random reads and random writes if you want to look up users on a particular key, but MongoDB provides a much richer model though which you could track user behavior all the way through an online application."

MongoDB and Cloudera already have bi-directional data connectors, but Asay and Bukhan said the two firms are preparing a deeper integration whereby the live, operational data with MongoDB can be snapshotted into Cloudera's data hub in parallel for analysis. This analysis can happen in near-real-time through the Shark framework or Impala and then be passed back to MongoDB to trigger the display of personalized content or a most-appropriate offer based on the analysis within Hadoop.

This integration, which is expected to be demoed at MongoDB World in New York in June, will run on YARN, the new resource management layer introduced with Hadoop 2.0. But there was no talk of running MongoDB and Cloudera on the same cluster of servers -- a leap ahead that would just confuse matters.

For now the MongoDB-Cloudera partnership is one of convenience, allowing two successful companies to paint a simple NoSQL-for-operational-database, Hadoop-for-analytics picture of the big-data market. Why Cloudera and not the entire Hadoop community?

"This is one of the fantastic things about open-source," says Asay. "This development is being done out in the open, and much of what we do will be available to all, so the other Hadoop vendors will be able to use it."

Another NoSQL vendor, like DataStax, might not paint quite as clean a demarcation between NoSQL and Hadoop roles. DataStax's software distribution, for example, includes both the Cassandra NoSQL database and Hadoop, and they both run on the same cluster. What's more, DataStax and other high-scale database vendors have been busy adding to and touting the analytic query capabilities of their databases.

MongoDB and Cloudera are already jointly selling their software in the field, according to Asay, and they've "aligned" their salesforces to offer consistent messaging on the best uses of their respective products. Things might get messier down the road, but having recently landed massive venture capital infusions, MongoDB and Cloudera apparently feel confident (and flush) enough to divide and conquer the big-data market.

Private clouds are moving rapidly from concept to production. But some fears about expertise and integration still linger. Also in the Private Clouds Step Up issue of InformationWeek: The public cloud and the steam engine have more in common than you might think (free registration required).

Doug Henschen is Executive Editor of InformationWeek, where he covers the intersection of enterprise applications with information management, business intelligence, big data and analytics. He previously served as editor in chief of Intelligent Enterprise, editor in chief of ... View Full Bio

Tomer, real-time movement of data from MongoDB into Hadoop is exactly what these partners were talking about with the new, deeper intergration described above in the article. They said it will take snapshots of the data in MongoDB and replicate in Hadoop using parallel processing. Execs didn't specify whether the access method would be HBase, but they did say the analysis could be done through a low-latency tool such as Spark or Impala. We'll learn more in June when this deeper MongoDB/Hadoop integration is set to be introduced in beta form.

HBase is an important component in the Hadoop stack. Many of our customers use both HBase and MongoDB in their organizations. In fact, HBase will serve a key role in providing real-time integration between MongoDB and Hadoop. There's a need to move beyond batch exports from MongoDB to Hadoop, and instead adopt a real-time, log-based replication approach (similar to Golden Gate or Informatica IDR in the relational world). There are two ways that can work:

Replicate from Mongo to static files in Hadoop. The data will be streamed from Mongo into Parquet files, with some 'schema discovery' that then populates Hive metastore with the columns discovered in the Mongo table. This will work for some use cases, but there are also some challenges. The main challenge is that a new form of compactions will need to be introduced, because updates and deletes in Mongo can't be performed on the Hadoop files directly. In addition, the data in Hadoop won't be available until the files are closed and the schema in Hadoop/Hive will need to remain in sync which could be a challenge over time.

Replicate from Mongo to HBase tables. The data will be streamed from Mongo into HBase tables, and the data can be queried directly from HBase. This approach is more real-time, and will be easier to manage. The HBase table will be a mirror of the Mongo table at all times, with no need to do extra 'compactions' on the Hadoop side.

The other thought I had about this partnership is that Cloudera is NOT being very ambitious about the use of HBase -- perhaps in deference to MongoDB. Maybe MapR or another distributor might suggest that you can do more with data on a single cluster?

I'd say it speaks more to the duplicate functional tech needs that exist in both the operational database and data warehouse worlds. For operational databases like Cassandra (and legacy RDBMS's like SQL Server, etc.), there will always be the need to analyze and search that data in the context of the online apps they serve, which is why we enable that in our platform. The same needs for analysis and search also exist in the data warehouse/lake worlds that Hadoop is now playing in. However, the use cases and apps that an operational DB and data warehouse/lake serve are still quite different, which is why the divide between the two still exists now in the NoSQL market just as it does in the traditional RDBMS world. In other words, in the same way an RDBMS guru doesn't use Teradata for online/transactional apps, none of our customers use our platform for a Hadoop data warehouse system.

Just a quick clarifying note: DataStax is not a Hadoop vendor, but instead we focus on serving the database requirements of modern online applications - those that are always-on, distribute data around the globe, and need to scale without limits. These applications often have the need to run analytics and search operations on their online data, so we allow for that in our NoSQL platform by integrating analytics and search technologies that function across a distributed shared nothing architecture that can span multiple data centers and cloud availability zones. For more details on how this works, please see the following post: http://www.datastax.com/2013/06/why-hadoop-and-solr-in-datastax-enterprise.

Is it quite as clear as NoSQL is for this and Hadoop is for that? MongoDB and Cloudera are thinking they can divide and conquer the market, but maybe practitioners and competitors have different ideas about overlaps between the two platform? During my call with MongoDB and Cloudera, for example, Bukhan said HBase is better suited to high-scale applications, but then Asay of MongoDB tried to do some back peddaling and point out that MongoDB can handle petabyte-scale applications. This one of the areas where gray areas might emerge. DataStax (of Cassandra fame), Couchbase, and (Basho supporting) Riak might have a different take on the best uses of NoSQL vs. Hadoop. By all means share your perspectives here.