When it comes to Hadoop distributions, enterprises care about a number of things. Among them are high performance, high availability, and API compatibility. MapR, a San Jose, Calif.-based start-up, is betting that enterprises are less concerned with whether the distribution is purely open source or if it includes proprietary components. That’s according to Jack Norris, MapR’s vice president of marketing. He said MapR is the market leader in all three of the top Hadoop priorities – performance, availability, and API compatibility – and it has the customers to prove it.

Currently MapR has between 40 and 50 paying customers using its enterprise M5 Hadoop distribution, which, as everyone knows by now, includes the proprietary NFS storage layer. They include commScore, the online market intelligence firm, which recently dumped Cloudera’s Hadoop distribution for M5. In addition, the company’s free community distribution, M3, has been downloaded thousands of times according to Norris.

MapR’s performance and availability advantages over competing Hadoop distributions, Norris explained, are due in part to:

M5’s distributed namenode architecture, which removes the single point of failure that plagues HDFS;

Its ability to run the equivalent number of jobs on fewer nodes, which results in overall lower TCO.

Figure 1 - MapR Hadoop StackSource: MapR 2011

But it’s the open source issue where MapR takes a lot of heat. Norris argues that MapR’s approach – improving upon an open source core with proprietary value-add components and services – is a pretty “standard” model in the commercial open source world. While that is a common commercial open source business model, many would argue that the storage layer in a Hadoop distribution is the core, not an add-on.

Norris also said what’s important is not that a given Hadoop distribution is purely open source or not, but that it is 100% API compatible with the Apache distribution, which M5 is. This, he said, means that while developers can’t fiddle with NFS, they can easily integrate MapR’s distribution with HBase, HDFS, and other Apache Hadoop components, as well as move data in and out of NFS should they choose to tap a different Hadoop distribution. This last point is particularly important. It means, according to MapR, that there is no greater risk for vendor lock-in with its Hadoop distribution than with any other.

MapR’s focus on performance, availability, and API compatibility over open source code also comes through in its go-to-market strategy. MapR is not interested in educating the wider market about the benefits of Hadoop, as Cloudera and Hortonworks seem to be, according to Norris. Rather, MapR is targeting companies that are already using Hadoop or have made the decision to deploy Hadoop and are evaluating their distribution options. MapR also has a relationship with EMC to ship parts of its distribution with EMC Greenplum’s Hadoop offering.

Norris said MapR is targeting customers who already understand what Hadoop can do and want a highly available, enterprise-ready version that they can quickly deploy and easily integrate with other big data tools and technologies through open APIs. MapR’s target customers already did the experimenting with Cloudera or Apache, Norris explained, and are now ready to move Hadoop into production.

Fact Checking MapR’s Approach

Let’s consider MapR’s claims one-by-one.

API compatibility is more important than open source code. As Hadoop goes mainstream, traditional enterprise users will be more interested in deploying stable, high-performance, enterprise-ready big data stacks than in hacking the Hadoop core. In the meantime, however, big data application developers are adamant that they have access to the source code to integrate their wares seamlessly with Hadoop. In the long-term, this claim is probably accurate, but as Hadoop continues rapid development open source code is still a critical element for many.

MapR provides better performance and availability than competing Hadoop distributions. It is certainly true that MapR’s distribution has demonstrated significant performance and speed improvements over “vanilla” Hadoop. That said, CIOs are increasingly less interested in “speeds and feeds” and more interested in how Hadoop can deliver real business value.

Enterprises are at no higher risk for vendor lock-in with MapR than with competing Hadoop distributions. It will prove reassuring to potential MapR customers that moving data out of M5, should they choose to move to a different distribution, is no more difficult than with any other distribution thanks to M5’s API compatibility. Still, (and like Cloudera’s enterprise Hadoop distribution), M5 costs money. How much money an enterprise sinks into an M5 deployment will determine the cost-effectiveness of moving to a competing distribution. So the risk of vendor lock-in with MapR is probably even with that of Cloudera, but higher than that of Hortonworks' distribution or the straight Apache Hadoop distribution.

MapR’s strategy carries with it a number of risks. The biggest risk for MapR is that Apache Hadoop catches up to M5 in performance and availability capabilities before it, M5, gains wide-spread adoption, thus nullifying its entire value proposition. Indeed, Apache contributors recently introduced HDFS federation to tackle the single-point-of-failure issue “by adding support for multiple namenodes/namespaces to HDFS file system.”

Norris said that while MapR respects the competition, he doesn’t believe the Apache distribution is even close to reaching performance parity with M5. When it comes to the single-point-of-failure issue, for example. MapR’s distributed namenode is superior to namenode federation in that M5 “is self-healing, and no user intervention is needed at any point.” In any event, that is a judgment the community will make.

Another risk is that its message of performance/availability/compatibility over open source code never reaches CIOs, drowned out by the fervent open source Hadoop community as well as by marketing from competitors. Hortonworks, like most Benchmark-funded start-ups, is a marketing and PR machine, while Cloudera, with more than 100 paying customers, is double MapR’s size and is on the verge of becoming the de facto Hadoop distribution.

And don’t forger support services. Enterprises that deploy Hadoop want assurance that if there’s an issue with their cluster, the vendor is there ready and waiting to put out the fire with fast technical support and intervention.

The $10 billion question, then, is which of the three Hadoop distribution models will enterprises embrace. Cloudera differentiates its core open source Hadoop distribution with its proprietary management console, which the company updated just last week. Hortonworks is going to market as the only 100% open source commercial Apache Hadoop distribution and plans to make money on technical support services. MapR is betting enterprises serious about Hadoop will value its performance and availability advantages over open source code, with its API compatibility assuaging vendor lock-in concerns.

The race is on. For MapR to remain competitive, I believe it must take the following steps:

Develop deep and real partnerships with big data application vendors. Enterprises looking to capitalize on big data analytics are increasingly looking to application vendors that promise to deliver real business value from Hadoop. The more application vendors work closely with MapR, the more likely these vendors are to recommend MapR as the underlying Hadoop infrastructure.

Aggressively take its message of performance/availability/compatibility over open source code to enterprise CIOs and even CEOs, who are more interested in enterprise stability and performance than whether a technology is open source or not. If MapR can convince executives that its Hadoop distribution is more powerful, safe and cost-effective than competing distributions, it has a chance to slow Cloudera’s and Hortonworks’ momentum and give itself a fighting chance to win the market.

Action Item: Enterprises evaluating MapR’s Hadoop distribution should demand proof-points/customer references from the vendor that include illustrations of its open API claims, including the ability to easily move data into and out of its cluster. Enterprises looking to navigate the larger Hadoop distribution market should focus on which of the competing Hadoop approaches – Cloudera, Hortonworks or MapR-- brings the greatest business value with the lowest cost and least risk. As we’ve written before, for some enterprises the value of fast business impact on revenue or profit offered by MapR will outweigh the risks of higher capex and the inability to customize the code. For enterprises just beginning to learn about Big Data and the benefits of Hadoop, it may make more sense to adopt Cloudera or Hortonworks’ more open approach, betting that performance improvements the community will develop over time and the flexibility offered by an open source distribution will prove more valuable in the long-term. Whatever option enterprises choose, stay up-to-speed with developments in the Hadoop community, as both open and proprietary improvements that can deliver real business value are being made to the technology at a fast clip.Footnotes:

I also think that you missed some critical capabilities. Big data systems can lose data in a variety of ways. These ways include human error, application error (which is just human error by proxy), disk or node failure and data-center failure. The competitive versions of Hadoop only address the node and disk failure scenarios while human error in both forms is the dominant data loss risk in large clusters.

MapR addresses all of these data loss risks. Many of our customers have stated that they couldn't care less about MapR's much lower TCO but instead consider it the only viable Hadoop distribution due to superior data integrity features.