Blog Post

Hadoop’s civil war: Does it matter who contributes most?

If you were going to buy a service contract for your open-source software, would you prefer your service provider actually be the certifiable authority on that very software? If “yes,” then you understand why Cloudera and Hortonworks have been playing a game of one-upmanship over the past few weeks in an attempt to prove whose contributions to the Apache Hadoop project matter most. However, while reputation matters to both companies, it might not matter as much as fending off encroachments to their common turf.

A few weeks ago, Hortonworks, the Hadoop startup that spun out of Yahoo (s yhoo) in June, published a blog post highlighting Yahoo’s — and, by proxy, Hortonworks’ — impressive contributions to the Hadoop code. Early this week, Cloudera CEO Mike Olson countered with gusto, laying out a strong case for why Cloudera’s contributions are just as meaningful, maybe more so. Yesterday, it was Hortonworks CEO Eric Baldeschwieler firing back with even more evidence showing that, nope, Yahoo/Hortonworks is actually the best contributor. The heated textual exchange is just the latest salvo in the always somewhat-acrimonious relationship between Yahoo and Cloudera, but now that Team Yahoo is in Hadoop to make money, he who claims the most expertise might also claim the most revenue.

From Olson's post.From Baldeschwieler's post.

Hortonworks is betting its entire existence on it. With the company likely not offering its own distribution, Hortonworks will rely almost exclusively on its ability to support the Apache Hadoop code (and perhaps some forthcoming management software) for bringing in customers. This is a risky move.

To make a Linux analogy, Hortonworks is playing the role of a company focused on supporting the official Linux kernel, while Cloudera is left playing the role of Red Hat(s rht), selling and supporting its own open-source, but enterprise-grade, distribution. Maybe Hortonworks should try to be Hadoop’s version of Novell. Whatever you think about the companies’ respective business models, though, it’s clear why reputation matters.

However, I’ve been told by a couple of people deeply involved in the big data world that perhaps Hortonworks and Cloudera would be better served if they spent their energies worrying about a common enemy by the name of MapR. MapR is the Hadoop startup that has replaced the Hadoop Distributed File System with its own file system that it claims far outperforms HDFS and is much more reliable, and that already has a major OEM partner in EMC (s emc).

Ryan Rawson, director of engineering at Drawn to Scale and chief an architect for working on HBase, told me he’s very impressed with MapR and that it could prove very disruptive in a Hadoop space that has thus far been dominated by Cloudera and core Apache. “The MapR guys definitely have a better architecture [than HDFS],” he said, with significant performance increases to match.

Rawson’s rationale for finding such promise in MapR is hard to argue with. As he noted, “garage hobbyists” aren’t building out large Hadoop clusters, but rather, real companies doing real business. If MapR’s file system outperforms HDFS by 3x, that might mean one-third the hardware investment and fewer management hassles. These things matter, he said, and everyone knows that there’s no such thing as a free lunch: Even if they give away the software, Cloudera and Hortonworks still sell products in the form of services.

It’s not just MapR that’s trying to get a piece of Apache Hadoop’s big data market share, either. As I explained earlier this week, there are and will continue to be alternative big data platforms that might start looking more appealing to customers if Hadoop fails to meet their expectations.

The Apache Hadoop community, led for the most part by Hortonworks and Cloudera, has some major improvements in the works that will help it address many of its criticisms, but they’re not here yet. Does it matter which company drives the code and patches for those improvements? Yes, it does. But maybe not as much as burying the hatchet and making sure the Apache Hadoop they both rely on remains worth using.

20 Responses to “Hadoop’s civil war: Does it matter who contributes most?”

Nobody is going to argue that Hadoop Job scheduling can’t be improved, and I’m not going to criticise Platform Computing’s schedulers. They have a good track record for scheduling classic “grid style” applications where CPU and memory capacity was considered more important than closeness to data. It’s a shame that all the grid engines from Condor onwards were built on the assumption that moving data around wasn’t hard. While you can do it with a good SAN, that itself is a SPOF and the $/PB puts a financial limit on your maximum storage capacity sooner than either HDFS or MapR FS will (exception: when you are paying Oracle for your Hadoop cluster)

The roadmap for the MRv2 resource manager shipping is looking like early 2012 for a release, people like myself are running it locally. You are free to say that MRv2 isn’t production ready -yet- but that’s not a sustainable criticism. You are going to have to get enough market share that end users can say whether or not it is better -and whether or not it is worth the premium over what is nearly free. You are also going to have to move up from simple Job Submission API compatibility to providing an implementation of the YARN APIs, because layers above the MRv2 engine are being built against it. Rather than spread doubt about the 0.23 scheduler, it may have been better to actually join in the development of those APIs so that you can be confident that they will work well with your technologies. Just saying negative things about future releases of Apache Hadoop may help in the short term but it is no going to help raise issues you have with those future APIs with us “amateur and egocentric” Hadoop developers. Sorry.

SteveL “amateur and egocentric” committer on the Apache Hadoop project.

A couple notes on MapR and official Hadoop …
Faster I/O of course matters but I think that the most strong point in MapR proposition is enterprise features which official Hadoop still lacks: HA, including NFS HA and DFS snapshots. This is definitely could be a game changer especially taking into account how long it takes usually to add new features into Hadoop release (it takes years). The major problem of Apache Hadoop projects is that there are too many developers involved and too little code/features produced. MapR easily proves that a small dedicated team of professionals can easily outperform large and bloated OS community infested by amateur coders and egocentric leaders.

“MapR easily proves that a small dedicated team of professionals can easily outperform large and bloated OS community infested by amateur coders and egocentric leaders.”

I’m sorry, are you implying that we are amateur coders? I like that: funny. Quant even. Especially as both MapR and Hadoop effectively say “use Linux underneath”, and not, say, Windows. Where do I fit in? Amateur or egocentric? Both? Where does the Facebook dev team fit in?

An in house team can be more agile, as they don’t have the extended community decision making process that creates inertia. They can also write good products. In-house teams can also produce stuff whose insides would scare you if you could see it. (I am thinking of NTFS and HFS+ here). It’s a lot easier for closed source projects to point to public issues w/ open source apps (the JIRAs are there, after all), than it is for the other teams to do the same.

What is hard for any DFS to do is stabilise at scale. It takes time, you have to discover the quirks of the real world, and learn those lessons. Until EMC announced they were setting up a 1000-node cluster, I didn’t believe any of the claims of scalability from MapR as they were impossible to substantiate. HDFS: I know it’s limits; everyone know’s its limits. It’s a shame it has them, but we know where they are.

That said, Vladimir’s point about NFS and the like is valid. Performance is a tricky one as it’s often driven by disk bandwidth, and that is the same regardless of how you access it. We’d love someone to write a better NFS bridge to HDFS.

I think its more about the focus of development as opposed to the quality of the developers. The Hadoop distribution is growing in its various components and no one company is providing a robust solution for all of the components of the distribution. Enterprises would be well advised to evaluate best of breed components. While MapR is definitely improving the file system performance, the Job Tracker issues continue to be inherited from the weak scheduling and job management architecture in Hadoop. Platform Computing is already providing the replacement JobTracker capability with its Platform MapReduce implementation, compatible with all Hadoop Applications, but providing for separation of the JobTracker and Resource Manager – allowing 300+ JobTrackers in a production environment providing high availability, better management and visibility into running jobs, better SLA and cluster utilization. While the Open Source community is talking about a release to fix the issues (NextGen MapReduce) – a 1.0 release of this functionality from the community will not provide the maturity and sophistication enterprises need to be able to put this in production anytime soon. Checkout our webinar on this topic to learn how you can get this functionality today. http://info.platform.com/EA.FY12.Q2.WebinarSharedServicesMapReduce_LongReg.html.

MapR’s Hadoop distribution is the only one that provides any business continuity (HA, data protection, disaster recovery). The distributed HA architecture provides no single points of failure and completely distributed metadata. Snapshots provide point-in-time recovery, so users can recover from user and application errors. Mirroring provides asynchronous cross-datacenter disaster recovery and remote backup.

The MapR distribution is 2-5x faster than any other distribution today. This is both on standard benchmarks (terasort, DFSIO and YCSB benchmarks) and in customer deployments. While some performance improvements will be introduced in stock Hadoop in 2012, the performance improvements that will be released by MapR in 2012 are dramatic and will extend the performance gap beyond today’s 2-5x.

Here at MapR, our sole focus is on developing innovative technology to provide real value to the community of Hadoop users. With the massive adoption of MapR in the last few months, I think it’s safe to say that this approach has proven itself.

As a long time user of Hadoop and HDFS with terabytes of data and having looked at CassandraFS and MapR as HDFS replacements, I’d have to say that it makes me uncomfortable putting huge amounts of data in a closed source system offered by a relatively small company. In that respect, I feel more confident with Apache Hadoop & HDFS knowing that there will always be community support.
So I don’t think MapR is a good comparison to Apache HDFS.
Greenplum + EMC + MapR is definitely something to watch closely.

steve – a combo of lack of append, and known bugs. Eg: the bug where the snn will upload a corrupt image to the nn. Mostly fixed, but from what I understand there is no actual protection to the snn uploading a vastly corrupt image. The patch fixed the underlying snn bug, but didnt add protection against the general issue coming up again.

But the key point here is that HDFS has lost data, even at yahoo. Claiming its bulletproof is not really winning hearts and minds of people who have suffered at the hands of HDFS.

The “High Availability for HDFS”, when it is alpha’ed in mid-2012 or so, will still require that the end-user deploy a pair of Netapp filers in HA mode (about $300K with all the appropriate licenses included).

The HDFS NameNode will still suffer from the Java garbage-collection-of-death, causing periodic cluster-wide outages in spite of spending the $300k.

There is no comparison: with MapR, for that price, you can buy a second cluster of approx. 50-60 new machines which are quite beefy. MapR’s HA has been in production for almost a year now, while HDFS HA is all brand-new code, and will require a year to stabilize

Srivas – maybe you misread, or maybe it’s conveniently ignored… IAC, HA for HDFS is coming in hadoop-0.23.1 later this year, not mid-2012. hadoop-0.23.0 is coming in days/weeks. Cloudera has also talked about releasing CDH4 beta with HA this year. I’m sure you’ve seen this on Apache Hadoop mailing lists, given that you are fairly active there. ;-)

Regarding its stability – Apache Hadoop has been in production for nearly 4 years now at Yahoo, Facebook etc. with hundreds of petabytes of data on tens of thousands of machines. By the same token MapR will need another 3 years to attain stability of HDFS. Or, by the same metric, it will require more than a year for the non-existent MapR secureFS to be developed and to attain stability of secure HDFS which has been in production for over a year now! *smile*

As SteveL said to another vendor – it’s unfortunate proprietary vendors talk ill of Apache Hadoop to sell their own, it’s just something we in the OSS live with; especially with the information asymmetry – we don’t have access to the information in your bug-tracking systems.

Ah I think Derrick has overstated my position, I just merely one of the architects of HBase, not the chief architect – and I didn’t claim so during our conversation.

As for the rest, I am excited that the various Hadoop companies are really meeting the challenge head on and looking at ways of improving HDFS and Hadoop – all things I have been saying for over 2 years now. I’ve never been in a position to hire and run a HDFS team, so I have been limited to asking others to chip in and make a better HDFS for us all. And now it is happening.

“Ryan Rawson, director of engineering at Drawn to Scale and chief architect for HBase”

Ryan has been a valued engineer in the HBase community for some time, but is neither the “chief architect” of HBase (no such person exists), nor director of engineering at Drawn to Scale (unless this is a very recent change — he joined cx.com about 7 months ago).

Those interested in how Cloudera, Hortonworks, Yahoo, and others are continuing to work together are encouraged to follow the open source work going on on the ASF JIRA. Particularly with regards to performance as mentioned above, please refer to https://issues.apache.org/jira/browse/HADOOP-7714 for example. This patch was developed at Cloudera, experimented with at Yahoo, and I believe Hortonworks will help with some additional benchmarking and code review. Open source cooperation at its best.