2013-03-10

Enterprise Hadoop: yes, but how are you going to fix it?

EMC's Pivotal HD has started a lot of debate as to whether building on top of Hadoop can be considered being part of a Hadoop ecosystem or whether it's an attempt to co-opt it: to do something and claim that it is part of a bigger system.

Can you say you are "part of the Hadoop stack" when all you are
doing a closed source layer on top? I think that's quite nuanced, and
depends on what you do -and how it's interpreted.

The Apache License grants everyone the freedom to take the source away and do anything they want with it

There is no requirement for you to contribute a single line of code back; a single bug report.

This is a difference between the ASF license and GPL-licensed software which you redistribute: with GPL code the changes must (somehow) be published.

Other aspects of the ASF license:

You can't abuse ASF brand names, which in the Apache Hadoop world means you can't use Apache Hadoop, Apache HBase, Apache Mahout, Giraph, Apache Pig, Apache Hive, etc in your product names. There are some excellent guidelines on this in the wiki page Defining Hadoop -and if you want actual feedback, email the trademarks@ list. It may seem that doing so removes the secrecy/surprise factor of your product announcement, but it's better that than a hurried renaming of all your product and documentation.

If you sue other users of the product over patents of yours that you believe apply to the technology -you revoke your own right to the software. I haven't known that to happen with Apache products -though the Oracle/Google lawsuit did cover copyright of APIs and reimplementations thereof. If APIs ever became copyrightable, then decades of progress in the computing industry will grind to halt.

People are also free to look at Apache APIs and clean-room re-implement them; you just can't use the Apache product names at that point. Asserting compatibility becomes indefensible: if you look at the ASF JIRAs, even 100% compatibility across versions is hard to achieve -that's with the same source tree. It's not the binary signature that is (usually) the problem, its what happens afterwards that's trouble. Little things like whether renaming a file is atomic, or what happens when you ask for the block locations of a directory.

Now, what about introducing a closed source product on top of Hadoop and saying you are part of the hadoop ecosystem, that you have x-hundred people working on Hadoop?

This is where it gets tricky.

Some people say "it's like building on Linux" -and there are some very big closed applications that run on Linux. A big one that springs to mind is Oracle RDBMs.

Are the thousands of people who work on Oracle-on-Linux "working on Linux"? Are they working on "Oracle on Linux", or are they working "on Oracle", on Linux?

Whatever way you look at it, those people aren't working in the Linux OS, just on something that runs on top of it . Would you call it part of the Linux "stack", the way MySQL and Apache HTTPD are?

Personally: I have no idea.

What probably doesn't happen from Oracle's work is any direct feedback from their application into the OS. [Correction: it does, thx @tlipcon]. I also doubt that RedHat, Novell and others regression test Oracle RDBMS on their latest builds of Linux. By their very nature, closed-source applications fall out of the normal OSS regression and release test processes, that rely not only on the open source trees, but the open test suites. This is also why Oracle's actions in not releasing all tests for MySQL seems so short sighted: it may hurt MariaDB, but it also hinders Linux regression testing.

Breaking that link between the OS and the application means that Oracle have not been in the position to rapidly adapt to problems in the OS and filesystem, because there's no way to push their issues back upstream, to get changes in, to get new releases out in a hurry to fix a problem with their application or hardware. Instead the onus becomes on the application to deal with the problem themselves.

How have Oracle handled this? Eventually, by getting into the Linux Distribution business itself, with Oracle Unbreakable Linux. By releasing a complete OS build, they can coordinate OS and application releases, they can fix their version of the OS to handle problems that surface in Oracle's applications -on a timetable that works for them. They also get to handle Oracle hardware support in a timely manner, and charge support revenue from users.

That works -at a cost. By forking RedHat Linux, Oracle have taken on all the maintenance and testing costs themselves.

The amount that Oracles charge has to cover those costs, or the quality of the Oracle fork of Linux degrades relative to the reference points of RHEL and Debian.

For Oracle, or the combined OS+11g+exadata deal has enough margins in the database that they can come up with a price is was less than ({HP | Dell}-RHEL-Oracle11g), and so presumably those costs can be covered. What's not clear is this: did Oracle get into the business of selling a supported Linux because they saw money in it, or because they concluded that their hardware and database products effectively mandated it?

Other companies getting into the business of redistributing Hadoop-derived products to customers who are paying those companies in the expectation of support are going to have start thinking about this.

If you have just sold something that has some Hadoop JARs in it -code that the customer depends on- and they have a problem, how are you going to fix it?

Here are some strategies:

Hope it won't be a problem. Take the Apache artifacts, ship as is. It is, in the opinions of myself and my Hortonworks colleagues, production ready. Push customers with problems to issues.apache.org, forward them yourself. You could do the same with CDH, which in the opinions of my friends at Cloudera, also production ready.Do that, and issues on Apache JIRA will be ignored unless you can replicate them on the ASF artefacts.

Build your own expertise: this takes time, and while that happens you aren't in a position to field support calls. If you make your own releases, you end up needing your own test infrastructure, QA'ing it, and tracking the changes in hadoop trunk and branch-1.

Partner with the experts: work with people who have in-depth understanding of the code, it's history, why decisions were made and experience in cutting production scale releases suitable for use in web companies and enterprises. That means Hortonworks and Cloudera. Many of the enterprise vendors do this, because they've realised it was the best option.

The web companies, the early adopters went for #1 and ended up with #2: build your own expertise. This is effectively what I did in my HPLabs work on dynamic in-cloud Hadoop. You can see my journeys through the source -while working on big things, little things crop up, especially problems related to networking in a virtual world, configuration in a dynamically configured space, and recovery/sync problems that my service model discovered. I still know my way through a fraction of the code, but every project I work on builds up my understanding, and contributes stuff back to the core, including things like better specifications of the filesystem API's semantics, and the tests to go with it.

That trail of JIRAs related to my work shows up something else: if you are delving deep into Hadoop, your reading of the code alone should be enough to get you filing bugs against minor issues, niggles, potential synchronization, cleanup or robustness problems. If you are pushing the envelope in what Hadoop can do: bigger issues.

We are starting to see some involvement in hadoop-core from Intel, though apart from the encryption contribs, it still appears to be at an initial state -though Andrew Purtell has long been busy in HBase. We do see a lot activity from Junping Du of VMWare -not just the topology work, but other big virtualisation features, and the day-to-day niggles and test problems you get working with trunk. Conclusion: at least one person in VMWare is full time on Hadoop. Which is great: the more bugs that get reported, the more patches, the better Hadoop becomes. Participating in the core code development project develops your expertise while ensuring that the Apache (hence Hortonworks and Cloudera) artifacts meet your needs.

Are there other contributors from EMC? Intel? I have no idea. You can't tell from gmail & ymail addresses alone; you'd have to deanonymize them by going via LinkedIn. That's not just name match; you can use the LI "find your contacts" scanner to go through those people's email addresses and reverse lookup their names. Same for twitter. I may just do that for a nice little article on "practical deanonymization".

In the meantime, whenever someone comes to you with a product containing the Apache Hadoop stack, say "if there is a problem in the Hadoop JARs - how are you going to fix it?"

[Artwork: See no evil by Inkie, co-organiser of the See No Evil event. Clearly painted with the aid of a cherry picker]

1 comment:

Intel will have three active committers working full time on community centered work by April or so. We have a larger team that will also be focused on contributing to various projects in the Apache Hadoop ecosystem, some as individual JIRAs, some as projects now in incubation. I think you will come to know them over time, their @intel.com addresses may be a reliable indication, but it is my hope that high quality work will be another, and over time they may become invested in the Apache Way and demonstrate sufficient merit and contribution to become committers as well.

Encryption is indeed a start, a specific requirement and a cleanly separable set of tasks where we can bring some initial value to the community; but it is also a prelude to interesting work on larger security related concerns: labeling, admission control, audit, policy. We are also interested in working with the community on addressing broader and larger challenges, such as updating the Hadoop ecosystem's runtime for new classes of (nearly) affordable storage and networking. This is another interesting cross-cutting challenge, across at least the OS, Java virtual machine, and Hadoop common, as well as HDFS and HBase. Expect to see JIRAs from us on these things and more as we develop strategies and get internal committments in place.