Tag Archives: hadoop

Hadoop MapReduce is a batch-processing system. Why? Because that’s the way Google described their MapReduce implementation.

But it doesn’t have to be that way. Introducing HOP: the Hadoop Online Prototype[updated link to final NSDI ’10 version]. With modest changes to the structure of Hadoop, we were able to convert it from a batch-processing system to an interactive, online system that can provide features like “early returns” from big jobs, and continuous data stream processing, while preserving the simple MapReduce programming and fault tolerance models popularized by Google and Hadoop. And by the way, it exposes pipeline parallelism that can even make batch jobs finish faster. This is a project led by Tyson Condie, in collaboration with folks at Berkeley and Yahoo! Research.

For the last year or so, my team at Berkeley — in collaboration with Yahoo Research — has been undertaking an aggressive experiment in programming. The challenge is to design a radically easier programming model for infrastructure and applications in the next computing platform: The Cloud. We call this the Berkeley Orders Of Magnitude (BOOM) project: enabling programmers to develop OOM bigger systems in OOM less code.

To kick this off we built something we callBOOM Analytics[link updated to Eurosys10 final version]: a clone of Hadoop and HDFS built largely in Overlog, a declarative language we developed some years back for network protocols. BOOM Analytics is just as fast and scalable as Hadoop, but radically simpler in its structure. As a result we were able — with amazingly little effort — to turbocharge our incarnation of the elephant with features that would be enormous upgrades to Hadoop’s Java codebase. Two of the fanciest are: Read More »

Just for the record, I think MapReduce is fine, but not especially interesting technology. The thing is, the “teachable moment” it presents is really great stuff, because it is bringing people toward data-centric parallel programming. So it’s good for the data-centric research business in general, and especially for data-centric approaches to parallelism.

One thing I plan to do here is jot down ideas I don’t have time to work on myself. Here’s the first installment in what will hopefully be a running series of “Research Gimme‘s”. Anybody who wants to run with this, I’d love to hear what you’re up to.

So…. who’s going to re-examine Online Aggregation in the Hadoop context? Goodness knows it’d be useful. It will require moving Hadoop beyond a slavish implementation of the Google MapReduce paper. That’s got to be a good thing… Here’s the start of the program: