Making sense of MapReduce

Last night I went to hear Ken Krugler of Bixolabs talk about Hadoop at the monthly meeting of the Software Developers Forum. Maybe because Ken is an unusually lucid speaker, or maybe because I just reached some sort of cumulative tipping point through the prep work of all those patient people who have tried to help me in the past; but finally, I think I get Hadoop and MapReduce.

First of all, it’s MapReduce and then Hadoop: not Hadoop with MapReduce. The way to think about Hadoop is that it is the open source choice for implementing MapReduce like algorithms. Hadoop may provide technical advantages over competitors (Twister, Greenplum and others) about which I am in no position to comment, but it is nevertheless only one of several MapReduce frameworks.

The second big “revelation” was the nature of MapReduce. When dealing with massive data, it is very likely that the size of the data to be processed is many orders of magnitude larger than size of the software required to process it. The revelation is: not only does it make sense to move the code to the data, but one ought also to do as much parallel processing as possible. Looked at this way, MapReduce is a variation of the flow based programming techniques developed by mainframe programmers in the 60’s and 70’s which were part of a tremendous effort to understanding parallel programming during that time period. In fact, the tuples of Linda spaces, the precursor of Network Spaces which forms the conceptual foundation Revolution’s parallel R programming package, ParallelR, is probably a direct ancestor of MapReduce’s key – value pairs. Furthermore, the buzz in the R community about Hadoop makes sense if one thinks about Hadoop as the open source way to implement sophisticated R based MapReduce algorithms. R should be an ideal choice for implementing parallel algorithms that work on independent chunks of data.

Related

To leave a comment for the author, please follow the link and comment on his blog: Revolutions.