Over the past few months I’ve been hacking together scripts to distribute data parallel jobs. However, it’s always nice when somebody else has done the work. In this case, Hadoop is an implementation of the map/reduce framework from Google. As Yahoo and others have shown, it’s an extremely scalable framework, and when coupled with Amazons EC2, it’s an extremely powerful system, for processing large datasets.

I’ve been hearing a lot about Hadoop from my brother who is working on linking R with Hadoop and I thought that this would be a good time to try it out for myself. So the first task was to convert the canonical word counting example to something closer to my interest – counting occurrence of elements in a collection of SMILES. This is a relatively easy example, since SMILES files are line oriented, so it’s simply a matter of reworking the WordCount example that comes with the Hadoop distribution.

For now, I run Hadoop 0.20.0 on my Macbook Pro following these instructions on setting up a single node Hadoop system. I also put the bin/ directory of the Hadoop distribution in my PATH. The code employs the CDK to parse a SMILES string and identify each element.

This is compiled in the usual manner and converted to a jar file. For some reason, Hadoop wouldn’t run the class unless the CDK classes were also included in the jar file (i.e., the -libjars argument didn’t seem to let me specify the CDK libraries separately from my code). So the end result was to include the whole CDK in my Hadoop program jar.

OK, the next thing was to create an input file. I extracted 10,000 SMILES from PubChem and copied them into my local HDFS by

Thus across the entire 10,000 molecules, there were 13 occurrences of Ag, 1881,452 occurrences of carbons and so on.

Something useful?

OK, so this is a rather trivial example. But it was quite simple to create and more importantly, I should be able to take this jar file and run it on a proper multi-node Hadoop cluster and work with the entire PubChem collection.

A more realistic use case is to do SMARTS searching. In this case, the mapper would simply emit the molecule title along with an indication of whether it matched the supplied pattern (say 1 for a match, 0 otherwise) and the reducer would simply collect the key/value pairs for which the value was 1. Since one could do this with SMILES input, this is quite simple.

A slightly non-trivial task is to apply this framework to SD files. My motivation is that I’d like to run pharmacophore searches across large collections, without having to split up large SD files by hand. Hadoop is an excellent framework for this. The problem is that most Hadoop examples work with line-oriented data files. SD files are composed of multi-line records and so this type of input requires some extra work, which I’ll describe in my next post.

Rich, good question. However, not having used CouchDB (and also I just started with Hadoop last night), I probably can’t give a very good answer.

My understanding is that Hadoop is a general purpose framework for map/reduce style tasks. In that sense, one could build something like CouchDB on top of the Hadoop framework.

However a few points.

First – from the point of view of computation, Hadoop seems to be preferable. In the sense, that one can bring computation to the data. If one were to store the data in an RDBMS, it would imply that you’d do arbitrary computation within the DB (which is unwieldy in many RDBMS’s). So if the scenario is such that you are performing abritrary computations across large datasets and want to process every object, Hadoop seems a good thing

(One thing that I will be working on is to do pharmacophore searches with Hadoop. While one certainly could do this in an RDBMS, but I think it’d be a bit of a hack. In contrast, I can simple rework preexisting pharmacophore code into the Hadoop framework with minimal effort)

Second – I think unless we’re talking about parallel RDBMS’s, then it doesn’t really make sense to compare. An interesting discussion of map/reduce versus parallel RDBMS is at http://tinyurl.com/cdy5rl. Of course if the application doesn’t require any significant relational algebra, then an RDBMS is definitely overkill.

One of the big things versus RDBMS and plain old map/reduce might be the way indexing is done – in effect, RDBMS will pre-compute stuff. I don’t know whether CouchDB can do indexing on arbitrary fields (my understanding is that it basically acts like a big hash table?)