Thursday, November 09, 2006

Hadoop on Amazon EC2

The details of setting this up are available at the node "AmazonEC2" on the Lucene-Hadoop Wiki at Apache.org.

When looking for more about this, I noticed that the hyped-but-not-launched natural language search engine Powerset appears to be leading the charge on using Hadoop on EC2. From the Hadoop mailing list:

At Powerset we have used EC2 and Hadoop with a large number of nodes, successfully running Map/Reduce computations and HDFS. Pretty much like you describe, we use HDFS for intermediate results and caching, and periodically extract data to our local network. We are not really using S3 at the moment for persistent storage.

A nice feature of Hadoop as measured against our use of EC2 has been the capability of fluidly changing the number of instances that are part of the cluster. Our instances are set up to join the cluster and the DFS as soon as they are activated and when - for any reason - we lose those machines, the overall process doesn't suffer. We have been quite happy with this, even at significant number of instances.

That is an interesting detail on the recent announcement that Powerset is a heavy user of Amazon's EC2.

I am not sure I have an immediate use for Hadoop on EC2, but it is nice to see. Developers may now be able to rapidly bring up hundreds of servers, run a massive parallel computation on them using Hadoop's MapReduce implementation, and then shut down all the instances, all with low effort and at low cost. Very cool.