Links

Friday, February 19, 2010

Here at Bizo, the combination of Hudson for cron management, Hive for report generation, and Elastic MapReduce for provisioning compute power has greatly simplified our data processing. Periodically and automatically, our Hudson cron instance generates Hive scripts for us and launches them in EC2.

The main inconvenience with this process is that the results of our Hive jobs are left as one or more obscurely named files in S3. These often need some post-processing to put them into a more friendly form. Unfortunately, EMR doesn't have an easy hook for launching these post-processing tasks -- although we could implement them as MapReduce steps, we'd need to write our own workflows, losing the simplicity of using EMR's simple "--hive-script" flag.

Our solution is to use SimpleDB to store some basic metadata about jobs. Using this metadata, a Hudson job periodically checks the EMR API to determine whether tasks have completed. If so, it then triggers other Hudson jobs that are responsible for processing the results.

The Hudson parameterized build feature. It's not really feasible to create a new Hudson job for each individual report that runs, so we pass parameters to Hudson so the post-processing step can figure out where the results are in S3 and what to do with them. It's not well-documented how to do this programatically (as opposed to from the web interface); the solution is to send some JSON to the build url.

The Trigger Script. This is the script that periodically runs on our cron server to check if a post-processing step should be triggered. The JSON format for parameterized jobs is described in the comments of this file.

The end result is that a job can run an EMR job and configure a post-processing step for itself with the following commands: