Map Reduce Jobs in Ruby with Wukong

I recently got to work on some really interesting, big data problems at Highgroove. One of our clients needed to record every api call and analyze specific time periods for averages and usage metrics. Hadoop fits this use case pretty well with a distributed file system and map reduce framework built in. But, I’ll be honest, I wasn’t too happy about having to write map reduce jobs in Java. While looking for alternatives I found Wukong; a gem that adds a Ruby wrapper around the Hadoop Streaming utility. Here’s an example of how easy it makes writing map reduce jobs.

Now there are always trade offs. Streaming queries, especially to a Ruby script, are definitely slower than their pure Java counterparts. But, between the base startup time of any map reduce job (about 30 seconds) and the speed up from running all jobs in parallel across a cluster of workers, it’s really not that bad and totally worth the improvements in readability and maintenance. Also, because this script runs over the entire cluster, you have to make sure that each machine has the right ruby version and and all necessary gems installed, but we easily solved this with a few lines in a Chef recipe.

We’ve really only touched the surface of what is included with Wukong. There are plenty of different mapper and reducer classes depending on how you have your data structured and how you want to query it. With the script class provided by Wukong, you can run scripts across an entire Hadoop cluster with a single command. It even includes support for writing pig latin scripts in Ruby. So if you know Ruby and want to get involved with some big data projects, Wukong + Hadoop is a great way to get started.