Thursday, January 5, 2012

Running a Ruby map-reduce job with Hadoop

I am currently developing and app in my spare time and needed to merge a file with itself to include movies that are both Romance and Comedy.

The file looked somthing like this:

movie-a Comedymovie-b Comedymovie-a Romance

I wanted to produce a file of the form

movie-a [Comedy,Romance]

Ignoring the movies that don't include both genres.

I inmediately thought of using hadoop, even if the file was not huge, the map reduce algorithm seems a good fit for the problem.

I have done some small work in hadoop with Java, but in this case my project was Ruby based and I wanted to keep on using Ruby even for my hadoop job, so I used the streaming API of hadoop to solve the problem.
I needed to develop a fast and easy solution. What follows is the code:

Consulting some bibliography and the Web I came to a very easy solution.

map.rb

ARGF.eachdo|line|

begin

parts = line.split("\t")

puts parts[0]+"\t"+ parts[parts.size-1]

rescue

puts'error'

end

end

reduce.rb

current_key = nil

current_key_values=[]

ARGF.eachdo|line|

line = line.chomp

(key, value) = line.split(/\t/)

if current_key.nil?

current_key=key

end

if current_key==key

current_key_values<<value

else

if current_key_values.include?("Comedy")and current_key_values.include?("Romance")

puts current_key +"\t"+ current_key_values.to_s

end

current_key=key

current_key_values=[value]

end

end

hadoop_run.sh

#!/bin/bash

HADOOP_HOME=/home/cscarioni/programs/hadoop-0.22.0

JAR=contrib/streaming/hadoop-0.22.0-streaming.jar

STREAMCOMMAND="$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/$JAR"

$STREAMCOMMAND \

-mapper'ruby map.rb' \

-reducer'ruby reduce.rb' \

-file map.rb \

-file reduce.rb \

-input'/home/cscarioni/Downloads/comedy_romance_movies' \

-output/home/cscarioni/Downloads/comedy_romance_movies_results

I consider the main differences between the streaming API and the Java API are:

1. The streaming API works everything in the stdin and stdout between the scripts, (take a look at the ARGF and puts use in both map.rb and reduce.rb) like when we use the pipe in the command line between commands

2. In the Java API the results from the mapper phase are grouped together for example in our case we would actually receive directly on the reduce phase the line (movie-a [Romance,Comedy]). In the streaming API on the other hand, the grouping needs to be done manually, what we get is an ordered list by key (so all the movie-a would be next to each other).

So there we have a small and functional map reduce job in Ruby with Hadoop.