A Guide to Python Frameworks for Hadoop

I recently joined Cloudera after working in computational biology/genomics for close to a decade. My analytical work is primarily performed in Python, along with its fantastic scientific stack. It was quite jarring to find out that the Apache Hadoop ecosystem is primarily written in/for Java. So my first order of business was to investigate some of the options that exist for working with Hadoop from Python.

In this post, I will provide an unscientific, ad hoc review of my experiences with some of the Python frameworks that exist for working with Hadoop, including:

Hadoop Streaming

mrjob

dumbo

hadoopy

pydoop

and others

Ultimately, in my analysis, Hadoop Streaming is the fastest and most transparent option, and the best one for text processing. mrjob is best for rapidly working on Amazon EMR, but incurs a significant performance penalty. dumbo is convenient for more complex jobs (objects as keys; multistep MapReduce) without incurring as much overhead as mrjob, but it’s still slower than Streaming.

Read on for implementation details, performance comparisons, and feature comparisons.

Toy Problem Definition

To test out the different frameworks, we will not be doing “word count”. Instead, we will be transforming the Google Books Ngram data. An n-gram is a synonym for a tuple of n words. The n-gram data set provides counts for every single 1-, 2-, 3-, 4-, and 5-gram observed in the Google Books corpus grouped by year. Each row in the n-gram data set is composed of 3 fields: the n-gram, the year, and the number of observations. (You can explore the data interactively here.)

We would like to aggregate the data to count the number of times any pair of words are observed near each other, grouped by year. This would allow us to determine if any pair of words are statistically near each other more often than we would expect by chance. Two words are “near” if they are observed within 4 words of each other. Or equivalently, two words are near each other if they appear together in any 2-, 3-, 4-, or 5-gram. So a row in the resulting data set would be comprised of a 2-gram, a year, and a count.

There is one subtlety that must be addressed. The n-gram data set for each value of n is computed across the whole Google Books corpus. In principle, given the 5-gram data set, I could compute the 4-, 3-, and 2-gram data sets simply by aggregating over the correct n-grams. For example, if the 5-gram data set contains

1

2

3

(the,cat,in,the,hat)199920

(the,cat,is,on,youtube)199913

(how,are,you,doing,today)19865000

then we could aggregate this into 2-grams which would result in records like

1

(the,cat)199933// i.e., 20 + 13

However, in practice, Google only includes an n-gram if it is observed more than 40 times across the entire corpus. So while a particular 5-gram may be too rare to meet the 40-occurrence threshold, the 2-grams it is composed of may be common enough to break the threshold in the Google-supplied 2-gram data. For this reason, we use the 2-gram data for words that are next to each other, the 3-gram data for pairs of words that are separated by one word, the 4-gram data for pairs of words that are separated by 2 words, etc. In other words, given the 2-gram data, the only additional information the 3-gram data provide are the outermost words of the 3-gram. In addition to being more sensitive to potentially rare n-grams, using only the outermost words of the n-grams helps ensure we avoid double counting. In total, we will be running our computation on the combination of 2-, 3-, 4-, and 5-gram data sets.

The MapReduce pseudocode to implement this solution would look like so:

1

2

3

4

5

6

7

8

9

def map(record):

(ngram,year,count)=unpack(record)

// ensure word1 has the lexicographically first word:

(word1,word2)=sorted(ngram[first],ngram[last])

key=(word1,word2,year)

emit(key,count)

def reduce(key,values):

emit(key,sum(values))

Hardware

These MapReduce jobs are executed on a ~20 GB random subset of the data. The full data set is split across 1500 files; we select a random subset of the files using this script. The filenames remain intact, which is important because the filename identifies the value of n in the n-grams for that chunk of data.

The Hadoop cluster comprises five virtual nodes running CentOS 6.2 x64, each with 4 CPUs, 10 GB RAM, 100 GB disk, running CDH4. The cluster can execute 20 maps at a time, and each job is set to run with 10 reducers.

The software versions I worked with on the cluster were as follows:

Hadoop: 2.0.0-cdh4.1.2

Python: 2.6.6

mrjob: 0.4-dev

dumbo: 0.21.36

hadoopy: 0.6.0

pydoop: 0.7 (PyPI) and the latest version on git repository

Java: 1.6

Implementations

Most of the Python frameworks wrap Hadoop Streaming, while others wrap Hadoop Pipes or implement their own alternatives. Below, I will discuss my experience with a number of tools for using Python to write Hadoop jobs, along with a final comparison of performance and features. One of the features I am interested in is the ease of getting up and running, so I did not attempt to optimize the performance of the individual packages.

As with every large data set, there are bad records. We check for a few kinds of errors in each record including missing fields and wrong n-gram size. For the latter case, we must know the name of the file that is being processed in order to determine the expected n-gram size.

Hadoop Streaming

Hadoop Streaming is the canonical way of supplying any executable to Hadoop as a mapper or reducer, including standard Unix tools or Python scripts. The executable must read from stdin and write to stdout using agreed-upon semantics. One of the disadvantages of using Streaming directly is that while the inputs to the reducer are grouped by key, they are still iterated over line-by-line, and the boundaries between keys must be detected by the user.

Here is the code for the mapper:

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

#! /usr/bin/env python

import os

import re

import sys

# determine value of n in the current block of ngrams by parsing the filename

Hadoop Streaming separates the key and value with a tab character by default. Because we also separate the fields of our key with tab characters, we must tell Hadoop that the first three fields are all part of the key by passing these options:

Note that the files mapper.py and reducer.py must be specified twice on the command line: the first time points Hadoop at the executables, while the second time tells Hadoop to distribute the executables around to all the nodes in the cluster.

Hadoop Streaming is clean and very obvious/precise about what is happening under the hood. In contrast, the Python frameworks all perform their own serialization/deserialization that can consume additional resources in a non-transparent way. Also, if there is a functioning Hadoop distribution, then Streaming should just work, without having to configure another framework on top of it. Finally, it’s trivial to send Unix commands and/or Java classes as mappers/reducers.

The disadvantage of Streaming is that everything must be done manually. The user must decide how to encode objects as keys/values (e.g., as JSON objects). Also, support for binary data is not trivial. And as mentioned above, the reducer must keep track of key boundaries manually, which can be prone to errors.

mrjob

mrjob is an open-source Python framework that wraps Hadoop Streaming and is actively developed by Yelp. Since Yelp operates entirely inside Amazon Web Services, mrjob’s integration with EMR is incredibly smooth and easy (using the boto package).

mrjob provides a pythonic API to work with Hadoop Streaming, and allows the user to work with any objects as keys and mappers. By default, these objects are serialized as JSON objects internally, but there is also support for pickled objects. There are no other binary I/O formats available out of the box, but there is a mechanism to implement a custom serializer.

Significantly, mrjob appears to be very actively developed, and has great documentation.

As with all the Python frameworks, the implementation looks like pseudocode:

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

#! /usr/bin/env python

import os

import re

from mrjob.job import MRJob

from mrjob.protocol import RawProtocol,ReprProtocol

classNgramNeighbors(MRJob):

# mrjob allows you to specify input/intermediate/output serialization

# default output protocol is JSON; here we set it to text

OUTPUT_PROTOCOL=RawProtocol

def mapper_init(self):

# determine value of n in the current block of ngrams by parsing filename

Writing MapReduce jobs is incredibly intuitive and simple. However, there is a significant cost incurred by the internal serialization scheme. A binary scheme would most likely need to be implemented by the user (e.g., to support typedbytes). There are also some built-in utilities for log file parsing. Finally, mrjob allows the user to write multi-step MapReduce workflows, where intermediate output from one MapReduce job is automatically used as input into another MapReduce job.

(Note: The rest of the implementations are all highly similar, aside from package-specific implementation details. They can all be found here.)

dumbo

dumbo is another Python framework that wraps Hadoop Streaming. It seems to enjoy relatively broad usage, but is not developed as actively as mrjob at this point. It is one of the earlier Python Hadoop APIs, and is very mature. However, its documentation is lacking, which makes it a bit harder to use.

It performs serialization with typedbytes, which allows for more compact data transfer with Hadoop, and can natively read SequenceFiles or any other file type by specifying a Java InputFormat. In fact, dumbo enables the user to execute code from any Python egg or Java JAR file.

In my experience, I had to manually install dumbo on each node of my cluster for it to work. It only worked if typedbytes and dumbo were built as Python eggs. Finally, it failed to run with a combiner, as it would terminate on MemoryErrors.

The command to run the job with dumbo is

1

2

3

4

5

6

7

8

dumbo start ngrams.py\

-hadoop/usr\

-hadooplib/usr/lib/hadoop-0.20-mapreduce/contrib/streaming\

-numreducetasks10\

-input hdfs:///ngrams \

-output hdfs:///output-dumbo \

-outputformat text\

-inputformat text

hadoopy

hadoopy is another Streaming wrapper that is compatible with dumbo. Similarly, it focuses on typedbytes serialization of data, and directly writes typedbytes to HDFS.

It has a nice debugging feature, in which it can directly write messages to stdout/stderr without disrupting the Streaming process. It feels similar to dumbo, but the documentation is better. The documentation also mentions experimental Apache HBase integration.

With hadoopy, there are two ways to launch jobs:

launch requires Python/hadoopy to be installed on each node in the cluster, but has very little overhead after that.

launch_frozen does not even require that Python is installed on the nodes, but it incurs a ~15 second penalty for PyInstaller to work. (It’s claimed that this can be somewhat mitigated by optimizations and caching tricks.)

Jobs in hadoopy must be launched from within a Python program. There is no built-in command line utility.

I launch hadoopy via the launch_frozen scheme using my own Python script:

1

python launch_hadoopy.py

After running it with launch_frozen, I installed hadoopy on all nodes and used the launchmethod instead. The performance was not significantly different.

pydoop

In contrast to the other frameworks, pydoop wraps Hadoop Pipes, which is a C++ API into Hadoop. The project claims that they can provide a richer interface with Hadoop and HDFS because of this, as well as better performance, but this is not clear to me. However, one advantage is the ability to implement a Python Partitioner, RecordReader, and RecordWriter. All input/output must be strings.

Most importantly, I could not successfully build pydoop via pip or directly from source.

Others

happy is a framework for writing Hadoop jobs through Jython, but seems to be dead.

Disco is a full-blown non-Hadoop reimplementation of MapReduce. Its core is written in Erlang, with the primary API in Python. It is developed at Nokia, but is much less used than Hadoop.

octopy is a reimplementation of MapReduce purely in Python in a single source file. It is not intended for “serious” computation.

Mortar is another option for working with Python that was just recently launched. Through a web app, the user can submit Apache Pig or Python jobs to manipulate data sitting in Amazon S3.

There are several higher-level interfaces into the Hadoop ecosystem, such as Apache Hive and Pig. Pig provides the facility to write user-defined-functions with Python, but it appears to run them through Jython. Hive also has a Python wrapper called hipy.

(Added Jan. 7 2013)Luigi is a Python framework for managing multistep batch job pipelines/workflows. It is probably a bit similar to Apache Oozie but it has some built-in functionality for wrapping Hadoop Streaming jobs (though it appears to be a light wrapper). Luigi has a nice feature of extracting out a Python traceback if your Python code crashes a job, and also has nice command-line features. It has a great introductory README file but seems to lack comprehensive reference documentation. Luigi is actively developed and used at Spotify for running many jobs there.

Native Java

Finally, I implemented the MR job using the new Hadoop Java API. After building it, I ran it like so:

A Note About Counters

In my initial implementations of these MR jobs, I used counters to keep track of the number of bad records. In Streaming, this requires writing messages to stderr. It turns out this incurs a significant overhead: the Streaming job took 3.4x longer than the native Java job. The frameworks were similarly penalized.

Performance Comparison

The MapReduce job was also implemented in Java as a baseline for performance. All values for the Python frameworks are ratios relative to the corresponding Java performance.

Java is obviously the fastest, with Streaming taking 50% longer, and the Python frameworks taking substantially longer still. From a profile of the mrjob mapper, it appears a substantial amount of time is spent in serialization/deserialization. The binary formats in dumbo and hadoopy may ameliorate the problem. The dumbo implementation may have been faster if the combiner was allowed to run.

Feature Comparison

Mostly gleaned from the respective packages’ documentation or code repositories.

Conclusions

Streaming appears to be the fastest Python solution, without any magic under the hood. However, it requires care when implementing the reducer, and also when working with more complex objects.

All the Python frameworks look like pseudocode, which is a huge plus.

mrjob seems highly active, easy-to-use, and mature. It makes multistep MapReduce flows easy, and can easily work with complex objects. It also works seamlessly with EMR. But it appears to perform the slowest.

The other Python frameworks appear to be somewhat less popular. Their main advantage appears to be built-in support for binary formats, but this is probably something that can be implemented by the user, if it matters.

So for the time being:

Prefer Hadoop Streaming if possible. It’s easy enough, as long as care is taken with the reducer.

Prefer mrjob to rapidly get on Amazon EMR, at the cost of significant computational overhead.

Prefer dumbo for more complex jobs that may include complex keys and multistep MapReduce workflows; it’s slower than Streaming but faster than mrjob.

If you have your own observations based on practice, or for that matter any errors to point out, please do so in comments.

Update (10/15/2014): See the presentation below for updates about this topic:

That’s the first time we see anyone trying this. Mac OS X isn’t a supported Hadoop or Pydoop platform, hence the compilation issues.

I looked at the setup output you posted. There seems to be some sort of incompatibility in boost-python. Unfortunately I don’t have a Mac to try debugging your problem.

Nevertheless, I’ve searched the Internet for your error and it seems that you’re not the first one to encounter it. One project got around the problem simply reordering their #includes: https://groups.google.com/d/msg/pythonvision/eVu5I4vzDfw/pi894YBwEMgJ
A stab in the dark could be to edit pydoop/src/pipes_context.hpp and reverse the order of the lines
#include
#include

For what it’s worth, dumbo is a lot faster when using ctypedbytes and the memory limit safeguards can easily be tweaked.

I’d also be interested to know what things in particular you found to be missing in the docs. I don’t think we’ll have every little detail documented anytime soon, but maybe there’s some low hanging fruit that could be rectified with realistic effort…

Regarding serialization performance, Steve Johnson points out in the HN comments this is likely caused by the default Python JSON library. Switching to simplejson may speed things up significantly, and it is what we use at Yelp.

Hi Uri.
Could you please give us details on the errors you had on the CentOS machines? We have multiple installations running on configurations nominally similar to the one you tried, so it would be very interesting to know what went wrong in your case.

By the way, as Simone was saying, pydoop 0.8 supports OS X Mountain Lion so it would be very nice if you could try a re-install.

Do you know if it’s possible to write and run a MapReduce using Jython and the standard Java Mapreduce API? It is frequently suggested as a viable option but I haven’t found anyone who has actually done it. As you point out, the happy project appears to be dead. Additionally, the word_count.py example included in most hadoop distributions does not seem to work. It relies on jythonc, which was deprecated years ago. I got the old version of Jython with jythonc and compiled it anyway, but the resultant jar would not run on the TaskTrackers.

I believe that Pig lets you write Python UDFs that get run through Jython. I think requiring Jython is a huge roadblock for many people since they want to use their favorite Python modules that are probably not compatible (e.g., NumPy). Mortar data is a new player that makes it easy to write Pig scripts and Python to work on data in S3. According to their website, it should be possible to use all of the Python scientific stack (NumPy, SciPy, etc.).

Thanks for the info. I totally agree that Jython is limited in many respects. I was trying to test the basic Jython MRs for the sake of completeness, but I don’t think they are usable. Is it safe to assume that the included examples (WordCount.py and JythonAbacus.py) are deprecated and Jython (outside of Pig) is not supported?

I liked this writeup when I first saw it and have come back to try to get mrjob working. I keep getting this error however:
IOError: Could not check path hdfs:///test/pg5000.txt
Any suggestions on what could be causing this?
thanks,

Glad the post is helpful. I looked briefly at the mrjob code, and seems that the call to invoke_hadoop is raising a CalledProcessError, which leads to your error. I imagine this could happen if you can’t successfully run hadoop fs -ls in the context that the mrjob code is running. Or perhaps your hadoop configuration files (e.g., site.xml or something like that) are not configured to correctly point to the cluster. Perhaps it would help to explicitly add the hostname of the hadoop cluster to the hdfs:// path? Hope this helps! Also, if you have followup questions, would you mind moving this to one of the user mailing lists? (either mrjob’s mailing list or github issue tracker, or hadoop-users or cdh-users)? Thanks! –Uri

I could not get pydoop to install with Python 2.7.6 — I get an error message about missing python-boost header files.

I have no interest in tracking this down, so this more of a rant than a request for help. I’ve been experimenting with Hadoop and various APIs for a couple of years. I’ve never encountered such bug-ridden code in both Hadoop-core and the APIs for different languages. Programming with Hadoop is a huge time-sink and it gets to the point where it isn’t worth the effort. I’ll wait for something better to come along.