Overview

Pangool is an open-source implementation of what we call
Tuple Map/Reduce based on the Hadoop Java MapReduce API.

Introduction

Pangool is a Java, low-level MapReduce API. It aims to be a replacement for the
Hadoop Java MapReduce API. By implementing an intermediate Tuple-based schema and
configuring a Job conveniently, many of the accidental complexities that arise from using
the Hadoop Java MapReduce API disappear. Things like secondary sort and reduce-side joins
become extremely easy to implement and understand. Pangool's performance is comparable to that
of the Hadoop Java MapReduce API. Pangool also augments Hadoop's API by making multiple outputs
and inputs first-class and allowing instance-based configuration.

Compatibility

Pangool is compatible with Hadoop 0.20.X, 1.X, 0.20.X and YARN.
Pangool has been tested to run properly on EMR.
For clusters that use some versions of the community Hadoop distribution, adjustments must
be made to Hadoop's classpath for including a newer version of Jackson (1.7.6).
Sometimes, the version of Jackson that Hadoop depends on is quite old (1.0.1) and clashes with
the one that Avro uses. This is a well-known problem (see MAPREDUCE-1700 and MAPREDUCE-1938).
One workaround is to copy to hadoop/lib the newer JARs of Jackson, removing older ones.

Features

Intermediate Tuple-based serialization

By using Tuples instead of (key, value) pairs, the user is not forced to write their
custom data types (e.g. Writables) or use external serialization libraries when working
with more than two fields.

Efficient, easy-to-use secondary sorting

In Pangool you can say groupBy(“user”, “country”), sortBy(“user”, “country”, “name”). Pangool
will use an intelligent and efficient Partitioner, Sort and Group Comparator underneath just like an
advanced user would do with the plain Hadoop MapReduce API.

Efficient, easy-to-use reduce-side joins

Doing reduce-side joins with Pangool is as simple as it can get. By using
Tuples and configuring your MapReduce jobs properly, you can easily join various
datasets and perform arbitrary business logic on them. Again, Pangool will know how
to partition, sort and group by underneath in an efficient way.

Instance-based configuration

Mapper, Combiner, Reducers, Input / Output Formats and Comparators can be passed
via object instance. Pangool will serialize the instance into the DistributedCache
and reinstantiate the object when needed. This way, boilerplate configuration code is
no longer needed.

First-class multiple inputs / outputs

Multiple inputs & outputs in Pangool is part of its standard API.

Input / Output Tuple formats

Tuples may be persisted and used as input to other Jobs by using
TupleOutputFormat / TupleInputFormat.

Performance and flexibility

Pangool is an alternative to the Java Hadoop MapReduce API. The same things can be
achieved by using one or another. Pangool’s performance is quite close to that of
Hadoop’s MapReduce API (see our benchmark with other tools for a reference). Pangool
just makes life easier to those that require the efficiency and flexibility of
the plain Java Hadoop MapReduce API.