Pangool Benchmark

In order to demonstrate that Pangool performs comparable to Hadoop, we have designed a benchmark and executed it in EC2.
This benchmark has been updated as of 12th of March of 2012.

We take this benchmark seriously. We have made all its details publicly available so if you find something wrong or have any comment you can just tell us or submit a pull request. You can run this benchmark anytime by following its instructions and if you find something weird, please tell us.

We have used Whirr 0.7.1 to instantiate a CDH3 cluster with 4 m1.large slave nodes and 1 master node. We used the following properties:

We have executed three different examples with three different input sizes. Each of the examples were launched in a comparable
implementation in Hadoop, Crunch and Cascading. You can see the implementations in the pangool-benchmark project.
If you find something wrong, please send us your feedback. You are free to suggest alternative implementations that could perform better.

We have considered Crunch and Cascading for this benchmark in order to have more data points.
They are, however, different APIs that were conceived with other requirements in mind. Pangool aims to be a replacament of the low-level MapReduce Hadoop Java API, whereas both Crunch and Cascading abstract the user from MapReduce - which allows them to implement easier flow management, something that Pangool doesn't (yet) aim for.

For Cascading, we used the 1.2.5 stable release. For Crunch, we built the 0.2.0 version in 2012-02 and used Avro serialization
because we found it to be much more efficient than TupleWritables.

You'll find a script in pangool-benchmark that launches the full benchmark. Please read the header before launching it to
make sure you meet all the conditions for launching it.

These are the results that we obtained in time (seconds) and IO (bytes):

We can see that all implementations had a comparable IO footprint. However, we clearly see that Pangool and Hadoop are much closer in efficiency than Crunch or Cascading. In this case Pangool performed 8% worse than Hadoop, Crunch 27% and Cascading 70%.

These are the results that we obtained in time (seconds) and IO (bytes):

In this example Hadoop and Pangool are both close in efficiency and IO footprint. However, Cascading deviates considerably in both aspects. Pangool performed 5% worse than Hadoop, Crunch 28% and Cascading 181%.

Standard Word Count

We used the default Word Count implementations of all APIs for this last benchmark. Because Word Count is a quite simple problem that doesn't involve joins or secondary sort, we are measuring here the most basic API overhead.
Click to see the implementations for Pangool, Hadoop, Crunch and Cascading.

For generating the input datasets, we used the following commands (you'll find them in the launch-benchmark.sh script we mentioned above):

These are the results that we obtained in time (seconds) and IO (bytes):

We can see that in this case Hadoop, Pangool (5% worse) and Crunch (9% worse) performed similarly. All three implementations are using Combiners. Cascading had 241% overhead even with the use of its in-memory combiners.

Conclusions

We have seen how Pangool performs comparable to Hadoop in three different cases, performing between only 5 and 8% worse
according to this benchmark. Except for the word count example - where Pangool code is larger than Hadoop's word count
becase we can't reuse the Combiner as we are not writing Tuples as output - the implementations in Pangool are much
more concise and easy than those of Hadoop, which are extremely complex because of the use of custom Comparators,
custom data types and other boilerplate code.

Overall, in code conciseness, Cascading wins in all of the examples, but its performance is much worse than that of Hadoop, Pangool or Crunch. This is understandable, since Cascading is a higher-level API that solves a lot of problems that neither Hadoop or Pangool solve.

On the other hand, we have found Crunch to perform very good when used in conjunction with Avro. It is an interesting API because it solves higher-level problems like flow management while retaining a decent performance. However, we found the resulting code for both the Url Resolution and Secondary Sort example to be extremely complex and hard to maintain.