Monday, March 26, 2012

Comparison of Hadoop Frameworks

I had to do simple processing of log files in a Hadoop cluster. Writing Hadoop MapReduce classes in Java is the assembly code of Big Data. There are several high level Hadoop frameworks that make Hadoop programming easier. Here is the list of Hadoop frameworks I tried:

Pig

Scalding

Scoobi

Hive

Spark

Scrunch

Cascalog

The task was to read log files join with other data do some statistics on arrays of doubles. Programming this without Hadoop is simple, but caused me some grief with Hadoop.

This blog post is not a full review, but my first impression of these Hadoop frameworks.

Pig

Pig is a data flow language / ETL system. It work at a much higher level than direct Hadoop in Java.
You are working with named tuples. It is mildly typed, meaning you can define a type for each field in a tuple or it will default to byte array.

Pig is well documented

Pig scripts are concise

You can use it for both script and interactive sessions

You can make Pig scripts embedded in Python run as Jython

Pig is a popular part of the Hadoop ecosystem

Issues

Pig Latin is a new language to learn

Pig Latin is not a full programming language, only a data flow language

You have to write User Defined Function / UDF in Java if you want to do something that the language does not support directly

Hive

Hive works on tables made of named tuples with types. It does not check the type at write time, you just copy files into the directory that represent a table. Writing to Hive is very fast, but it does check types at read time.

You can run JDBC against Hive.

It was easy to get Hive running and I really liked it.

Issues

A problem was that Hive only understands a few file format:

Text format

Tab delimited format

Hadoop SequenceFile format

Starting from Hive version 0.9, is has support for Avro file format that can be used from different languages.

In order to do Sum by group I would have to create User Defined Aggregation Function. Turns out that UDF and UDAF is badly documented. I did not find any examples about how to write them for arrays of doubles.

Spark

I was very impressed by Spark. It was easy to build with SBT. It was very simple to write my program. It was trivial to define group by sum for double array, just by defining a vector class with addition.

Issues

I tested my program in local mode. I was very happy that I had a workable solution. Then I investigated moving it to a Hadoop cluster. For this Spark had dependency on Apache Mesos cluster manager. Mesos is a thin virtual layer that Hadoop is running on top of.

It turned out that Spark is not actually running on Hadoop. It is running on HDFS and is an alternative to Hadoop. Spark can run side by side with Hadoop if you have Apache Mesos installed.

Spark is an interesting framework that can outperform Hadoop for certain calculation. It uses the same code from running in memory calculation and code on a big HDSF cluster.

If you have full control over your cluster Spark could be a good option, but if you have to run on an established Hadoop cluster it is very invasive.

Issues

Cascalog

Created by Nathan Marz from Twitter.
Language Clojure, a modern Lisp dialect.

As Scalding Cascalog is built on Cascading.

Easy to build with the Clojure build system Leiningen.

It is used as a replacement for Pig. You can run it from the Clojure REPL or run scripts, and get a full and consistent language.

I tried Cascalog and was impressed. It is a good option if you are working in Clojure

Hadoop vs. Storm

I had to solve the same log file statistics problem in real-time using the Storm framework and this was much simpler.

Why is Hadoop so hard

Hadoop is solving a hard problem

Hadoop is a big software stack with a lot of dependencies

Libraries only work with specific versions of Hadoop

Serialization is adding complexity, see next section

The technology is still not mature

Looks like some of these problems are getting addressed. Hadoop should be more stable now that Hadoop 1.0 has been released.

Serialization in Hadoop

Java have a built in serialization format, but it is not memory efficient. Serialization in Hadoop has to be:

Memory efficient

Use compression

Support self-healing

Support splitting a file in several parts

Hadoop SequenceFile format has these properties, but unfortunately it does not speak so well with the rest of the world.

Serialization is adding complexity to Hadoop. One reason Storm is simpler is that it just uses Kryo Java serialization to send objects over the wire.

Apache Avro is a newer file format that does everything that is needed by Hadoop but is speaks well with other languages as well. Avro is supported in Pig 0.9 and should be in Hive 0.9.

High level Hadoop frameworks in Java

You do not have to use Scala or Clojure to do high level Hadoop in Java. Cascading and Crunch are two Java based high level Hadoop frameworks. They are both based on the idea is that you set up a Hadoop data flow with pipes.

Functional constructs are clumpy in Java. It is a nuisance but doable. When you deploy code to a Hadoop cluster you have to pack up all your library dependencies into on super jar. When you are using Scala or Clojure you need to also package the whole language into this super jar. This also adds complexity. So using Java is a perfectly reasonable choice.

Crunch

Conclusion

I liked all of the Hadoop frameworks I tried, but there is a learning curve and I found problems with all of them.

Extract Transform Load

For ETL Hive and Pig are my top picks. They are easy to use, well supported, and part of the Hadoop ecosystem. It is simple to integrate a prebuilt Map Reduce classes in data flow in both. It is trivial to join data source. This is hard to do in plain Hadoop.

Cascalog is serious contender for ETL if you like Lisp / Clojure.

Hive vs. Pig

I prefer Hive. It is based on SQL. You can use your database intuition and you can access it though JDBC.

Scala based Hadoop frameworks

They all made Hadoop programming look remarkable close to normal Scala programming.

For programming Hadoop Scalding is my top pick since I like the named fields.

Both Scrunch and Scoobi are simple and powerful Scala based Hadoop frameworks. They require Cloudera's Hadoop distribution, which is a very popular distribution.

You should also have a look at Stratosphere (stratosphere.eu)It has both a Java and a native Scala API.It is probably comparable to Spark but has an optimizer, better support for iterative algorithms and (probably) better support for out-of-core execution.

Since Stratosphere runs with Hadoop YARN, you don't need full control of the cluster.

You have introduced me to some additional tools in the Hadoop toolkit.

Here are some corrections/my two cents.

In pig and hive you can add fields terminated by to the create or load to use csv or whatever.Hadoop Map Reduce and HDFS are a performance package. Using one without the other dosen't make much sense performance wise. otherwise you are limited to streaming all of your input and output. Cloudera's Spark distribution combines all the layers so you don't have to do the integration work yourself. Cloudera's Manager will do the installs for you and their Hue package will put a web interface bow on it. Cloudera's Impala's performance can take those Hive queries and speed them up by avoiding the map reduce overhead and keeping the data in memory between intermediary steps on the distributed nodes.

Hortonworks' Yarn is an alternative to the above and they push SparkR and MapR as R rather than SQL alternatives to Clodera's SQL based Impala and R based Spark / Apache Mahout. Yes, Spark is a distribution and a package.

I'm just getting started with Hadoop and have only begun to discover the tools available and how to use them.

Some useful tips:The rank statement combined with the filter statement in pig is good for getting rid of headers. Uniquely identifying rows and subsetting loaded data.The latest versions of Hive and Impala create table statements can had a skip.initial.lines or skip.header.lines property. Saving on having to throw away valuable information and preprocess input.

About Me

My interests are natural language processing, machine learning, programming language design, artificial intelligence and science didactic.
Author of open source software image processing project called ShapeLogic: https://github.com/sami-badawi/shapelogic-scala.
I have worked in NLP for several years, but spent many years working in the cubicles, at: Goldman Sachs with market risk, Fitch / Algorithmics with operational risk, BlackRock with mortgage backed securities, DoubleClick with Internet advertisement infrastructure, Zyrinx / Scavenger with game development. I have a master of science in mathematics and computer science from University of Copenhagen. For work I have been using these programming languages: Scala, Python, Java, C++, C, C#, F#, Mathematica, Haskell, JavaScript, TypeScript, Clojure, Perl, R, Ruby, Slang, Ab Initio (ETL), VBA. Plus many more programming languages for play.