Archive

The Hadoop ecosystem contains a lot of sub project. Hbase and Pig are just some of them.

Hbase is the Hadoop database, allowing to manage your data in a table way more than in a file way.

Pig is a scripting language that will generate on the fly map reduce job to get the data you need. It is very compact compared to hand writing map reduce job.

One of the nice thing between Pig and Hbase is that they can be integrated. Thanks to recent patch committed.

The documentation is not well updated yet (currently almost relate to the patch itself) some can be found on some post like herebut they all lack of details explanation. Even the Cloudera distribution CDH3 indicates support for this integration but no sample can be found.

Below I describe the installation and configuration steps to make the integration works, provide and example and finally expose some of the limits of the current release (0.8)

First, install the map reduce components (Job tracker and Task tracker). One Job tracker and many task tracker as you have data nodes. Each distribution may provide different procedure for the installation, I’m using the Cloudera CDH3 distrib, which for the map reduce installation is well documented.

Now proceed with the Pig installation, it is also easy as long you are not trying the integration with Hbase. You need only to install pig on the client side, you do not need to install it on each Data Node neither on the Name Node, but just on the machine where you want to run the pig program.

Check your installation by entering the the grunt shell (just enter ‘pig’ from the shell).

Now the tricky part – In order to use Pig/Hbase integration you in fact need to make Map Reduce jobs aware of Hbase classes, otherwise you will have “ClassNotFoundException” or worst the zookeeper exception like “org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase” during execution. The way to perform this easily without coping the hbase configurations into your hadoop configuration dir, is by using hadoop-env.sh and hbase to print its own classpath.
So add to your hadoop-env.sh file file the following

You will also need pig to be aware of Hbase configuration, for this you can use the HBASE_CONF_DIR environment variable (for CDH release), which is configured by default to be /etc/hbase/conf,

Ok your installation should be fine now, so let’s do an example…. For this example let assume we have stored in HBase a schema named TestTable, and column family named A, we have also several fields named field0, field1,…, and we want to extract this information and store it into ‘results/extract’. In this case the pig script will looks like:

So the above script indicate that the my_data relation will contains the fields “field0, field1” and the ID (due to the -loadKey parameter). These fields will be stored as id, field0, field1 under the ‘result/extract’ folder and values will be separated by semicolon.

You can also use some comparison operator on the key. The current operator supported are lt, lte, gt, gte for lower than, lower than or equal, greater than and greather than or equal.

Note: There is no support for logical operator, you can use more than one comparison operator which are chained as AND.

Limitations:

The current HBaseStorage, does not allow the usage of wildcard, that is if you need all the fields in a row, you need to enumerate them. Wildcard are supported in version 0.9.

You can use HBaseStorage to store back the records in HBase nevertheless the HBase usage is incosistent a bug was already opened on this.

Jar files (JAva aRchives) are very convenient containers, you can pack all you need for your application (at least for classes and resources), put the jar on the target environment and just run java -cp <myapp.jar> <appMain> <command line args> to execute your program.

With a jar file you don’t need scripts or long command line to setup your classpath for execution. Nevertheless if you can do better than configuring the classpath and the main from command line, you can use the manifest file for this. Doing so, you can just type java -jar <command line args>

The manifest is a text file (property like) containing information on the archive, as part of this information you can define the main class of the archive and define the classpath (as long you did not pack other jar too)

In order to do so, define in the manifest the following tag ‘Class-Path’ and ‘Main-Class’. Following is a sample:
Main-Class: sample.package.MyMain
Class-Path: directory-one/sub-directory-one/referenced.jar directory-two/

Keep in mind that:

You specify several directories and/or referenced jar using a space as delimiter

Reference to directories and other jars are relative to the jar

Any referenced jar using the Class-Path attribute cannot be present in your original archive (without special classloader)

If you have resource in some directory don’t forget the slash at the end otherwise the content of the directory is not seen.

Ant is a powerful build and script tool provided by Apache Foundation. In a recent project I used the exec task and needed to allow usage of spaces in command line argument of the called executable. A legitimate request, but if you don’t pay attention on the different ‘exec’ parameters syntax, you may waste a lot of time…

Handling space in “exec” task’s arguments.

The exec task of ant allow to execute system command. The arguments of the command are passed as arg sub tags. The sub tag is the arg tag, followed by an attribute named value, line or path. If your argument contains space (for example a file path) do not use the line attribute, instead use the value or path ones. The line attribute will consider the spaces as separator of command line argument for the program executed.

For example you need to pass a text file as command line to a document editing application, and that the file is located under C:\Documents and settings\msthoughts\doc.txt

Will produce an error since the command line is interpreted as 3 args each one denoted with brakets: [c:\Documents] [and] [settings\msthoughts\doc.txt]

To resolve this in command line window you usually use the quotes to wraps the command line arg having space. But using quote from ant will lead to other problem very quickly. The better solution is to use the value attribute instead of the line attribute