Category Archives: Hadoop/Pig

In one of my projects, we had huge number of Pig scripts which dealt with data from a single source. The schema for this common data source is quite complex and changes every few months. Since this schema was present in all Pig files, when ever it changes, it was a real pain to update all Pig scripts.

I was looking for a way to separate out the schema into a separate Pig file and then include it in all other Pig scripts, like how you import a class in Java, instead of copy pasting it into all Pig files.

After some quick web searches, I found that from Pig 0.9 and above this feature is indeed available in Pig itself. It’s called macros. All you need to do is to just include the following line in your Pig script where you need it to be included.

import 'other-file.pig'

You can either give relative path in the above line or set the search path as well from where Pig should include the scripts. If you want to include the search path, then you can do something like this.

When I wrote about using Python to write UDF functions for Pig, I mentioned that Pig would internally be using Jython to parse the code, but 99% of time this shouldn’t be an issue. But I hit the other 1% recently 🙂

I had a small piece of Python code that used the built-in json module to parse JSON data. I converted that into a UDF function and when I tried to call it from Pig, I was getting “module not found” exception. After some quick checks, I found that the latest stable version of Jython is 2.5.x and json module was added from 2.6

After some web searches, I came across jyson through a blog post about using JSON in Jython. jyson is an Java implementation of JSON codec for Jython 2.5 which can be used as a drop-in replacement for Python’s built-in json module.

I downloaded jyson jar and then added it to Pig’s Dpig.additional.jars property. In the Python code, I changed the import statement to import com.xhaus.jyson.JysonCodec as json. After that everything started to work again 🙂

Recently I was working with Pig (the apache one, not the animal 😉 ) and needed to implement a complex logic. Instead of struggling to write it in Pig, I decided to write a UDF (User defined Function). Also, I was too lazy to copy paste lot of boilerplate code to write the UDF in Java and decided to write it in Python. Long time readers might know that ever since I learned Python (around 7 years ago), I have been a huge fan.

In the end, I found that it was too easy to write UDF’s using Python, when compared with writing them in Java. I thought of writing about it here so that it would be helpful and will act as a starting point for people who also want to write their own UDF using Python.

Python vs Jython

Well, before we start, one thing that we have to keep in mind is that, even though we would be writing our code in Python, Pig will internally execute the code using Jython. 99% of time there will not be any difference, but it is good to keep that in mind.

Python code

First in the python side all we need to do to expose a Python function as a UDF, is to just specify a decorator to it.

Let’s say we have the following Python function that returns the length of the argument that is passed to it.

def get_length(data):
return len(data)

All we need to expose this function as a UDF is to add the @outputSchema decorator. So the code becomes

@outputSchema("num:long")
def get_length(data):
return len(data)

When data is passed from Pig to Python, it is passed as bytearray. Most of the time, this shouldn’t be a problem. But there are times when this could be a problem. In those cases, we can just convert it into proper string before we consume it. So the final code would look like this

Pig code

In the Pig side, we should do two things.

Register the UDF

Call the UDF 😉

Register the UDF

As I said in the beginning, Pig internally will use Jython to parse Python code. So we first need to register our Python file using the REGISTER statement. We can just say REGISTER 'udf.py' USING jython as pyudf

Call UDF

Once we register the UDF using the REGISTER statement, we can then call the UDF function using the alias that we created.

In pretty much every Pig script that you will be writing, you will have to specify at least two locations – the input and the output locations. If you are going to use multiple inputs or have to register multiple jars for UDF, then this is bound to increase.

I run most of my Pig scripts through a shell script and I was looking for a way to pass in these locations at runtime instead of hard coding them in the Pig script. After a bit of research, I found that Pig has the ability to accept command-line parameters and there are in fact multiple options to pass them. I thought of documenting them here so that I know where to look when I need to 🙂

Parameter Placeholder

First, we need to create a place holder for the parameter that needs to be replaced inside the Pig script. Let’s say you have the following line in your Pig script where you are loading an input file.

INPUT = LOAD '/data/input/20130326'

In the above statement, if you want to replace date part dynamically, then have to create a placeholder for it.

INPUT = LOAD '/data/input/$date'

Individual Parameters

To pass individual parameters to the Pig script we can use the -param option while invoking the Pig script. So the syntax would be

pig -param date=20130326 -f myfile.pig

If you want to pass two parameters then you can add one more -param option.

pig -param date=20130326 -param date2=20130426 -f myfile.pig

Param File

If there are lot of parameters that needs to be passed, or if we needed a more flexible way to do it, then we can place all of them in a single file and pass the file name using the -param_file option.

The param file uses the simple ini file format where every line contains the param name and the value. We can specify comments using the # character.

date=20130326
date2=20130426

We can pass the param file using the following syntax

pig -param_file=myfile.ini -f myfile.pig

Default Statement

We can also assign a default value to a parameter inside the Pig script using the default statement like below

%default date '20130326'

Processing Order

One good thing about parameter substitution in Pig is that you can pass in value for the same parameter using multiple options simultaneously. Pig will pick them up in the following order.

The default statement takes the lowest precedence.

The values passed using -param_file takes the next precedence.

If there are multiple entries for the same param is present in a file, then the one which comes later takes more precedence.

If there are multiple param files, then the files that are specified later will take more precedence.

The values that are passed using the -param option takes the next precedence.

If multiple values are specified for the same param, then the ones which are specified later takes more precedence.

Debugging

Sometimes, the precedence might be little confusing, especially if you have multiple files and multiple params. Pig also provides a -debug option to debug this kind of scenario’s. If you invoke Pig with this option, then it will generate a file with extension .substitued in the current directory with the place holders replaced with the correct values.

What I use?

I follow this convention while passing params in Pig and it has worked nicely for me so far.

I specify a default value using the default statement and then pass actual values using the -param_file option. If I am in a hurry and just want to test something locally, then I use -param option, but generally I try to put them in a separate ini file so that I can check-in the options as well.