Pig Frustrations

My desires to implement better scalability through pre-processing reports via the Grid have lead me to Pig. Unfortunately, while Pig does remove some of the difficulties of writing for Hadoop (you no longer have to write all of the map-reduce jobs yourself in java), it has many limitations.

The biggest limitation I’ve found so far is just lack of documentation. There are a few tutorials, and a language reference, but not really all that much more. Even internally at Yahoo there is only a few more bits of documentation than found at the official apache site. I know that Pig is very new and hasn’t even hit a 1.0 release yet, however the documentation really has to improve in order for people to start using it more.

Even when looking at the language reference itself I find there are limitations. For instance, you cannot use a FOREACH inside of a FOREACH. There’s no nested looping like that. This makes working with complex data structures quite difficult. The system seems to be based off of doing simple sorting, filtering, and grouping. However if you want to group multiple times to create more advanced data structures: bags within bags within bags, and then sort the inner most bag, there isn’t a construct for that that I have found.

Also I’ve run into error messages that do not really explain what went wrong. Since there is little documentation and even users so far, it’s not easy to resolve some of these errors. A quick search will usually result in a snippet of code where the error was defined or a list of error codes on a web page with no real descriptions to speak of. Definitely no possible solutions.

Despite those flaws, it still seems beneficial to write in Pig rather than writing manual MapReduce jobs.

It’s also possible that some of these flaws can be fixed by writing extensions in Java to call from my Pig scripts. I haven’t gathered up all of the requirements to build and compile custom functions into a jar yet, and perhaps that will be the next step. It’s just unfortunate that some of these things which I feel should be fairly basic are not included in the base package.

I’m still researching however and its possible I just haven’t discovered the way to do what I need. If that changes I’ll be sure to update.

I believe one of the features that would help the most is in the pipeline currently: nested Foreach loops. Some of the difficulties I think are related mostly to the documentation however. I plan on writing up my findings after I’ve completed my current task to improve the Pig documentation for other users in the future.