1. Pig is not good with class casting, for instance from bytearray to chararray (String).So it’s always best to be as explicit as one can. For instance, in the ETLProjector, one can directly specify class type: ETLProject(‘size:int, title:chararray’). It’s best to avoid defaulting into bytearray since it’s the least informative of all types. Also I don’t think there is a filter function for whether the field in question can be properly cast to a different type, so don’t even bother.

2. In case the data has already been generated, one could always go back to the .pig_schema file and modify the type manually. For instance, the index for bytearray type is 50. If you only have chararray in your schema, which happens a lot when you don’t care about arithmetic operations on your data, then simply do a search replace of 50 with 55 (chararray) and your next pig script will thank you for that.

3. Sometimes you have an overflow of data. Then it’s best to filter by some stringent conditions. Always put a filter of $field is not null just to be on the safe side.

4. Another common reason for pig script to fail is simply memory overflow issue. So be reasonable in accessing data. Don’t take more than what you need.

5. Finally a common pig idiom that comes up over and over is the following:

Let’s say f has 3 fields (color:charray, size:long, shape:chararray),

f2 = GROUP f BY color;

f3 = FOREACH f2 {

shapes = DISTINCT f.shape;

GENERATE group AS color, SUM( f.size) AS total_size, COUNT(shapes) AS num_uniq_shapes;

}

This gives you the expected result: each distinct color has its own record, within which the sizes of all records are added up, and the total number of distinct shapes is counted.