Records and bytes written reported by pig are wrong in a multi-store program

Details

Description

The stats features checked in as part of PIG-626 (reporting the number of records and bytes written at the end of the query) print wrong values (often but not always 0) when the pig script being run contains more than 1 store.

Alan Gates
added a comment - 02/Jun/09 21:22 There are a couple of issues going on here.
One, PigStats looks through the plan until it finds the first root and then stops. So for multi-store scripts that have multiple roots in their plans, this does not work.
Two, Hadoop does not return accurate numbers for records written in many cases. I do not know if this is a bug in hadoop or a bug in the output format pig uses when doing multiple stores in one job.

This patch addresses the two problems listed above. It changes the stats patch to collect all root MR jobs instead of just the first it encounters. The second issue (that MR returns bogus results for multi-store scripts) is addressed by having pig not report records written in this case.

Alan Gates
added a comment - 03/Jun/09 15:56 This patch addresses the two problems listed above. It changes the stats patch to collect all root MR jobs instead of just the first it encounters. The second issue (that MR returns bogus results for multi-store scripts) is addressed by having pig not report records written in this case.

Hudson
added a comment - 06/Jun/09 12:40 Integrated in Pig-trunk #465 (See http://hudson.zones.apache.org/hudson/job/Pig-trunk/465/ )
: Turned off reporting of records and bytes written for mutli-store
queries as the returned results are confusing and wrong.