Description

2008-10-14 15:11:07,639 [main] ERROR org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.Launcher - Error message from task (reduce) task_200809241441_9923_r_000000java.lang.NullPointerException
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.getNext(POProject.java:183)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:215)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:166)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.runPipeline(PigMapReduce.java:252)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.reduce(PigMapReduce.java:222)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Reduce.reduce(PigMapReduce.java:134)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:318)
at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2207)

Pradeep suspects that the problem is in src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/POProject.java; line 374

Activity

If you run a script like the above now (on version 0.7) it does not fail, but instead gives an error message "ERROR 1026: Attempt to fetch field 0 from schema of size 0" This is at least a decent error message. The problem now is that we allow positional notation to work in cases where the schema is undefined, which it is when you say bag{}. So $0 should work.

Alan Gates
added a comment - 19/Oct/10 18:49 If you run a script like the above now (on version 0.7) it does not fail, but instead gives an error message "ERROR 1026: Attempt to fetch field 0 from schema of size 0" This is at least a decent error message. The problem now is that we allow positional notation to work in cases where the schema is undefined, which it is when you say bag{}. So $0 should work.

Currently, we load x as bag, inside x we don't do any interpretation. So what we load is a bag of bytearrays.

This however cause problem when we do further processing for this bag. Assume in data.txt, the bag actually contains three item tuples:

B = foreach A generate x.($1, $2);

We expect it will project 2nd, 3th field of the tuple. But in current code, x is a bag of one field bytearray, this results an error

B = foreach A generate flatten x;

We expect it will flatten x into 3 fields. But in current code, we cannot even flatten x, since x does not contain tuple.

The problem stems in two sources:
1. Currently bag requires tuple in some cases, but not require tuple in other cases. This is inconsistent. We should make it a rule. So when we load a bag, actually means load a bag of tuples

2. When we load a tuple with unknown number of fields (tuple inner schema is unknown), we assume it contains only one bytearray field. However, it is not possible to cast one byte field to multiple fields later. Recall when we load a file with unknown schema:

A = load 'data.txt';

We actually load multiple fields seperated by delimit, each field is of type bytearray. When we load empty bag, we can mimic this behavior.

So I propose two changes:
1. Load a bag implies loading a bag of tuples, even when bag inner schema is empty.
2. When we convert bytearray to tuple with no inner schema, we no longer assume one field. We will take comma as delimit (in the case of UTF8StorageConverter) and produce a tuple of multiple bytearray fields.

Daniel Dai
added a comment - 11/Jan/11 01:43 We need to decide how to load empty bag, eg.
A = load 'data.txt' as (x: bag{});
Currently, we load x as bag, inside x we don't do any interpretation. So what we load is a bag of bytearrays.
This however cause problem when we do further processing for this bag. Assume in data.txt, the bag actually contains three item tuples:
B = foreach A generate x.($1, $2);
We expect it will project 2nd, 3th field of the tuple. But in current code, x is a bag of one field bytearray, this results an error
B = foreach A generate flatten x;
We expect it will flatten x into 3 fields. But in current code, we cannot even flatten x, since x does not contain tuple.
The problem stems in two sources:
1. Currently bag requires tuple in some cases, but not require tuple in other cases. This is inconsistent. We should make it a rule. So when we load a bag, actually means load a bag of tuples
2. When we load a tuple with unknown number of fields (tuple inner schema is unknown), we assume it contains only one bytearray field. However, it is not possible to cast one byte field to multiple fields later. Recall when we load a file with unknown schema:
A = load 'data.txt';
We actually load multiple fields seperated by delimit, each field is of type bytearray. When we load empty bag, we can mimic this behavior.
So I propose two changes:
1. Load a bag implies loading a bag of tuples, even when bag inner schema is empty.
2. When we convert bytearray to tuple with no inner schema, we no longer assume one field. We will take comma as delimit (in the case of UTF8StorageConverter) and produce a tuple of multiple bytearray fields.
Assume data.txt is:
{(1,2,3),(4,5,6)}
After this change,
A = load 'data.txt' as (x: bag{});
describe A:
We get: bag{}
dump A:
We get: {(1,2,3),(4,5,6)}
, which is not a bag of byteArrays, but a bag of three item tuples.