Also, this might be something that could potentially confuse Pig users. During testing, a user might expect both manipulate(foo1) and foo1 from previous relation to be output to check if "manipulate" works correctly. It might be a better option, as Thejas suggested to project the fields required.

Prashant Kommireddi
added a comment - 08/Feb/12 19:49 Also, this might be something that could potentially confuse Pig users. During testing, a user might expect both manipulate(foo1) and foo1 from previous relation to be output to check if "manipulate" works correctly. It might be a better option, as Thejas suggested to project the fields required.

The project-range jira (PIG-1693) started as 'There needs to be a way in foreach to indicate "and all the rest of the fields"'. But then it turned out that project-range semantics will be more easy to understand.
But when I look at your example, it strikes me that we can implement something like an @ operator, which projects all fields input which are not currently present in the current relation schema.

Thejas M Nair
added a comment - 09/Feb/12 01:39 The project-range jira ( PIG-1693 ) started as 'There needs to be a way in foreach to indicate "and all the rest of the fields"'. But then it turned out that project-range semantics will be more easy to understand.
But when I look at your example, it strikes me that we can implement something like an @ operator, which projects all fields input which are not currently present in the current relation schema.
grunt> describe inp;
inp: {a,b,c,d}
grunt> f = foreach inp generate myudf(a,b,c,2) as c, @;
-- in this case @ could mean all columns in input except for column c, as there is already a column c projected here.
grunt> describe f;
f: {c,a,b,d}

First, you are using the tag "newbie" incorrectly . I know you are trying to be self-effacing and say "I am a newbie" but actually that tag is intended to be "If I am a newbie and want to contribute to Pig, what JIRAs should I tackle first while I get the lay of the land?". This is clearly not one of them.

Second, I think this feature would be oh-my-god confusing to users. The example Thejas used above illustrates the point nicely, actually – we use a udf of a, b, c, and some constant to get a field called c, then project "the rest" – with "the rest" being defined as "anything that doesn't have a name conflict". But "the rest" could just as easily mean "the rest of the columns I didn't use" (so, just d). It also changes the script if you rename one alias – say, you realize you didn't want to call the result of the udf c, but instead want to call it processed_c, and all of a sudden the number of columns produced changes, and their respective ordinals shift. It'll be a nightmare.

Just use a new name when generating your derived column. It's derived, after all.

I'd be ok with some syntax that would indicate columns to not generate ("generate *^a^b"?), but the proposed syntax is fraught with peril.

Dmitriy V. Ryaboy
added a comment - 10/Feb/12 00:12 Couple of things:
First, you are using the tag "newbie" incorrectly . I know you are trying to be self-effacing and say "I am a newbie" but actually that tag is intended to be "If I am a newbie and want to contribute to Pig, what JIRAs should I tackle first while I get the lay of the land?". This is clearly not one of them.
Second, I think this feature would be oh-my-god confusing to users. The example Thejas used above illustrates the point nicely, actually – we use a udf of a, b, c, and some constant to get a field called c, then project "the rest" – with "the rest" being defined as "anything that doesn't have a name conflict". But "the rest" could just as easily mean "the rest of the columns I didn't use" (so, just d). It also changes the script if you rename one alias – say, you realize you didn't want to call the result of the udf c, but instead want to call it processed_c, and all of a sudden the number of columns produced changes, and their respective ordinals shift. It'll be a nightmare.
Just use a new name when generating your derived column. It's derived, after all.
I'd be ok with some syntax that would indicate columns to not generate ("generate *^a^b"?), but the proposed syntax is fraught with peril.

Russell Jurney
added a comment - 10/Feb/12 00:52 This JIRA is actually, "this bugs me every day. I think I should file a JIRA."
Do you have a problem with the initial proposal, or with Thejas's suggestion?

I have a problem with the semantics you propose because I have a feeling they'll cause more problems than they will solve, as described above.

a =

{ cat, dog, parrot, fox }

b = foreach a generate cat as dog, @;

– what do you expect?

c = foreach a generate TOTUPLE(cat, dog) as fox, @;

– what do you expect?

d = foreach a generate TOTUPLE(cat, dog) as catdog, @;

– what do you expect?

I think this would be totally confusing, especially when we consider script maintenance, not just script writing. Explicitly specifying what not to project instead of relying on some column name matching voodoo is much more preferable, imo.

Dmitriy V. Ryaboy
added a comment - 10/Feb/12 02:02 I have a problem with the semantics you propose because I have a feeling they'll cause more problems than they will solve, as described above.
a =
{ cat, dog, parrot, fox }
b = foreach a generate cat as dog, @;
– what do you expect?
c = foreach a generate TOTUPLE(cat, dog) as fox, @;
– what do you expect?
d = foreach a generate TOTUPLE(cat, dog) as catdog, @;
– what do you expect?
I think this would be totally confusing, especially when we consider script maintenance, not just script writing. Explicitly specifying what not to project instead of relying on some column name matching voodoo is much more preferable, imo.

I think this would be totally confusing, especially when we consider script maintenance, not just script writing.

Yes, I do worry that pig queries will become like many perl programs that I have seen!

I don't find it very confusing to use "anything that doesn't have a name conflict" as the meaning of @. I think the script should be maintainable as long as you refer to columns by the names and not position. Specifying what not to project is also fine, it is just a few extra keystrokes.
But I value the opinions of Dmitriy and Russel more than mine as they have more experience actually writing pig queries and maintaining them!

Thejas M Nair
added a comment - 10/Feb/12 02:18 I think this would be totally confusing, especially when we consider script maintenance, not just script writing.
Yes, I do worry that pig queries will become like many perl programs that I have seen!
I don't find it very confusing to use "anything that doesn't have a name conflict" as the meaning of @. I think the script should be maintainable as long as you refer to columns by the names and not position. Specifying what not to project is also fine, it is just a few extra keystrokes.
But I value the opinions of Dmitriy and Russel more than mine as they have more experience actually writing pig queries and maintaining them!

Having read through the comments above, I kind of want to agree with Dmitriy here. It seems to me like this feature would not be straightforward to understand for most users, and might add some complexity. Does not come naturally to me, for eg most languages would support functionality such as (pseudo-code)

1. FOREACH A GENERATE a THROUGH z
2. FOREACH A GENERATE $0 THROUGH $5
3. FOREACH A GENERATE $5 THROUGH END
4. FOREACH A GENERATE a func(a) s a, func(b) as b;

In all of the above a user clearly understands what he wants from the syntax. But it gets complicated once there is hidden logic in the syntax (FOREACH A GENERATE func(a) as a and the rest excluding a since its already been projected).

Also, how does a user write a statement when he wants both "manipulate(foo1)" and "foo1" to be projected? I guess one would have to say, FOREACH my_relation GENERATE manipulate(foo1) as foo1, foo1 as foo2, * ? Which really becomes the exact opposite case of this JIRA

Prashant Kommireddi
added a comment - 10/Feb/12 03:06 Having read through the comments above, I kind of want to agree with Dmitriy here. It seems to me like this feature would not be straightforward to understand for most users, and might add some complexity. Does not come naturally to me, for eg most languages would support functionality such as (pseudo-code)
1. FOREACH A GENERATE a THROUGH z
2. FOREACH A GENERATE $0 THROUGH $5
3. FOREACH A GENERATE $5 THROUGH END
4. FOREACH A GENERATE a func(a) s a, func(b) as b;
In all of the above a user clearly understands what he wants from the syntax. But it gets complicated once there is hidden logic in the syntax (FOREACH A GENERATE func(a) as a and the rest excluding a since its already been projected).
Also, how does a user write a statement when he wants both "manipulate(foo1)" and "foo1" to be projected? I guess one would have to say, FOREACH my_relation GENERATE manipulate(foo1) as foo1, foo1 as foo2, * ? Which really becomes the exact opposite case of this JIRA

I agree with Prashant about avoiding hidden logic. The answers to Dmitriy's examples are not obvious and syntactic sugar should be dead obvious, otherwise it fails. I think a better approach would be to just be explicit with project-range semantics.

The semantics I laid out are simple: don't overload a field if you want it auto-included.

Taking a step back... let me state the problem: my pig scripts look more complicated than they are. When I join, or manipulate my data, if I don't elicitly do:

foreach my foo generate foo as foo, bar as bar, etc. then the DESCRIBE of that relation is unreadable. I don't want to rename fields each and every foreach if the field's functiion/identity hasn't changed. So I can't use *. Listing out fields to generate AS themselves feels wrong because it is so verbose. And yet I need my clean DESCRIBES and consistent column names. So my code balloons. I don't know a better way, but maybe there is one.

There are a couple issues in there, but that is what I'd like to address.

What about a udf? rest(), or others() as in: generate LOWER(foo) as foo, others(); Can the UDF get the input schema and... emit those fields? Don't know it a udf can do that.

Russell Jurney
added a comment - 10/Feb/12 17:36 The semantics I laid out are simple: don't overload a field if you want it auto-included.
Taking a step back... let me state the problem: my pig scripts look more complicated than they are. When I join, or manipulate my data, if I don't elicitly do:
foreach my foo generate foo as foo, bar as bar, etc. then the DESCRIBE of that relation is unreadable. I don't want to rename fields each and every foreach if the field's functiion/identity hasn't changed. So I can't use *. Listing out fields to generate AS themselves feels wrong because it is so verbose. And yet I need my clean DESCRIBES and consistent column names. So my code balloons. I don't know a better way, but maybe there is one.
There are a couple issues in there, but that is what I'd like to address.
What about a udf? rest(), or others() as in: generate LOWER(foo) as foo, others(); Can the UDF get the input schema and... emit those fields? Don't know it a udf can do that.

Russell Jurney
added a comment - 10/Feb/12 17:44 Prashant, the user would do: foreach A generate change(foo1) as new_thing, foo1;
I really wish there was a way to strip relation:: unless there are dupe fields. That is really what my pain point is.

I have the same pain point. 65% of the LOC of my pig (1300+ lines in one script ...) are projection related for alias renaming ONLY.

Here is a real chunk with only slightly obscured/shortened names:

P_SOP1 = FOREACH P_SOP GENERATE
F_P2::s as s, o as o, day as day, hour as hour,
datetime as datetime, c as c, ex_s as ex_s,
u as u, ex_u as ex_u, mvtt as mvtt, tid as tid,
vgid as vgid, vcid as vcid, ex_pid as ex_pid,
pid as pid, pgid as pgid, p_q as q, pc as pc,
p_tc as tc;

In order for a script to be maintainable certain aliases need to be 'stable' and NOT contain any XYZPDQ:: prefixes. Otherwise, downstream consumers of the alias will BREAK if the upstream data flow related to XYZPDQ change at all.

I am not so sure I like the exact semantics of the proposed @ operator, but I would use it. The 'continuation' style projections are for a different purpose entirely. One problem is "I need to project 20 of these 40 fields and create a couple derived ones". Another is "alias name cleanup and normalization" – when you do not want to remove or add fields at all, but need to rename a couple and keep the rest. The same feature should not try and do both, it will end up being confusing. Do one thing and do it well. One feature for easy relabeling, one for column pruning/projection. These might combine together to do both in one step, but it wouldn't be so bad if they were two steps as long as they were both very easy to use.

My instinct also says to be careful introducing another operator starting with a new character. Perhaps "*" is better than "@". All alias manipulation built-ins could start with "" as a classification hint.

Scott Carey
added a comment - 10/Feb/12 19:01 I have the same pain point. 65% of the LOC of my pig (1300+ lines in one script ...) are projection related for alias renaming ONLY.
Here is a real chunk with only slightly obscured/shortened names:
P_SOP1 = FOREACH P_SOP GENERATE
F_P2::s as s, o as o, day as day, hour as hour,
datetime as datetime, c as c, ex_s as ex_s,
u as u, ex_u as ex_u, mvtt as mvtt, tid as tid,
vgid as vgid, vcid as vcid, ex_pid as ex_pid,
pid as pid, pgid as pgid, p_q as q, pc as pc,
p_tc as tc;
In order for a script to be maintainable certain aliases need to be 'stable' and NOT contain any XYZPDQ:: prefixes. Otherwise, downstream consumers of the alias will BREAK if the upstream data flow related to XYZPDQ change at all.
I am not so sure I like the exact semantics of the proposed @ operator, but I would use it. The 'continuation' style projections are for a different purpose entirely. One problem is "I need to project 20 of these 40 fields and create a couple derived ones". Another is "alias name cleanup and normalization" – when you do not want to remove or add fields at all, but need to rename a couple and keep the rest. The same feature should not try and do both, it will end up being confusing. Do one thing and do it well. One feature for easy relabeling, one for column pruning/projection. These might combine together to do both in one step, but it wouldn't be so bad if they were two steps as long as they were both very easy to use.
My instinct also says to be careful introducing another operator starting with a new character. Perhaps "* " is better than "@". All alias manipulation built-ins could start with " " as a classification hint.

Scott Carey
added a comment - 10/Feb/12 19:04 Wiki text got the better of my last comment. Here is my last sentence again:
** is better than @. All alias manipulation built-ins could start with * as a classification hint.

Scott, the proposed operator wouldn't solve your problem (getting rid of deduplication prefixes) at all, since the proposed logic would look at a schema "foo:x, bar:x, bar:z" and given "generate foo::x as x, @" would NOT match x in the original schema – the original schema having deduplication prefixes – and happily generate x, foo:x, bar;x, bar:z.

My proposal, however, would: generate foo:x as x, *^foo::x^bar::x;

Pig should already "just work" when you use field aliases without a prefix, except for cases where there is a conflict without the deduplication. File bugs when you encounter cases where it doesn't.

We should be able to drop the prefixes up front when they are not needed. That's a good suggestion, let's do that.

Dmitriy V. Ryaboy
added a comment - 10/Feb/12 22:31 Scott, the proposed operator wouldn't solve your problem (getting rid of deduplication prefixes) at all, since the proposed logic would look at a schema "foo:x, bar:x, bar:z" and given "generate foo::x as x, @" would NOT match x in the original schema – the original schema having deduplication prefixes – and happily generate x, foo:x, bar;x, bar:z.
My proposal, however, would: generate foo:x as x, *^foo::x^bar::x;
Pig should already "just work" when you use field aliases without a prefix, except for cases where there is a conflict without the deduplication. File bugs when you encounter cases where it doesn't.
We should be able to drop the prefixes up front when they are not needed. That's a good suggestion, let's do that.

Raghu Angadi
added a comment - 10/Feb/12 22:35 I agree 'renaming a just a couple of fields out of many fields' is a major pain. PIG-1693 should help a lot.
a new statement that lets renaming some of the fields would be helpful.
that said, a new operator like '@' that means everything else intuitively feels pretty confusing an error prone.

Pig should already "just work" when you use field aliases without a prefix, except for cases where there is a conflict without the deduplication. File bugs when you encounter cases where it doesn't.

Part of it is that in the past, when the bulk of our large scripts were written, it didn't just work. Rather than wrestle with figuring out where the issues were coming from, it became easier to just rename after every group or join.
Now, if all of those bugs are gone there is still an issue – an alias with no ambiguities at all (or after any inner join) you must still proactively re-name columns in order for the script to be maintainable and not leak alias names past their usefulness.

We should be able to drop the prefixes up front when they are not needed. That's a good suggestion, let's do that.

That would be awesome, although there are some script backwards-compatibility details to work out. Combine that with * exclusion syntax and at least you can have an upper bound of N/2 columns to type when only dropping fields.

This is because column names often appear in comma separated order in pig, sometimes within parenthesis. Delimiting by ^ is less consistent and pig-like. Also, you can copy-paste lists of columns this way, or break them across lines if there are many.

(english -> foo::x as x, and everything else without the short name x)

I agree 'renaming a just a couple of fields out of many fields' is a major pain. PIG-1693 should help a lot.

PIG-1693 looks nice for positional field users, but for those that use names it isn't nearly as powerful. names and positional order don't mix. Naming is great because in larger scripts you can address fields by name and not have to worry about fields being added or removed or re-ordered.

We have aliases that take >100 lines of pig to create, that are shared by many downstream users each with 10 to 100 line scripts. The contract of the alias is its fields and their names. As long as downstream users use names and not positions, upstream changes are safe. Only downstream users that need to know about additional or removed fields are impacted by changes to the upstream script.
In order to use PIG-1693, order would have to become part of the contract, which is unacceptable for maintenance purposes in very large script collections.

What if you had 30 columns and need to remove 10 of them, randomly distributed? PIG-1693 doesn't work so well with that. A 'project all except' operator would be far more straightforward and clear as to what is happening: *^(drop these columns).

Scott Carey
added a comment - 11/Feb/12 00:37
Pig should already "just work" when you use field aliases without a prefix, except for cases where there is a conflict without the deduplication. File bugs when you encounter cases where it doesn't.
Part of it is that in the past, when the bulk of our large scripts were written, it didn't just work. Rather than wrestle with figuring out where the issues were coming from, it became easier to just rename after every group or join.
Now, if all of those bugs are gone there is still an issue – an alias with no ambiguities at all (or after any inner join) you must still proactively re-name columns in order for the script to be maintainable and not leak alias names past their usefulness.
We should be able to drop the prefixes up front when they are not needed. That's a good suggestion, let's do that.
That would be awesome, although there are some script backwards-compatibility details to work out. Combine that with * exclusion syntax and at least you can have an upper bound of N/2 columns to type when only dropping fields.
Might I suggest a syntax more like:
> generate foo::x as x, *^(foo::x, bar::x, baz::x);
instead of
> generate foo::x as x, *^foo::x^bar::x^baz::x;
This is because column names often appear in comma separated order in pig, sometimes within parenthesis. Delimiting by ^ is less consistent and pig-like. Also, you can copy-paste lists of columns this way, or break them across lines if there are many.
Another bit of sugar might be wildcards inside that:
> generate foo::x as X, *^::x;
or even
> generate foo::x as x, *^x;
(english -> foo::x as x, and everything else without the short name x)
I agree 'renaming a just a couple of fields out of many fields' is a major pain. PIG-1693 should help a lot.
PIG-1693 looks nice for positional field users, but for those that use names it isn't nearly as powerful. names and positional order don't mix. Naming is great because in larger scripts you can address fields by name and not have to worry about fields being added or removed or re-ordered.
We have aliases that take >100 lines of pig to create, that are shared by many downstream users each with 10 to 100 line scripts. The contract of the alias is its fields and their names. As long as downstream users use names and not positions, upstream changes are safe. Only downstream users that need to know about additional or removed fields are impacted by changes to the upstream script.
In order to use PIG-1693 , order would have to become part of the contract, which is unacceptable for maintenance purposes in very large script collections.
What if you had 30 columns and need to remove 10 of them, randomly distributed? PIG-1693 doesn't work so well with that. A 'project all except' operator would be far more straightforward and clear as to what is happening: *^(drop these columns).

Another thing that would help alleviate these pains would be a functioning and enhanced PigPen of some form, showing the fields at each step with sample records from illustrate in a table, coloring fields that changed in some way, clickable ranges, etc.

In my opinion, visualization is not optional for complex dataflow programming. We're just holding the flows in our heads instead of visualizing them.

Russell Jurney
added a comment - 11/Feb/12 01:59 Another thing that would help alleviate these pains would be a functioning and enhanced PigPen of some form, showing the fields at each step with sample records from illustrate in a table, coloring fields that changed in some way, clickable ranges, etc.
In my opinion, visualization is not optional for complex dataflow programming. We're just holding the flows in our heads instead of visualizing them.

I have a relation inbound with ~20 fields. One of them is a bag of about 100 tuples. All I want to do is flatten it and project out two tuples.

B = FOREACH A GENERATE *, FLATTEN(x.(foo, bar)) as flatx;

Ok, now I have a problem:

The bag of 100 is still in the relation, copied 100 times. To get rid of it I need to list every field one by one instead of use *. No, PIG-1693 is not useful. The field order is subject to change. This chunk needs to be resilient to changes in the inbound aliases that do not change the semantic meaning of fields.

Then the next step is to project out the foo and bar from flatx, which will require listing the 20 fields AGAIN.

This issue is generally worse when you are using FLATTEN than simple projection, since it is much more important to drop the fields for performance reasons. Some sane syntax here could easily cut the size of most of my scripts by more than half!

Scott Carey
added a comment - 22/Mar/12 17:59 This one is annoying me again today.
I have a relation inbound with ~20 fields. One of them is a bag of about 100 tuples. All I want to do is flatten it and project out two tuples.
B = FOREACH A GENERATE *, FLATTEN(x.(foo, bar)) as flatx;
Ok, now I have a problem:
The bag of 100 is still in the relation, copied 100 times. To get rid of it I need to list every field one by one instead of use *. No, PIG-1693 is not useful. The field order is subject to change. This chunk needs to be resilient to changes in the inbound aliases that do not change the semantic meaning of fields.
Then the next step is to project out the foo and bar from flatx, which will require listing the 20 fields AGAIN.
This issue is generally worse when you are using FLATTEN than simple projection, since it is much more important to drop the fields for performance reasons. Some sane syntax here could easily cut the size of most of my scripts by more than half!

Scott Carey
added a comment - 22/Mar/12 18:04 "All I want to do is flatten it and project out two tuples."
I meant "All I want to do is flatten the tuple, project out two of its fields, and leave every other field alone."

Dmitriy V. Ryaboy
added a comment - 29/Apr/12 22:48 I like Scott's proposal of this new syntax:
generate foo::x as x, *^(foo::x, bar::x, baz::x);
Still not clear:
how this should work when projecting all fields inside a tuple, for example after a COGROUP.
should this work for prefixes, eg, "generate someUDF(foo::x) as x, foo::*^(x, y, z)" ?