If we include the piggybank functions in the default import list, we need to make sure that they are compiled and tested in the default build, and that the releases will be blocked due to them not compiling etc. Is that the intention ?

Milind Bhandarkar
added a comment - 18/Jun/09 18:30 If we include the piggybank functions in the default import list, we need to make sure that they are compiled and tested in the default build, and that the releases will be blocked due to them not compiling etc. Is that the intention ?

Daniel Dai
added a comment - 18/Jun/09 18:42 How about this syntax:
Add a new java property: udf.import.list. We can put package name to import to this property. We can put multiple packages, separated by colon.
User runs pig like this:
java -Dudf.import.list=com.xxx.udf1:com.xxx.udf2 ......
For that, user can refer to UDFs in these two packages without specify the package name.

Olga Natkovich
added a comment - 18/Jun/09 18:44 In response to Milind. I don't think we are committing to more support for piggybank. All this does is, if you do use UDFs from piggybank, you don't need to use full package name.

That way, I only add/override the default udf import list. Other wise, we will have two variables - import.list, and udf.import.list, and resolving a udf name will have to check both in specific order.

Milind Bhandarkar
added a comment - 18/Jun/09 18:53 Can we have:
java -Dimport.list += com.xxx.udf1 ...
That way, I only add/override the default udf import list. Other wise, we will have two variables - import.list, and udf.import.list, and resolving a udf name will have to check both in specific order.

Olga, what I am saying is to have a default import list: which contains default UDFs (tokenize, Max, Min, flatten), followed by piggybank contribs. And the same list can be added to / overridden on the command-line. This has several advantages. Pig built-ins do not have to be reserved words, and can be overridden. For example, recent mails on pig-users have mentioned that tokenize+flatten should be a single udf. This can be done by providing a flatten (which is null), and tokenize, which does tokenize+flatten, and existing scripts will still work. This simplifies pig grammar as well. Users can create udf libraries, and use them with:

Milind Bhandarkar
added a comment - 18/Jun/09 19:22 Olga, what I am saying is to have a default import list: which contains default UDFs (tokenize, Max, Min, flatten), followed by piggybank contribs. And the same list can be added to / overridden on the command-line. This has several advantages. Pig built-ins do not have to be reserved words, and can be overridden. For example, recent mails on pig-users have mentioned that tokenize+flatten should be a single udf. This can be done by providing a flatten (which is null), and tokenize, which does tokenize+flatten, and existing scripts will still work. This simplifies pig grammar as well. Users can create udf libraries, and use them with:
java -Dimport.list += `cat my-udf-lib. import `
Thoughts ?

My thinking is that most users will use build-in UDFs. So it is better to be straight-forward to the majority. One thing is, import list is ordered. We can put the buildin UDFs in the end. So if user provide udf with the same name, Pig takes the user defined udf first. How is that?

Daniel Dai
added a comment - 18/Jun/09 19:40 My thinking is that most users will use build-in UDFs. So it is better to be straight-forward to the majority. One thing is, import list is ordered. We can put the buildin UDFs in the end. So if user provide udf with the same name, Pig takes the user defined udf first. How is that?

Instead of a list, if you make it a map (i.e. short name -> fully qualified class name), it will be much easier, as it will guarantee that each name has exactly one udf class associated with it. It will also allow users to use udfs that have class names which are pig reserved words. For example, If I have an existing UDF with a class name such as load or store, I can still use them with a different name like myload, without having to rename the class.

Milind Bhandarkar
added a comment - 18/Jun/09 19:53 Instead of a list, if you make it a map (i.e. short name -> fully qualified class name), it will be much easier, as it will guarantee that each name has exactly one udf class associated with it. It will also allow users to use udfs that have class names which are pig reserved words. For example, If I have an existing UDF with a class name such as load or store, I can still use them with a different name like myload, without having to rename the class.
So, I suggest:
java -jar pig.jar -Dimport.list+=MyLoad:com.xxxx.Load,Flatten:com.xxxx.Flatten,...
If I do not specify -Dimport.list on the pig command line, then the default import.list is used.
Thoughts ?

(1) Builtin UDFs are not reserved words. (Flatten is reserved but it is not a UDF) The issue we have seen is users creating UDFs that had reserved words in the package name and if the package name is registered as proposed in this JIRa, their problem will go away.
(2) I don't think we need to allow to overwrite the defaults. We are not planning to expand the list beyond default distribution (builtins + piggybank.) The plan is to hardwire this values in the code since they are not likely to change
(3) Our plan is to keep it simple and to just allow users to add packages based on what they use in their UDFs.

Olga Natkovich
added a comment - 18/Jun/09 19:55 Milind,
Couple of comments and clarifications:
(1) Builtin UDFs are not reserved words. (Flatten is reserved but it is not a UDF) The issue we have seen is users creating UDFs that had reserved words in the package name and if the package name is registered as proposed in this JIRa, their problem will go away.
(2) I don't think we need to allow to overwrite the defaults. We are not planning to expand the list beyond default distribution (builtins + piggybank.) The plan is to hardwire this values in the code since they are not likely to change
(3) Our plan is to keep it simple and to just allow users to add packages based on what they use in their UDFs.

Olga Natkovich
added a comment - 18/Jun/09 20:10 Also think you are suggesting UDF aliasing on command line which I am not sure is the right place for it.
The scope of this work is just to make it easier for users to refer to their UDFs.

Hardwiring names that are not reserved words is likely to cause more pain in the long run. What Daniel is proposing seems to be right for your assumptions 1, 2, and 3, though. But one can easily think of several use cases where overriding on the command-line eases a lot of pain. Assume that someone writes a new superfast PigStorage UDF, and wants to compare its performance with the default PigStorage provided with Pig. Instead of modifying the entire benchmark suite PigMix to use the new storage UDF, he/she can just make PigStorage point to his own UDF on the commandline and run PigMix. It saves a lot of Pain.

Milind Bhandarkar
added a comment - 18/Jun/09 20:39 Hardwiring names that are not reserved words is likely to cause more pain in the long run. What Daniel is proposing seems to be right for your assumptions 1, 2, and 3, though. But one can easily think of several use cases where overriding on the command-line eases a lot of pain. Assume that someone writes a new superfast PigStorage UDF, and wants to compare its performance with the default PigStorage provided with Pig. Instead of modifying the entire benchmark suite PigMix to use the new storage UDF, he/she can just make PigStorage point to his own UDF on the commandline and run PigMix. It saves a lot of Pain.

Hi, Milind, in the use case you mentioned, he/she can write his own PigStorage, put the jar in the import list. Pig will take user supplied UDF first, thus override the buildin PigStorage. How is this?

Daniel Dai
added a comment - 18/Jun/09 21:23 Hi, Milind, in the use case you mentioned, he/she can write his own PigStorage, put the jar in the import list. Pig will take user supplied UDF first, thus override the buildin PigStorage. How is this?

Olga Natkovich
added a comment - 18/Jun/09 21:52 Milind, we have parameter substitution for what you are mentioning as example.
My proposal would be to keep this issue strictly for the packaging thing. This will already make a lot of people happy and users asked for just that.
We can discuss and understand more user requirements regarding aliases in a separate thread.

Daniel: For that to work, user's class will have to be called PigStorage. And also, inserting user's jars before pig jar for looking up methods can have major unintended consequences. pig.jar should always be the first in the classpath.

Olga: My use case cannot use parameter substitution, because PigMix scrips does not specify PigStorage as, say, $storage. The solution I proposed is as simple to implement as Daniel's original proposal (+= is a syntactic sugar. even = can be used with the same effect.), and it fixes a specific ask, and also allows for extensibility. Am I missing something here ?

Milind Bhandarkar
added a comment - 18/Jun/09 22:08 Daniel: For that to work, user's class will have to be called PigStorage. And also, inserting user's jars before pig jar for looking up methods can have major unintended consequences. pig.jar should always be the first in the classpath.
Olga: My use case cannot use parameter substitution, because PigMix scrips does not specify PigStorage as, say, $storage. The solution I proposed is as simple to implement as Daniel's original proposal (+= is a syntactic sugar. even = can be used with the same effect.), and it fixes a specific ask, and also allows for extensibility. Am I missing something here ?

Hi, Milind,
For your first comment, yes, user's class have to be PigStorage. For your second comment, we do not put user's jar before pig.jar. We put their udf search path first. Let's say user put "-Dudf.import.list=com.xxx.udf1:com.xxx.udf2", when we see an unknown UDF, we first search in the package "com.xxx.udf1", then "com.xxx.udf2", then "org.apache.pig.builtin". We build this policy in our code. It's not put user.jar in front of pig.jar.

Daniel Dai
added a comment - 18/Jun/09 22:28 Hi, Milind,
For your first comment, yes, user's class have to be PigStorage. For your second comment, we do not put user's jar before pig.jar. We put their udf search path first. Let's say user put "-Dudf.import.list=com.xxx.udf1:com.xxx.udf2", when we see an unknown UDF, we first search in the package "com.xxx.udf1", then "com.xxx.udf2", then "org.apache.pig.builtin". We build this policy in our code. It's not put user.jar in front of pig.jar.

Issue is not the complexity of implementation but that I am not sure we want to support command line aliasing and I want to discuss and understand the use cases for it separately. And we can parameterize PigMix if we needed to - that was just an example of an alternative solution for the issue you specified.

I looking for a list of requirements - not a solution.

Another comment is I don't think the solution you are proposing would work. The way the list is used to by prepending the package name to the function name to see if the function exist. It deos not do anything with function name itself.

Olga Natkovich
added a comment - 18/Jun/09 22:32 Milind,
Issue is not the complexity of implementation but that I am not sure we want to support command line aliasing and I want to discuss and understand the use cases for it separately. And we can parameterize PigMix if we needed to - that was just an example of an alternative solution for the issue you specified.
I looking for a list of requirements - not a solution.
Another comment is I don't think the solution you are proposing would work. The way the list is used to by prepending the package name to the function name to see if the function exist. It deos not do anything with function name itself.

Milind Bhandarkar
added a comment - 18/Jun/09 22:53 Olga, specifying a list of packages as a path list will have the same issues as
import com.xyz. package .*;
in java, where it is considered to be a bad practice. So, in the solution that I have proposed, I am assuming the class name is specified on the commandline and not the package name.

Milind Bhandarkar
added a comment - 18/Jun/09 23:08 Daniel,
>>>>Hi, Milind, If a user wrote 10 UDFs, I guess he/she does not suppose to put 10 entries in the command line, right?
No, thats why I have a `cat myudflist` allowed on the command-line.

yes, `cat myudflist` is a way to get around. However, in my humble opinion, this syntax is not very intuitive to the ordinary user. Many users may have the impression that they have to put their UDFs one by one.

Daniel Dai
added a comment - 18/Jun/09 23:22 yes, `cat myudflist` is a way to get around. However, in my humble opinion, this syntax is not very intuitive to the ordinary user. Many users may have the impression that they have to put their UDFs one by one.

Pig streaming already uses backquotes for executing external programs. So, users are familiar with this syntax. I believe an ordinary pig user already knows about doing such things in unix shells. But anyway, as Olga said, she is looking for requirements, and not solutions, so, here is a requirement:

I have two jars: xyz.jar, and abc.jar. I am using two UDFs in my scripts. I want to use function1 from xyz.jar, and function2 from abc.jar. How do I use function2 from abc.jar with full confidence that xyz.jar does not contain a UDF named function2? How do you propose I do that without modifying a whole bunch of pig scripts that I am testing for my functions ?

In the solution that I proposed, I can just change function2 mapping by including "-Dimport.list=function2:com.yahoo.milind.function2" on the command-line.

Milind Bhandarkar
added a comment - 19/Jun/09 00:25 Daniel,
Pig streaming already uses backquotes for executing external programs. So, users are familiar with this syntax. I believe an ordinary pig user already knows about doing such things in unix shells. But anyway, as Olga said, she is looking for requirements, and not solutions, so, here is a requirement:
I have two jars: xyz.jar, and abc.jar. I am using two UDFs in my scripts. I want to use function1 from xyz.jar, and function2 from abc.jar. How do I use function2 from abc.jar with full confidence that xyz.jar does not contain a UDF named function2? How do you propose I do that without modifying a whole bunch of pig scripts that I am testing for my functions ?
In the solution that I proposed, I can just change function2 mapping by including "-Dimport.list=function2:com.yahoo.milind.function2" on the command-line.

Olga Natkovich
added a comment - 19/Jun/09 00:30 You use a fully qualified name for the other one.
I would like for us to continue on our original plan. It might not solve all the issues but it certainly helps and it is a very small change to the current implementation.
We can discuss improvements in a separate JIRA.

As long the suggested improvements do not result in redundancy / make the original solutions obsolete, its fine. But I believe that the core issue, which is, "how does pig resolve UDFs?", is not addressed properly in the "small change to current implementation".

Milind Bhandarkar
added a comment - 19/Jun/09 00:47 Olga,
As long the suggested improvements do not result in redundancy / make the original solutions obsolete, its fine. But I believe that the core issue, which is, "how does pig resolve UDFs?", is not addressed properly in the "small change to current implementation".

One comment - I think the Yahoo line that you commented out should be removed

One question - the way this is implemented, the builtins will take precedence over user defined functions in case of the conflict. I think this is the right approach - I think overwriting builting should be explicit via fully qualified names but I wanted to see what others thought.

Olga Natkovich
added a comment - 23/Jun/09 16:53 Hi Daniel,
The patch looks good.
One comment - I think the Yahoo line that you commented out should be removed
One question - the way this is implemented, the builtins will take precedence over user defined functions in case of the conflict. I think this is the right approach - I think overwriting builting should be explicit via fully qualified names but I wanted to see what others thought.