UDFs in scripting languages

Details

Description

It should be possible to write UDFs in scripting languages such as python, ruby, etc. This frees users from needing to compile Java, generate a jar, etc. It also opens Pig to programmers who prefer scripting languages over Java.

Alan Gates
added a comment - 19/Aug/09 18:22 Attaching some preliminary work by Kishore Gopalakrishna on this. This code is a good start, but not ready for inclusion. It needs to be cleaned up, put in our class structure, etc.
Comments from Kishore:
It contains all the libraries required and also the GenericEval UDF and
GenericFilter UDF
I dint get a chance to get the Algebraic function working.
To test it, just unzip the package and run
rm -rf wordcount/output;
pig -x local wordcount.pig ---> to test eval
pig -x local wordcount_filter.pig ---> to test filter [sorry it should
be named filter.pig]
cat wordcount/output

1) How do we do type conversion? The current patch assumes a single string input and output. We'll want to be able to do conversions from scripting languages to pig types that make sense. How this can be done is tied up with #2 below.

2) Do we do this using the Bean Scripting Framework or with specific bindings for each language? This patch shows how to do the specific bindings for Groovy. It can be done for Jython, and I'm reasonably sure it can be done for JRuby. The obvious advantage of using the BSF is we get all the languages they support for free. We need to understand the performance costs of each choice. We should be able to use the existing patch to test the difference between using the BSF and direct Groovy bindings. Also, it seems like type conversions will be much easier to do if we use specific bindings, as we can do explicit type mappings for each language. Perhaps this is possible with BSF, but I'm not sure how.

3) Grammer for how to declare these. I propose that we allow two options: inlined in define and file referenced in define. So these would roughly look like:

Alan Gates
added a comment - 15/Oct/09 02:58 Questions that we need to answer to get this patch ready for commit:
1) How do we do type conversion? The current patch assumes a single string input and output. We'll want to be able to do conversions from scripting languages to pig types that make sense. How this can be done is tied up with #2 below.
2) Do we do this using the Bean Scripting Framework or with specific bindings for each language? This patch shows how to do the specific bindings for Groovy. It can be done for Jython, and I'm reasonably sure it can be done for JRuby. The obvious advantage of using the BSF is we get all the languages they support for free. We need to understand the performance costs of each choice. We should be able to use the existing patch to test the difference between using the BSF and direct Groovy bindings. Also, it seems like type conversions will be much easier to do if we use specific bindings, as we can do explicit type mappings for each language. Perhaps this is possible with BSF, but I'm not sure how.
3) Grammer for how to declare these. I propose that we allow two options: inlined in define and file referenced in define. So these would roughly look like:
define myudf ScriptUDF('groovy', 'return input.get(0).split();');
define myudf ScriptUDF('python', myudf.py);
We could also support inlining in the Pig Latin itself, something like:
B = foreach A generate
{'groovy', 'return input.get(0).split();');}
;
I'm not a fan of this type of inlining, as I think it makes the code hard to read.

I ran some quick and sloppy performance tests on this. I ran it using both BSF and direct bindings to groovy. I also ran it using the builtin TOKENIZE function in Pig. I had it read 5000 lines of text. The groovy (or TOKENIZE) functions handle splitting the line, then we do a standard group/count to count the words. I got the following results:

So a 30x slow down using this. That's pretty painful. I know string translation between languages can be bad. I don't know how much of this is inter-language bindings and how much is groovy. When i get chance I'll try this in Python and see if I get similar numbers.

Alan Gates
added a comment - 16/Oct/09 23:28 I ran some quick and sloppy performance tests on this. I ran it using both BSF and direct bindings to groovy. I also ran it using the builtin TOKENIZE function in Pig. I had it read 5000 lines of text. The groovy (or TOKENIZE) functions handle splitting the line, then we do a standard group/count to count the words. I got the following results:
Groovy using BSF: 55.070 seconds
Groovy direct bindings: 58.560 seconds
TOKENIZE: 2.554 seconds
So a 30x slow down using this. That's pretty painful. I know string translation between languages can be bad. I don't know how much of this is inter-language bindings and how much is groovy. When i get chance I'll try this in Python and see if I get similar numbers.

Ashutosh Chauhan
added a comment - 16/Oct/09 23:35 30x is indeed too slow. But, between BSF and direct bindings, I imagine direct bindings should have been more performant, since BSF adds an extra layer of translation. Isn't it ?

I expected to see the direct bindings to be faster as well, but the tests didn't show that. In the code contributed by Kishore the type translation was done the same regardless of the bindings used. Perhaps there would be a more efficient way to do the type translation for direct bindings.

Alan Gates
added a comment - 16/Oct/09 23:48 I expected to see the direct bindings to be faster as well, but the tests didn't show that. In the code contributed by Kishore the type translation was done the same regardless of the bindings used. Perhaps there would be a more efficient way to do the type translation for direct bindings.

Though good learning from this test is BSF is not slower then direct bindings (need additional verifications though..) So, this feature could be implemented in lot less code and complexity using BSF as oppose to using different direct bindings for different languages. On the other hand, only useful language BSF supports currently is Ruby. Not sure how many people using Pig will also be interested in groovy, javascript etc.( other languages supported by BSF ).

Ashutosh Chauhan
added a comment - 17/Oct/09 00:04 Though good learning from this test is BSF is not slower then direct bindings (need additional verifications though..) So, this feature could be implemented in lot less code and complexity using BSF as oppose to using different direct bindings for different languages. On the other hand, only useful language BSF supports currently is Ruby. Not sure how many people using Pig will also be interested in groovy, javascript etc.( other languages supported by BSF ).

Right, I overlooked it. I think Ruby and Python are two most widely used scripting languages and both are supported by BSF. So, comparing BSF with direct bindings:
1) Performance : Initial test shows almost equal.
2) Support of multiple languages.
3) Ease of implementation
To me, BSF seems to be the way to go for this, atleast the first-cut. Implementing this feature using BSF will allow us to expose this to users quickly and if many people are using it and finding one particular language to be slow then we can explore language bindings for that particular language. Thoughts?

Ashutosh Chauhan
added a comment - 17/Oct/09 00:39 Right, I overlooked it. I think Ruby and Python are two most widely used scripting languages and both are supported by BSF. So, comparing BSF with direct bindings:
1) Performance : Initial test shows almost equal.
2) Support of multiple languages.
3) Ease of implementation
To me, BSF seems to be the way to go for this, atleast the first-cut. Implementing this feature using BSF will allow us to expose this to users quickly and if many people are using it and finding one particular language to be slow then we can explore language bindings for that particular language. Thoughts?

1) I still have to figure out how to do type translation in BSF. The current patch just assumes one string argument and then does reflection on the fly on return to figure out what it is returning. We may or may not be able to expose schemas to scripted UDFs (ala outputSchema and argToFuncMapping) but we at least need to handle multiple and non-string arguments. I need to do more digging in order to understand how to do this type translation in BSF.

2) For at least some either jython or jruby we've got to show better than a 30x differential. There are some products you're just too embarrassed to sell. We may be able to speed this up some by having the framework figure out the return type for this UDF and always convert the returning object based on that return type rather than trying to do reflection.

I don't know ruby or python, and I don't have time at the moment to go learn either. If someone is willing to give me snippets of python and/or ruby that mimic the split functionality given in the patch, I'm happy to test against those two in BSF and see what happens.

Alan Gates
added a comment - 17/Oct/09 01:55 A couple thoughts:
1) I still have to figure out how to do type translation in BSF. The current patch just assumes one string argument and then does reflection on the fly on return to figure out what it is returning. We may or may not be able to expose schemas to scripted UDFs (ala outputSchema and argToFuncMapping) but we at least need to handle multiple and non-string arguments. I need to do more digging in order to understand how to do this type translation in BSF.
2) For at least some either jython or jruby we've got to show better than a 30x differential. There are some products you're just too embarrassed to sell. We may be able to speed this up some by having the framework figure out the return type for this UDF and always convert the returning object based on that return type rather than trying to do reflection.
I don't know ruby or python, and I don't have time at the moment to go learn either. If someone is willing to give me snippets of python and/or ruby that mimic the split functionality given in the patch, I'm happy to test against those two in BSF and see what happens.

I did little research on the topic and it turned there is a third option for doing it. JSR-223[1] for "Scripting for Java" has been approved through JCP and now is a part of java platform in form of javax.script[2] as of java 6. It seems that it aims to provide a consistent api through java language itself. No bindings needed, no BSF all one needs is a "scripting engine". And they claim to have a very long list of languages supported including awk, python, ruby, groovy, javascript, scheme, php, smalltalk etc.
It will be interesting to explore this since:
1) Support from java platform implies no dependencies on BSF and language bindings jars.
2) Possibly more performant.
3) One consistent api for all scripting languages
4) Longer list of supported languages

I am currently reading the apis and if I get something to work, will post back here.

Ashutosh Chauhan
added a comment - 17/Oct/09 10:25 I did little research on the topic and it turned there is a third option for doing it. JSR-223 [1] for "Scripting for Java" has been approved through JCP and now is a part of java platform in form of javax.script [2] as of java 6. It seems that it aims to provide a consistent api through java language itself. No bindings needed, no BSF all one needs is a "scripting engine". And they claim to have a very long list of languages supported including awk, python, ruby, groovy, javascript, scheme, php, smalltalk etc.
It will be interesting to explore this since:
1) Support from java platform implies no dependencies on BSF and language bindings jars.
2) Possibly more performant.
3) One consistent api for all scripting languages
4) Longer list of supported languages
I am currently reading the apis and if I get something to work, will post back here.
[1] http://www.jcp.org/en/jsr/detail?id=223
[2] http://java.sun.com/javase/6/docs/api/javax/script/package-summary.html
[3] https://scripting.dev.java.net/

I did some quick benchmarking using BSF approach for UDFs written in Ruby, Python, Groovy and native builtin in Pig. It's a standard wordcount example where udf tokenizes an input string into number of words. I used pig sources(src/org/apache/pig) as input which has more then 210K lines. Since, I haven't yet figured out type translation so to be consistent in experiment, I passed data as String argument and return type as Object[] in all languages. Following are the numbers I got averaged over 3 runs:

Language

Time(seconds)

Factor

Pig

17

1

Ruby

155

9.1

Python

178

10.4

Groovy

1460

85

This shows Groovy-BSF combo is super-slow and Ruby and Python is much better. These numbers must be seen as an absolute worst case. I believe type translations, compiling script in constructor and using the compiled version instead of evaluating script in every exec() call will give much better performance. Also, there might exist other optimizations.

Sometime next week, I will try to repeat the same experiment with javax.script

Ashutosh Chauhan
added a comment - 18/Oct/09 00:45 I did some quick benchmarking using BSF approach for UDFs written in Ruby, Python, Groovy and native builtin in Pig. It's a standard wordcount example where udf tokenizes an input string into number of words. I used pig sources(src/org/apache/pig) as input which has more then 210K lines. Since, I haven't yet figured out type translation so to be consistent in experiment, I passed data as String argument and return type as Object[] in all languages. Following are the numbers I got averaged over 3 runs:
Language
Time(seconds)
Factor
Pig
17
1
Ruby
155
9.1
Python
178
10.4
Groovy
1460
85
This shows Groovy-BSF combo is super-slow and Ruby and Python is much better. These numbers must be seen as an absolute worst case. I believe type translations, compiling script in constructor and using the compiled version instead of evaluating script in every exec() call will give much better performance. Also, there might exist other optimizations.
Sometime next week, I will try to repeat the same experiment with javax.script

so the testing indicates that with this implementation the jython is fairly on par with the java TOKENIZE impl, and js is just shy of twice as slow.

there are a lot of reasons that the performance of this implementation is startlingly better than the previous numbers, mostly to do with caching the functions, and jython.2.5.1 perhaps being better than whatever python variant was tried above.
this impl also aheres to the schema system for output data, which does cost some cpu, but is generally not too bad.

the scripter converter does not have a js handler, but it does convert inlined jython code (anything between @@ jython @@ and subsequent @@)
for example (taken from pjy_wc.pjy):
@@ jython @@
def split(a):
""" @return b:

{tt:(t:chararray)}

"""
return a.split()

anyway, i'd like to discuss these approaches moving into pig with more out-of-the-box support.
package: org/apache/pig/scripting is meant to be the harness that i'd like to see as part of pig (or something very like that package)
packages: org/apache/pig/scripting/js, org/apache/pig/scripting/jython are implementations that i think are pretty useful, but could be improved. distributing these with pig is certainly debatable. eps jython requires jython.jar to function, and the js implementation is really just a proof of concept for a second language impl (i didn't even make a FilterFunc yet)

the scripter functionality is something i'd like to see supported by the pig parser as much as possible, but i don't have a great idea of how to do that yet. perhaps a new statement to allow a user to register a language pack jar would include hooking it into the parser to handle file references etc. as manually handling the dependency graph is a major pita. The creation of a Code jar and the invocation of javac (in particular, this may not be needed) are pretty arduous, so it'd be nice for a general system to make this work.
I tried to write the script so that you could add new language handlers to it and it would process functions of the form

{lang}

.

{function}

(args) and convert appropriately. but i only implemented jython, so the language separation may not be entirely complete, e.g. a language with very different structure may require some other modifications to the script.

i want to close by saying that the initial inspiration for this work and the idea of the pre-process script came from a blog post about a project called baconsnake http://arnab.org/blog/baconsnake, by Arnab Nandi. That post put me on the track of using jython from java code for the first time, and the idea of making the actual script injecting language tolerable. many thanks.

Woody Anderson
added a comment - 04/Feb/10 04:24 slight error in the js_wc.js script:
change line 9 to:
X = foreach a GENERATE spig_split($0);
and, if you want schema info in the JS impl, change 'bag' to 'b:
{tt:(t:chararray)}
' on line 4.
setenv PIG_HEAPSIZE 2048
time pig -x local tokenize.pig
41.724u 2.046s 0:30.52 143.3% 0+0k 0+16io 8pf+0w
time pig -x local js_wc.pig
72.079u 2.905s 0:54.50 137.5% 0+0k 0+46io 14pf+0w
time pig -x local pjy_wc.pig
41.588u 2.155s 0:33.58 130.2% 0+0k 0+6io 8pf+0w
so the testing indicates that with this implementation the jython is fairly on par with the java TOKENIZE impl, and js is just shy of twice as slow.
there are a lot of reasons that the performance of this implementation is startlingly better than the previous numbers, mostly to do with caching the functions, and jython.2.5.1 perhaps being better than whatever python variant was tried above.
this impl also aheres to the schema system for output data, which does cost some cpu, but is generally not too bad.
the scripter converter does not have a js handler, but it does convert inlined jython code (anything between @@ jython @@ and subsequent @@)
for example (taken from pjy_wc.pjy):
@@ jython @@
def split(a):
""" @return b:
{tt:(t:chararray)}
"""
return a.split()
anyway, i'd like to discuss these approaches moving into pig with more out-of-the-box support.
package: org/apache/pig/scripting is meant to be the harness that i'd like to see as part of pig (or something very like that package)
packages: org/apache/pig/scripting/js, org/apache/pig/scripting/jython are implementations that i think are pretty useful, but could be improved. distributing these with pig is certainly debatable. eps jython requires jython.jar to function, and the js implementation is really just a proof of concept for a second language impl (i didn't even make a FilterFunc yet)
the scripter functionality is something i'd like to see supported by the pig parser as much as possible, but i don't have a great idea of how to do that yet. perhaps a new statement to allow a user to register a language pack jar would include hooking it into the parser to handle file references etc. as manually handling the dependency graph is a major pita. The creation of a Code jar and the invocation of javac (in particular, this may not be needed) are pretty arduous, so it'd be nice for a general system to make this work.
I tried to write the script so that you could add new language handlers to it and it would process functions of the form
{lang}
.
{function}
(args) and convert appropriately. but i only implemented jython, so the language separation may not be entirely complete, e.g. a language with very different structure may require some other modifications to the script.
i want to close by saying that the initial inspiration for this work and the idea of the pre-process script came from a blog post about a project called baconsnake http://arnab.org/blog/baconsnake , by Arnab Nandi. That post put me on the track of using jython from java code for the first time, and the idea of making the actual script injecting language tolerable. many thanks.

did a bit more classloader work and i removed the need for the rather ugly javac hack.
so, now the command line is:
scripter --jars '/tmp/jython.jar:spig.jar:pjy.jar:pjs.jar' -c ./Code.jar -w ./tmp/ -o pjy_wc.pig pjy_wc.pjy

Woody Anderson
added a comment - 18/Feb/10 20:52 did a bit more classloader work and i removed the need for the rather ugly javac hack.
so, now the command line is:
scripter --jars '/tmp/jython.jar:spig.jar:pjy.jar:pjs.jar' -c ./Code.jar -w ./tmp/ -o pjy_wc.pig pjy_wc.pjy
if https://issues.apache.org/jira/browse/PIG-1242 were accomplished, the code.jar could be omitted in favor of register jython_code.py;, which would be even nicer.

Great work !! This will definitely be useful for lot of Pig users. I just hastily looked at your work. One question which stuck to me is you are doing lot of heavy lifting to provide for multi-language support by figuring out which language user is asking for and then doing reflection to load appropriate interpreter and stuff. I think it might be easier to use one of the frameworks here (BSF or javax.script) which hides this and allows handling of multiple language transparently. (atleast, thats what they claim to do) Have you taken a look at them? These frameworks will arguably help us to provide support for more languages without maintaining lot of code on our part. Though, I am sure they will come at the performance cost (certainly CPU and possibly memory too).

Ashutosh Chauhan
added a comment - 20/Feb/10 02:53 Hey Woody,
Great work !! This will definitely be useful for lot of Pig users. I just hastily looked at your work. One question which stuck to me is you are doing lot of heavy lifting to provide for multi-language support by figuring out which language user is asking for and then doing reflection to load appropriate interpreter and stuff. I think it might be easier to use one of the frameworks here (BSF or javax.script) which hides this and allows handling of multiple language transparently. (atleast, thats what they claim to do) Have you taken a look at them? These frameworks will arguably help us to provide support for more languages without maintaining lot of code on our part. Though, I am sure they will come at the performance cost (certainly CPU and possibly memory too).

yes, i've looked at both javax.script and BSF, both of which are not well designed for this scenario (in my opinion).
This comes mostly from their extreme generality and that they do not seem to provide a way to access and subsequently stash a consistent reference to a particular function. aka a pointer.

This is partly what allows direct use of the jython interpreter to be so fast. Each invocation utilizes a function object directly, it does not have to give a name to an 'engine' which looks up the function and decided appropriate call context, object context etc.
Those things are great, but not if you don't need them.
Perhaps someone can show me how those systems work much better than i have been able to utilize them, but this approach allows the impl to be agnostic to these frameworks in a way that can boost performance.
as you may have noticed, the js example uses javax.script, which BSF3 now conforms to, this impl must populate an engine, and then use the function name over and over. this involves more function name lookups and is less condusive to lamda functions etc.

bsf is also extremely easy to integrate under the hood in the same way, it has the same perf costs as javax.script due to the hoop jumping. I tried this out while trying to make perl work, but the perlengine is 6 years old and i was unable to get it to work, the bsf binding part worked well enough though.

the reflection overhead is pretty minimal, and not really needed if the user writes the code directly (they can simply use the appropriate package directly).
eg.
define spig_println_Tchararray_P1 org.apache.pig.scripting.Eval('js','println_Tchararray_P1','chararray','var println_Tchararray_P1 = function(a0)

Woody Anderson
added a comment - 24/Feb/10 19:56 yes, i've looked at both javax.script and BSF, both of which are not well designed for this scenario (in my opinion).
This comes mostly from their extreme generality and that they do not seem to provide a way to access and subsequently stash a consistent reference to a particular function. aka a pointer.
This is partly what allows direct use of the jython interpreter to be so fast. Each invocation utilizes a function object directly, it does not have to give a name to an 'engine' which looks up the function and decided appropriate call context, object context etc.
Those things are great, but not if you don't need them.
Perhaps someone can show me how those systems work much better than i have been able to utilize them, but this approach allows the impl to be agnostic to these frameworks in a way that can boost performance.
as you may have noticed, the js example uses javax.script, which BSF3 now conforms to, this impl must populate an engine, and then use the function name over and over. this involves more function name lookups and is less condusive to lamda functions etc.
bsf is also extremely easy to integrate under the hood in the same way, it has the same perf costs as javax.script due to the hoop jumping. I tried this out while trying to make perl work, but the perlengine is 6 years old and i was unable to get it to work, the bsf binding part worked well enough though.
the reflection overhead is pretty minimal, and not really needed if the user writes the code directly (they can simply use the appropriate package directly).
eg.
define spig_println_Tchararray_P1 org.apache.pig.scripting.Eval('js','println_Tchararray_P1','chararray','var println_Tchararray_P1 = function(a0)
{ println(a0); };');
v.s
define spig_println_Tchararray_P1 org.apache.pig.scripting.js.Eval('println_Tchararray_P1','chararray','var println_Tchararray_P1 = function(a0) { println(a0); }
;');
the top level Eval is there simply to allow factory based performance improvements that can be created by knowledgeable implementers.
if the scriptengine frameworks provided nicer access to functions, and nicer call patterns it would have been nicer to use them.

Just curious to know, can we not implement it along the lines of DEFINE commands. In that case we will let the shell take care of scripting issues, and no need to include scripting-specific jars ( jython etc. ). That might require code changes in core-pig and cant be implemented as a separate UDF-package though.

Prasen Mukherjee
added a comment - 26/Feb/10 03:30 Just curious to know, can we not implement it along the lines of DEFINE commands. In that case we will let the shell take care of scripting issues, and no need to include scripting-specific jars ( jython etc. ). That might require code changes in core-pig and cant be implemented as a separate UDF-package though.

Ya, this functionality could be partially simulated using DEFINE / Streaming combination. But that may not be most efficient way to achieve it. First of all, streaming script would be run in a separate process (as oppose to same JVM in approaches discussed above) so there will be CPU cost involved in getting data in and out of from java process to stream script process. Then, there is a cost of serialization and deserialization of parameters. You loose all the type information of the parameters. Once you are in same runtime you can start doing interesting things. Also, having scripts in define statements will get kludgy soon as one you start to do complicated things there.

no need to include scripting-specific jars (jython etc.)

Do you mean Include in pig distribution or in pig's classpath at runtime ? In either case that may not necessarily a problem. For first part, we can use ivy to pull the jars for us instead of including in distribution and for second part we can ship all the jars required by Pig to compute nodes.

Ashutosh Chauhan
added a comment - 04/Mar/10 03:04 @Prasen
can we not implement it along the lines of DEFINE commands.
Ya, this functionality could be partially simulated using DEFINE / Streaming combination. But that may not be most efficient way to achieve it. First of all, streaming script would be run in a separate process (as oppose to same JVM in approaches discussed above) so there will be CPU cost involved in getting data in and out of from java process to stream script process. Then, there is a cost of serialization and deserialization of parameters. You loose all the type information of the parameters. Once you are in same runtime you can start doing interesting things. Also, having scripts in define statements will get kludgy soon as one you start to do complicated things there.
no need to include scripting-specific jars (jython etc.)
Do you mean Include in pig distribution or in pig's classpath at runtime ? In either case that may not necessarily a problem. For first part, we can use ivy to pull the jars for us instead of including in distribution and for second part we can ship all the jars required by Pig to compute nodes.

I agree frameworks will not be performant. I think there usefulness depends on what we want to achieve? If we want to support many different languages, then they might prove useful, if we are only interested in supporting a language or two (seems Python and Ruby are most popular ones) then it won't make sense to pay the overhead associated with them.

Ashutosh Chauhan
added a comment - 04/Mar/10 03:16 @Woody
I agree frameworks will not be performant. I think there usefulness depends on what we want to achieve? If we want to support many different languages, then they might prove useful, if we are only interested in supporting a language or two (seems Python and Ruby are most popular ones) then it won't make sense to pay the overhead associated with them.

Dmitriy V. Ryaboy
added a comment - 04/Mar/10 16:50 FWIW – I would rather few languages were supported, and were fast, than support a lot of languages that are all unusably slow. Ten times slower than Pig is in the unusable range, imo.

FWIW - I would rather few languages were supported, and were fast, than support a lot of languages that are all unusably slow. Ten times slower than Pig is in the unusable range, imo.

+1
I think if we can get Python going and make it easy to add Ruby, we'll have satisfied 90% of the potential users. I've had a number of people ask me directly if they could program in either of those languages. I've never had anyone say they wish they could write UDFs in groovy or java script. I think people will pay a 2x cost for Python or Ruby. I don't think they'll pay 10x.

Alan Gates
added a comment - 04/Mar/10 17:06 FWIW - I would rather few languages were supported, and were fast, than support a lot of languages that are all unusably slow. Ten times slower than Pig is in the unusable range, imo.
+1
I think if we can get Python going and make it easy to add Ruby, we'll have satisfied 90% of the potential users. I've had a number of people ask me directly if they could program in either of those languages. I've never had anyone say they wish they could write UDFs in groovy or java script. I think people will pay a 2x cost for Python or Ruby. I don't think they'll pay 10x.

@Ashutosh
I don't think there is any measurable overhead to the reflection mechanism in the example I provided. The objects are allocated "a few" times due to the schema interrogation logic of pig (something that might deserve an entire other bug thread of discussion, as i have no idea why X copies of a UDF have to be allocated for this).
When it comes time to run (i.e. where it really counts), there is a single invocation of the factory pattern followed by "huge" (data set derived) number of calls to that function. The UDF that is called is fully built an fully initialized with final variables etc, facilitating maximal streamlined execution.
There are certainly things about the approach i took, but language selection overhead is not one of them. If you have profiling numbers that suggest otherwise I'd be suitably surprised.

A secondary point to the whole idea of needing some script language code other than, say BSF or javax.script is the idea of type coercion. BSF/javax is not usable in a drop in manner. Each engine unfortunately consumes and produces objects in its own object model. If either of these frameworks had bothered to mandate converting input/output to java.util things would at least be easier, b/c we could convert from that to DataBag/Tuple in a unified manner, but this isn't the case. Thus conversion must be implemented per Engine, at which point, a conversion from PyArray to Tuple is more appropriate than PyArray -> List -> Tuple for performance concerns.
But, even for rudimentary correctness, type conversion must be implemented for each, at which point, a wrapping pattern that selects an appropriate function factory is a necessary pattern anyway.

@Alan/@Dmitriy
Orthogonal to the above point: The idea of trying to support multiple script languages vs. a few. I am personally not of the same mind as you guys i guess.
I think there is near zero 'overhead' perf cost for supporting some unspecified language. Languages continually evolve and new languages emerge that utilize the JVM better and better. I certainly agree that, at this time, jython and jruby seem the best. However, to say that clojure or javascript, or whatever are not going to move forward and potentially become more effectively integrated with the JVM is a bit premature.

I would make the sacrifice if the ability to support multiple languages was actually that hard, or had an actual serious performance cost.
I just don't think those two issues are real.

The performance costs come from the individual scripting engine features with respect to byte-code compliation, function referencing, string manipulation, execution caching etc., and their type coercion complexities.
That is completely different than the cost of PIG supporting multiple languages.
Also, supporting multiple languages is also not that hard. Arnab has thought about this, as have I. I think his ideas, while not perfect, offer a good avenue of exploration and moving forward that offers integration of PIG with any script language. It (importantly) offers to put those languages in PIG instead of the other way around, and it allows for multiple interpreter contexts and even multiple languages.

I'll quote Arnab's quick description here:

DEFINE CMD `SCRIPTBLOCK` script('javascript')
This is identical to the commandline streaming syntax, and follows gracefully in the style of the "ship" and "cache" keywords.

` script('JAVASCRIPT');
Note the use of backticks is consistent with the current syntax, and is unlikely to occur in common scripts, so it saves us the escaping. Also it allows newlines in the code.
The goal is to create namespaces – you can now call your function as "JSBlock.split(a)". This allows us to have multiple functions in one block.

This idea, coupled with the ability to register files and directories directly (e.g. register foo.py provides the ability to load code into an arbitrary namespace/interpreter-scope, load it for an arbitrary language etc.
and the invocation syntax is nice and clean Block.foo() calls a method named foo in the interpreter.
To allow for the easy invocation syntax to perform well, we would need to cause it to execute in the same was as:
define spig_split org.apache.pig.scripting.Eval('jython','split','b:

{tt:(t:chararray)}

');

i don't see that as particularly difficult modification of the function rationalization logic of pig. Actually, i think it's a general improvement as it cuts down on object allocations.

In the event that this methodology is adopted, you are then still free to write projects that stuff PIG inside python or ruby etc. But PIG itself remains an environment that plays well with multiple script engines.

conclusion:
I see it as quite achievable to support any given language with near zero overhead above the lang's scriptengine,
I thing it's quite doable to do this in a flexible model that allows them to be mixed together, even within the same script
I think that, overall this is highly preferable to a single or otherwise finite language situation (though i advocate possibly auto-supporting jython/jruby)

Woody Anderson
added a comment - 04/Mar/10 23:12 @Ashutosh
I don't think there is any measurable overhead to the reflection mechanism in the example I provided. The objects are allocated "a few" times due to the schema interrogation logic of pig (something that might deserve an entire other bug thread of discussion, as i have no idea why X copies of a UDF have to be allocated for this).
When it comes time to run (i.e. where it really counts), there is a single invocation of the factory pattern followed by "huge" (data set derived) number of calls to that function. The UDF that is called is fully built an fully initialized with final variables etc, facilitating maximal streamlined execution.
There are certainly things about the approach i took, but language selection overhead is not one of them. If you have profiling numbers that suggest otherwise I'd be suitably surprised.
A secondary point to the whole idea of needing some script language code other than, say BSF or javax.script is the idea of type coercion. BSF/javax is not usable in a drop in manner. Each engine unfortunately consumes and produces objects in its own object model. If either of these frameworks had bothered to mandate converting input/output to java.util things would at least be easier, b/c we could convert from that to DataBag/Tuple in a unified manner, but this isn't the case. Thus conversion must be implemented per Engine, at which point, a conversion from PyArray to Tuple is more appropriate than PyArray -> List -> Tuple for performance concerns.
But, even for rudimentary correctness, type conversion must be implemented for each, at which point, a wrapping pattern that selects an appropriate function factory is a necessary pattern anyway.
@Alan/@Dmitriy
Orthogonal to the above point: The idea of trying to support multiple script languages vs. a few. I am personally not of the same mind as you guys i guess.
I think there is near zero 'overhead' perf cost for supporting some unspecified language. Languages continually evolve and new languages emerge that utilize the JVM better and better. I certainly agree that, at this time, jython and jruby seem the best. However, to say that clojure or javascript, or whatever are not going to move forward and potentially become more effectively integrated with the JVM is a bit premature.
I would make the sacrifice if the ability to support multiple languages was actually that hard, or had an actual serious performance cost.
I just don't think those two issues are real.
The performance costs come from the individual scripting engine features with respect to byte-code compliation, function referencing, string manipulation, execution caching etc., and their type coercion complexities.
That is completely different than the cost of PIG supporting multiple languages.
Also, supporting multiple languages is also not that hard. Arnab has thought about this, as have I. I think his ideas, while not perfect, offer a good avenue of exploration and moving forward that offers integration of PIG with any script language. It (importantly) offers to put those languages in PIG instead of the other way around, and it allows for multiple interpreter contexts and even multiple languages.
I'll quote Arnab's quick description here:
DEFINE CMD `SCRIPTBLOCK` script('javascript')
This is identical to the commandline streaming syntax, and follows gracefully in the style of the "ship" and "cache" keywords.
Thus your javascript example becomes
DEFINE JSBlock `
function split(a)
{
return a.split(" ");
}
` script('JAVASCRIPT');
Note the use of backticks is consistent with the current syntax, and is unlikely to occur in common scripts, so it saves us the escaping. Also it allows newlines in the code.
The goal is to create namespaces – you can now call your function as "JSBlock.split(a)". This allows us to have multiple functions in one block.
This idea, coupled with the ability to register files and directories directly (e.g. register foo.py provides the ability to load code into an arbitrary namespace/interpreter-scope, load it for an arbitrary language etc.
and the invocation syntax is nice and clean Block.foo() calls a method named foo in the interpreter.
To allow for the easy invocation syntax to perform well, we would need to cause it to execute in the same was as:
define spig_split org.apache.pig.scripting.Eval('jython','split','b:
{tt:(t:chararray)}
');
i don't see that as particularly difficult modification of the function rationalization logic of pig. Actually, i think it's a general improvement as it cuts down on object allocations.
In the event that this methodology is adopted, you are then still free to write projects that stuff PIG inside python or ruby etc. But PIG itself remains an environment that plays well with multiple script engines.
conclusion:
I see it as quite achievable to support any given language with near zero overhead above the lang's scriptengine,
I thing it's quite doable to do this in a flexible model that allows them to be mixed together, even within the same script
I think that, overall this is highly preferable to a single or otherwise finite language situation (though i advocate possibly auto-supporting jython/jruby)

Woody, what I meant by my remark was that I disagree with Ashutosh and agree with you, not that I only want to support Python. If using a framework meant we could support 100 jvm-based languages and your approach meant we could support 2, I'd still go with what actually works.

By the way, we should adapt this to create a reflection UDF to call out to Java libraries, so we don't have to wrap things like String.split anymore.

Dmitriy V. Ryaboy
added a comment - 04/Mar/10 23:56 Woody, what I meant by my remark was that I disagree with Ashutosh and agree with you, not that I only want to support Python. If using a framework meant we could support 100 jvm-based languages and your approach meant we could support 2, I'd still go with what actually works.
By the way, we should adapt this to create a reflection UDF to call out to Java libraries, so we don't have to wrap things like String.split anymore.

Java reflection is very doable, it's kind of a pain i guess, but you could definitely do it. I think using BeanShell might be a way to use java syntax if you want to, but jython and jruby also are quite good at allowing you to call java code very easily and naturally.
What kind of reflection system are you thinking? passing a string as input to some function? or finding someway to assume you can make certain method calls on the objects that represent various data object in pig. e.g. $0.split("."), assuming $0 is a chararray/string.
or are you thinking something that equates to:
def splitter java.util.regex.Pattern("\.");
A = foreach B generate splitter.split($0);

to have it perform at 'peak', you'd need to wrap the reflection into the constructor and cache the java.lang.reflect.Method object.
it wouldn't be too hard to write (the assumed impl uses constructor args to determine the correct Method via reflection):
def split org.apache.pig.scripting.Eval('reflect', 'java.util.regex.Pattern', 'split', "\.", 'String', 'b:

{tt:(t:chararray)}

');
A = foreach B generate split($0);

to be more 'generic' but less performant, you could do it more like this (the assumed impl uses less info to simply reflect a particular object):
def split org.apache.pig.scripting.Eval('reflect', 'java.util.regex.Pattern', 'split', "\.");
A = foreach B generate split('split', $0);

the issue here is that each invocation has to determine the correct Method object (after the first it's probably highly cacheable), also since the method might change as a result of a different name or different args, the lookup might also produce a different output schema. At any rate, i think you could write reasonably peformant caching code for this solution, but it'd be more complicated and a tag slower than the former approach.
Mainly i've tried in all of my impls to do as little as possible in the exec() method, and try to make most objects in use final and immutable (e.g. build them all in the constructor).

you could of course go so far as to delay the creation of the actual Pattern object (i.e. where you first present the split pattern "\."). Again, it lends itself to performance degrading coding patterns, but if you're careful with your actions, i think you could get most of it back with appropriately cached objects. Doing this in a completely generic fashion.. i'll think about it i guess, i think there's more overhead here than in the other approaches, but if your lib function is more than 'split', the overhead might not be noticeable. Of course, you could implement each of these abstractions levels and use them judiciously.

anyway, there are a lot of options here, are these in line with what you were thinking?

Woody Anderson
added a comment - 05/Mar/10 22:17 Java reflection is very doable, it's kind of a pain i guess, but you could definitely do it. I think using BeanShell might be a way to use java syntax if you want to, but jython and jruby also are quite good at allowing you to call java code very easily and naturally.
What kind of reflection system are you thinking? passing a string as input to some function? or finding someway to assume you can make certain method calls on the objects that represent various data object in pig. e.g. $0.split("."), assuming $0 is a chararray/string.
or are you thinking something that equates to:
def splitter java.util.regex.Pattern("\.");
A = foreach B generate splitter.split($0);
to have it perform at 'peak', you'd need to wrap the reflection into the constructor and cache the java.lang.reflect.Method object.
it wouldn't be too hard to write (the assumed impl uses constructor args to determine the correct Method via reflection):
def split org.apache.pig.scripting.Eval('reflect', 'java.util.regex.Pattern', 'split', "\.", 'String', 'b:
{tt:(t:chararray)}
');
A = foreach B generate split($0);
to be more 'generic' but less performant, you could do it more like this (the assumed impl uses less info to simply reflect a particular object):
def split org.apache.pig.scripting.Eval('reflect', 'java.util.regex.Pattern', 'split', "\.");
A = foreach B generate split('split', $0);
the issue here is that each invocation has to determine the correct Method object (after the first it's probably highly cacheable), also since the method might change as a result of a different name or different args, the lookup might also produce a different output schema. At any rate, i think you could write reasonably peformant caching code for this solution, but it'd be more complicated and a tag slower than the former approach.
Mainly i've tried in all of my impls to do as little as possible in the exec() method, and try to make most objects in use final and immutable (e.g. build them all in the constructor).
you could of course go so far as to delay the creation of the actual Pattern object (i.e. where you first present the split pattern "\."). Again, it lends itself to performance degrading coding patterns, but if you're careful with your actions, i think you could get most of it back with appropriately cached objects. Doing this in a completely generic fashion.. i'll think about it i guess, i think there's more overhead here than in the other approaches, but if your lib function is more than 'split', the overhead might not be noticeable. Of course, you could implement each of these abstractions levels and use them judiciously.
anyway, there are a lot of options here, are these in line with what you were thinking?

Hi,
I'm attaching something I implemented last year. I cleaned it up and updated the dependency to Pig 0.6.0 for the occasion.
There's probably some overlap with previous posts, sorry about the late submission.
Here is my approach.
I wanted to make easier a couple of things:

writing programs that require multiple calls to pig

UDFs

parameter passing to Pig
So I integrated Pig with Jython so that the whole program (main program, UDFs, Pig scripts) could be in one python script.
example: python/tc.py in the attachment

The script defines Python functions that are available as UDFs to pig automatically. The decorator @outputSchema is an easy way to specify what the output schema of the UDF is.
example (see script): @outputSchema("relationships:

{t:(target:chararray, candidate:chararray)}

"
Also notice that the UDFs use the standard python constructs: tuple, dictionary and list. they are converted to Pig constructs on the fly. This makes the definition of UDFs in Python very easy. Notice that the udf takes a list of arguments, not a tuple. The input tuple gets automatically mapped to the arguments.

Then the script defines a main() function that will be the main program executed on the client.
In the main the Python program has access to a global pig variable that provides two methods (for now) and is designed to be an equivalent to PigServer.
List<ExecJob> executeScript(String script)
to execute a pig script in-lined in Python
deleteFile(String filename)
to delete a file
This looks a little bit like the JDBC approach where you "query" Pig and then can process the data.

also you can embed python expressions in the pig statements using $

{ ... }

example: $

{n - 1}

They get executed in the current scope and replaced in the script.

To run the example (assuming javac, jar and java are in your PATH):

tar xzvf pyg.tgz

add pig-0.6.0-core.jar to the lib folder

./makejar.sh

./runme.sh

It runs the following:
org.apache.pig.pyg.Pyg local tc.py

tc.py is a python script that performs a transitive closure on a list of relation using an iterative algorithm. It defines python functions

Limitations:

you can not include other python scripts but this should be doable.

I haven't spent much time testing performance. I suspect the Pig<->Python type conversion to be a little slow as it creates many new objects. It could possibly be improved by making the Pig objects implement the Python interfaces.

Julien Le Dem
added a comment - 13/Mar/10 22:33 Hi,
I'm attaching something I implemented last year. I cleaned it up and updated the dependency to Pig 0.6.0 for the occasion.
There's probably some overlap with previous posts, sorry about the late submission.
Here is my approach.
I wanted to make easier a couple of things:
writing programs that require multiple calls to pig
UDFs
parameter passing to Pig
So I integrated Pig with Jython so that the whole program (main program, UDFs, Pig scripts) could be in one python script.
example: python/tc.py in the attachment
The script defines Python functions that are available as UDFs to pig automatically. The decorator @outputSchema is an easy way to specify what the output schema of the UDF is.
example (see script): @outputSchema("relationships:
{t:(target:chararray, candidate:chararray)}
"
Also notice that the UDFs use the standard python constructs: tuple, dictionary and list. they are converted to Pig constructs on the fly. This makes the definition of UDFs in Python very easy. Notice that the udf takes a list of arguments, not a tuple. The input tuple gets automatically mapped to the arguments.
Then the script defines a main() function that will be the main program executed on the client.
In the main the Python program has access to a global pig variable that provides two methods (for now) and is designed to be an equivalent to PigServer.
List<ExecJob> executeScript(String script)
to execute a pig script in-lined in Python
deleteFile(String filename)
to delete a file
This looks a little bit like the JDBC approach where you "query" Pig and then can process the data.
also you can embed python expressions in the pig statements using $
{ ... }
example: $
{n - 1}
They get executed in the current scope and replaced in the script.
To run the example (assuming javac, jar and java are in your PATH):
tar xzvf pyg.tgz
add pig-0.6.0-core.jar to the lib folder
./makejar.sh
./runme.sh
It runs the following:
org.apache.pig.pyg.Pyg local tc.py
tc.py is a python script that performs a transitive closure on a list of relation using an iterative algorithm. It defines python functions
Limitations:
you can not include other python scripts but this should be doable.
I haven't spent much time testing performance. I suspect the Pig<->Python type conversion to be a little slow as it creates many new objects. It could possibly be improved by making the Pig objects implement the Python interfaces.
(the attachment contains jython.jar 2.5.0 for simplicity)
Best regards, Julien

Schema parsing:
I notice that you wrote a Schema parser in EvalBase.
It took me a while to figure out but you can do that with the following Pig class
org.apache.pig.impl.logicalLayer.parser.QueryParser
using the following code:
QueryParser parser = new QueryParser(new StringReader(schema));
result = parser.TupleSchema();
for example:
String schema = "relationships:
{t:(target:chararray, candidate:chararray)}

"
and you get a Schema instance back.

Different options for passing the Python code to the hadoop nodes:
I notice you pass the Python functions by creating a .py file included in the jar which is then loaded through the class loader.
I pass the python code to the nodes by adding it as a parameter of my UDF constructor (encoded in a string). The drawback is that it is verbose as it gets included for every function.

Julien Le Dem
added a comment - 14/Mar/10 03:17 Hi Woody,
Some comments:
Schema parsing:
I notice that you wrote a Schema parser in EvalBase.
It took me a while to figure out but you can do that with the following Pig class
org.apache.pig.impl.logicalLayer.parser.QueryParser
using the following code:
QueryParser parser = new QueryParser(new StringReader(schema));
result = parser.TupleSchema();
for example:
String schema = "relationships:
{t:(target:chararray, candidate:chararray)}
"
and you get a Schema instance back.
Different options for passing the Python code to the hadoop nodes:
I notice you pass the Python functions by creating a .py file included in the jar which is then loaded through the class loader.
I pass the python code to the nodes by adding it as a parameter of my UDF constructor (encoded in a string). The drawback is that it is verbose as it gets included for every function.

1. schema parsing: yup, i much prefer re-using the parser, i wasn't able to find that impl, but should have been more diligent in looking for it.
2. i love the outputSchema decorator pattern that you use.
3. code via a .py file vs. string literal in the constructor. The .py file is a definite win when dealing with encoding issues (quotes, newlines etc). It's also a cleaner way to import larger blocks of code, and works for jython files etc. that are used indirectly etc. The constructor pattern is still supported in my approach, i just use it exclusively for lambda functions.
4. the pyObjToObj code is simpler in your approach, but limits the integration flexibility. i.e. you explicitly tie tuple:tuple, list:bag. Also, it's not clear how well this would handle sequences and iterators etc. I personally prefer using the schema to disambiguate the conversion, so that existing python code can be used to generate bags/tuples etc. via the schema rather than having to convert python objects using wrapper code.
5. the outputSchema logic is nice (as i said in #2, i love the decorator thing), but the schema should be cached if it is not a function. If it's a function, then the ref should be cached. This is particularly important if you're using the schema to inform the python -> pig data coercion.
6. as i said in prev comments, the scope of the interpreter is important. If you have two different UDFs that you want to share any state (such as counters), then a shared interpreter is a good idea. There are also memory gains from sharing etc. In general, i think you rarely want a distinct interpreter, and as such it should be possible, but not the default.

Anyway, thanks for attaching the submission, i think there are lots of great ideas in your project. It makes me wish i'd known about it sooner, parsing the pig schema system was not a fun day, though i guess i did learn a bit from it. The decorator thing is lovely. I'll probably borrow those and produce a tighter jython and scripting harness at some point.

Overall, i'm still firmly in the multi-language camp, but i think this provides nice improvements for a jython impl, and can clearly still swallow whatever language support pig introduces for anyone who wants to drive pig from python. So i think it should still be useful as a standalone project/harness.

Woody Anderson
added a comment - 15/Mar/10 17:34 @julien
have read over your code.
1. schema parsing: yup, i much prefer re-using the parser, i wasn't able to find that impl, but should have been more diligent in looking for it.
2. i love the outputSchema decorator pattern that you use.
3. code via a .py file vs. string literal in the constructor. The .py file is a definite win when dealing with encoding issues (quotes, newlines etc). It's also a cleaner way to import larger blocks of code, and works for jython files etc. that are used indirectly etc. The constructor pattern is still supported in my approach, i just use it exclusively for lambda functions.
4. the pyObjToObj code is simpler in your approach, but limits the integration flexibility. i.e. you explicitly tie tuple:tuple, list:bag. Also, it's not clear how well this would handle sequences and iterators etc. I personally prefer using the schema to disambiguate the conversion, so that existing python code can be used to generate bags/tuples etc. via the schema rather than having to convert python objects using wrapper code.
5. the outputSchema logic is nice (as i said in #2, i love the decorator thing), but the schema should be cached if it is not a function. If it's a function, then the ref should be cached. This is particularly important if you're using the schema to inform the python -> pig data coercion.
6. as i said in prev comments, the scope of the interpreter is important. If you have two different UDFs that you want to share any state (such as counters), then a shared interpreter is a good idea. There are also memory gains from sharing etc. In general, i think you rarely want a distinct interpreter, and as such it should be possible, but not the default.
Anyway, thanks for attaching the submission, i think there are lots of great ideas in your project. It makes me wish i'd known about it sooner, parsing the pig schema system was not a fun day, though i guess i did learn a bit from it. The decorator thing is lovely. I'll probably borrow those and produce a tighter jython and scripting harness at some point.
Overall, i'm still firmly in the multi-language camp, but i think this provides nice improvements for a jython impl, and can clearly still swallow whatever language support pig introduces for anyone who wants to drive pig from python. So i think it should still be useful as a standalone project/harness.

The main advantage of embedding pig calls in the scripting language is that it enables iterative algorithms, which Pig is no very good at currently. Why would we limit users to UDFs when they can have their whole program in their scripting language of choice?

4. Python is a very interesting language to integrate with Pig because it has all the same native data structures (tuple:tuple, list:bag, dictionary:map) which makes the UDFs compact and easy to code. That said, in scripting languages that don't match as well as Python to the Pig types, using the schema to disambiguate will be a must have.
When do we need to convert sequences and iterators ? Pig has only tuple, bag and map as complex types AFAIK.
5. agreed, It should be cached or initialised at the begining.
3. and 6. I'll investigate passing the main script through the classpath when I have time. One interpreter would be nice to save memory and initialization time. I'm not sure the shared state is such an advantage as UDFs should not rely on being run in the same process. Maybe I'm just missing something.

About the multi language: I'm not against it, but there's not that much code to share.
The scripting<->pig type conversion is specific to each language as you mentioned. also calling functions, getting a list of functions, defining output schemas will be specific.

How I see the multilanguage:

pig local|mapred -script

{language}

{scriptfile}

main program:

generic: loads the sript file

generic: makes the script available in the classpath of the tasks (through a jar generated on the fly?)

Julien Le Dem
added a comment - 21/Mar/10 21:49 @Woody
The main advantage of embedding pig calls in the scripting language is that it enables iterative algorithms, which Pig is no very good at currently. Why would we limit users to UDFs when they can have their whole program in their scripting language of choice?
4. Python is a very interesting language to integrate with Pig because it has all the same native data structures (tuple:tuple, list:bag, dictionary:map) which makes the UDFs compact and easy to code. That said, in scripting languages that don't match as well as Python to the Pig types, using the schema to disambiguate will be a must have.
When do we need to convert sequences and iterators ? Pig has only tuple, bag and map as complex types AFAIK.
5. agreed, It should be cached or initialised at the begining.
3. and 6. I'll investigate passing the main script through the classpath when I have time. One interpreter would be nice to save memory and initialization time. I'm not sure the shared state is such an advantage as UDFs should not rely on being run in the same process. Maybe I'm just missing something.
About the multi language: I'm not against it, but there's not that much code to share.
The scripting<->pig type conversion is specific to each language as you mentioned. also calling functions, getting a list of functions, defining output schemas will be specific.
How I see the multilanguage:
pig local|mapred -script
{language}
{scriptfile}
main program:
generic: loads the sript file
generic: makes the script available in the classpath of the tasks (through a jar generated on the fly?)
specific: initializes the interpreter for the scripting language
specific: adds the global variables defined by pig for the main (in my case: decorators, pig server instance)
generic: loads the script in the interpreter
specific: figures out the list of functions and registers them automatically as UDFs in PIG using a dedicated UDF wrapper class
specific: run the main
Pig execute call from the script:
generic: parse the Pig string to replace $
{expression}
by the value of the expression as evaluated by the interpreter in the local scope.
UDF init:
generic: loads the script from the classpath
specific: initializes the interpreter for the scripting language
specific: add the global variables defined by pig for the UDFs (in my case: decorators)
generic: loads the script in the interpreter
specific: figures out the runtime for the outputSchema: function call or static schema (parsing of schema generic)
UDF call:
specific: convert a pig tuple to a parameter list in the scripting language types
specific: call the function with the parameters
specific: convert the result to Pig types
generic: return the result

Woody,
I submitted my attempt at generic Java invocation in PIG-1354. Would appreciate feedback. It's fairly limited (only works for methods that return one of classes that has a Pig equivalent, and takes parameters of the same), but I've already found it quite useful, even in the limited state. Had to break out a separate class for each return type, Pig was giving me trouble otherwise.

Dmitriy V. Ryaboy
added a comment - 05/Apr/10 17:56 Woody,
I submitted my attempt at generic Java invocation in PIG-1354 . Would appreciate feedback. It's fairly limited (only works for methods that return one of classes that has a Pig equivalent, and takes parameters of the same), but I've already found it quite useful, even in the limited state. Had to break out a separate class for each return type, Pig was giving me trouble otherwise.

Julien Le Dem
added a comment - 29/Apr/10 02:03 I implemented the modifications mentioned in my previous comment:
https://issues.apache.org/jira/browse/PIG-928?focusedCommentId=12847986&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12847986
To run the example (assuming javac, jar and java are in your PATH):
tar xzvf pyg.tgz
add pig-0.6.0-core.jar to the lib folder
./makejar.sh
./runme.sh
The python implementation is now decoupled form the generic code.
the script code is passed through the classpath.
To implement other scripting languages, extend org.apache.pig.greek.ScriptEngine
I renamed this thing Pig-Greek

Nicolas Torzec
added a comment - 05/May/10 19:50 On the benchmarking side,
I had a look at the benchmark comparing native Pig built-in functions with UDFs written in Ruby, Python and Groovy using the BSF approach.
For the sake of comprehensiveness, couldn't we also compare it with Pig streaming through Ruby, Python and Groovy?

2. Scripting engines and examples other than Jython(e.g. beanshell and rhino)

3. Junit-based test harness (provided as test.zip)

4. Python<->Pig Object transforms are not very efficient (see calltrace.zip). Preferred the cleaner implementation first. (non-obvious optimizations such as object reuse can be introduced as separate bug)

##Notes: ##

1. I went with "register" instead of "define" since files can contain multiple functions, similar to .jars. imho this makes more sense, using define would introduce the concept of "codeblock aliases" and function names would look like "alias.functionName()", which is possible but inconsistent since we cannot have "alias2.functionName()" (which would require separate interpreter instances, etc etc).

2. This has been tested both locally and in mapred mode.

3. We assume .py files are simply a list of functions. Since the entire file is loaded, you can have dependent functions. No effort is made to resolve imports, though.

4. You'll need to add jython.jar into classpath, or compile it into pig.jar.

Arnab Nandi
added a comment - 24/May/10 11:40 Building on Julien's and Woody's code, this patch provides pluggable scripting support in native Pig.
##Syntax:##
register 'test.py' USING org.apache.pig.scripting.jython.JythonScriptEngine;
This makes all functions inside test.py available as Pig functions.
##Things in this patch: ##
1. Modifications to parser .jjt file
2. ScriptEngine abstract class and Jython instantiation.
3. Ability to ship .py files similar to .jars, loaded on demand.
4. Input checking and Schema support.
##Things NOT in this patch: ##
1. Inline code support: (Replace 'test.py' with `multiline inline code`, prefer to submit as separate bug)
2. Scripting engines and examples other than Jython(e.g. beanshell and rhino)
3. Junit-based test harness (provided as test.zip)
4. Python<->Pig Object transforms are not very efficient (see calltrace.zip). Preferred the cleaner implementation first. (non-obvious optimizations such as object reuse can be introduced as separate bug)
##Notes: ##
1. I went with "register" instead of "define" since files can contain multiple functions, similar to .jars. imho this makes more sense, using define would introduce the concept of "codeblock aliases" and function names would look like "alias.functionName()", which is possible but inconsistent since we cannot have "alias2.functionName()" (which would require separate interpreter instances, etc etc).
2. This has been tested both locally and in mapred mode.
3. We assume .py files are simply a list of functions. Since the entire file is loaded, you can have dependent functions. No effort is made to resolve imports, though.
4. You'll need to add jython.jar into classpath, or compile it into pig.jar.
Would love comments and code-followups!

I've found that using lazy conversion from objects to tuples can save significant amounts of time when records get later filtered out, only parts of the output used, etc. Perhaps this is something to try if you say pythonToPig is slow?

Dmitriy V. Ryaboy
added a comment - 24/May/10 18:51 I've found that using lazy conversion from objects to tuples can save significant amounts of time when records get later filtered out, only parts of the output used, etc. Perhaps this is something to try if you say pythonToPig is slow?
Here's what I did with Protocol Buffers: http://github.com/dvryaboy/elephant-bird/blob/master/src/java/com/twitter/elephantbird/pig/util/ProtobufTuple.java

Thanks Dmitriy! Lazy objects are a great idea. Note that I'm not saying that pythontoPig is slow per se – it's just the biggest part of the profiler trace, and would be a great place for optimization. I ran some numbers on the patch, and it looks like outside of the runtime instantiation, there is a fairly small performance penalty with the current code (1.2x slower).

Arnab Nandi
added a comment - 25/May/10 07:48 Thanks Dmitriy! Lazy objects are a great idea. Note that I'm not saying that pythontoPig is slow per se – it's just the biggest part of the profiler trace, and would be a great place for optimization. I ran some numbers on the patch, and it looks like outside of the runtime instantiation, there is a fairly small performance penalty with the current code (1.2x slower).
WordCount example from Alan's package.zip:
Data size
Native
Jython
Factor
10K
9s
18s
2
50K
14s
19s
1.35
500K
54s
64s
1.19
(Full Data: 8x"War & Peace" from Proj. Gutenberg, 500K lines, 24MB)
(TOKENIZE was modified to spaces-only, both implementations have identical output)
Python code:
@outputSchema("s:{d:(word:chararray)}")
def tokenize(word):
if word is not None:
return word.split(' ')

Thanks for putting together a patch for this. One question I have is about register Vs define. Currently you are auto-registering all the functions in the script file and then they are available for later use in script. But I am not sure how we will handle the case for inlined functions. For inline functions define seems to be a natural choice as noted in previous comments of the jira. And if so, then we need to modify define to support that use case. Wondering to remain consistent, we always use define to define <non-native> functions instead of auto registering them. I also didn't get why there will be need for separate interpreter instances in that case.

Ashutosh Chauhan
added a comment - 26/May/10 01:15 Arnab,
Thanks for putting together a patch for this. One question I have is about register Vs define. Currently you are auto-registering all the functions in the script file and then they are available for later use in script. But I am not sure how we will handle the case for inlined functions. For inline functions define seems to be a natural choice as noted in previous comments of the jira. And if so, then we need to modify define to support that use case. Wondering to remain consistent, we always use define to define <non-native> functions instead of auto registering them. I also didn't get why there will be need for separate interpreter instances in that case.

Thanks for looking into the patch Ashutosh! Very good question, short answer: I couldn't come up with an elegant solution using define

I spent a bunch of time thinking about the "right thing to do" before going this way. As Woody mentioned, my initial instinct was to do this in in define, but kept hitting roadblocks when working with define:

I came up with the analogy that "register" is like "import" in java, and "define" is like "alias" in bash. In this interpretation, whenever you want to introduce new code, you register it with Pig. Whenever you want to alias anything for convenience or to add meta-information, you define it.

Define is not amenable to multiple functions in the same script.

For example, to follow the stream convention,

{define X 'x.py' [inputoutputspec][schemaspec];}.

Which function is the input/output spec for? A solution like

{[func1():schemaspec1,func2:schemaspec2]}

is... ugly.

Further, how do we access these functions? One solution is to have the namespace as a codeblock, e.g. X.func1(), which is doable by registering functions as "X.func1", but we're (mis)leading the user to believe there is some sort of real namespacing going on. I foresee multi-function files as a very common use case; people could have a "util.py" with their commonly used suite of functions instead of forcing 1 file per 2-3 line function.

Note that Julien's @decorator idea cleanly solves this problem and I think it'll work for all languages.

With inline define, most languages have the convention of mentioning function definitions with the function name, input references & return schema spec, it seems redundant to force the user to break this convention and have something like

{define x as script('def X(a,b): return a + b;');},

and have x.X(). Lambdas can solve this problem halfway, you'll need to then worry about the schema spec and we're back at a kludgy solution!

My plan for inline functions is to write all to a temp file (1 per script engine) and then deal with them as registering a file.

Jython code runs in its own interpreter because I couldn't figure out how to load Jython bytecode into Java, this has something to do with the lack of a jythonc afaik(I may be wrong). There will be one interpreter per non-compilable scriptengine, for others(Janino, Groovy), we load the class directly into the runtime.

From a code-writing perspective, overloading define to tack on a third use-case despite would involve an overhaul to the POStream physical operator and felt very inelegant; register on the other hand is well contained to a single purpose – including files for UDFs.

Consider the use of Janino as a ScriptEngine. Unlike the Jython scriptengine, this loads java UDFs into the native runtime and doesn't translate objects; so we're looking at potentially zero loss of performance for inline UDFs (or register 'UDF.java'; ). The difference between native and script code gets blurry here...

[tl;dr] ...and then I thought fair enough, let's just go with register!

Arnab Nandi
added a comment - 27/May/10 00:22 Thanks for looking into the patch Ashutosh! Very good question, short answer: I couldn't come up with an elegant solution using define
I spent a bunch of time thinking about the "right thing to do" before going this way. As Woody mentioned, my initial instinct was to do this in in define , but kept hitting roadblocks when working with define :
I came up with the analogy that "register" is like "import" in java, and "define" is like "alias" in bash. In this interpretation, whenever you want to introduce new code, you register it with Pig. Whenever you want to alias anything for convenience or to add meta-information, you define it.
Define is not amenable to multiple functions in the same script.
For example, to follow the stream convention, {define X 'x.py' [inputoutputspec] [schemaspec] ;}. Which function is the input/output spec for? A solution like { [func1():schemaspec1,func2:schemaspec2] } is... ugly.
Further, how do we access these functions? One solution is to have the namespace as a codeblock, e.g. X.func1(), which is doable by registering functions as "X.func1", but we're (mis)leading the user to believe there is some sort of real namespacing going on. I foresee multi-function files as a very common use case; people could have a "util.py" with their commonly used suite of functions instead of forcing 1 file per 2-3 line function.
Note that Julien's @decorator idea cleanly solves this problem and I think it'll work for all languages.
With inline define , most languages have the convention of mentioning function definitions with the function name, input references & return schema spec, it seems redundant to force the user to break this convention and have something like {define x as script('def X(a,b): return a + b;');}, and have x.X(). Lambdas can solve this problem halfway, you'll need to then worry about the schema spec and we're back at a kludgy solution!
My plan for inline functions is to write all to a temp file (1 per script engine) and then deal with them as registering a file.
Jython code runs in its own interpreter because I couldn't figure out how to load Jython bytecode into Java, this has something to do with the lack of a jythonc afaik(I may be wrong). There will be one interpreter per non-compilable scriptengine, for others(Janino, Groovy), we load the class directly into the runtime.
From a code-writing perspective, overloading define to tack on a third use-case despite would involve an overhaul to the POStream physical operator and felt very inelegant; register on the other hand is well contained to a single purpose – including files for UDFs.
Consider the use of Janino as a ScriptEngine. Unlike the Jython scriptengine, this loads java UDFs into the native runtime and doesn't translate objects; so we're looking at potentially zero loss of performance for inline UDFs (or register 'UDF.java'; ). The difference between native and script code gets blurry here...
[tl;dr] ...and then I thought fair enough, let's just go with register !

With java UDFs, you REGISTER a jar.
Then you can use the classes in the jar using their fully qualified class name.
Optionally you can use DEFINE to alias the functions or pass extra initialization parameters.

with scripting as implemented by Arnab, you REGISTER a script file (adding the script language information as it is not only java anymore) and you can use all the functions in it (just like you do in java).
Then I would say you should be able to alias them using DEFINE and define a closure by passing extra parameters, DEFINE log2 logn(2, $0); (maybe I am asking to much here )

Julien Le Dem
added a comment - 03/Jun/10 23:58 I like Register better as well.
With java UDFs, you REGISTER a jar.
Then you can use the classes in the jar using their fully qualified class name.
Optionally you can use DEFINE to alias the functions or pass extra initialization parameters.
with scripting as implemented by Arnab, you REGISTER a script file (adding the script language information as it is not only java anymore) and you can use all the functions in it (just like you do in java).
Then I would say you should be able to alias them using DEFINE and define a closure by passing extra parameters, DEFINE log2 logn(2, $0); (maybe I am asking to much here )

Issues- (as per current implementation)
1. flat namespace- this consumes the UDF namespace. Do we need to have test.py.helloworld?
2. no way to find signature- We do not verify signature of helloworld in front end, user has no feedback about UDF signatures.
3. Dependencies- no ship clause.

Optional command-

describe 'test.py';
helloworld{x:chararray};
complex{i:int};

Changes needed- ScriptEngine needs to have a function that for a given script and funcspec dumps the function signature if funcspec if the function is present in the script (given path).
abstract void dumpFunction(String path, FuncSpec funcSpec, PigContext pigContext);

2. Registration of single UDF from a script-
test.py has helloworld which has dependencies in '1.py' and '2.py'.

I support above comment.
Also, in favor of not breaking old code. I think, we should avoid introducing new keywords.

In the above proposal, by adding python as a lang-keyword I meant to hide extensibility of ScriptEngine interface by natively supporting python. If we have to allow users add support for other languages. we need to allow "using org.apache.pig.scripting.jython.JythonScriptEngine". But this will need us to document the scriptengine interface.

Aniket Mokashi
added a comment - 11/Jun/10 01:43 I support above comment.
Also, in favor of not breaking old code. I think, we should avoid introducing new keywords.
In the above proposal, by adding python as a lang-keyword I meant to hide extensibility of ScriptEngine interface by natively supporting python. If we have to allow users add support for other languages. we need to allow "using org.apache.pig.scripting.jython.JythonScriptEngine". But this will need us to document the scriptengine interface.
Following seems to be more suitable choice. Comments?
-- register all UDFs inside test.py using custom (or builtin) ScriptEngine
register 'test.py' using org.apache.pig.scripting.jython.JythonScriptEngine ship ('1.py', '2.py');
-- namespace? test.helloworld?
b = foreach a generate helloworld(a.$0), complex(a.$1);
-- register helloworld UDF as hello using JythonScriptEngine
define hello using org.apache.pig.scripting.jython.JythonScriptEngine from 'test.py'#helloworld ship ('1.py', '2.py');
b = foreach a generate helloworld(a.$0);
Also, register scalascript.jar would not be necessary if getStandardScriptJarPath() returns the path of the jar.

class in the USING clause would need to implement a new interface ScriptEngine (or something) which would be used to interpret the file. If no USING clause is
given, then it is assumed that filename is a jar. I like this better than the 'lang python' option we had earlier because it allows users to add new engines
without modifying the parser. We should however provide a pre-defined set of scripting engines and names, so that for example python translates to
org.apache.pig.script.jython.JythonScriptingEngine

If the AS clause is not given, then the basename of filename defines the namespace name for all functions defined in that file. This allows us to avoid
function name clashes. If the AS clause is given, this defines an alternate namespace. This allows us to avoid name clashes for filenames. Functions would
have to be referenced by full namespace names, though aliases can be given via DEFINE.

Note that the AS clause is a sub-clause of the USING clause, and cannot be used alone, so there is no ability to give namespaces to jars.

As far as I can tell there is no need for a SHIP clause in the register. Additional python modules that are needed can be registered. As long as Pig lazily
searches for functions and does not automatically find every function in every file we register, this will work fine.

So taken altogether, this would look like the following. Assume we have two python files /home/alan/myfuncs.py

import mymod
def a():
...
def b():
...

and /home/bob/myfuncs.py:

def a():
...
def c():
...

and the following Pig Latin

REGISTER /home/alan/myfuncs.py USING python;
REGISTER /home/alan/mymod.py; -- no need for USING since I won't be looking in here for files, it just has to be moved over
REGISTER /home/bob/myfuncs.py USING python AS hisfuncs;
DEFINE b myfuncs.b();
A = LOAD 'mydata' as (x, y, z);
B = FOREACH A GENERATE myfuncs.a(x), b(y), hisfuncs.a(z);
...

Alan Gates
added a comment - 16/Jun/10 00:33 I propose the following syntax for register:
REGISTER _filename_ [USING _class_ [AS _namespace_]]
This is backwards compatible with the current version of register.
class in the USING clause would need to implement a new interface ScriptEngine (or something) which would be used to interpret the file. If no USING clause is
given, then it is assumed that filename is a jar. I like this better than the 'lang python' option we had earlier because it allows users to add new engines
without modifying the parser. We should however provide a pre-defined set of scripting engines and names, so that for example python translates to
org.apache.pig.script.jython.JythonScriptingEngine
If the AS clause is not given, then the basename of filename defines the namespace name for all functions defined in that file. This allows us to avoid
function name clashes. If the AS clause is given, this defines an alternate namespace. This allows us to avoid name clashes for filenames. Functions would
have to be referenced by full namespace names, though aliases can be given via DEFINE.
Note that the AS clause is a sub-clause of the USING clause, and cannot be used alone, so there is no ability to give namespaces to jars.
As far as I can tell there is no need for a SHIP clause in the register. Additional python modules that are needed can be registered. As long as Pig lazily
searches for functions and does not automatically find every function in every file we register, this will work fine.
So taken altogether, this would look like the following. Assume we have two python files /home/alan/myfuncs.py
import mymod
def a():
...
def b():
...
and /home/bob/myfuncs.py :
def a():
...
def c():
...
and the following Pig Latin
REGISTER /home/alan/myfuncs.py USING python;
REGISTER /home/alan/mymod.py; -- no need for USING since I won't be looking in here for files, it just has to be moved over
REGISTER /home/bob/myfuncs.py USING python AS hisfuncs;
DEFINE b myfuncs.b();
A = LOAD 'mydata' as (x, y, z);
B = FOREACH A GENERATE myfuncs.a(x), b(y), hisfuncs.a(z);
...

I like the suggestion. However I would prefer not to use namespaces by default.
Most likely users will register a few functions and use namespaces only when conflicts happen.
The shortest syntax should be used for the most common use case.

most of the time:
REGISTER /home/alan/myfuncs.py USING python;
B = FOREACH A GENERATE a;

when it is needed:
REGISTER /home/alan/myfuncs.py USING python AS myfuncs;
B = FOREACH A GENERATE myfuncs.a;

Also register jar does not prefix classes by the jar name so that would be inconsistent.
REGISTER /home/alan/myfuncs.jar;

Julien Le Dem
added a comment - 16/Jun/10 01:24 I like the suggestion. However I would prefer not to use namespaces by default.
Most likely users will register a few functions and use namespaces only when conflicts happen.
The shortest syntax should be used for the most common use case.
most of the time:
REGISTER /home/alan/myfuncs.py USING python;
B = FOREACH A GENERATE a ;
when it is needed:
REGISTER /home/alan/myfuncs.py USING python AS myfuncs;
B = FOREACH A GENERATE myfuncs.a ;
Also register jar does not prefix classes by the jar name so that would be inconsistent.
REGISTER /home/alan/myfuncs.jar;

Few points to note-
1. As jar is treated in a different way (searched in system resources, classloader used etc) than other files, we differentiate a jar with its extension.
2. namespace is kept as default = "" as per above comment, this is implemented as part of registerFunctions interface of ScriptEngine, so that different engines can have different behavior as necessary.
3. keyword python is supported along with custom scriptengine name.

Aniket Mokashi
added a comment - 17/Jun/10 02:37 I have attached the patch for proposed changes.
Few points to note-
1. As jar is treated in a different way (searched in system resources, classloader used etc) than other files, we differentiate a jar with its extension.
2. namespace is kept as default = "" as per above comment, this is implemented as part of registerFunctions interface of ScriptEngine, so that different engines can have different behavior as necessary.
3. keyword python is supported along with custom scriptengine name.

I rebased the patch and made it pull jython down via maven. 2.5.1 doesn't appear to be available right now, so this pulls down 2.5.0. Hope that's ok.

Looks like the tabulation is wrong in most of this patch.. someone please hit ctrl-a, ctrl-i next time .

Needless to say, this thing needs tests, desperately.

Also imho in order for it to make it into trunk, it should be a compile-time option to support (and pull down) jython or jruby or whatnot, not a default option. Otherwise we are well on our way to making people pull down the internet in order to compile pig.

Dmitriy V. Ryaboy
added a comment - 02/Jul/10 20:06 I rebased the patch and made it pull jython down via maven. 2.5.1 doesn't appear to be available right now, so this pulls down 2.5.0. Hope that's ok.
Looks like the tabulation is wrong in most of this patch.. someone please hit ctrl-a, ctrl-i next time .
Needless to say, this thing needs tests, desperately.
Also imho in order for it to make it into trunk, it should be a compile-time option to support (and pull down) jython or jruby or whatnot, not a default option. Otherwise we are well on our way to making people pull down the internet in order to compile pig.

The fix needed some changes in queryparser to support namespace, I found this in test cases I added.
Current EvalFuncSpec logic is convoluted, I replaced it with a cleaner one.
I have attached the updated patch with changes mentioned above.

I am not sure what needs to be done for jython.jar, my guess was to check-in that in /lib. Thoughts?

Aniket Mokashi
added a comment - 02/Jul/10 23:50 The fix needed some changes in queryparser to support namespace, I found this in test cases I added.
Current EvalFuncSpec logic is convoluted, I replaced it with a cleaner one.
I have attached the updated patch with changes mentioned above.
I am not sure what needs to be done for jython.jar, my guess was to check-in that in /lib. Thoughts?

Aniket, I already made the changes you need to pull down jython – take a look at the patch I attached.

One more general note – let's say jython instead of python (in the grammar, the keywords, everywhere), as there may be slight incompatibilities between the two and we want to be clear on what we are using.

Dmitriy V. Ryaboy
added a comment - 02/Jul/10 23:56 Aniket, I already made the changes you need to pull down jython – take a look at the patch I attached.
One more general note – let's say jython instead of python (in the grammar, the keywords, everywhere), as there may be slight incompatibilities between the two and we want to be clear on what we are using.

I had added an interface: getStandardScriptJarPath to find the path of jython jar to be shipped as part of job.jar only when user uses this feature. How do I incorporate this into new changes?
Do we want to go for compile time support option?

Aniket Mokashi
added a comment - 03/Jul/10 00:28 I had added an interface: getStandardScriptJarPath to find the path of jython jar to be shipped as part of job.jar only when user uses this feature. How do I incorporate this into new changes?
Do we want to go for compile time support option?

Aniket, this is assuming the ScriptEngine requires only one jar.
I would suggest instead having a method ScriptEngine.init(PigContext) that would be called after the ScriptEngine instance has been retrieved from the factory.
That would let the script engine add whatever is needed to the job.

Julien Le Dem
added a comment - 03/Jul/10 00:49 Aniket, this is assuming the ScriptEngine requires only one jar.
I would suggest instead having a method ScriptEngine.init(PigContext) that would be called after the ScriptEngine instance has been retrieved from the factory.
That would let the script engine add whatever is needed to the job.
if (scriptingLang != null ) {
ScriptEngine se = ScriptEngine.getInstance(scriptingLang);
//pigContext.scriptJars.add(se.getStandardScriptJarPath());
se.init(pigContext);
se.registerFunctions(path, namespace, pigContext);
}
Have a good week end, Julien

Julien Le Dem
added a comment - 06/Jul/10 19:48 actually, I retract the init() method as it seems this could all happen in registerFunctions()
public void registerFunctions(String path, String namespace, PigContext pigContext)
throws IOException {
pigContext.addJar(JAR_PATH);
...
also I was suggesting this way of automatically figuring out the jar path for a class:
/**
figure out the jar location from the class
@param clazz
@return the jar file location, null if the class was not loaded from a jar
*/
protected static String getJar(Class<?> clazz)
Unknown macro: { URL resource = clazz.getClassLoader().getResource(clazz.getCanonicalName().replace(".","/")+".class"); if (resource.getProtocol().equals("jar")) Unknown macro: {
return resource.getPath().substring(resource.getPath().indexOf('} return null; }
otherwise the code depends on the path it is run from.

ScriptEvalFunc does not do much anymore, I would suggest to remove it.
If we want to keep it to add shared code in the future then remove its constructor as it forces the schema to be fixed.
The output schema may depend on the input schema in some cases.

Aniket Mokashi
added a comment - 08/Jul/10 01:19 I got what you mean, if user needs a generic square function he can write:
#!/usr/bin/python
@outputSchemaFunction(\ "squareSchema\" )
def square(number):
return (number * number)
def squareSchema(input):
return input
I will make changes so that I can use similar approach as pig-greek. Since outputschema needs to know both input and name of outputSchemaFunction current code would need further changes.

Aniket Mokashi
added a comment - 08/Jul/10 22:57 Added support for decorator outputSchemaFunction that points to a function which defines the schema for the function.
Also, in case of function with no decorator schema is assumed to be databytearray.

Hadoop QA
added a comment - 09/Jul/10 22:17 -1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12449105/RegisterPythonUDFFinale4.patch
against trunk revision 962628.
+1 @author. The patch does not contain any @author tags.
+1 tests included. The patch appears to include 3 new or modified tests.
-1 patch. The patch command could not apply the patch.
Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/365/console
This message is automatically generated.

ScriptEngine is a new public interface for Pig once we commit this patch. We need to declare this as public and it's stability level (evolving I'm guessing since
its new, but I'm open to arguments for other levels). See PIG-1311 for info on how to do this.

Alan Gates
added a comment - 13/Jul/10 23:37 ScriptEngine is a new public interface for Pig once we commit this patch. We need to declare this as public and it's stability level (evolving I'm guessing since
its new, but I'm open to arguments for other levels). See PIG-1311 for info on how to do this.

Do you want to allow: register myJavaUDFs.jar using 'java' as 'javaNameSpace' ? Use-case could be that if we are allowing namespaces for non-java, why not allow for Java udfs as well. But then define is exactly for this purpose. So, it may make sense to throw exception for such a case.

In ScriptEngine.getJarPath() shouldn't you throw a FileNotFoundException instead of returning null.

Don't gobble up Checked Exceptions and then rethrow RuntimeExceptions. Throw checked exceptions, if you need to.

ScriptEngine.getInstance() should be a singleton, no?

In JythonScriptEngine.getFunction() I think you should check if interpreter.get(functionName) != null and then return it and call Interpreter.init(path) only if its null.

In JythonUtils, for doing type conversion you should make use of both input and output schemas (whenever they are available) and avoid doing reflection for every element. You can get hold of input schema through outputSchema() of EvalFunc and then do UDFCOntext magic to use it. If schema == null || schema == bytearray, you need to resort to reflections. Similarily if outputSchema is available via decorators, use it to do type conversions.

In jythonUtils.pythonToPig() in case of Tuple, you first create Object[] then do Arrays.asList(), you can directly create List<Object> and avoid unnecessary casting. In the same method, you are only checking for long, dont you need to check for int, String etc. and then do casting appropriately. Also, in default case I think we cant let object pass as it is using Object.class, it could be object of any type and may cause cryptic errors in Pipeline, if let through. We should throw an exception if we dont know what type of object it is. Similar argument for default case of pigToPython()

I didn't get why the changes are required in POUserFunc. Can you explain and also add it as comments in the code.

Testing:

This is a big enough feature to warrant its own test file. So, consider adding a new test file (may be TestNonJavaUDF). Additionally, we see frequent timeouts on TestEvalPipeline, we dont want it to run any longer.

Instead of adding query through pigServer.registerCode() api, add it through pigServer.registerQuery(register myscript.py using "jython"). This will make sure we are testing changes in QueryParser.jjt as well.

Add more tests. Specifically, for complex types passed to the udfs (like bag) and returning a bag. You can get bags after doing a group-by. You can also take a look at original Julien's patch which contained a python script. Those I guess were at right level of complexity to be added as test-cases in our junit tests.

Nit-picks:

Unnecessary import in JythonFunction.java

In PigContext.java, you are using Vector and LinkedList, instead of usual ArrayList. Any particular reason for it, just curious?

Ashutosh Chauhan
added a comment - 14/Jul/10 00:21
Do you want to allow: register myJavaUDFs.jar using 'java' as 'javaNameSpace' ? Use-case could be that if we are allowing namespaces for non-java, why not allow for Java udfs as well. But then define is exactly for this purpose. So, it may make sense to throw exception for such a case.
In ScriptEngine.getJarPath() shouldn't you throw a FileNotFoundException instead of returning null.
Don't gobble up Checked Exceptions and then rethrow RuntimeExceptions. Throw checked exceptions, if you need to.
ScriptEngine.getInstance() should be a singleton, no?
In JythonScriptEngine.getFunction() I think you should check if interpreter.get(functionName) != null and then return it and call Interpreter.init(path) only if its null.
In JythonUtils, for doing type conversion you should make use of both input and output schemas (whenever they are available) and avoid doing reflection for every element. You can get hold of input schema through outputSchema() of EvalFunc and then do UDFCOntext magic to use it. If schema == null || schema == bytearray, you need to resort to reflections. Similarily if outputSchema is available via decorators, use it to do type conversions.
In jythonUtils.pythonToPig() in case of Tuple, you first create Object[] then do Arrays.asList(), you can directly create List<Object> and avoid unnecessary casting. In the same method, you are only checking for long, dont you need to check for int, String etc. and then do casting appropriately. Also, in default case I think we cant let object pass as it is using Object.class, it could be object of any type and may cause cryptic errors in Pipeline, if let through. We should throw an exception if we dont know what type of object it is. Similar argument for default case of pigToPython()
I didn't get why the changes are required in POUserFunc. Can you explain and also add it as comments in the code.
Testing:
This is a big enough feature to warrant its own test file. So, consider adding a new test file (may be TestNonJavaUDF). Additionally, we see frequent timeouts on TestEvalPipeline, we dont want it to run any longer.
Instead of adding query through pigServer.registerCode() api, add it through pigServer.registerQuery(register myscript.py using "jython"). This will make sure we are testing changes in QueryParser.jjt as well.
Add more tests. Specifically, for complex types passed to the udfs (like bag) and returning a bag. You can get bags after doing a group-by. You can also take a look at original Julien's patch which contained a python script. Those I guess were at right level of complexity to be added as test-cases in our junit tests.
Nit-picks:
Unnecessary import in JythonFunction.java
In PigContext.java, you are using Vector and LinkedList, instead of usual ArrayList. Any particular reason for it, just curious?
More documentation (in QuerParser.jjt, ScriptEngine, JythonScriptEngine (specifically for outputSchema, outputSchemaFunction, schemafunction))
Also keep an eye of recent "mavenization" efforts of Pig, depending on when it gets checked-in you may (or may not) need to make changes to ivy

Do you want to allow: register myJavaUDFs.jar using 'java' as 'javaNameSpace' ? Use-case could be that if we are allowing namespaces for non-java, why not allow for Java udfs as well. But then define is exactly for this purpose. So, it may make sense to throw exception for such a case.

myJavaUDFs.jar can itself have package structure that can define its own namespace, for example- maths.jar has function math.sin etc, I will throw parseexception for such a case

ScriptEngine.getInstance() should be a singleton, no?

getInstance is a factory method that returns an instance of scriptEngine based on its type. We create a newInstance of the scriptEngine so that if registerCode is called simultaneously, we can create a different interpreter for both the invocations to register these scripts to pig.

In JythonScriptEngine.getFunction() I think you should check if interpreter.get(functionName) != null and then return it and call Interpreter.init(path) only if its null.

This behavior is consistent with interpreter.get method that returns null if some resource is not found inside the script. Callers of this function handle runtimeexceptions. Also, we will fail much earlier if we try to access functions that are not already present/registered so it should be safe.
Also, interpreter is never null because its a static member of the JythonScriptEngine, instantiated statically.

I didn't get why the changes are required in POUserFunc. Can you explain and also add it as comments in the code.

POUserFunc has possible bug to check res.result != null when it is always null at this point. If the returntype expected is bytearray, we cast return object to byte[] with toString().getBytes() (which was never hit due to the bug mentioned above), but when return type is byte[] we need special handling (this is not case for other evalfuncs as they generally return pigtypes).

Instead of adding query through pigServer.registerCode() api, add it through pigServer.registerQuery(register myscript.py using "jython"). This will make sure we are testing changes in QueryParser.jjt as well.

Aniket Mokashi
added a comment - 14/Jul/10 08:21 Thanks for your comments. I will make the required changes.
Do you want to allow: register myJavaUDFs.jar using 'java' as 'javaNameSpace' ? Use-case could be that if we are allowing namespaces for non-java, why not allow for Java udfs as well. But then define is exactly for this purpose. So, it may make sense to throw exception for such a case.
myJavaUDFs.jar can itself have package structure that can define its own namespace, for example- maths.jar has function math.sin etc, I will throw parseexception for such a case
ScriptEngine.getInstance() should be a singleton, no?
getInstance is a factory method that returns an instance of scriptEngine based on its type. We create a newInstance of the scriptEngine so that if registerCode is called simultaneously, we can create a different interpreter for both the invocations to register these scripts to pig.
In JythonScriptEngine.getFunction() I think you should check if interpreter.get(functionName) != null and then return it and call Interpreter.init(path) only if its null.
This behavior is consistent with interpreter.get method that returns null if some resource is not found inside the script. Callers of this function handle runtimeexceptions. Also, we will fail much earlier if we try to access functions that are not already present/registered so it should be safe.
Also, interpreter is never null because its a static member of the JythonScriptEngine, instantiated statically.
I didn't get why the changes are required in POUserFunc. Can you explain and also add it as comments in the code.
POUserFunc has possible bug to check res.result != null when it is always null at this point. If the returntype expected is bytearray, we cast return object to byte[] with toString().getBytes() (which was never hit due to the bug mentioned above), but when return type is byte[] we need special handling (this is not case for other evalfuncs as they generally return pigtypes).
Instead of adding query through pigServer.registerCode() api, add it through pigServer.registerQuery(register myscript.py using "jython"). This will make sure we are testing changes in QueryParser.jjt as well.
register is Grunt command parsed by gruntparser hence doesnt go through queryparser. We directly call registerCode from GruntParser. Also, parsing logic is trivial.

Although, this one define its output schema as ByteArray we fail this one as we do not know how to deserialize Student. Clearly, this is due to the bug in POUserFunc which fails to convert to ByteArray. Hence, res.result != null should be changed to result.result !=null.

I am still not convinced about the changes required in POUserFunc. That logic should really be a part of pythonToPig(pyObject). If python UDF is returning byte[], it should be turned into DataByteArray before it gets back into Pig's pipeline. And if we do that conversion in pythonToPig() (which is a right place to do it) we will need no changes in POUserFunc.

As I suggested in previous comment in the same method you should avoid first creating Array and then turning that Array in list, you can rather create a list upfront and use it.

Instead of instanceof, doing class equality test will be a wee-bit faster. Like instead of (pyObject instanceof PyDictionary) do pyobject.getClass() == PyDictionary.class. Obviously, it will work when you know exact target class and not for the derived ones.

parseSchema(String schema) already exist in org.apache.pig.impl.util.Utils class. So, no need for that in ScriptEngine

For register command, we need to test not only for functionality but for regressions as well. Look at TestGrunt.java in test package to get an idea how to write test for it.

Ashutosh Chauhan
added a comment - 21/Jul/10 20:27 Thanks, Aniket for making those changes. Its getting closer.
I am still not convinced about the changes required in POUserFunc. That logic should really be a part of pythonToPig(pyObject). If python UDF is returning byte[], it should be turned into DataByteArray before it gets back into Pig's pipeline. And if we do that conversion in pythonToPig() (which is a right place to do it) we will need no changes in POUserFunc.
As I suggested in previous comment in the same method you should avoid first creating Array and then turning that Array in list, you can rather create a list upfront and use it.
Instead of instanceof, doing class equality test will be a wee-bit faster. Like instead of (pyObject instanceof PyDictionary) do pyobject.getClass() == PyDictionary.class. Obviously, it will work when you know exact target class and not for the derived ones.
parseSchema(String schema) already exist in org.apache.pig.impl.util.Utils class. So, no need for that in ScriptEngine
For register command, we need to test not only for functionality but for regressions as well. Look at TestGrunt.java in test package to get an idea how to write test for it.

Also what will happen if user returned a nil python object (null equivalent of Java) from UDF. It looks to me that will result in NPE. Can you add a test for that and similar test case from pigToPython()

Ashutosh Chauhan
added a comment - 21/Jul/10 20:30 Addendum:
Also what will happen if user returned a nil python object (null equivalent of Java) from UDF. It looks to me that will result in NPE. Can you add a test for that and similar test case from pigToPython()

I am still not convinced about the changes required in POUserFunc. That logic should really be a part of pythonToPig(pyObject). If python UDF is returning byte[], it should be turned into DataByteArray before it gets back into Pig's pipeline. And if we do that conversion in pythonToPig() (which is a right place to do it) we will need no changes in POUserFunc.

I agree that it is better to move computation on JythonFunction side (JythonUtils) for type checking and should provide more type safety to avoid user defined types complexity. But I would still go for changes in POUserFunc for result.result for the case defined in above example (removing byte[] scenario).

Instead of instanceof, doing class equality test will be a wee-bit faster. Like instead of (pyObject instanceof PyDictionary) do pyobject.getClass() == PyDictionary.class. Obviously, it will work when you know exact target class and not for the derived ones.

Jython code has derived classes for each of the basic Jython types, though they aren't used for most of the types as of now, they may start returning these derived objects (PyTupleDerived) in their future implementation, in which case we might break our code. Also, PyLongDerived are already used inside the code. _tojava_ function just returns the proxy java object until we ask for a specific type of object. I think its better to use instanceof instead of class equality here.

For register command, we need to test not only for functionality but for regressions as well. Look at TestGrunt.java in test package to get an idea how to write test for it.

Code path for .jar registration is identical to old code, except that it doesnt "use" any engine or namespace.

Also what will happen if user returned a nil python object (null equivalent of Java) from UDF. It looks to me that will result in NPE. Can you add a test for that and similar test case from pigToPython()

A java null object will be turned into PyNone object but _tojava_ function will always returns the special object Py.NoConversion if this PyObject can not be converted to the desired Java class.

Aniket Mokashi
added a comment - 22/Jul/10 22:52 I am still not convinced about the changes required in POUserFunc. That logic should really be a part of pythonToPig(pyObject). If python UDF is returning byte[], it should be turned into DataByteArray before it gets back into Pig's pipeline. And if we do that conversion in pythonToPig() (which is a right place to do it) we will need no changes in POUserFunc.
I agree that it is better to move computation on JythonFunction side (JythonUtils) for type checking and should provide more type safety to avoid user defined types complexity. But I would still go for changes in POUserFunc for result.result for the case defined in above example (removing byte[] scenario).
Instead of instanceof, doing class equality test will be a wee-bit faster. Like instead of (pyObject instanceof PyDictionary) do pyobject.getClass() == PyDictionary.class. Obviously, it will work when you know exact target class and not for the derived ones.
Jython code has derived classes for each of the basic Jython types, though they aren't used for most of the types as of now, they may start returning these derived objects (PyTupleDerived) in their future implementation, in which case we might break our code. Also, PyLongDerived are already used inside the code. _ tojava _ function just returns the proxy java object until we ask for a specific type of object. I think its better to use instanceof instead of class equality here.
For register command, we need to test not only for functionality but for regressions as well. Look at TestGrunt.java in test package to get an idea how to write test for it.
Code path for .jar registration is identical to old code, except that it doesnt "use" any engine or namespace.
Also what will happen if user returned a nil python object (null equivalent of Java) from UDF. It looks to me that will result in NPE. Can you add a test for that and similar test case from pigToPython()
A java null object will be turned into PyNone object but _ tojava _ function will always returns the special object Py.NoConversion if this PyObject can not be converted to the desired Java class.