Embed Pig in scripting languages

Details

Description

It should be possible to embed Pig calls in a scripting language and let functions defined in the same script available as UDFs.
This is a spin off of https://issues.apache.org/jira/browse/PIG-928 which lets users define UDFs in scripting languages.

Richard Ding
added a comment - 10/Sep/10 22:05 Thanks Julien. I rebased the patch with the latest trunk and added an option (-greek) in the Main class.
Now one can run a "PIG-Greek" script with following command:
java -cp pig.jar:<jython jar>:<hadoop config dir> org.apache.pig.Main -g <pig-greek script>
or in local mode:
java -cp pig.jar:<jython jar> org.apache.pig.Main -x local -g <pig-greek script>

In the previous patch, the executeScript method on ScriptPigServer returns a list of ExecJobs (one for each store statement in the script). Unfortunately, the order of ExecJobs in the list is indeterminate.

This patch fixes this problem by making the executeScript method return a PigStats object. One then can retrieves the output result by the alias corresponding to store statement.

Richard Ding
added a comment - 14/Sep/10 23:03 In the previous patch, the executeScript method on ScriptPigServer returns a list of ExecJobs (one for each store statement in the script). Unfortunately, the order of ExecJobs in the list is indeterminate.
This patch fixes this problem by making the executeScript method return a PigStats object. One then can retrieves the output result by the alias corresponding to store statement.
Here is a example:
P = pig.executeScript("""
A = load '${input}';
... ...
store G into '${output}'; """)
output = P.result( "G" ) # an OutputStats object
iter = output.iterator()
if iter.hasNext():
# do something
else :
# do something else

Julien Le Dem
added a comment - 15/Sep/10 21:35 The -g parameter on the command line should take two parameters, the scripting implementation instance name and the script itself.
That way we can have several scripting implementations.
java -cp pig.jar:<jython jar> org.apache.pig.Main -x local -g jython script/tc.py
case GREEK: {
ScriptEngine scriptEngine = ScriptEngine.getInstance(instanceName);
scriptEngine.run( new PigServer(pigContext), file);
return ReturnCode.SUCCESS;
}

Julien Le Dem
added a comment - 15/Sep/10 21:41 The end of loop condition in the script can just test for to_join_n emptiness. It was testing both because it did not know which one was to_join_n.
if (not P.result( "to_join_n" ).iterator().hasNext()):

Attach the test script modified based on Julien's comment. As for commend line option -g, it can also use one parameter (script file name) and let Pig determine the script engine by the file extension.

Richard Ding
added a comment - 17/Sep/10 17:52 Attach the test script modified based on Julien's comment. As for commend line option -g, it can also use one parameter (script file name) and let Pig determine the script engine by the file extension.

Using the file extension requires a registration mechanism (or hard coded list) so if it is supported it would be nice to be able to provide the class name of the scripting implementation as well.
I would like to use my own implementation of the scripting engine (let's say javascript) by specifying the class name in the command line.
similar to the mecanism for UDFs inclusion:http://wiki.apache.org/pig/UDFsUsingScriptingLanguages

Register 'test.py' using org.apache.pig.scripting.jython.JythonScriptEngine as myfuncs;

Julien Le Dem
added a comment - 23/Sep/10 00:05 Using the file extension requires a registration mechanism (or hard coded list) so if it is supported it would be nice to be able to provide the class name of the scripting implementation as well.
I would like to use my own implementation of the scripting engine (let's say javascript) by specifying the class name in the command line.
similar to the mecanism for UDFs inclusion:
http://wiki.apache.org/pig/UDFsUsingScriptingLanguages
Register 'test.py' using org.apache.pig.scripting.jython.JythonScriptEngine as myfuncs;

Alan has posted a proposal that includes embedding Pig in scripting language on Pig wiki: http://wiki.apache.org/pig/TuringCompletePig. The proposal is based on the implementation here via a JDBC like compile, bind, run model.

Richard Ding
added a comment - 04/Nov/10 22:13 Alan has posted a proposal that includes embedding Pig in scripting language on Pig wiki: http://wiki.apache.org/pig/TuringCompletePig . The proposal is based on the implementation here via a JDBC like compile, bind, run model.

Richard Ding
added a comment - 13/Nov/10 00:27 Attaching the initial patch that aims to implement the embedding part of the above proposal.
Notes about the patch:
Pig executes the top-level Jython statements in the script, no need to write a main() function.
You can invoke a Jython script from the command line the same way as you invoke a standard Pig script as long as the first line of the script is #! /usr/bin/python .
Example:
java -cp jython.jar:pig.jar myscript.py
The run method on ScriptEngine returns a Map<String, PigStats>, with one entry for each runtime Pig pipeline. For named pipeline, the map key is the given pipeline name.
The proposed API is implemented in two classes: ScriptPigServer and PigPipeline .
The compile method now is a no-op, will be implemented later.

The ScriptEngine implementations that can be used are still hardwired. As a user I would want to add a parameter to the command line to use my own (adding it to the classpath and providing the class name). For example I'm working on a javascript implementation for Pig-Greek. Currently I have no way of using it without modifying Pig's code.

I like to not have to define a main() function for the top level code, however using regular expressions to separate functions from the main code seems at high risk of not working in many cases (in JythonScriptEngine.getFunctions(InputStream)). It would be better to trust an actual Python parser or to leave it as is: requiring a main() function.

Julien Le Dem
added a comment - 21/Nov/10 20:08 Hi Richard,
Some comments about PIG-1479 _3.patch:
The ScriptEngine implementations that can be used are still hardwired. As a user I would want to add a parameter to the command line to use my own (adding it to the classpath and providing the class name). For example I'm working on a javascript implementation for Pig-Greek. Currently I have no way of using it without modifying Pig's code.
I like to not have to define a main() function for the top level code, however using regular expressions to separate functions from the main code seems at high risk of not working in many cases (in JythonScriptEngine.getFunctions(InputStream)). It would be better to trust an actual Python parser or to leave it as is: requiring a main() function.

As for the second comment, there is a third option, namely separating frontend (control flow code) from backend (scripting UDFs) by putting them in different files, and requires control flow writer to explicitly register UDFs in his/her script. For example, in control flow file script.py:

pig.registerUDF("myudfs.py", "mynamespace")
# control flow and PIG pipelines that use UDFs defined in myudfs.py

The advantage of this is that only UDF files are shipped to the backend while control flow file (and its dependencies) remains in front end. Obviously, the disadvantage is that you can't put everything in one file.

Richard Ding
added a comment - 02/Dec/10 20:20 Thanks Julien.
As for the second comment, there is a third option, namely separating frontend (control flow code) from backend (scripting UDFs) by putting them in different files, and requires control flow writer to explicitly register UDFs in his/her script. For example, in control flow file script.py:
pig.registerUDF( "myudfs.py" , "mynamespace" )
# control flow and PIG pipelines that use UDFs defined in myudfs.py
The advantage of this is that only UDF files are shipped to the backend while control flow file (and its dependencies) remains in front end. Obviously, the disadvantage is that you can't put everything in one file.

This patch makes changes to the public interface PigProgressNotificationListener. It's ok, since it's marked evolving. Do we know how many people are using this and what we'll need to do to mitigate the changes for them?

PigPipeline needs better javadoc comments at the class level. The current javadocs confuse it with the defined Pig class.

Rather than the Pig class detailed in the design doc this patch has ScriptPigServer, which has a slightly different interface. Does this represent a change to the design or is there a yet to be built Pig class?

Do we need two classes BoundPipeline and MultiBoundPipeline? Could we instead have just BoundPipeline, and then for each run method there would be:

Then run is a valid call whether this is a single or multi-job situation, which means users don't have to write their code differently in situations where they are using both single and multi-job binds. In simple cases where users know they only have one thing bound they can use the simpler runSingle call. Calling runSingle when multiple things are bound would be an error.

We need to mark the availability and stability of the ScriptEngine interface. I suspect it is Public Evolving.

Alan Gates
added a comment - 08/Dec/10 19:22 Comments and questions:
This patch makes changes to the public interface PigProgressNotificationListener. It's ok, since it's marked evolving. Do we know how many people are using this and what we'll need to do to mitigate the changes for them?
PigPipeline needs better javadoc comments at the class level. The current javadocs confuse it with the defined Pig class.
Rather than the Pig class detailed in the design doc this patch has ScriptPigServer, which has a slightly different interface. Does this represent a change to the design or is there a yet to be built Pig class?
Do we need two classes BoundPipeline and MultiBoundPipeline? Could we instead have just BoundPipeline, and then for each run method there would be:
public List<PigStats> run()
public PigStats runSingle() {
if (multijob) throw ...
return run().get(0);
}
Then run is a valid call whether this is a single or multi-job situation, which means users don't have to write their code differently in situations where they are using both single and multi-job binds. In simple cases where users know they only have one thing bound they can use the simpler runSingle call. Calling runSingle when multiple things are bound would be an error.
We need to mark the availability and stability of the ScriptEngine interface. I suspect it is Public Evolving.

This patch makes changes to the public interface PigProgressNotificationListener. It's ok, since it's marked evolving. Do we know how many people are using this and what we'll need to do to mitigate the changes for them?

This interface is available only in Pig 0.8 which is just ready to release. So not many people are using it. On the other hand it's too late to get into 0.8. The reason for the change is that the embedded script could contain multiple Pig scripts and Pig runtime needs to tell users from which script they get the notification.

PigPipeline needs better javadoc comments at the class level. The current javadocs confuse it with the defined Pig class.

Will do.

Rather than the Pig class detailed in the design doc this patch has ScriptPigServer, which has a slightly different interface. Does this represent a change to the design or is there a yet to be built Pig class?

The patch breaks the Pig class interface into several class: ScriptPigServer to register or define in global scope, to compile a Pig Latin script into a PigPipeline object. PigPipeline binds a set of variables and generates a BoundPigline object which then runs the bound pipeline. Embedded script writers will have access to a ScriptPigServer object called "pig" in the script.

Do we need two classes BoundPipeline and MultiBoundPipeline? Could we instead have just BoundPipeline, and then for each run method there would be: ...

I went back and forth between these two approaches. I'm fine with a single BoundPipeline class with two different run/runSingle method.

We need to mark the availability and stability of the ScriptEngine interface. I suspect it is Public Evolving.

Richard Ding
added a comment - 09/Dec/10 00:18 Thanks Alan,
This patch makes changes to the public interface PigProgressNotificationListener. It's ok, since it's marked evolving. Do we know how many people are using this and what we'll need to do to mitigate the changes for them?
This interface is available only in Pig 0.8 which is just ready to release. So not many people are using it. On the other hand it's too late to get into 0.8. The reason for the change is that the embedded script could contain multiple Pig scripts and Pig runtime needs to tell users from which script they get the notification.
PigPipeline needs better javadoc comments at the class level. The current javadocs confuse it with the defined Pig class.
Will do.
Rather than the Pig class detailed in the design doc this patch has ScriptPigServer, which has a slightly different interface. Does this represent a change to the design or is there a yet to be built Pig class?
The patch breaks the Pig class interface into several class: ScriptPigServer to register or define in global scope, to compile a Pig Latin script into a PigPipeline object. PigPipeline binds a set of variables and generates a BoundPigline object which then runs the bound pipeline. Embedded script writers will have access to a ScriptPigServer object called "pig" in the script.
Do we need two classes BoundPipeline and MultiBoundPipeline? Could we instead have just BoundPipeline, and then for each run method there would be: ...
I went back and forth between these two approaches. I'm fine with a single BoundPipeline class with two different run/runSingle method.
We need to mark the availability and stability of the ScriptEngine interface. I suspect it is Public Evolving.
Will do.

Hi Richard,
Thank you for the updated patch.
Follow my comments, all related to usability:

Pig script invocation
The main invocation mechanism is as follows:

results = pig.compile("<Pig Latin>").bind({param:value, ...}).run()

I was proposing to also bind variables automatically to local variables in the current scope.

results = pig.compile("<Pig Latin>").bindToLocal().run()

or more simply

results = pig.run("<Pig Latin>")

(as implemented in the original submission)
I understand that all languages may not allow that, but all scripting languages I can think of allow it. Only compiled languages strip variable names. This could be optional for the implementation.
If the bind() step is usefull in some situations and is more generic, it is not the most frequent use case.
Implicit binding to local variables is an important feature. As the Pig script is embedded in a particular context, in most use cases the parameters will have the same name than the local variables used to populate them.
The goal is to embed Pig, making the integration seemless. Most cases won't need the indirection to have different parameter names from local variables, making it a burden for the developper.

Ability to have the main program and the UDFs in the same script
This was the main reason I started this work. The goal was to have everything in one script. The fact that the UDFs are run on the slaves should not force the user to put them in a separate file. The main goal is to have the entire algorithm in the same place without arbitrary separations like this one.
When putting in the balance having a main() function vs not being able to have UDFs in the same file I will definitly choose to have a main() function.
Just embedding Pig without having UDFs in the same file is not very different from running the Pig command line from a script.

Julien Le Dem
added a comment - 10/Dec/10 06:49 Hi Richard,
Thank you for the updated patch.
Follow my comments, all related to usability:
Pig script invocation
The main invocation mechanism is as follows:
results = pig.compile( "<Pig Latin>" ).bind({param:value, ...}).run()
I was proposing to also bind variables automatically to local variables in the current scope.
results = pig.compile( "<Pig Latin>" ).bindToLocal().run()
or more simply
results = pig.run( "<Pig Latin>" )
(as implemented in the original submission)
I understand that all languages may not allow that, but all scripting languages I can think of allow it. Only compiled languages strip variable names. This could be optional for the implementation.
If the bind() step is usefull in some situations and is more generic, it is not the most frequent use case.
Implicit binding to local variables is an important feature. As the Pig script is embedded in a particular context, in most use cases the parameters will have the same name than the local variables used to populate them.
The goal is to embed Pig, making the integration seemless. Most cases won't need the indirection to have different parameter names from local variables, making it a burden for the developper.
Ability to have the main program and the UDFs in the same script
This was the main reason I started this work. The goal was to have everything in one script. The fact that the UDFs are run on the slaves should not force the user to put them in a separate file. The main goal is to have the entire algorithm in the same place without arbitrary separations like this one.
When putting in the balance having a main() function vs not being able to have UDFs in the same file I will definitly choose to have a main() function.
Just embedding Pig without having UDFs in the same file is not very different from running the Pig command line from a script.

Pig will use the bind() method to implicitly bind variables to local variables in the current scope. It'll do an implicit mapping of variables in the host language to parameters in Pig Latin:

results = pig.compile("<Pig Latin>").bind().run()

Ability to have the control flow program and the UDFs in the same script:

I agree that it's good to have everything in one script. Since I can't think of a way to only execute functions in python, I'll go back to use a simple parser to separate functions and control flow program so that UDFs can be registered before the control flow program runs.

A related issue is the python IMPORT statements. Users will be responsible to ship the imported modules to the backend servers. Pig won't automatically resolve the module paths and ship the files to the backend.

Richard Ding
added a comment - 14/Dec/10 01:45 Thanks Julien. How about the following proposal?
Pig script invocation:
Pig will use the bind() method to implicitly bind variables to local variables in the current scope. It'll do an implicit mapping of variables in the host language to parameters in Pig Latin:
results = pig.compile( "<Pig Latin>" ).bind().run()
Ability to have the control flow program and the UDFs in the same script:
I agree that it's good to have everything in one script. Since I can't think of a way to only execute functions in python, I'll go back to use a simple parser to separate functions and control flow program so that UDFs can be registered before the control flow program runs.
A related issue is the python IMPORT statements. Users will be responsible to ship the imported modules to the backend servers. Pig won't automatically resolve the module paths and ship the files to the backend.

+1 to using a fuzzy parser. I agree that being able to have the Python UDFs in the same file is important, and in user reviews others have voiced the same opinion. But forcing Python users to have a main function is going to seem very unnatural to them. So I think the fuzzy parsing is the best compromise.

Alan Gates
added a comment - 14/Dec/10 16:43 +1 to using a fuzzy parser. I agree that being able to have the Python UDFs in the same file is important, and in user reviews others have voiced the same opinion. But forcing Python users to have a main function is going to seem very unnatural to them. So I think the fuzzy parsing is the best compromise.

Richard Ding
added a comment - 23/Dec/10 22:00 Based on the feedback, the new patch contains the following changes:
Support the main program and the UDFs in the same script. However, when mixing jython functions with top level control flow code, the script must use the idiomatic "conditional script" stanza:
def udf1()
...
def udf2()
...
if __name__ == '__main__':
# control flow code
Support explicit registering scripting UDFs:
Pig.registerUDF( "udfs.py" , "")
# control flow code
Confirm Pig scripting API to the specification: http://wiki.apache.org/pig/TuringCompletePig . The main change is that the scripts now need explicitly import the Pig class:
from org.apache.pig.scripting import Pig
... ...
results = Pig.compile( "<Pig Latin>" ).bind().run()

Latest patch looks good. I just have one question. Why do we need the synchronous implementation of PigProgressNotificationListener (SyncProgressNotificationAdaptor)? In what case do we expect Pig to be notifying in parallel? I am assuming that we want to allow user scripts to be multi-threaded, but do we expect multiple threads to use the same PigProgressNotificationListener?

Alan Gates
added a comment - 06/Jan/11 20:56 Latest patch looks good. I just have one question. Why do we need the synchronous implementation of PigProgressNotificationListener (SyncProgressNotificationAdaptor)? In what case do we expect Pig to be notifying in parallel? I am assuming that we want to allow user scripts to be multi-threaded, but do we expect multiple threads to use the same PigProgressNotificationListener?

Richard Ding
added a comment - 06/Jan/11 22:08
It is for parallel execution of a pipeline. User registers listener through PigRunner API:
public static PigStats run( String [] args, PigProgressNotificationListener listener) ;
It's expected that the same listener is used by all the threads (each executes an instance of the pipeline) in parallel.