Need a way for Pig to take an alternative property file

Details

Description

Currently, Pig read the first ever pig.properties in CLASSPATH. Pig has a default pig.properties and if user have a different pig.properties, there will be a conflict since we can only read one. There are couple of ways to solve it:

1. Give a command line option for user to pass an additional property file
2. Change the name for default pig.properties to pig-default.properties, and user can give a pig.properties to override
3. Further, can we consider to use pig-default.xml/pig-site.xml, which seems to be more natural for hadoop community. If so, we shall provide backward compatibility to also read pig.properties, pig-cluster-hadoop-site.xml.

Activity

2. Change the name for default pig.properties to pig-default.properties, and user can give a pig.properties to override

3. Further, can we consider to use pig-default.xml/pig-site.xml, which seems to be more natural for hadoop community. If so, we shall provide backward compatibility to also read pig.properties, pig-cluster-hadoop-site.xml.

+1 to both the above approaches. In fact, I see that both are almost similar in the sense that user has the option to provide a set of properties in a file.

Since this is something that pig source-code provides, i think it makes sense to keep the filename (the one in which users can specify their own properties) fixed, rather than having it as a command line option.

V.V.Chaitanya Krishna
added a comment - 23/Apr/10 11:43 2. Change the name for default pig.properties to pig-default.properties, and user can give a pig.properties to override
3. Further, can we consider to use pig-default.xml/pig-site.xml, which seems to be more natural for hadoop community. If so, we shall provide backward compatibility to also read pig.properties, pig-cluster-hadoop-site.xml.
+1 to both the above approaches. In fact, I see that both are almost similar in the sense that user has the option to provide a set of properties in a file.
Since this is something that pig source-code provides, i think it makes sense to keep the filename (the one in which users can specify their own properties) fixed, rather than having it as a command line option.
Thoughts?

Do we need to have two different property files ? One possibility is to not package pig.properties in the pig.jar and then include it in the classpath while invoking Pig. (We can modify pig shell script to include it in the path by default). Then, user can add/delete/modify the pig.properties as he wish as well override default properties.

Disadvantage of two property files, is sometimes its confusing which property is getting picked up (one in default or one in user specified). If there is only one property file, there is only one way to specify the properties to Pig which I think is better way of doing it.

Ashutosh Chauhan
added a comment - 27/Apr/10 01:27 Do we need to have two different property files ? One possibility is to not package pig.properties in the pig.jar and then include it in the classpath while invoking Pig. (We can modify pig shell script to include it in the path by default). Then, user can add/delete/modify the pig.properties as he wish as well override default properties.
Disadvantage of two property files, is sometimes its confusing which property is getting picked up (one in default or one in user specified). If there is only one property file, there is only one way to specify the properties to Pig which I think is better way of doing it.

Do we need to have two different property files ? One possibility is to not package pig.properties in the pig.jar and then include it in the classpath while invoking Pig. (We can modify pig shell script to include it in the path by default). Then, user can add/delete/modify the pig.properties as he wish as well override default properties.

This might lead to overriding of some properties which might actually be unacceptable. Also, in the long run, we might want to have a configuration file with properties that are not supposed to be changed (similar to what happened in case of Hadoop project)

Disadvantage of two property files, is sometimes its confusing which property is getting picked up (one in default or one in user specified). If there is only one property file, there is only one way to specify the properties to Pig which I think is better way of doing it.

Since the processing of properties' files is sequential (i.e., one file after another), we can be sure that the latest occuring value is taken for a give property. For example, we can load the default properties' file first and then followed by the one in which users give their own set of properties. This way, we could provide preference to users' settings as well.

V.V.Chaitanya Krishna
added a comment - 27/Apr/10 11:10 Do we need to have two different property files ? One possibility is to not package pig.properties in the pig.jar and then include it in the classpath while invoking Pig. (We can modify pig shell script to include it in the path by default). Then, user can add/delete/modify the pig.properties as he wish as well override default properties.
This might lead to overriding of some properties which might actually be unacceptable. Also, in the long run, we might want to have a configuration file with properties that are not supposed to be changed (similar to what happened in case of Hadoop project)
Disadvantage of two property files, is sometimes its confusing which property is getting picked up (one in default or one in user specified). If there is only one property file, there is only one way to specify the properties to Pig which I think is better way of doing it.
Since the processing of properties' files is sequential (i.e., one file after another), we can be sure that the latest occuring value is taken for a give property. For example, we can load the default properties' file first and then followed by the one in which users give their own set of properties. This way, we could provide preference to users' settings as well.
Thoughts?

Option 3 also entails moving from key=value format to xml format, correct? Is there any value to the xml format beyond conformity with the rest of Hadoop? I think key=value is easier for people to understand.

For option 2 I think we'd need to provide an empty pig.properties file (just as hadoop provides an empty hadoop-site.xml).

Alan Gates
added a comment - 29/Apr/10 01:27 Option 3 also entails moving from key=value format to xml format, correct? Is there any value to the xml format beyond conformity with the rest of Hadoop? I think key=value is easier for people to understand.
For option 2 I think we'd need to provide an empty pig.properties file (just as hadoop provides an empty hadoop-site.xml).

Yes. I think it is a valid way of putting up properties. But I think xml format gives more structured-ness to the properties file, no? But however, this would also mean a significant code change in terms of the approach for loading the properties (things like xml parsing etc. might have to be taken care of).

For option 2 I think we'd need to provide an empty pig.properties file (just as hadoop provides an empty hadoop-site.xml).

We can provide the current pig.properties file itself, with the default values. So it would be somewhat analogous to hadoop-default.xml. I would also propose that this be bundled in the jar itself and thus avoid it to be changed by users. (because they would have a different properties file already to put up their settings).

V.V.Chaitanya Krishna
added a comment - 29/Apr/10 03:27 I think key=value is easier for people to understand.
Yes. I think it is a valid way of putting up properties. But I think xml format gives more structured-ness to the properties file, no? But however, this would also mean a significant code change in terms of the approach for loading the properties (things like xml parsing etc. might have to be taken care of).
For option 2 I think we'd need to provide an empty pig.properties file (just as hadoop provides an empty hadoop-site.xml).
We can provide the current pig.properties file itself, with the default values. So it would be somewhat analogous to hadoop-default.xml. I would also propose that this be bundled in the jar itself and thus avoid it to be changed by users. (because they would have a different properties file already to put up their settings).
Thoughts?

For option 3, yes, we need to move key value pair to xml. I checked current pig.properties, seems moving all properties is straightforward. Option 2 and 3 are functionally equivalent. The advantage for 2 is simplicity. The advantage for 3 is more hadoop compliant, one work with hadoop will be clear what pig-default.xml and pig-site-xml means.

Daniel Dai
added a comment - 29/Apr/10 05:30 For option 3, yes, we need to move key value pair to xml. I checked current pig.properties, seems moving all properties is straightforward. Option 2 and 3 are functionally equivalent. The advantage for 2 is simplicity. The advantage for 3 is more hadoop compliant, one work with hadoop will be clear what pig-default.xml and pig-site-xml means.

We have an internal discussion and decide to take option 1. In the command line, we take -propertyfile <name> for the alternative property file name. Pig will first load pig.properties, and then load alternative property file in classpath. Later entry will overwrite the previous seen one, ie. alternative property file will take precedence over pig.properties.

Daniel Dai
added a comment - 30/Apr/10 18:31 We have an internal discussion and decide to take option 1. In the command line, we take -propertyfile <name> for the alternative property file name. Pig will first load pig.properties, and then load alternative property file in classpath. Later entry will overwrite the previous seen one, ie. alternative property file will take precedence over pig.properties.

Olga Natkovich
added a comment - 04/May/10 18:33 I would also like to propose that we address issue raised in https://issues.apache.org/jira/browse/PIG-602 as part of this JIRA.
I think what is asked in PIG-602 can be accomplished by making the properties passed in available to all UDFs vi UDFContext. The UDFs can decide what they pull out.

Hi, V.V.Chaitanya,
Are you still working on this issue? We want to do both option 1 & 2. That is:
1. We provide command line option -propertyfile <name>
2. We read both pig-default.properties and pig.properties. pig.properties will override pig-default.properties. When release, we will bundle pig-default.properties in pig.jar and user can give a pig.properties in classpath

Daniel Dai
added a comment - 11/May/10 08:22 Hi, V.V.Chaitanya,
Are you still working on this issue? We want to do both option 1 & 2. That is:
1. We provide command line option -propertyfile <name>
2. We read both pig-default.properties and pig.properties. pig.properties will override pig-default.properties. When release, we will bundle pig-default.properties in pig.jar and user can give a pig.properties in classpath

Hadoop QA
added a comment - 12/May/10 19:56 -1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12444330/PIG-1381-3.patch
against trunk revision 943578.
+1 @author. The patch does not contain any @author tags.
+1 tests included. The patch appears to include 3 new or modified tests.
-1 patch. The patch command could not apply the patch.
Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/315/console
This message is automatically generated.

+1 on the changes.
For completeness, we can also check in an empty pig.properties in the conf dir and then add comments in both pig.properties and pig-default.properties that if user wants to pass some properties doing it through pig-default.properties will have no effect and instead they should add extra properties they want to add/override in pig.properties.

Ashutosh Chauhan
added a comment - 13/May/10 20:14 +1 on the changes.
For completeness, we can also check in an empty pig.properties in the conf dir and then add comments in both pig.properties and pig-default.properties that if user wants to pass some properties doing it through pig-default.properties will have no effect and instead they should add extra properties they want to add/override in pig.properties.

Apologies for coming in so late. I was on vacation and didnt have access to internet.
I have done the coding part for option 1. I need to write unit test cases for it. I should be submitting the patch positively by saturday.

V.V.Chaitanya Krishna
added a comment - 14/May/10 04:36 Apologies for coming in so late. I was on vacation and didnt have access to internet.
I have done the coding part for option 1. I need to write unit test cases for it. I should be submitting the patch positively by saturday.

Refactoring the code implementing the loading of properties from pig-default.properties and pig.properties (to avoid code duplication).

Extracting the code that loads properties from the deprecated .pigrc file. This makes it easier to use the method again to load properties from the user-specified properties' file.

load the properties from deprecated .pigrc file before the other default files (i.e., pig-default.properties and pig.properties). This will make the code simpler as we dont need to check for the existence of property before loading it from .pigrc file, because it will later get overriden by the value in pid-default.properties or pig.properties.

V.V.Chaitanya Krishna
added a comment - 24/May/10 10:43 Uploading patch that implements option 1 of command-line option of providing a properties' file by user.
The patch has the following changes:
Renaming PropertiesUtil.loadPropertiesFromFile to PropertiesUtil.loadDefaultProperties
Refactoring the code implementing the loading of properties from pig-default.properties and pig.properties (to avoid code duplication).
Extracting the code that loads properties from the deprecated .pigrc file. This makes it easier to use the method again to load properties from the user-specified properties' file.
load the properties from deprecated .pigrc file before the other default files (i.e., pig-default.properties and pig.properties). This will make the code simpler as we dont need to check for the existence of property before loading it from .pigrc file, because it will later get overriden by the value in pid-default.properties or pig.properties.

I reviewed the patch. Command line properties file will override default properties, and we can have multiple number of -propertyFile entry in command line. Command line switch is -P or -propertyFile. That's good.

I have a comment for the line:
opts.registerOpt('P', "propertyFile", CmdLineParser.ValueExpected.OPTIONAL);

I think value of "perpertyFile" perperty is not OPTIONAL, should change it to REQUIRED.

Daniel Dai
added a comment - 24/May/10 20:26 I reviewed the patch. Command line properties file will override default properties, and we can have multiple number of -propertyFile entry in command line. Command line switch is -P or -propertyFile. That's good.
I have a comment for the line:
opts.registerOpt('P', "propertyFile", CmdLineParser.ValueExpected.OPTIONAL);
I think value of "perpertyFile" perperty is not OPTIONAL, should change it to REQUIRED.

V.V.Chaitanya Krishna
added a comment - 25/May/10 04:24 I think value of "perpertyFile" perperty is not OPTIONAL, should change it to REQUIRED
If the option is made mandatory for the user, then the following scenarios might occur:
User who want to run just with the default properties and does not need any properties to be set will still be forced to submit a blank properties file.
If this is made mandatory, then the presence of pig.properties might not make much sense. I believe this option should be an alternative to pig.properties in user providing properties.
Ideally, I think users should be able to work without submitting their own set of properties.

Daniel Dai
added a comment - 25/May/10 04:41 Hi, V.V. Chaitanya,
It is "ValueExpected.REQUIRED". It does not require that the option appeared in command line. But if it does, then you you must give a value after "-propertyfile/-P".