This discussion started inHADOOP-8924<https://issues.apache.org/jira/browse/HADOOP-8924>, where it was proposed to replace the build-time utility "saveVersion.sh"with a python script. This would require Python as a build-timedependency. Here's the background:

Those of us involved in the branch-1-win port of Hadoop to Windows withoutuse of Cygwin, have faced the issue of frequent use of shell scriptsthroughout the system, both in build time (eg, the utility "saveVersion.sh"),and run time (config files like "hadoop-env.sh" and the start/stop scriptsin "bin/*" ). Similar usages exist throughout the Hadoop stack, in allprojects.

The vast majority of these shell scripts do not do anything platformspecific; they can be expressed in a posix-conforming way. Therefore, itseems to us that it makes sense to start using a cross-platform scriptinglanguage, such as python, in place of shell for these purposes. For thoserare occasions where platform-specific functionality really is needed,python also supports quite a lot of platform-specific functionality on bothLinux and Windows; but where that is inadequate, one could stillconditionally invoke a platform-specific module written in shell (forLinux/*nix) or powershell or bat (for Windows).

The primary motive for moving to a cross-platform scripting language ismaintainability. The alternative would be to maintain two complete suitesof scripts, one for Linux and one for Windows (and perhaps others in thefuture). We want to avoid the need to update dual modules in two differentlanguages when functionality changes, especially given that many Linuxdevelopers are not familiar with powershell or bat, and many Windowsdevelopers are not familiar with shell or bash.

Regarding the choice of python:

- There are already a few instances of python usage in Hadoop, such as the utility (currently broken) "relnotes.py", and massive usage of python in the examples/ and contrib/ directories. - Python is also used in Bigtop build-time. - The Python language is available for free on essentially all platforms, under an Apache-compatiblelicense<http://www.apache.org/legal/resolved.html>.

- It is supported in Eclipse and similar IDEs. - Most importantly, it is widely accepted as a reasonably good OO scripting language, and it is easily learned by anyone who already knows shell or perl, or other common scripting languages. - On the Tiobe index of programming languagepopularity<http://www.tiobe.com/index.php/content/paperinfo/tpci/index.html>, which seeks to measure the relative number of software engineers who know and use each language, Python far exceeds Perl and Ruby. The only more well-known scripting languages are PHP and Visual Basic, neither of which seems a prime candidate for this use.

For build-time usage, I think we should immediately approve python as abuild-time dependency, and allow people who are motivated to do so, to openjiras for migrating existing build-time shell scripts to python.

For run-time, there is likely to be a lot more discussion. Lots of folks,including me, aren't real happy with use of active scripts forconfiguration, and various others, including I believe some of the Bigtopfolks, have issues with the way the start/stop scripts work. Nevertheless,all those scripts exist today and are widely used. And they present animpediment to porting to Windows-without-cygwin.

Nothing about run-time use of scripts has changed significantly over thepast three years, and I don't think we should hold up the Windows portwhile we have a huge discussion about issues that veer dangerously intoreligious/aesthetic domains. It would be fun to have that discussion, but Idon't want this decision to be dependent on it!

So I propose that we go ahead and also approve python as a run-timedependency, and allow the inclusion of python scripts in place of currentshell-based functionality. The unpleasant alternative is to spawn a bunchof powershell scripts in parallel to the current shell scripts, with a verynegative impact on maintainability. The Windows port must, after all, beallowed to proceed.

Let's have a discussion, and then I'll put both issues, separately, to avote (unless we miraculously achieve consensus without a vote :-)

I also encourage members of the other Hadoop-related projects, to carrythis discussion into those forums. It would be very cool to agree on awhole-stack solution for the scripting problem.

Best regards,

-

Re: [PROPOSAL] introduce Python as build-time and run-time dependency for Hadoop and throughout Hadoop stack

> This discussion started in> HADOOP-8924<https://issues.apache.org/jira/browse/HADOOP-8924>> , where it was proposed to replace the build-time utility "saveVersion.sh"> with a python script. This would require Python as a build-time> dependency. Here's the background:>> Those of us involved in the branch-1-win port of Hadoop to Windows without> use of Cygwin, have faced the issue of frequent use of shell scripts> throughout the system, both in build time (eg, the utility> "saveVersion.sh"),> and run time (config files like "hadoop-env.sh" and the start/stop scripts> in "bin/*" ). Similar usages exist throughout the Hadoop stack, in all> projects.>> The vast majority of these shell scripts do not do anything platform> specific; they can be expressed in a posix-conforming way. Therefore, it> seems to us that it makes sense to start using a cross-platform scripting> language, such as python, in place of shell for these purposes. For those> rare occasions where platform-specific functionality really is needed,> python also supports quite a lot of platform-specific functionality on both> Linux and Windows; but where that is inadequate, one could still> conditionally invoke a platform-specific module written in shell (for> Linux/*nix) or powershell or bat (for Windows).>> The primary motive for moving to a cross-platform scripting language is> maintainability. The alternative would be to maintain two complete suites> of scripts, one for Linux and one for Windows (and perhaps others in the> future). We want to avoid the need to update dual modules in two different> languages when functionality changes, especially given that many Linux> developers are not familiar with powershell or bat, and many Windows> developers are not familiar with shell or bash.>> Regarding the choice of python:>> - There are already a few instances of python usage in Hadoop, such as> the utility (currently broken) "relnotes.py", and massive usage of> python> in the examples/ and contrib/ directories.> - Python is also used in Bigtop build-time.> - The Python language is available for free on essentially all> platforms, under an Apache-compatible> license<http://www.apache.org/legal/resolved.html>.>> - It is supported in Eclipse and similar IDEs.> - Most importantly, it is widely accepted as a reasonably good OO> scripting language, and it is easily learned by anyone who already knows> shell or perl, or other common scripting languages.> - On the Tiobe index of programming language> popularity<> http://www.tiobe.com/index.php/content/paperinfo/tpci/index.html>,> which seeks to measure the relative number of software engineers who> know> and use each language, Python far exceeds Perl and Ruby. The only more> well-known scripting languages are PHP and Visual Basic, neither of> which> seems a prime candidate for this use.>> For build-time usage, I think we should immediately approve python as a> build-time dependency, and allow people who are motivated to do so, to open> jiras for migrating existing build-time shell scripts to python.>> For run-time, there is likely to be a lot more discussion. Lots of folks,> including me, aren't real happy with use of active scripts for> configuration, and various others, including I believe some of the Bigtop> folks, have issues with the way the start/stop scripts work. Nevertheless,

Alejandro

-

Re: [PROPOSAL] introduce Python as build-time and run-time dependency for Hadoop and throughout Hadoop stack

Hi Alejandro,For build-time issues in branch-2 and beyond, this may make sense (althoughI'm concerned about obscuring functionality in a way that only mavenexperts will be able to understand). In the particular case ofsaveVersion.sh, I'd be happy to see it done automatically by the buildtools.

However, for build-time issues in the non-mavenized branch-1, and forrun-time issues in both worlds, the need for cross-platform scriptingremains.

> Hey Matt,>> We already require java/mvn/protoc/cmake/forrest (forrest is hopefully on> its way out with the move of docs to APT)>> Why not do a maven-plugin to do that?>> Colin already has something to simplify all the cmake calls from the builds> using a maven-plugin (https://issues.apache.org/jira/browse/HADOOP-8887)>> We could do the same with protoc, thus simplifying the POMs.>> The saveVersion.sh seems like another prime candidate for a maven plugin,> and in this case it would not require external tools.>> Does this make sense?>> Thx>> --> Alejandro>

-

Re: [PROPOSAL] introduce Python as build-time and run-time dependency for Hadoop and throughout Hadoop stack

> Hi Alejandro,> For build-time issues in branch-2 and beyond, this may make sense (although> I'm concerned about obscuring functionality in a way that only maven> experts will be able to understand). In the particular case of> saveVersion.sh, I'd be happy to see it done automatically by the build> tools.>> However, for build-time issues in the non-mavenized branch-1, and for> run-time issues in both worlds, the need for cross-platform scripting> remains.>> Thanks,> --Matt>> On Wed, Nov 21, 2012 at 11:25 AM, Alejandro Abdelnur <[EMAIL PROTECTED]> >wrote:>> > Hey Matt,> >> > We already require java/mvn/protoc/cmake/forrest (forrest is hopefully on> > its way out with the move of docs to APT)> >> > Why not do a maven-plugin to do that?> >> > Colin already has something to simplify all the cmake calls from the> builds> > using a maven-plugin (https://issues.apache.org/jira/browse/HADOOP-8887)> >> > We could do the same with protoc, thus simplifying the POMs.> >> > The saveVersion.sh seems like another prime candidate for a maven plugin,> > and in this case it would not require external tools.> >> > Does this make sense?> >> > Thx> >> > --> > Alejandro> >>

-- Alejandro

-

Re: [PROPOSAL] introduce Python as build-time and run-time dependency for Hadoop and throughout Hadoop stack

I like Alejandro's idea about Maven for a few of reasons: - bringing in a scripting environment which is known for its inter-version idiosyncrasies just because Windows can't handle trivial shell scripting looks like an overkill to me - relative to above, there's a chance that Python's pre-requisites used in Hadoop might get into a conflict with some other components in the stack. This will be a nightmare for the integrator projects i.e. Bigtop - Maven is de-facto standard for Java stacks - Maven has built-in scripting language (Groovy) if some plugins aren't sufficient for achieving whatever goals

Addressing Matt's later point about non-Mavenized Hadoop-1 line: it uses Mavenstuff suchs as deploy/install via custom ant tasks. Same approach would workfor saveVersion.sh and others, I am sure.

Cos

On Wed, Nov 21, 2012 at 11:25AM, Alejandro Abdelnur wrote:> Hey Matt,> > We already require java/mvn/protoc/cmake/forrest (forrest is hopefully on> its way out with the move of docs to APT)> > Why not do a maven-plugin to do that?> > Colin already has something to simplify all the cmake calls from the builds> using a maven-plugin (https://issues.apache.org/jira/browse/HADOOP-8887)> > We could do the same with protoc, thus simplifying the POMs.> > The saveVersion.sh seems like another prime candidate for a maven plugin,> and in this case it would not require external tools.> > Does this make sense?> > Thx> > On Wed, Nov 21, 2012 at 11:15 AM, Matt Foley <[EMAIL PROTECTED]> wrote:> > > This discussion started in> > HADOOP-8924<https://issues.apache.org/jira/browse/HADOOP-8924>> > , where it was proposed to replace the build-time utility "saveVersion.sh"> > with a python script. This would require Python as a build-time> > dependency. Here's the background:> >> > Those of us involved in the branch-1-win port of Hadoop to Windows without> > use of Cygwin, have faced the issue of frequent use of shell scripts> > throughout the system, both in build time (eg, the utility> > "saveVersion.sh"),> > and run time (config files like "hadoop-env.sh" and the start/stop scripts> > in "bin/*" ). Similar usages exist throughout the Hadoop stack, in all> > projects.> >> > The vast majority of these shell scripts do not do anything platform> > specific; they can be expressed in a posix-conforming way. Therefore, it> > seems to us that it makes sense to start using a cross-platform scripting> > language, such as python, in place of shell for these purposes. For those> > rare occasions where platform-specific functionality really is needed,> > python also supports quite a lot of platform-specific functionality on both> > Linux and Windows; but where that is inadequate, one could still> > conditionally invoke a platform-specific module written in shell (for> > Linux/*nix) or powershell or bat (for Windows).> >> > The primary motive for moving to a cross-platform scripting language is> > maintainability. The alternative would be to maintain two complete suites> > of scripts, one for Linux and one for Windows (and perhaps others in the> > future). We want to avoid the need to update dual modules in two different> > languages when functionality changes, especially given that many Linux> > developers are not familiar with powershell or bat, and many Windows> > developers are not familiar with shell or bash.> >> > Regarding the choice of python:> >> > - There are already a few instances of python usage in Hadoop, such as> > the utility (currently broken) "relnotes.py", and massive usage of> > python> > in the examples/ and contrib/ directories.> > - Python is also used in Bigtop build-time.> > - The Python language is available for free on essentially all> > platforms, under an Apache-compatible> > license<http://www.apache.org/legal/resolved.html>.> >> > - It is supported in Eclipse and similar IDEs.> > - Most importantly, it is widely accepted as a reasonably good OO

-

Re: [PROPOSAL] introduce Python as build-time and run-time dependency for Hadoop and throughout Hadoop stack

I worked on some of the Python build scripting that currently resides inbranch-trunk-win. Initially, my goal was to keep a "pure" Mavenimplementation to the greatest degree possible without external scripting,but I encountered a few problems:

1. One approach is to try to express all of the build logic with existingMaven plugins. This turned out to be infeasible in some cases. I don'tknow of an existing plugin that does anything like the logic insaveVersion.sh/.py for walking the source tree and checksumming the files. For protoc, I saw a proposed plugin in open source, but it hadn't reachedrelease status yet. For creation of the distribution tarballs, the MavenAnt Plugin (and actually the underlying Ant tool) cannot preserve filepermissions or symlinks.

2. Considering that the first approach isn't possible, another possibilityis to write custom Maven plugins. This would require significantly moreengineering time to write and test the code. I think there are somelegitimate concerns too about supportability, because this approach wouldput significant build logic into Maven plugin code instead of somethingmore easily visible to release engineers, like pom.xml and externalscripts. Also, I'm actually not sure that we can implement everything witha Maven plugin. For example, I mentioned the problem of preserving filepermissions and symlinks in the distribution tarballs. Ant hasn't beenable to fix that problem due to a Java limitation, so our Maven pluginscoded in Java (or another JVM language) likely would suffer the same fate. We might be stuck with some amount of external scripting no matter what.

> I like Alejandro's idea about Maven for a few of reasons:> - bringing in a scripting environment which is known for its> inter-version> idiosyncrasies just because Windows can't handle trivial shell> scripting> looks like an overkill to me> - relative to above, there's a chance that Python's pre-requisites used> in> Hadoop might get into a conflict with some other components in the> stack.> This will be a nightmare for the integrator projects i.e. Bigtop> - Maven is de-facto standard for Java stacks> - Maven has built-in scripting language (Groovy) if some plugins aren't> sufficient for achieving whatever goals>> Addressing Matt's later point about non-Mavenized Hadoop-1 line: it uses> Maven> stuff suchs as deploy/install via custom ant tasks. Same approach would> work> for saveVersion.sh and others, I am sure.>> Cos>> On Wed, Nov 21, 2012 at 11:25AM, Alejandro Abdelnur wrote:> > Hey Matt,> >> > We already require java/mvn/protoc/cmake/forrest (forrest is hopefully on> > its way out with the move of docs to APT)> >> > Why not do a maven-plugin to do that?> >> > Colin already has something to simplify all the cmake calls from the> builds> > using a maven-plugin (https://issues.apache.org/jira/browse/HADOOP-8887)> >> > We could do the same with protoc, thus simplifying the POMs.> >> > The saveVersion.sh seems like another prime candidate for a maven plugin,> > and in this case it would not require external tools.> >> > Does this make sense?> >> > Thx> >> > On Wed, Nov 21, 2012 at 11:15 AM, Matt Foley <[EMAIL PROTECTED]> wrote:> >> > > This discussion started in> > > HADOOP-8924<https://issues.apache.org/jira/browse/HADOOP-8924>> > > , where it was proposed to replace the build-time utility> "saveVersion.sh"> > > with a python script. This would require Python as a build-time> > > dependency. Here's the background:> > >> > > Those of us involved in the branch-1-win port of Hadoop to Windows> without> > > use of Cygwin, have faced the issue of frequent use of shell scripts> > > throughout the system, both in build time (eg, the utility> > > "saveVersion.sh"),> > > and run time (config files like "hadoop-env.sh" and the start/stop

-

Re: [PROPOSAL] introduce Python as build-time and run-time dependency for Hadoop and throughout Hadoop stack

> I like Alejandro's idea about Maven for a few of reasons:> - bringing in a scripting environment which is known for its> inter-version> idiosyncrasies just because Windows can't handle trivial shell> scripting> looks like an overkill to me>

Excuse me? Can we at least try not to belittle other people's platforms ona public Apache forum? There's nothing trivial about implementing shell onWindows, as cygwin regrettably proved.> - relative to above, there's a chance that Python's pre-requisites used> in> Hadoop might get into a conflict with some other components in the> stack.> This will be a nightmare for the integrator projects i.e. Bigtop>

Are you proposing Groovy as a better scripting language than Python?>> Addressing Matt's later point about non-Mavenized Hadoop-1 line: it uses> Maven> stuff suchs as deploy/install via custom ant tasks. Same approach would> work> for saveVersion.sh and others, I am sure.>

Current ant scripts in Hadoop seem to use maven only for artifactmanagement via the maven repository. If I'm missing something, pleasepoint it out. The ant build task currently calls out to saveVersion.sh. Having it call out to maven, which then calls out to a plug-in and/or aGroovy script, doesn't sound like an improvement to me. And it's a waydifferent use of maven than currently in the Hadoop-1 line, not acontinuation of established practice.

--Matt>> Cos>> On Wed, Nov 21, 2012 at 11:25AM, Alejandro Abdelnur wrote:> > Hey Matt,> >> > We already require java/mvn/protoc/cmake/forrest (forrest is hopefully on> > its way out with the move of docs to APT)> >> > Why not do a maven-plugin to do that?> >> > Colin already has something to simplify all the cmake calls from the> builds> > using a maven-plugin (https://issues.apache.org/jira/browse/HADOOP-8887)> >> > We could do the same with protoc, thus simplifying the POMs.> >> > The saveVersion.sh seems like another prime candidate for a maven plugin,> > and in this case it would not require external tools.> >> > Does this make sense?> >> > Thx> >> > On Wed, Nov 21, 2012 at 11:15 AM, Matt Foley <[EMAIL PROTECTED]> wrote:> >> > > This discussion started in> > > HADOOP-8924<https://issues.apache.org/jira/browse/HADOOP-8924>> > > , where it was proposed to replace the build-time utility> "saveVersion.sh"> > > with a python script. This would require Python as a build-time> > > dependency. Here's the background:> > >> > > Those of us involved in the branch-1-win port of Hadoop to Windows> without> > > use of Cygwin, have faced the issue of frequent use of shell scripts> > > throughout the system, both in build time (eg, the utility> > > "saveVersion.sh"),> > > and run time (config files like "hadoop-env.sh" and the start/stop> scripts> > > in "bin/*" ). Similar usages exist throughout the Hadoop stack, in all> > > projects.> > >> > > The vast majority of these shell scripts do not do anything platform> > > specific; they can be expressed in a posix-conforming way. Therefore,> it> > > seems to us that it makes sense to start using a cross-platform> scripting> > > language, such as python, in place of shell for these purposes. For> those> > > rare occasions where platform-specific functionality really is needed,> > > python also supports quite a lot of platform-specific functionality on> both> > > Linux and Windows; but where that is inadequate, one could still

-

Re: [PROPOSAL] introduce Python as build-time and run-time dependency for Hadoop and throughout Hadoop stack

Sorry, to clarify my point a little more, Ant does allow you to makedeclarations to explicitly set the desired file permissions via thefileMode attribute of a tarfileset. However, it does not have thecapability to preserve whatever permissions were naturally created on filesearlier in the build process. This is a difference in maintainability, asadding new files to the build may then require extra maintenance of the Antdirectives to apply the desired fileMode. This is an easy thing tooverlook. A solution that preserves the natural permissions requires lessmaintenance overhead.

I couldn't find a way to make assembly plugin preserve permissions likethis either. It just has explicit fileMode directives similar to Ant. (Let me know if I missed something though.)

To see symlinks show up in distribution tarballs, you need to build withthe native components, like libhadoop.so or bundled Snappy.

On Wed, Nov 21, 2012 at 01:14PM, Matt Foley wrote:> Cos,> Please see in-line.> > On Wed, Nov 21, 2012 at 12:00 PM, Konstantin Boudnik <[EMAIL PROTECTED]> wrote:> > > I like Alejandro's idea about Maven for a few of reasons:> > - bringing in a scripting environment which is known for its> > inter-version idiosyncrasies just because Windows can't handle trivial> > shell scripting looks like an overkill to me> > Excuse me? Can we at least try not to belittle other people's platforms on> a public Apache forum? There's nothing trivial about implementing shell on> Windows, as cygwin regrettably proved.

Belittle? Hardly ;) Because we all know very well why shell is so awkward toimplement on any Windows system.

> > - relative to above, there's a chance that Python's pre-requisites used> > in Hadoop might get into a conflict with some other components in the> > stack. This will be a nightmare for the integrator projects i.e. Bigtop> > Said Bigtop project actually uses python, does it not?

It does, Matt. The main concern I have is at some point Hadoop's Python mightall of a sudden be of a different version than the one in BigTop. And all thehell will break lose compatibility wise. What would be the solution then?

> > - Maven is de-facto standard for Java stacks> >> > Sure -- except for when Ant was the de-facto standard for Java stacks. And

I am proposing Groovy is a better language than Python. Because, in part, itgoes far beyond scripting. And doesn't have permanent runtime backwardcompatibility issues. What was the last time JDK had backward compatibilityproblems?

> > Addressing Matt's later point about non-Mavenized Hadoop-1 line: it uses> > Maven> > stuff suchs as deploy/install via custom ant tasks. Same approach would> > work> > for saveVersion.sh and others, I am sure.> > Current ant scripts in Hadoop seem to use maven only for artifact> management via the maven repository. If I'm missing something, please> point it out. The ant build task currently calls out to saveVersion.sh.> Having it call out to maven, which then calls out to a plug-in and/or a> Groovy script, doesn't sound like an improvement to me. And it's a way

At least it it guaranteed to work everywhere. And all we need in this case isan extra jar file that can be pulled down through the same ivy/mavendependency mechanism.

In case of Python you'd have to make sure that you're having the right versionof the interpreter and runtime. And you will have to do it manually or have anextra requirement expressed via a system maintenance DSL.

> different use of maven than currently in the Hadoop-1 line, not a> continuation of established practice.

The main point of my argument expressed in a lesser than 100 words: addingPython that is inconsistent across different Linux distros and has a historyof backward incompatibilities (2.6 vs 2.5, 3.0 vs earlier, etc.) doesn't seemto leverage the benefit of having a somewhat easier build in Windows.

Perhaps, we can do a more format benefit analysis by just comparing thenumber of Hadoop installations on MS Win vs. Unix's.

Cos

> > On Wed, Nov 21, 2012 at 11:25AM, Alejandro Abdelnur wrote:> > > Hey Matt,> > >> > > We already require java/mvn/protoc/cmake/forrest (forrest is hopefully on> > > its way out with the move of docs to APT)> > >> > > Why not do a maven-plugin to do that?> > >> > > Colin already has something to simplify all the cmake calls from the> > builds> > > using a maven-plugin (https://issues.apache.org/jira/browse/HADOOP-8887)

-

Re: [PROPOSAL] introduce Python as build-time and run-time dependency for Hadoop and throughout Hadoop stack

On Wed, Nov 21, 2012 at 1:50 PM, Konstantin Boudnik <[EMAIL PROTECTED]> wrote:> The main point of my argument expressed in a lesser than 100 words: adding> Python that is inconsistent across different Linux distros and has a history> of backward incompatibilities (2.6 vs 2.5, 3.0 vs earlier, etc.) doesn't seem> to leverage the benefit of having a somewhat easier build in Windows.

This seems mostly like a red herring to me. I'll grant that it's hardto build a complete Python stack that's compatible between Python 2.xand 2.y, but it's very easy by following best practices to keep python*scripts* compatible across all reasonable Python 2.x versions.Simply pick an oldest-supported-version like 2.4 and be reasonablydisciplined to not use newer constructs or libraries. I wouldn't wishto try to build a complete system in such a limited dialect [1], butfor "we need a reasonable replacement for /bin/sh scripts" it's justfine.

Ignore Python 3 for the time being, it's a completely differentlanguage with incompatible syntax and semantics that doesn't supportseveral currently-important platforms. Maybe in a few years sanepeople can consider moving to it, but for now it's best to just stickwith the compatible subset of Python 2.x.

[1] the Mercurial project has had a pretty good experience with thisscheme; http://mercurial.selenic.com/wiki/SupportedPythonVersions theycurrently support 2.4 - 2.7 with a few required libraries. Theydropped 2.2 and 2.3 support a few years ago due to specificshortcomings on those versions.

-andy

-

Re: [PROPOSAL] introduce Python as build-time and run-time dependency for Hadoop and throughout Hadoop stack

Dne 21.11.2012 22:44, Chris Nauroth napsal(a):> Sorry, to clarify my point a little more, Ant does allow you to make> declarations to explicitly set the desired file permissions via the> fileMode attribute of a tarfileset.there are just 2 directories /bin and /sbin with executable files. Its probably possible to set file mode per directory in maven assembly plugin.

-

Re: [PROPOSAL] introduce Python as build-time and run-time dependency for Hadoop and throughout Hadoop stack

/Ignore Python 3 for the time being, it's a completely different language with incompatible syntax and semantics that doesn't support several currently-important platforms. Maybe in a few years sane people can consider moving to it, but for now it's best to just stick with the compatible subset of Python 2.x. [1] the Mercurial project has had a pretty good experience with this scheme; http://mercurial.selenic.com/wiki/SupportedPythonVersions they currently support 2.4 - 2.7 with a few required libraries. They dropped 2.2 and 2.3 support a few years ago due to specific shortcomings on those versions./

I know that Python compatibility can be worked around. I used Python for few years and wrote about 70k LOC in it until it started to irritate me that every new version has incompatibilities such as 2.4 vs 2.3 vs 2.5 and it makes maintaining and testing way harder then it should be. Its not just compatibility with missing library functions. sometimes even expression evaluated to different value under new version. This was similar to php 4 to php 5 migration. Today i have 3 versions of python installed because of software requirements.

For simple scripts it can probably work if you stick to some common subset.

Unfortunately, there are a couple of spots where it gets really messy anddirectory-wide rules fail to cover it. The trickiest maintenance issue ishadoop-hdfs-httpfs, where we unpack and repack a Tomcat. Initially, Itried to do this using only the ant plugin, but I wound up with a ton ofdifferent tarfileset directives with different fileMode values to reapplythe same permissions that were present in the original Tomcat distribution. This also would have been a brittle solution, because changes in theTomcat package would risk invalidating our ant rules. A solution thatpreserves the original permissions reduces this kind of maintenance work.

> Dne 21.11.2012 22:44, Chris Nauroth napsal(a):>> Sorry, to clarify my point a little more, Ant does allow you to make>> declarations to explicitly set the desired file permissions via the>> fileMode attribute of a tarfileset.>>> there are just 2 directories /bin and /sbin with executable files. Its> probably possible to set file mode per directory in maven assembly plugin.>

-

Re: [PROPOSAL] introduce Python as build-time and run-time dependency for Hadoop and throughout Hadoop stack

On Thu, Nov 22, 2012 at 12:58AM, Radim Kolar wrote:> I know that Python compatibility can be worked around. I used Python> for few years and wrote about 70k LOC in it until it started to> irritate me that every new version has incompatibilities such as 2.4> vs 2.3 vs 2.5 and it makes maintaining and testing way harder then> it should be. Its not just compatibility with missing library> functions. sometimes even expression evaluated to different value> under new version. This was similar to php 4 to php 5 migration.> Today i have 3 versions of python installed because of software> requirements.> > For simple scripts it can probably work if you stick to some common subset.> > Scripting via maven plugin has advantage that user do not needs to> install anything, there is couple of languages available: scala,> groovy, jelly, jruby. Maybe jython too.

pretty much all of the j* in JSR223 land is abomination of one sort oranother, actually :)

Cos

-

Re: [PROPOSAL] introduce Python as build-time and run-time dependency for Hadoop and throughout Hadoop stack

Dne 22.11.2012 1:14, Chris Nauroth napsal(a):> The trickiest maintenance issue is hadoop-hdfs-httpfs, where we unpack and repack a Tomcat.why its not possible to just ship WAR file? Its seems to be special purpose app and they needs hand security setup anyway and intergration with existing firewall/web infrastructure.

> pretty much all of the j* in JSR223 land is abomination of one sort or> another, actually :)jruby is good because you can run rails application on standard Java infrastructure which is way easier to maintain, then obscure Ruby application servers.

-

Re: [PROPOSAL] introduce Python as build-time and run-time dependency for Hadoop and throughout Hadoop stack

This predates me, so I don't know the rationale for repackaging Tomcatinside HTTPFS. I suspect that there was a desire to create a fullystand-alone distribution package, including a full web server. The MavenJetty plugin isn't directly applicable to this use case. I don't know whyit was decided to use Tomcat instead of Jetty. (If anyone else out therehas the background, please respond.) Regardless, if the desire is topackage a full web server instead of just the war, then switching to Jettywould not change the challenges of the build process. We'd still need topreserve whatever permissions are present in the Jetty distribution.

In general, when I was working on this, I did not question whether thecurrent packaging was "correct". I assumed that whatever changes I madefor Windows compatibility must yield the exact same distribution withoutchanges on currently supported platforms like Linux. If there arequestions around actually changing the output of the build process, thenthat will steer the conversation in another direction and increase thescope of this effort.

It seems like the trickiest issue is preservation of permissions andsymlinks in tar files. I suspect that any JVM-based solution like customMaven plugins, Groovy, or jtar would be limited in this respect. Accordingto Ant documentation, it's a JDK limitation, so I suspect all of thesewould have the same problem. I haven't tried any of them though. (Ifthere was a feasible solution, then Ant likely would have incorporated itlong ago.) If anyone wants to try though, we might learn something fromthat.

> Dne 22.11.2012 1:14, Chris Nauroth napsal(a):>> The trickiest maintenance issue is hadoop-hdfs-httpfs, where we unpack>> and repack a Tomcat.>>> why its not possible to just ship WAR file? Its seems to be special> purpose app and they needs hand security setup anyway and intergration with> existing firewall/web infrastructure.>> did you considered to use Jetty? it has really good maven support:> http://wiki.eclipse.org/Jetty/**Feature/Jetty_Maven_Plugin<http://wiki.eclipse.org/Jetty/Feature/Jetty_Maven_Plugin>> I am using jetty 8 instead of tomcat and run it with java -jar start.jar> no extra file permissions like x bit are needed.>> If you really need to create tar by hand, there is java library for doing> it - http://code.google.com/p/jtar/ and it can be used from any JVM based> script language, you have plenty of choices.>

-

Re: [PROPOSAL] introduce Python as build-time and run-time dependency for Hadoop and throughout Hadoop stack

>> It seems like the trickiest issue is preservation of permissions and> symlinks in tar files. I suspect that any JVM-based solution like custom> Maven plugins, Groovy, or jtar would be limited in this respect. According> to Ant documentation, it's a JDK limitation, so I suspect all of these> would have the same problem. I haven't tried any of them though. (If> there was a feasible solution, then Ant likely would have incorporated it> long ago.) If anyone wants to try though, we might learn something from> that.>> Thank you,> --Chris>>You are limited by what File.canRead(), canWrite() and canExecute) tell you.

The absence of a way to detect file permissions in Java -is because of thelowest-common-denominator approach of the JavaFS APIs, supporting FAT32(odd case logic, no perms or symlinks), NTFS (odd case logic, ACLs overperms, symlinks historically very hard to create), HFS+ (case insensitiveunix fs!) as well as classic unixy filesystems.

Ant <tarfileset> filesets in <tar> let you spec permissions on filesets youpull into the tar; they are generated x-platform, which the other reasonwhy you declare them in <tar> -you have the right to generate proper tarfiles even if you use a Windows box.

symlinks are problematic -even detecting them cross platform is prettyunreliable. To really do them you'd need to add a new <symlinkfileset>entity for <tar>, that would take the link declaration. I could imagine howto do that -and if stuck into the hadoop tools JAR, wouldn't even depend ona new version of Ant.

Maven just adds extra layers in the way.

-Steve

-

Re: [PROPOSAL] introduce Python as build-time and run-time dependency for Hadoop and throughout Hadoop stack

> This discussion started in>>> Those of us involved in the branch-1-win port of Hadoop to Windows without> use of Cygwin, have faced the issue of frequent use of shell scripts> throughout the system, both in build time (eg, the utility> "saveVersion.sh"),> and run time (config files like "hadoop-env.sh" and the start/stop scripts> in "bin/*" ). Similar usages exist throughout the Hadoop stack, in all> projects.>> The vast majority of these shell scripts do not do anything platform> specific; they can be expressed in a posix-conforming way. Therefore, it> seems to us that it makes sense to start using a cross-platform scripting> language, such as python, in place of shell for these purposes. For those> rare occasions where platform-specific functionality really is needed,> python also supports quite a lot of platform-specific functionality on both> Linux and Windows; but where that is inadequate, one could still> conditionally invoke a platform-specific module written in shell (for> Linux/*nix) or powershell or bat (for Windows).>> The primary motive for moving to a cross-platform scripting language is> maintainability. The alternative would be to maintain two complete suites> of scripts, one for Linux and one for Windows (and perhaps others in the> future). We want to avoid the need to update dual modules in two different> languages when functionality changes, especially given that many Linux> developers are not familiar with powershell or bat, and many Windows> developers are not familiar with shell or bash.>>I'd argue that a lot of Hadoop java developers aren't that familiar withbash. It's only in the last six months that I've come to hate it properly.

In the ant project, it was the launcher scripts that had the worstbugrep:line ratio, as -variations in .sh behaviour, especially under cygwin, but also thingsthat weren't bash (AIX, ...) -requirements of the entire unix command set for real work -variants in the parameters/behaviour of those commands between Linux andother widely used Unix systems (e.g. OSX) -lack of inclusion of the .sh scripts in the junit test suite -lack of understanding of bash.

In the ant project we added a Python launcher in, what, 2001, based on thePerl launcher supplied by one [EMAIL PROTECTED]ceforge> For run-time, there is likely to be a lot more discussion. Lots of folks,> including me, aren't real happy with use of active scripts for> configuration, and various others, including I believe some of the Bigtop> folks, have issues with the way the start/stop scripts work. Nevertheless,> all those scripts exist today and are widely used. And they present an> impediment to porting to Windows-without-cygwin.>

They're a maintenance and support cost on Unix. Too many scripts, even morein Yarn, weakly-nondeterministic logic for loading env variables,especially between init.d and bin/hadoop; not much diagnostics. And as withAnt, a relatively under-comprehended language with no unit test coverage.

I'd replace the bash logic with python for Unix dev and maintenance alone.You could put your logic into a shared python module in usr/lib/hadoop/bin, have PyUnit test the inner functions as part of the build and testprocess (& jenkins).>> Nothing about run-time use of scripts has changed significantly over the> past three years, and I don't think we should hold up the Windows port> while we have a huge discussion about issues that veer dangerously into> religious/aesthetic domains. It would be fun to have that discussion, but I> don't want this decision to be dependent on it!>>With Yarn its got more complex. More env variables to set, more supportcalls when they aren't.> So I propose that we go ahead and also approve python as a run-time> dependency, and allow the inclusion of python scripts in place of current> shell-based functionality. The unpleasant alternative is to spawn a bunch+1 to any vote to allow .py at run time as a new feature

=0 to ripping out and replacing the existing .sh scripts with python code,as even though I don't like the scripts, replacing them could be traumaticdownstream.

+1 to a gradual migration to .py for new code, starting with the yarnscripts.

-

Re: [PROPOSAL] introduce Python as build-time and run-time dependency for Hadoop and throughout Hadoop stack

the JSR-233 bridge comes with a javascript interpreter built in, BTW. Youcan actually use it in ant's <script> and <scriptdef> tasks without evenhaving to stick a new Jar on the CP. That doesn't mean it's ideal.

There was recent discussion on bigtop dev about moving to a later versionof groovy; Roman found they ran into some problem where the maven groovycode was reluctant to upgrade:

> We'd still need to preserve whatever permissions are present in the Jetty distribution.in jetty distribution there is just one shell startup script and you can even run jetty without it using autostartable jar. Requirement to preserve permissions is overkill. at most you need just to chmod +x one script. In tomcat it would be similar.

> Maven plugins, Groovy, or jtar would be limited in this respect.In jtar you are manipulating resulting tar file directly:

If we decide to go with Maven then there's no point to complicate thepicture with jython. This time I will keep the offensive about *yton to myself;)

Cos

On Sat, Nov 24, 2012 at 10:26PM, Radim Kolar wrote:> we have not discussed advantages of stand alone python vs> jython-in-maven pom> > http://code.google.com/p/jy-maven-plugin/> > language is about same, and it does not needs to have installed,> which is advantage on windows.

NEW: Monitor These Apps!

All projects made searchable here are trademarks of the Apache Software Foundation.
Service operated by Sematext