JobConf.findContainingJar unescapes unnecessarily on Linux

Details

Description

In JobConf.findContainingJar, the path name is decoded using URLDecoder.decode(...). This was done by Doug in r381794 (commit msg "Un-escape containing jar's path, which is URL-encoded. This fixes things primarily on Windows, where paths are likely to contain spaces.") Unfortunately, jar paths do not appear to be URL encoded on Linux. If you try to use "hadoop jar" on a jar with a "+" in it, this function decodes it to a space and then the job cannot be submitted.

Todd Lipcon
added a comment - 07/Jul/09 02:22 I just confirmed this on Windows vs Linux. On Windows, the URL that you get back from ClassLoader.getResource has spaces encoded as "%20". On Linux it doesn't.
Anyone have a creative solution to deal with this? We'd like to have +s in our version numbers due to standards in RPM and Debian land, but this is blocking that.

I spent about 4 hours trying to find a portable non-klugey fix. The trouble is that the behavior is different on Windows compared to Linux. On a Windows JVM, it encodes spaces as "%20" and +s as "%2B" and on Linux it does neither, best I can tell.

I definitely agree that this fix is not optimal, but I think '+' is the most common case for a "weird" character in a JAR name. In Debian and RPM packages, the non-alphanumeric characters allowed in version numbers are [+-~.:] and I think all of those should be fine after this patch.

Todd Lipcon
added a comment - 08/Jul/09 18:18 Owen: that's the only difference I know of that's mentioned in the spec: http://www.w3.org/MarkUp/html-spec/html-spec_8.html#SEC8.2.1
I spent about 4 hours trying to find a portable non-klugey fix. The trouble is that the behavior is different on Windows compared to Linux. On a Windows JVM, it encodes spaces as "%20" and +s as "%2B" and on Linux it does neither, best I can tell.
I definitely agree that this fix is not optimal, but I think '+' is the most common case for a "weird" character in a JAR name. In Debian and RPM packages, the non-alphanumeric characters allowed in version numbers are [+-~.:] and I think all of those should be fine after this patch.

Owen O'Malley
added a comment - 08/Jul/09 18:36 Of course, you could detect the operating system like Path does.
Look at the definition of Path.WINDOWS. Of course, we probably should make a method that checks that boolean...
That said, I'm ok with the quoting as long as it works on both operating systems. (I agree, it is kludgy, but...)

[exec] -1 overall. [exec][exec] +1 @author. The patch does not contain any @author tags.[exec][exec] +1 tests included. The patch appears to include 3 new or modified tests.[exec][exec] +1 javadoc. The javadoc tool did not generate any warning messages.[exec][exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings.[exec][exec] -1 findbugs. The patch appears to introduce 13 new Findbugs warnings.[exec][exec] -1 release audit. The applied patch generated 2 release audit warnings (more than the trunk's current 1 warnings).[exec][exec] +1 system test framework. The patch passed system test framework compile.

The findbugs warnings are bogus (none related to this patch). The release audit warnings are also unrelated ("smoke-tests" file and "robots.txt" file). See MAPREDUCE-2172.

Eli Collins
added a comment - 29/Dec/10 20:41 +1
This patch looks good to me. Looks like there may be other potential incorrect uses of URLDecode in the HarFileSystem, minding filling a jira to fix those?

Todd Lipcon
added a comment - 29/Dec/10 20:50 Looks like there may be other potential incorrect uses of URLDecode in the HarFileSystem, minding filling a jira to fix those?
Looking at HarFileSystem, it seems like it's actually safe - it's using URLDecoder to decode something that was encoded using URLEncoder in HadoopArchives.java. Using these as a pair is OK:
groovy:000> URLDecoder.decode(URLEncoder.encode( "foo+bar baz 100%" ))
===> foo+bar baz 100%