User supplied dependencies may conflict with MapReduce system JARs

Details

Description

If user code has a dependency on a version of a JAR that is different to the one that happens to be used by Hadoop, then it may not work correctly. This happened with user code using a different version of Avro, as reported here.

The problem is analogous to the one that application servers have with WAR loading. Using a specialized classloader in the Child JVM is probably the way to solve this.

Activity

I'm glad you filed this. I was just getting frustrated with this issue myself in the last couple weeks and have various thoughts on the issue. Some of these ideas are raw and flawed, but here is what I have been thinking:

Ideally, the framework would limit the classes visible to a job to the minimum required for job execution. A job could then bring in its own dependencies. Also, if there was a built-in hadoop dependency hidden by default that a job wanted, it could request access to it.

Similarly frustrating and related, is how a M/R job has to submit its whole job jar to the cluster each time. I have a 28MB jar, and a workflow of about 35 dependent M/R jobs (A DAG of them). Towards the end of this chain, the jobs get smaller and smaller in data size (the end ones are joining, augmenting, transforming and sorting data aggregated by the earlier jobs).
Two big things account for more clock time than the 'heavy lifting' work of the initial 'big data' jobs – job submission time and scheduling inefficiencies. The former is related to dependency management.

If the framework could support installing jars into an 'application' classloader space and then jobs reference that space, task latency could be reduced significantly as each job submission would not need to also submit all its dependency jars. In my case, the job jar would probably become a couple hundred K instead of almost 30MB – or even zero K if the jobs could just be stored and called. TaskTracker nodes could cache these application library spaces to reduce job start-up time.

In some ways, the dependency management above is like an application server. Each 'application' has its own classloader space, and there might be several different jobs available in an 'application' – analogous to several servlets available in a web app. Like an app server, there will probably be a need for a lib directory that is global, one that is exclusive to the framework, and a per-application space.

There are some questions related to static variables related to such classloader partitioning. With shared JVM's across tasks, users expect statics to live from one task to another in the same job. This means the classloader in a JVM corresponds with the Job ID and whether it is a M or R. Per-Job classloaders could enable JVM recycling across jobs in the distant future because disposing of a Job's classloader will free its static variables. That in turn leads to the possibility of future reductions in start-up time and per task costs.

Scott Carey
added a comment - 14/Apr/10 17:58 I'm glad you filed this. I was just getting frustrated with this issue myself in the last couple weeks and have various thoughts on the issue. Some of these ideas are raw and flawed, but here is what I have been thinking:
Ideally, the framework would limit the classes visible to a job to the minimum required for job execution. A job could then bring in its own dependencies. Also, if there was a built-in hadoop dependency hidden by default that a job wanted, it could request access to it.
Similarly frustrating and related, is how a M/R job has to submit its whole job jar to the cluster each time. I have a 28MB jar, and a workflow of about 35 dependent M/R jobs (A DAG of them). Towards the end of this chain, the jobs get smaller and smaller in data size (the end ones are joining, augmenting, transforming and sorting data aggregated by the earlier jobs).
Two big things account for more clock time than the 'heavy lifting' work of the initial 'big data' jobs – job submission time and scheduling inefficiencies. The former is related to dependency management.
If the framework could support installing jars into an 'application' classloader space and then jobs reference that space, task latency could be reduced significantly as each job submission would not need to also submit all its dependency jars. In my case, the job jar would probably become a couple hundred K instead of almost 30MB – or even zero K if the jobs could just be stored and called. TaskTracker nodes could cache these application library spaces to reduce job start-up time.
In some ways, the dependency management above is like an application server. Each 'application' has its own classloader space, and there might be several different jobs available in an 'application' – analogous to several servlets available in a web app. Like an app server, there will probably be a need for a lib directory that is global, one that is exclusive to the framework, and a per-application space.
There are some questions related to static variables related to such classloader partitioning. With shared JVM's across tasks, users expect statics to live from one task to another in the same job. This means the classloader in a JVM corresponds with the Job ID and whether it is a M or R. Per-Job classloaders could enable JVM recycling across jobs in the distant future because disposing of a Job's classloader will free its static variables. That in turn leads to the possibility of future reductions in start-up time and per task costs.

If the framework could support installing jars into an 'application' classloader space and then jobs reference that space, task latency could be reduced significantly as each job submission would not need to also submit all its dependency jars.

Scott, this is precisely what the DistributedCache was designed for. Please load your jars to HDFS, add your jars to the DistributedCache and then they are 'localized' once per-tasktracker and all jobs can use the same:

Arun C Murthy
added a comment - 14/Apr/10 21:33 If the framework could support installing jars into an 'application' classloader space and then jobs reference that space, task latency could be reduced significantly as each job submission would not need to also submit all its dependency jars.
Scott, this is precisely what the DistributedCache was designed for. Please load your jars to HDFS, add your jars to the DistributedCache and then they are 'localized' once per-tasktracker and all jobs can use the same:
http://hadoop.apache.org/common/docs/r0.20.0/api/org/apache/hadoop/filecache/DistributedCache.html

The documentation for DistributedCache says:
"Its efficiency stems from the fact that the files are only copied once per job and the ability to cache archives which are un-archived on the slaves."

Is the documentation wrong? or the claim that the distribution happens one per tasktracker and multiple jobs can use it incorrect?
The documentation above is ambiguous – does it copy items once per job, un-archiving once per slave per job? or does it cache un-archived data on slaves across a longer period of time?

What I am suggesting is not a Job-scope cache, but something that has a much longer scope – days, weeks, months – to share between many different jobs without per job copying or unpacking unless the contents have changed. It is unclear from the documentation on DistributedCache if there is any optimization outside of the Job scope. If it had this sort of optimization already that would be great.

Scott Carey
added a comment - 15/Apr/10 00:34 The documentation for DistributedCache says:
"Its efficiency stems from the fact that the files are only copied once per job and the ability to cache archives which are un-archived on the slaves."
Is the documentation wrong? or the claim that the distribution happens one per tasktracker and multiple jobs can use it incorrect?
The documentation above is ambiguous – does it copy items once per job, un-archiving once per slave per job? or does it cache un-archived data on slaves across a longer period of time?
What I am suggesting is not a Job-scope cache, but something that has a much longer scope – days, weeks, months – to share between many different jobs without per job copying or unpacking unless the contents have changed. It is unclear from the documentation on DistributedCache if there is any optimization outside of the Job scope. If it had this sort of optimization already that would be great.

Getting custom classloaders right is one of the hardest things to do in Java. Whoever volunteers to do this and I opt to run away from it had better talk to the experts in the area. If it is purely for short-lived standalone tasks things would be simpler (less classloader leakage risks), but you still have to be very good at handling the problems a CL tree brings to the table

returning anything loaded by a custom CL means the CL and all loaded classes hang around in VM, reloading becomes tricky.

multiple singletons in a single JVM

object equality tests fail

wierd errors that you had better log rather than hope don't happen.

I've always felt that the ASF ought to have an "understands classloaders" qualification; if you don't pass it, you don't get to submit classloaders to the codebase.

Steve Loughran
added a comment - 15/Apr/10 14:19 Getting custom classloaders right is one of the hardest things to do in Java. Whoever volunteers to do this and I opt to run away from it had better talk to the experts in the area. If it is purely for short-lived standalone tasks things would be simpler (less classloader leakage risks), but you still have to be very good at handling the problems a CL tree brings to the table
returning anything loaded by a custom CL means the CL and all loaded classes hang around in VM, reloading becomes tricky.
multiple singletons in a single JVM
object equality tests fail
wierd errors that you had better log rather than hope don't happen.
I've always felt that the ASF ought to have an "understands classloaders" qualification; if you don't pass it, you don't get to submit classloaders to the codebase.

David Rosenstrauch
added a comment - 12/Aug/10 21:11 I also just ran into this issue. (Again, due to using a recent release of avro + jackson.)
Is there any workaround for this? (Short of having to go into every node on the cluster and removing the jackson jar from the hadoop installation?

The patch in MAPREDUCE-1938 does unfortunately not solve the issue when the job implementation uses custom class loaders to load dependency classes. The proposed patch only addresses the issue when no custom class loaders are in the picture.

As a first step, it would really help (us anyway), if jobs would not be started with stuff on the classpath that is not at all required for job execution per se (e.g. jetty libs, the eclipse java compiler, jasper...).

Secondly, hadoop could actually start with only the hadoop api types on the classpath plus a small launcher that would load hadoops implementation in an isolated child loader, so that implementation dependencies do not leak through to the job's implementation. I am not sure if the hadoop implementation is ready for implementation/api separation via class loaders though.

I patched hadoop 0.20.2 to exclude all libs in lib/server from the jobs classpath and I move all non-job related jars into that server folder in my hadoop installation. That helped somewhat.

Henning Blohm
added a comment - 04/Nov/10 13:35 The patch in MAPREDUCE-1938 does unfortunately not solve the issue when the job implementation uses custom class loaders to load dependency classes. The proposed patch only addresses the issue when no custom class loaders are in the picture.
As a first step, it would really help (us anyway), if jobs would not be started with stuff on the classpath that is not at all required for job execution per se (e.g. jetty libs, the eclipse java compiler, jasper...).
Secondly, hadoop could actually start with only the hadoop api types on the classpath plus a small launcher that would load hadoops implementation in an isolated child loader, so that implementation dependencies do not leak through to the job's implementation. I am not sure if the hadoop implementation is ready for implementation/api separation via class loaders though.
I patched hadoop 0.20.2 to exclude all libs in lib/server from the jobs classpath and I move all non-job related jars into that server folder in my hadoop installation. That helped somewhat.

Here's a proof of concept for isolated classloaders in YARN. This approach uses OSGi for isolation. The idea is that the task JVM uses a Felix container to load the job JAR (which is an OSGi bundle) so that user code can use whichever libraries it likes, even if they conflict with system JARs.

In this example I have created a fictitious library with two incompatible versions. Version 1 is used by the system (in YarnChild) while version 2 is used by the example Mapper. Without isolation the job fails with a java.lang.NoSuchMethodError - regardless of whether the user JARs are first or second on the classpath. When run using isolation, the job succeeds and we can see that both version 1 and version 2 of the library are used:

Tom White
added a comment - 29/Aug/12 12:24 Here's a proof of concept for isolated classloaders in YARN. This approach uses OSGi for isolation. The idea is that the task JVM uses a Felix container to load the job JAR (which is an OSGi bundle) so that user code can use whichever libraries it likes, even if they conflict with system JARs.
In this example I have created a fictitious library with two incompatible versions. Version 1 is used by the system (in YarnChild) while version 2 is used by the example Mapper. Without isolation the job fails with a java.lang.NoSuchMethodError - regardless of whether the user JARs are first or second on the classpath. When run using isolation, the job succeeds and we can see that both version 1 and version 2 of the library are used:
/tmp/logs//application_1346151477167_0001/container_1346151477167_0001_01_000002/stdout:message 2
/tmp/logs//application_1346151477167_0001/container_1346151477167_0001_01_000002/syslog:2012-08-28 11:58:52,317 INFO [main] org.apache.hadoop.mapred.YarnChild: message 1
To run:
Checkout a revision of trunk that doesn't have MAPREDUCE-4068 ('svn up -r 1376252')
Apply the patch
Run 'mvn versions:set -DnewVersion=3.0.0' to change the version numbers to non-SNAPSHOT values, since OSGi doesn't like them.
Build:
(cd hadoop-mapreduce-project/hadoop-mapreduce-examples/lib-v1; mvn install)
(cd hadoop-mapreduce-project/hadoop-mapreduce-examples/lib-v2; mvn install)
mvn clean install -DskipTests
(cd hadoop-mapreduce-project/hadoop-mapreduce-examples/class-isolation-example/; mvn install)
mvn package -Pdist -DskipTests -Dtar
Install the tarball and run
bin/hadoop fs -mkdir -p input
bin/hadoop fs -put /usr/share/dict/words input
bin/hadoop jar ~/.m2/repository/org/apache/hadoop/class-isolation-example/1.0-SNAPSHOT/class-isolation-example-1.0-SNAPSHOT.jar org.apache.hadoop.examples.classisolation.Driver input output
Still to do/future improvements:
Make compatible with MAPREDUCE-4068 .
Write a unit test.
Currently only the Mapper is loaded using an OSGi service - extend the approach for all user-defined classes in a MR job.
Use OSGi fragments so that user job JARs don't need a Registrar class, since it would be a part of the host bundle that the job JAR extends.
Write a utility to convert existing job JARs to OSGi bundles (or fragments).

New patch with a unit test. The test isn't integrated into the build yet, so you have to build the class-isolation-example module manually first. I've also removed the fictitious libs and instead used Guava as an example of an incompatibility.

Tom White
added a comment - 04/Sep/12 15:38 New patch with a unit test. The test isn't integrated into the build yet, so you have to build the class-isolation-example module manually first. I've also removed the fictitious libs and instead used Guava as an example of an incompatibility.

Tom, I don't understand specific advantages of OSGI or Felix, so please pardon some of my questions.

However, with MR being an application in YARN (see MAPREDUCE-4421) we can just add user jars in front of the classpath for the tasks (we already allow it). This isn't the same Map/Reduce child inherits the TT classpath problem in MR1 (actually even in MR1 you can put child jars ahead in the classpath for a long while now). Given this, do we need to bring in OSGI or Felix, what do else do they provide? Thanks.

Arun C Murthy
added a comment - 04/Sep/12 19:33 Tom, I don't understand specific advantages of OSGI or Felix, so please pardon some of my questions.
However, with MR being an application in YARN (see MAPREDUCE-4421 ) we can just add user jars in front of the classpath for the tasks (we already allow it). This isn't the same Map/Reduce child inherits the TT classpath problem in MR1 (actually even in MR1 you can put child jars ahead in the classpath for a long while now). Given this, do we need to bring in OSGI or Felix, what do else do they provide? Thanks.

I see where Tom is coming from. Irrespective of how the Hadoop services are deployed, you need to be able to do things like submit jobs from OSGi containers (e.g Spring & others) which is what this patch appears to offer. And if Oracle finally commit to OSGi now that Java 8 is being redefined, it'd be good from all clients.

I would like to see a way to support this which doesn't put an OSGi JAR on the classpath of everything.

Tom -is there a way to abstract away OSGi support so that it's optional, even if its a subclass of JobSubmitter? An org.apache.hadoop.mapreduce.osgi.OSGiJobSubmitter could override some new specific protect methods to enable this.

Steve Loughran
added a comment - 04/Sep/12 20:45 Arun,
I see where Tom is coming from. Irrespective of how the Hadoop services are deployed, you need to be able to do things like submit jobs from OSGi containers (e.g Spring & others) which is what this patch appears to offer. And if Oracle finally commit to OSGi now that Java 8 is being redefined, it'd be good from all clients.
I would like to see a way to support this which doesn't put an OSGi JAR on the classpath of everything.
Tom -is there a way to abstract away OSGi support so that it's optional, even if its a subclass of JobSubmitter? An org.apache.hadoop.mapreduce.osgi.OSGiJobSubmitter could override some new specific protect methods to enable this.

Scott Carey
added a comment - 05/Sep/12 00:19 Putting user jars before/after the application dependencies doesn't actually solve the problem.
The conflict might require a user jar that is not compatible with one needed by the framework, either order breaks something
The user might override a system jar and alter functionality in a way that breaks the framework, or subverts security.
Both the host container and the user code need to be able to be certain of what code they are executing without stepping on each other's toes. This is not possible with one classpath.

Scott Carey
added a comment - 05/Sep/12 00:43 If we are lucky, projecct jigsaw will be pulled back into Java 8. According to: http://mreinhold.org/blog/late-for-the-train-qa it has not yet been decided.
If it is brought back in, then perhaps we can wait until Java has a module system 1 to 1.5 years from now. If not, I do not think Hadoop can wait until Java 9, sometime 2015 to 2016 ish.

Scott makes a good case for why some kind of classloader isolation is needed.

The patch is still a work in progress, but the idea is that the OSGi support is optional - so if you use a regular (non-OSGi) job JAR then it works like it does today, while if your job JAR is an OSGi bundle (basically a JAR with extra headers in the manifest, and possibly some embedded dependencies) then it is loaded in an OSGi container in the task JVM. This allows folks who want to use OGSi to do so while not impacting others. (Hopefully this answers Steve's question.)

From the point of view of this JIRA, OSGi is simply a means to ensure classloader isolation. That means that if Jigsaw became a reality, then we could use that instead or as well. OSGi has many other features, but they are not used for this change. (Note that there are other ongoing efforts to make Hadoop more OSGi-friendly, covered in HADOOP-7977, and while some might be helpful for this JIRA (such as HADOOP-6484), none is required.)

Also, in the future OSGi containers could improve container reuse by providing better isolation between jobs, since bundles can be unloaded, although I haven't spent any time looking at how that would work in the context of MR.

Tom White
added a comment - 05/Sep/12 10:30 Scott makes a good case for why some kind of classloader isolation is needed.
The patch is still a work in progress, but the idea is that the OSGi support is optional - so if you use a regular (non-OSGi) job JAR then it works like it does today, while if your job JAR is an OSGi bundle (basically a JAR with extra headers in the manifest, and possibly some embedded dependencies) then it is loaded in an OSGi container in the task JVM. This allows folks who want to use OGSi to do so while not impacting others. (Hopefully this answers Steve's question.)
From the point of view of this JIRA, OSGi is simply a means to ensure classloader isolation. That means that if Jigsaw became a reality, then we could use that instead or as well. OSGi has many other features, but they are not used for this change. (Note that there are other ongoing efforts to make Hadoop more OSGi-friendly, covered in HADOOP-7977 , and while some might be helpful for this JIRA (such as HADOOP-6484 ), none is required.)
Also, in the future OSGi containers could improve container reuse by providing better isolation between jobs, since bundles can be unloaded, although I haven't spent any time looking at how that would work in the context of MR.

The conflict might require a user jar that is not compatible with one needed by the framework, either order breaks something

You can always change the client framework and make it work with user code, per job, with class path ordering. There is currently always a way in both Hadoop 1 and 2 to submit a job with arbitrary dependencies, even though it might not be pretty (may require change to client framework).

The user might override a system jar and alter functionality in a way that breaks the framework, or subverts security.

The client framework code can always be changed per job to accommodate new dependencies. MR security is done at protocol level, i.e. no amount class path ordering can subvert security.

I agree with Arun that this is a nice to have feature to improve usability. Advanced users can already achieve whatever that can be achieved (including running an OSGi container) per job.

Luke Lu
added a comment - 05/Sep/12 17:55 The conflict might require a user jar that is not compatible with one needed by the framework, either order breaks something
You can always change the client framework and make it work with user code, per job, with class path ordering. There is currently always a way in both Hadoop 1 and 2 to submit a job with arbitrary dependencies, even though it might not be pretty (may require change to client framework).
The user might override a system jar and alter functionality in a way that breaks the framework, or subverts security.
The client framework code can always be changed per job to accommodate new dependencies. MR security is done at protocol level, i.e. no amount class path ordering can subvert security.
I agree with Arun that this is a nice to have feature to improve usability. Advanced users can already achieve whatever that can be achieved (including running an OSGi container) per job.

You can always change the client framework and make it work with user code, per job, with class path ordering. There is currently always a way in both Hadoop 1 and 2 to submit a job with arbitrary dependencies, even though it might not be pretty (may require change to client framework).

Without a user doing classloader gymnasitics and fancy packaging themselves, there is not always a way. A user cannot simply package a jar up and ask hadoop to execute it and expose to the user's execution environment only the public Hadoop API.

Scott Carey
added a comment - 06/Sep/12 02:17
You can always change the client framework and make it work with user code, per job, with class path ordering. There is currently always a way in both Hadoop 1 and 2 to submit a job with arbitrary dependencies, even though it might not be pretty (may require change to client framework).
Without a user doing classloader gymnasitics and fancy packaging themselves, there is not always a way. A user cannot simply package a jar up and ask hadoop to execute it and expose to the user's execution environment only the public Hadoop API.

Without a user doing classloader gymnasitics and fancy packaging themselves, there is not always a way.

That's an interesting way to say that except for some ways that would always work, there is not always a way. Using the standard task API to bootstrap an OSGi container is reasonably straight forward

A user cannot simply package a jar up and ask hadoop to execute it and expose to the user's execution environment only the public Hadoop API.

I do agree that there is a usability issue for certain (and arguably less common) use cases, where a user wants to use dependencies that conflict with client framework. However the proposed OSGi approach makes the usability worse for common cases: You'll always need OSGi bundles, which is a form of "fancy packaging", to run your jobs.

A more reasonable (and less heavy) solution would not require users to make any change (including adding metadata to their jars) to their existing code.

Luke Lu
added a comment - 06/Sep/12 06:20 Without a user doing classloader gymnasitics and fancy packaging themselves, there is not always a way.
That's an interesting way to say that except for some ways that would always work, there is not always a way. Using the standard task API to bootstrap an OSGi container is reasonably straight forward
A user cannot simply package a jar up and ask hadoop to execute it and expose to the user's execution environment only the public Hadoop API.
I do agree that there is a usability issue for certain (and arguably less common) use cases, where a user wants to use dependencies that conflict with client framework. However the proposed OSGi approach makes the usability worse for common cases: You'll always need OSGi bundles, which is a form of "fancy packaging", to run your jobs.
A more reasonable (and less heavy) solution would not require users to make any change (including adding metadata to their jars) to their existing code.

Prompted by this discussion I had a look at using a classloader approach similar to how servlet containers are implemented. The servlet spec says that classes in the WEB-INF/classes directory and JARs in the WEB-INF/lib directory are loaded in preference to system classes. I found this page about classloading in Jetty useful: http://docs.codehaus.org/display/JETTY/Classloading.

The attached patch does a similar thing for the Hadoop task classpath by using a custom classloader for classes instantiated by reflection in MapTask. The unit test from the previous patch passes with this implementation. I think this is worth exploring further.

Tom White
added a comment - 06/Sep/12 16:45 Prompted by this discussion I had a look at using a classloader approach similar to how servlet containers are implemented. The servlet spec says that classes in the WEB-INF/classes directory and JARs in the WEB-INF/lib directory are loaded in preference to system classes. I found this page about classloading in Jetty useful: http://docs.codehaus.org/display/JETTY/Classloading .
The attached patch does a similar thing for the Hadoop task classpath by using a custom classloader for classes instantiated by reflection in MapTask. The unit test from the previous patch passes with this implementation. I think this is worth exploring further.

no, we don't want to go anywhere near servlet classloaders, because you end up in WAR EAR and app server trees. The app server takes priority, except in the special case of JBoss in the past, which shared classes across webapps

Steve Loughran
added a comment - 06/Sep/12 21:50 no, we don't want to go anywhere near servlet classloaders, because you end up in WAR EAR and app server trees. The app server takes priority, except in the special case of JBoss in the past, which shared classes across webapps
https://community.jboss.org/wiki/JBossClassLoaderHistory
http://docs.jboss.org/jbossweb/2.1.x/class-loader-howto.html
people will hit walls when they try to do things like upgrade the XML parser or try and add a new URL handler.
I'll look at the patch, but classloaders are a mine of grief. That's the strength of OSGi: the grief is standardised an someone else has done the grief mining already

There is no need to be scared of classloaders, especially for the simple "load only and then exit" scenarios that we're talking about. Most of the class loader issues stem from long running containers that need to dynamically load/unload classes. OSGi is an overkill for MR tasks. To be clear, I'm not anti-OSGi, which I think is perfectly fine for managing server-side plugins.

Luke Lu
added a comment - 06/Sep/12 22:39 There is no need to be scared of classloaders, especially for the simple "load only and then exit" scenarios that we're talking about. Most of the class loader issues stem from long running containers that need to dynamically load/unload classes. OSGi is an overkill for MR tasks. To be clear, I'm not anti-OSGi, which I think is perfectly fine for managing server-side plugins.

Most of the class loader issues stem from long running containers that need to dynamically load/unload classes.

Also, the case we are talking about does not have the complex classloader trees that app servers have, so there are no sibling class sharing issues. In the task JVM there is only a single user app, so the classloader hierarchy is linear (boot, extension, system, job).

There are a few cases where certain APIs make assumptions about which classloader to use:

The system classloader. For example, URL stream handlers are loaded by the classloader that loaded java.net.URL (boot), or the system classloader. So if a task registered a URL stream handler and it was in the job JAR, then it wouldn't be found since it was loaded by the job classloader, not the system classloader. In this case, the workaround is to implement a factory and call URL.setURLStreamHandlerFactory().

The caller's current classloader. For example, java.util.ResourceBundle uses the caller's current classloader, so if the framework tries to load a bundle then the bundle (e.g. a localization bundle) would not be found if it were in the job JAR, since the system classloader (which loaded the framework class) can't see the job classloader's classes. As it happens, MR counters use resource bundles; however, they explicitly use the context classloader, so this problem doesn't occur (see org.apache.hadoop.mapreduce.util.ResourceBundles). (Also, I imagine the use of resource bundles to localize counter names in the job JAR is very rare.)

The context classloader. For example, JAXP uses the context classloader to load the DocumentBuilderFactory specified in a system property. This case is covered by setting the context classloader to be the job classloader for the duration of the task (my latest patch does this). Most APIs that involve classloaders use the context classloader these days.

So all of these cases can be handled. Also note that by default the job classloader is not used, to enable it you need to set mapreduce.job.isolated.classloader to true for your job.

The latest patch handles the case of embedded lib and classes directories in the JAR, as well as distributed cache files and archives. The unit test passes (and fails with a NoSuchMethodError due to the class incompatibility if mapreduce.job.isolated.classloader is set to false). So I think it is pretty close now - the main thing left to do is sort out the build for the test, which relies on the MR examples module.

Tom White
added a comment - 18/Sep/12 12:45 Most of the class loader issues stem from long running containers that need to dynamically load/unload classes.
Also, the case we are talking about does not have the complex classloader trees that app servers have, so there are no sibling class sharing issues. In the task JVM there is only a single user app, so the classloader hierarchy is linear (boot, extension, system, job).
There are a few cases where certain APIs make assumptions about which classloader to use:
The system classloader . For example, URL stream handlers are loaded by the classloader that loaded java.net.URL (boot), or the system classloader. So if a task registered a URL stream handler and it was in the job JAR, then it wouldn't be found since it was loaded by the job classloader, not the system classloader. In this case, the workaround is to implement a factory and call URL.setURLStreamHandlerFactory().
The caller's current classloader . For example, java.util.ResourceBundle uses the caller's current classloader, so if the framework tries to load a bundle then the bundle (e.g. a localization bundle) would not be found if it were in the job JAR, since the system classloader (which loaded the framework class) can't see the job classloader's classes. As it happens, MR counters use resource bundles; however, they explicitly use the context classloader, so this problem doesn't occur (see org.apache.hadoop.mapreduce.util.ResourceBundles). (Also, I imagine the use of resource bundles to localize counter names in the job JAR is very rare.)
The context classloader . For example, JAXP uses the context classloader to load the DocumentBuilderFactory specified in a system property. This case is covered by setting the context classloader to be the job classloader for the duration of the task (my latest patch does this). Most APIs that involve classloaders use the context classloader these days.
So all of these cases can be handled. Also note that by default the job classloader is not used, to enable it you need to set mapreduce.job.isolated.classloader to true for your job.
The latest patch handles the case of embedded lib and classes directories in the JAR, as well as distributed cache files and archives. The unit test passes (and fails with a NoSuchMethodError due to the class incompatibility if mapreduce.job.isolated.classloader is set to false). So I think it is pretty close now - the main thing left to do is sort out the build for the test, which relies on the MR examples module.

What I would add is the capability of blacklisting packages. This is, if a package is blacklisted and a class under that package hierarchy is found in the job JARs, the job should fail. This is something avail in webapp classloaders to avoid webapps for bundling things like servlet/jsp JARs that would break things. In our case we would blacklist common/hdfs/yarn/mapred packages and log4j (the factory is a singleton and if present in the job JARs will trash the log configuration of hadoop). I could see other JARs fitting this blacklist, thus I'd suggest that we have a config property with the list of blacklisted packages.

This is isolating MR jobs from Hadoop JARs. I think we should do the same at YARN level to isolate YARN JARs from AM JARs. Because of this, the JobClassLoader should be in common and probably have a different name, like IsolationClassLoader. Also it should receive, in the constructor, the blacklist.

Alejandro Abdelnur
added a comment - 04/Oct/12 08:32 Nice.
What I would add is the capability of blacklisting packages. This is, if a package is blacklisted and a class under that package hierarchy is found in the job JARs, the job should fail. This is something avail in webapp classloaders to avoid webapps for bundling things like servlet/jsp JARs that would break things. In our case we would blacklist common/hdfs/yarn/mapred packages and log4j (the factory is a singleton and if present in the job JARs will trash the log configuration of hadoop). I could see other JARs fitting this blacklist, thus I'd suggest that we have a config property with the list of blacklisted packages.
This is isolating MR jobs from Hadoop JARs. I think we should do the same at YARN level to isolate YARN JARs from AM JARs. Because of this, the JobClassLoader should be in common and probably have a different name, like IsolationClassLoader. Also it should receive, in the constructor, the blacklist.

Tom, one thing I've forgot to mention in my previous comment, we should see how to enable the classloader on the client side as well as it may be required (to use different JARs) for the submission code. May be as another JIRA.

Also, don't recall now if it is there or not, we may want o have a job config property to disable it in case some app runs into funny issues with it.

Alejandro Abdelnur
added a comment - 08/Oct/12 20:48 Tom, one thing I've forgot to mention in my previous comment, we should see how to enable the classloader on the client side as well as it may be required (to use different JARs) for the submission code. May be as another JIRA.
Also, don't recall now if it is there or not, we may want o have a job config property to disable it in case some app runs into funny issues with it.

I think that is a good idea. Servlet containers do this - e.g. system classes in Jetty are always loaded from the parent (http://docs.codehaus.org/display/JETTY/Classloading). Rather than failing the job if the class is a system class and is found in the job classpath (as you suggested) I think it would be acceptable to log a warning but load from the system classpath. I expect the default blacklist would be java.,javax.,org.apache.commons.logging.,org.apache.log4j.,org.apache.hadoop..

I think we should do the same at YARN level to isolate YARN JARs from AM JARs. Because of this, the JobClassLoader should be in common and probably have a different name, like IsolationClassLoader.

Other YARN apps might benefit from this work, so perhaps we should add the classloader to YARN (not Common, since HDFS shouldn't need it), and the MR-specific parts would stay in MR, of course.

we should see how to enable the classloader on the client side as well as it may be required (to use different JARs) for the submission code. May be as another JIRA.

I think this is a slightly different problem, since users generally have more control over the JVM they submit from than the JVM the task runs in. So, yes, another JIRA would be appropriate.

Also, don't recall now if it is there or not, we may want o have a job config property to disable it in case some app runs into funny issues with it.

Tom White
added a comment - 07/Nov/12 16:50 Thanks for the comments Alejandro.
What I would add is the capability of blacklisting packages.
I think that is a good idea. Servlet containers do this - e.g. system classes in Jetty are always loaded from the parent ( http://docs.codehaus.org/display/JETTY/Classloading ). Rather than failing the job if the class is a system class and is found in the job classpath (as you suggested) I think it would be acceptable to log a warning but load from the system classpath. I expect the default blacklist would be java.,javax.,org.apache.commons.logging.,org.apache.log4j.,org.apache.hadoop. .
I think we should do the same at YARN level to isolate YARN JARs from AM JARs. Because of this, the JobClassLoader should be in common and probably have a different name, like IsolationClassLoader.
Other YARN apps might benefit from this work, so perhaps we should add the classloader to YARN (not Common, since HDFS shouldn't need it), and the MR-specific parts would stay in MR, of course.
we should see how to enable the classloader on the client side as well as it may be required (to use different JARs) for the submission code. May be as another JIRA.
I think this is a slightly different problem, since users generally have more control over the JVM they submit from than the JVM the task runs in. So, yes, another JIRA would be appropriate.
Also, don't recall now if it is there or not, we may want o have a job config property to disable it in case some app runs into funny issues with it.
Agreed. It's off by default in the current patch.

Tom White
added a comment - 21/Nov/12 11:01 New patch which moves the classloader to YARN (renamed ApplicationClassLoader), and adds ability to blacklist system classes, which are never loaded by the application classloader.

Task should get the string "APP_CLASSPATH" from the ApplicationConstants.

the test dir logic won't work on windows if test.build.data isn't set :System.getProperty("test.build.data", "/tmp") -that default should be replaced with System.getProperty("java.io.tmpdir')

in ApplicationClassLoader.loadClass() it looks to me like it is possible to have the situation c==null && ex==null at the if (c==null} throw ex; clause -if parent.loadClass() => null. Some check for a null ex value and setting to (something?) would avoid this.

the tests should look for resource loading too, just to be thorough.

Other than that, with my finite classloader knowledge -looks good. Someone who understands OSGi should do quick review too.

Steve Loughran
added a comment - 21/Nov/12 12:24
Task should get the string "APP_CLASSPATH" from the ApplicationConstants .
the test dir logic won't work on windows if test.build.data isn't set : System.getProperty("test.build.data", "/tmp") -that default should be replaced with System.getProperty("java.io.tmpdir')
in ApplicationClassLoader.loadClass() it looks to me like it is possible to have the situation c==null && ex==null at the if (c==null} throw ex; clause -if parent.loadClass() => null . Some check for a null ex value and setting to (something?) would avoid this.
the tests should look for resource loading too, just to be thorough.
Other than that, with my finite classloader knowledge -looks good. Someone who understands OSGi should do quick review too.

Now that we have a much better way of dealing with dependency conflicts, what will be the fate of "mapreduce.job.user.classpath.first" feature? Is there any use case where this feature works but the CCL approach don't or somehow is preferred over CCL for some reason? If none, shall we deprecate it?

Kihwal Lee
added a comment - 04/Dec/12 17:36 Now that we have a much better way of dealing with dependency conflicts, what will be the fate of "mapreduce.job.user.classpath.first" feature? Is there any use case where this feature works but the CCL approach don't or somehow is preferred over CCL for some reason? If none, shall we deprecate it?

Tom White
added a comment - 04/Dec/12 17:46 I don't know of a reason that "mapreduce.job.user.classpath.first" would be preferable to CCL. However, I'd suggest waiting a release or so before deprecating it though, so we can see how CCL fares.

what will be the fate of "mapreduce.job.user.classpath.first" feature?

I think we should still keep it as an "expert" feature, as it can be used to replace the implementation of the job/app classloader itself in rare cases. We probably should print a WARNING when the feature is used. The new job/app classloader behavior can be used as a much saner default.

Luke Lu
added a comment - 04/Dec/12 17:56 what will be the fate of "mapreduce.job.user.classpath.first" feature?
I think we should still keep it as an "expert" feature, as it can be used to replace the implementation of the job/app classloader itself in rare cases. We probably should print a WARNING when the feature is used. The new job/app classloader behavior can be used as a much saner default.

Tom, one thing I've forgot to mention in my previous comment, we should see how to enable the classloader on the client side as well as it may be required (to use different JARs) for the submission code.

I think this is a slightly different problem, since users generally have more control over the JVM they submit from than the JVM the task runs in. So, yes, another JIRA would be appropriate.

Kihwal Lee
added a comment - 07/Dec/12 21:17
Tom, one thing I've forgot to mention in my previous comment, we should see how to enable the classloader on the client side as well as it may be required (to use different JARs) for the submission code.
I think this is a slightly different problem, since users generally have more control over the JVM they submit from than the JVM the task runs in. So, yes, another JIRA would be appropriate.
I think AM also runs user code, if a custom output format is defined.

Tom White
added a comment - 21/Dec/12 14:34 Kihwal, that's true - thanks for pointing it out. I've modified the patch to take care of that case, by setting the classloader for the MRAppMaster (when the configured of course).
I've also created YARN-286 for the YARN part of this patch so it can be committed separately.
This patch is a combined patch so that Jenkins can test it as a whole.

Hi Tom, can you elaborate more on why "org.apache.commons.logging.,org.apache.log4j." should be blacklisted? we are trying to do the similar class loader thing for another software, so want to learn some experience here

James Xu
added a comment - 08/Oct/13 07:19 Hi Tom, can you elaborate more on why "org.apache.commons.logging.,org.apache.log4j." should be blacklisted? we are trying to do the similar class loader thing for another software, so want to learn some experience here

James Xu, the set of packages to blacklist came from the Jetty (http://docs.codehaus.org/display/JETTY/Classloading), which is why Commons Logging and Log4j were included. Excluding logging classes prevents inadvertant double initialization of the logging system - once when the task JVM starts and again when the user code is loaded. Note that you can change the system default by setting mapreduce.job.classloader.system.classes.

Tom White
added a comment - 08/Oct/13 12:22 James Xu , the set of packages to blacklist came from the Jetty ( http://docs.codehaus.org/display/JETTY/Classloading ), which is why Commons Logging and Log4j were included. Excluding logging classes prevents inadvertant double initialization of the logging system - once when the task JVM starts and again when the user code is loaded. Note that you can change the system default by setting mapreduce.job.classloader.system.classes.