MR279: MRReliabilityTest job fails because of missing job-file.

Details

Description

The ApplicationReport should have the jobFile (e.g. hdfs://localhost:9000/tmp/hadoop-<USER>/mapred/staging/<USER>/.staging/job_201107121640_0001/job.xml)

Without it, jobs such as MRReliabilityTest fail with the following error (caused by the fact that jobFile is hardcoded to "" in TypeConverter.java):
e.g. java.lang.IllegalArgumentException: Can not create a Path from an empty string
at org.apache.hadoop.fs.Path.checkPathArg(Path.java:88)
at org.apache.hadoop.fs.Path.<init>(Path.java:96)
at org.apache.hadoop.mapred.JobConf.<init>(JobConf.java:445)
at org.apache.hadoop.mapreduce.Cluster.getJobs(Cluster.java:104)
at org.apache.hadoop.mapreduce.Cluster.getAllJobs(Cluster.java:218)
at org.apache.hadoop.mapred.JobClient.getAllJobs(JobClient.java:757)
at org.apache.hadoop.mapred.JobClient.jobsToComplete(JobClient.java:741)
at org.apache.hadoop.mapred.ReliabilityTest.runTest(ReliabilityTest.java:219)
at org.apache.hadoop.mapred.ReliabilityTest.runSleepJobTest(ReliabilityTest.java:133)
at org.apache.hadoop.mapred.ReliabilityTest.run(ReliabilityTest.java:116)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:69)
at org.apache.hadoop.mapred.ReliabilityTest.main(ReliabilityTest.java:504)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72)
at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:144)
at org.apache.hadoop.test.MapredTestDriver.run(MapredTestDriver.java:111)
at org.apache.hadoop.test.MapredTestDriver.main(MapredTestDriver.java:118)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:192)

Hadoop QA
added a comment - 25/Jul/11 17:54 -1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12487736/MAPREDUCE-2716-v2.patch
against trunk revision 1150533.
+1 @author. The patch does not contain any @author tags.
+1 tests included. The patch appears to include 12 new or modified tests.
-1 patch. The patch command could not apply the patch.
Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/501//console
This message is automatically generated.

Hey Jeffrey, apologies for not noticing this earlier but this is an incorrect fix. YARN ( as opposed to mapreduce) doesn't have any notion of jobFile. Job-file is a concept of mapreduce and so should not be added to ApplicationReport (which is generic).

The correct fix is to do similar to how ClientServiceDelegate.getJobStatus() is constucting job-file and populating JobStatus.

Vinod Kumar Vavilapalli
added a comment - 26/Jul/11 03:40 Hey Jeffrey, apologies for not noticing this earlier but this is an incorrect fix. YARN ( as opposed to mapreduce) doesn't have any notion of jobFile. Job-file is a concept of mapreduce and so should not be added to ApplicationReport (which is generic).
The correct fix is to do similar to how ClientServiceDelegate.getJobStatus() is constucting job-file and populating JobStatus.

New patch that uses the getJobStatus() method to populate the jobFile and other values. MRReliabilityTest gets past the jobfile issue, but there are still some functions that call the fromYarn method that sets the jobFile to "", so I wonder if those need to be changed too.

Jeffrey Naisbitt
added a comment - 28/Jul/11 19:00 New patch that uses the getJobStatus() method to populate the jobFile and other values. MRReliabilityTest gets past the jobfile issue, but there are still some functions that call the fromYarn method that sets the jobFile to "", so I wonder if those need to be changed too.

But this patch has one issue in ClientServiceDelegate.getAllJobs(). After getting the app-list from the resource-manager, we also make more trips one per job to the RM via getJobState(JobId oldJobId) (ClientServiceDelegate.java +342 after applying your patch). Which is unnecessary.

Instead, we can use just get the JobStatus using the already present TypeConverter.fromYarn(ApplicationReport report) and set the job-file on the JobStatus to be conf.get("yarn.apps.stagingDir") + "/" + jobId.toString(). Once you do this, you will also not need the new fromYarn(ApplicationReport report, JobStatus status) utility in TypeConverter.

Regarding your other comment about other places that use TypeConverter.fromYarn(ApplicationReport report). Yes that is a problem. So I propose we change it to TypeConverter.fromYarn(ApplicationReport report, Configuration conf) and set job-file using conf as described above and use this method everywhere.

Vinod Kumar Vavilapalli
added a comment - 29/Jul/11 12:05 That's more like it.
But this patch has one issue in ClientServiceDelegate.getAllJobs() . After getting the app-list from the resource-manager, we also make more trips one per job to the RM via getJobState(JobId oldJobId) (ClientServiceDelegate.java +342 after applying your patch). Which is unnecessary.
Instead, we can use just get the JobStatus using the already present TypeConverter.fromYarn(ApplicationReport report) and set the job-file on the JobStatus to be conf.get("yarn.apps.stagingDir") + "/" + jobId.toString() . Once you do this, you will also not need the new fromYarn(ApplicationReport report, JobStatus status) utility in TypeConverter.
Regarding your other comment about other places that use TypeConverter.fromYarn(ApplicationReport report) . Yes that is a problem. So I propose we change it to TypeConverter.fromYarn(ApplicationReport report, Configuration conf) and set job-file using conf as described above and use this method everywhere.

Another question/concern though: I see that in JobSubmitter.java (submitJobInternal), it originally generates the "submitJobDir" based on a call to JobSubmissionFiles.getStagingDir & the jobId. (where getStagingDir ultimately goes off to YARNRunner.getStagingAreaDir(), which goes to the ResourceMgrDelegate & then MRApps). Anyway, that staging area directory seems to include the user name and ".staging" as well as the value from the "yarn.apps.stagingDir". Is that the value we should be using instead of just getting the setting from conf? If so, maybe it would make sense to just pass the actual staging directory in instead of the actual Configuration object.

Jeffrey Naisbitt
added a comment - 29/Jul/11 21:55 Thanks for the review, Vinod. That makes sense.
Another question/concern though: I see that in JobSubmitter.java (submitJobInternal), it originally generates the "submitJobDir" based on a call to JobSubmissionFiles.getStagingDir & the jobId. (where getStagingDir ultimately goes off to YARNRunner.getStagingAreaDir(), which goes to the ResourceMgrDelegate & then MRApps). Anyway, that staging area directory seems to include the user name and ".staging" as well as the value from the "yarn.apps.stagingDir". Is that the value we should be using instead of just getting the setting from conf? If so, maybe it would make sense to just pass the actual staging directory in instead of the actual Configuration object.

Apologies for the delayed reply, I was tied up with one of those bigger patches.

.. that staging area directory seems to include the user name and ".staging" as well as the value from the "yarn.apps.stagingDir"

Good catch, that's correct! In fact, given this, ClientServiceDelegate.getJobStatus() using the key from configuration directly is actually a bug.

For fixing this, we should be obtaining the job-file by simply using the same API from MRApps in both getJobStatus() and getAllJobs(). Or may be just add another API MRApps.getJobFile(Configuration conf, String user, JobID jobID)

Vinod Kumar Vavilapalli
added a comment - 03/Aug/11 14:11 Apologies for the delayed reply, I was tied up with one of those bigger patches.
.. that staging area directory seems to include the user name and ".staging" as well as the value from the "yarn.apps.stagingDir"
Good catch, that's correct! In fact, given this, ClientServiceDelegate.getJobStatus() using the key from configuration directly is actually a bug.
For fixing this, we should be obtaining the job-file by simply using the same API from MRApps in both getJobStatus() and getAllJobs(). Or may be just add another API MRApps.getJobFile(Configuration conf, String user, JobID jobID)

Jeffrey Naisbitt
added a comment - 11/Aug/11 15:12 Aargh... I forgot to add the jobId, and I was assigning the same jobFile to multiple apps/jobs. This patch addresses these issues.
I'm currently using the JobID since that's what fromYarn(applicationId) returns, but I am curious if I should be trying to use JobId instead.

Jeffrey Naisbitt
added a comment - 11/Aug/11 16:50 We found some more bugs in the existing code and new code that are causing the jobFile to use the incorrect user. I will update the patch to address these issues.

After running some more tests, there are still some problems with the latest patch (v7).

First, the createFakeJobReport calls in ClientServiceDelegate.getJobStatus seem to need a valid jobfile since the MRReliability test uses that codepath and expects a valid jobFile.

Also, I noticed that with adding the ".staging" and user (the way MRApps generates the stagingAreaDir) to the current setting in yarn-default.xml (/tmp/hadoop-yarn/$

{user.name}

/staging), we end up with duplicated information in the path. I get this:
/tmp/hadoop-yarn/<username>/staging/<username>/.staging/job_XXX/job.xml
(Notice the duplicated <username> and the fact that we have both staging and .staging. I'm not sure how this would work when another user's jobfile is requested)

It seems like either the default should not include "<username>/staging", or we shouldn't be adding both later (unless I'm missing something here). For consistency, shouldn't we have the jobfile path & stagingAreaDir match what they were in previous versions?
Looking at a log from a mapreduce job on trunk (not MR-279), I see this:
hdfs://<hostname>/user/<username>/.staging/job_XXXX/job.xml

Jeffrey Naisbitt
added a comment - 15/Aug/11 21:31 After running some more tests, there are still some problems with the latest patch (v7).
First, the createFakeJobReport calls in ClientServiceDelegate.getJobStatus seem to need a valid jobfile since the MRReliability test uses that codepath and expects a valid jobFile.
Also, I noticed that with adding the ".staging" and user (the way MRApps generates the stagingAreaDir) to the current setting in yarn-default.xml (/tmp/hadoop-yarn/$
{user.name}
/staging), we end up with duplicated information in the path. I get this:
/tmp/hadoop-yarn/<username>/staging/<username>/.staging/job_XXX/job.xml
(Notice the duplicated <username> and the fact that we have both staging and .staging. I'm not sure how this would work when another user's jobfile is requested)
It seems like either the default should not include "<username>/staging", or we shouldn't be adding both later (unless I'm missing something here). For consistency, shouldn't we have the jobfile path & stagingAreaDir match what they were in previous versions?
Looking at a log from a mapreduce job on trunk (not MR-279), I see this:
hdfs://<hostname>/user/<username>/.staging/job_XXXX/job.xml

I think you can ignore the part above about adding ".staging" and user. The weird part is just because the default setting includes both, but I see on live clusters that they are still using "/user" for that setting (so it will look normal). I still need to fix the issue with createFakeJobReport needing a valid jobFile.

Jeffrey Naisbitt
added a comment - 18/Aug/11 20:30 I think you can ignore the part above about adding ".staging" and user. The weird part is just because the default setting includes both, but I see on live clusters that they are still using "/user" for that setting (so it will look normal). I still need to fix the issue with createFakeJobReport needing a valid jobFile.

Hadoop QA
added a comment - 26/Aug/11 20:32 -1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12490454/MAPREDUCE-2716-v6.patch
against trunk revision .
+1 @author. The patch does not contain any @author tags.
+1 tests included. The patch appears to include 6 new or modified tests.
-1 patch. The patch command could not apply the patch.
Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/534//console
This message is automatically generated.

Vinod Kumar Vavilapalli
added a comment - 29/Aug/11 15:50 Should replace "mapreduce/mr-client" with hadoop-mapreduce-client in the patch.
Even after that, because of MAPREDUCE-2807 , ClientServiceDelegate changed much, so the patch needs regeneration w.r.t that.
First, the createFakeJobReport calls in ClientServiceDelegate.getJobStatus seem to need a valid jobfile since the MRReliability test uses that codepath and expects a valid jobFile.
I guess this should be easy to fix, isn't it? Or is there is some complication? Note that NotRunningJob class is the new fake-report where this change should go.

Modified the log statements to print the exception trace in the test, and found this:

Call: protocol=org.apache.hadoop.yarn.proto.ClientRMProtocol.ClientRMProtocolService.BlockingInterface, method=getApplicationReport
2011-09-02 10:19:56,878 INFO mapred.ClientServiceDelegate (ClientServiceDelegate.java:invoke(234)) - Failed to contact AM for job job_201103121733_0001 Will retry..
java.lang.reflect.InvocationTargetException
at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.mapred.ClientServiceDelegate.invoke(ClientServiceDelegate.java:228)
at org.apache.hadoop.mapred.ClientServiceDelegate.getJobStatus(ClientServiceDelegate.java:293)
at org.apache.hadoop.mapred.YARNRunner.getJobStatus(YARNRunner.java:534)
at org.apache.hadoop.mapreduce.Cluster.getJob(Cluster.java:153)
at org.apache.hadoop.mapred.TestClientRedirect.testRedirect(TestClientRedirect.java:164)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:44)
at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15)
at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:41)
at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20)
at org.junit.runners.BlockJUnit4ClassRunner.runNotIgnored(BlockJUnit4ClassRunner.java:79)
at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:71)
at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:49)
at org.junit.runners.ParentRunner$3.run(ParentRunner.java:193)
at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:52)
at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:191)
at org.junit.runners.ParentRunner.access$000(ParentRunner.java:42)
at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:184)
at org.junit.runners.ParentRunner.run(ParentRunner.java:236)
at org.apache.maven.surefire.junit4.JUnit4TestSet.execute(JUnit4TestSet.java:35)
at org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:115)
at org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:97)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.maven.surefire.booter.ProviderFactory$ClassLoaderProxy.invoke(ProviderFactory.java:103)
at $Proxy0.invoke(Unknown Source)
at org.apache.maven.surefire.booter.SurefireStarter.invokeProvider(SurefireStarter.java:150)
at org.apache.maven.surefire.booter.SurefireStarter.runSuitesInProcess(SurefireStarter.java:74)
at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:69)
Caused by: java.lang.NullPointerException
at org.apache.hadoop.mapreduce.v2.proto.MRProtos$JobReportProto$Builder.setUser(MRProtos.java:7747)
at org.apache.hadoop.mapreduce.v2.api.records.impl.pb.JobReportPBImpl.setUser(JobReportPBImpl.java:194)
at org.apache.hadoop.mapred.NotRunningJob.getJobReport(NotRunningJob.java:110)
... 37 more

Vinod Kumar Vavilapalli
added a comment - 02/Sep/11 04:56 Modified the log statements to print the exception trace in the test, and found this:
Call: protocol=org.apache.hadoop.yarn.proto.ClientRMProtocol.ClientRMProtocolService.BlockingInterface, method=getApplicationReport
2011-09-02 10:19:56,878 INFO mapred.ClientServiceDelegate (ClientServiceDelegate.java:invoke(234)) - Failed to contact AM for job job_201103121733_0001 Will retry..
java.lang.reflect.InvocationTargetException
at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.mapred.ClientServiceDelegate.invoke(ClientServiceDelegate.java:228)
at org.apache.hadoop.mapred.ClientServiceDelegate.getJobStatus(ClientServiceDelegate.java:293)
at org.apache.hadoop.mapred.YARNRunner.getJobStatus(YARNRunner.java:534)
at org.apache.hadoop.mapreduce.Cluster.getJob(Cluster.java:153)
at org.apache.hadoop.mapred.TestClientRedirect.testRedirect(TestClientRedirect.java:164)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:44)
at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15)
at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:41)
at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20)
at org.junit.runners.BlockJUnit4ClassRunner.runNotIgnored(BlockJUnit4ClassRunner.java:79)
at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:71)
at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:49)
at org.junit.runners.ParentRunner$3.run(ParentRunner.java:193)
at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:52)
at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:191)
at org.junit.runners.ParentRunner.access$000(ParentRunner.java:42)
at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:184)
at org.junit.runners.ParentRunner.run(ParentRunner.java:236)
at org.apache.maven.surefire.junit4.JUnit4TestSet.execute(JUnit4TestSet.java:35)
at org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:115)
at org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:97)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.maven.surefire.booter.ProviderFactory$ClassLoaderProxy.invoke(ProviderFactory.java:103)
at $Proxy0.invoke(Unknown Source)
at org.apache.maven.surefire.booter.SurefireStarter.invokeProvider(SurefireStarter.java:150)
at org.apache.maven.surefire.booter.SurefireStarter.runSuitesInProcess(SurefireStarter.java:74)
at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:69)
Caused by: java.lang.NullPointerException
at org.apache.hadoop.mapreduce.v2.proto.MRProtos$JobReportProto$Builder.setUser(MRProtos.java:7747)
at org.apache.hadoop.mapreduce.v2.api.records.impl.pb.JobReportPBImpl.setUser(JobReportPBImpl.java:194)
at org.apache.hadoop.mapred.NotRunningJob.getJobReport(NotRunningJob.java:110)
... 37 more

v8 fixes the issue with TestClientRedirect's infinite loop and adds a null check to make future debugging easier and to prevent the loop in cases like this (where the application does not have the user set).

Jeffrey Naisbitt
added a comment - 02/Sep/11 15:47 v8 fixes the issue with TestClientRedirect's infinite loop and adds a null check to make future debugging easier and to prevent the loop in cases like this (where the application does not have the user set).