mr279 job history handling after killing application

Details

Description

The job history/application tracking url handling during kill is not consistent. Currently if you kill a job that was running the tracking url points to job history, but job history server doesn't have the job.

It looks like we are talking only of the rare case where the AppMaster dies somehow, right? For failed jobs, the 'Tracking UI' column looks like it is set correctly to point to the job history page for that job.

In the case when a job fails, it is the AM that sends the unregister event to the RM telling the RM to change the tracking URL. However, in the use case we are addressing, the AM has died. I've looked into alterative ways to get the job history URL for a job in this case, but I think it would involve having other daemons try to recreate the AM's unregister event.

To me, since this is a narrow use case, I think it is sufficient to just "null-out" the tracking URL, which will cause the scheduler UI to put UNASSIGNED in the 'Tracking UI' column, which will not be a link.

Eric Payne
added a comment - 06/Oct/11 17:59 It looks like we are talking only of the rare case where the AppMaster dies somehow, right? For failed jobs, the 'Tracking UI' column looks like it is set correctly to point to the job history page for that job.
In the case when a job fails, it is the AM that sends the unregister event to the RM telling the RM to change the tracking URL. However, in the use case we are addressing, the AM has died. I've looked into alterative ways to get the job history URL for a job in this case, but I think it would involve having other daemons try to recreate the AM's unregister event.
To me, since this is a narrow use case, I think it is sufficient to just "null-out" the tracking URL, which will cause the scheduler UI to put UNASSIGNED in the 'Tracking UI' column, which will not be a link.

-1 tests included. The patch doesn't appear to include any new or modified tests.
Please justify why no new tests are needed for this patch.
Also please list what manual steps were performed to verify this patch.

+1 javadoc. The javadoc tool did not generate any warning messages.

+1 javac. The applied patch does not increase the total number of javac compiler warnings.

+1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

+1 release audit. The applied patch does not increase the total number of release audit warnings.

Hadoop QA
added a comment - 06/Oct/11 22:09 -1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12498063/MAPREDUCE-2783.v1.txt
against trunk revision .
+1 @author. The patch does not contain any @author tags.
-1 tests included. The patch doesn't appear to include any new or modified tests.
Please justify why no new tests are needed for this patch.
Also please list what manual steps were performed to verify this patch.
+1 javadoc. The javadoc tool did not generate any warning messages.
+1 javac. The applied patch does not increase the total number of javac compiler warnings.
+1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.
+1 release audit. The applied patch does not increase the total number of release audit warnings.
+1 core tests. The patch passed unit tests in .
+1 contrib tests. The patch passed contrib unit tests.
Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/959//testReport/
Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/959//console
This message is automatically generated.

Fixed event handling within RMAppAttemptImpl to empty out the stale tracking URL field so that the scheduler UI would not point to a stale link. No unit test is feasable.

Manual tests were as follows:

1) Start task (wordcount) and kill -9 MRAppMaster. Result is that scheduler UI shows 'UNASSIGNED' in 'Tracking UI' column. 'UNASSIGNED' is not a stale link.
2) Start task (wordcount) and kill using 'bin/mapred job kill'. Scheduler UI shows 'History' in 'Tracking UI' column. 'History' is a link to the job history page for the killed job.

Eric Payne
added a comment - 06/Oct/11 23:18 Fixed event handling within RMAppAttemptImpl to empty out the stale tracking URL field so that the scheduler UI would not point to a stale link. No unit test is feasable.
Manual tests were as follows:
1) Start task (wordcount) and kill -9 MRAppMaster. Result is that scheduler UI shows 'UNASSIGNED' in 'Tracking UI' column. 'UNASSIGNED' is not a stale link.
2) Start task (wordcount) and kill using 'bin/mapred job kill'. Scheduler UI shows 'History' in 'Tracking UI' column. 'History' is a link to the job history page for the killed job.

Vinod Kumar Vavilapalli
added a comment - 07/Oct/11 10:16 The bug related to absent history files for killed jobs got fixed via one of the other patches.
I also manually verified the above behaviour on my single node setup.
The fix for the corner case looks good. +1.