Make yarn-common services robust

Details

Description

Review the yarn common services (CompositeService, AbstractLivelinessMonitor and make their service startup and especially shutdown more robust against out-of-lifecycle invocation and partially complete initialization.

Activity

I'm going to submit an initial patch soon, but want to flag up something first. Should CompositeService attempt to stop() all its child services when told to stop() itself - regardless of what state the it is in? This would be consistent with a philosophy of "support the stop operation regardless of your state".

Without that you can't reliably clean up during failed start operations -which the CompositeService itself tries to do by stopping all child services if any one of them fails to start - including the service that failed.

Steve Loughran
added a comment - 15/Mar/12 21:14 I'm going to submit an initial patch soon, but want to flag up something first. Should CompositeService attempt to stop() all its child services when told to stop() itself - regardless of what state the it is in? This would be consistent with a philosophy of "support the stop operation regardless of your state".
Without that you can't reliably clean up during failed start operations -which the CompositeService itself tries to do by stopping all child services if any one of them fails to start - including the service that failed.

I've looked at the code and written the tests, the CompositeService seems good apart from MAPREDUCE-4016, where I need to verify that sharing of configurations is not a feature people actually wanted. If MAPREDUCE-4015 was included -a YarnException subclass specifically for illegal state changes, CompositeService.init() could be extended to relay this exception up, rather than wrap it.

The patch I'm submitting has merged in the AbstractService changes to aid making service stop operations robust regardless of service state; these were taken from MAPREDUCE-3502. When committed, and MAPREDUCE-4015/MAPREDUCE-4016 resolved, this should complete the yarn-common changes and then specific services can be reviewed.

Steve Loughran
added a comment - 15/Mar/12 21:40 I've looked at the code and written the tests, the CompositeService seems good apart from MAPREDUCE-4016 , where I need to verify that sharing of configurations is not a feature people actually wanted. If MAPREDUCE-4015 was included -a YarnException subclass specifically for illegal state changes, CompositeService.init() could be extended to relay this exception up, rather than wrap it.
The patch I'm submitting has merged in the AbstractService changes to aid making service stop operations robust regardless of service state; these were taken from MAPREDUCE-3502 . When committed, and MAPREDUCE-4015 / MAPREDUCE-4016 resolved, this should complete the yarn-common changes and then specific services can be reviewed.

One of the tests in MAPREDUCE-4014 also demonstrates the current sharing of configurations -setting a property on the root service sets it in all the children. While this could be a feature, it could leak configuration between children.

Steve Loughran
added a comment - 15/Mar/12 21:45 One of the tests in MAPREDUCE-4014 also demonstrates the current sharing of configurations -setting a property on the root service sets it in all the children. While this could be a feature, it could leak configuration between children.
conf.set( "t2" , "should not be shared" );
assertServiceConfigurationContains(svc, "t2" );
assertServiceConfigurationContains(firstChild, "t2" );
assertServiceConfigurationContains(lastChild, "t2" );

Hadoop QA
added a comment - 15/Mar/12 23:06 -1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12518548/MAPREDUCE-4014.patch
against trunk revision .
+1 @author. The patch does not contain any @author tags.
+1 tests included. The patch appears to include 8 new or modified tests.
+1 javadoc. The javadoc tool did not generate any warning messages.
+1 javac. The applied patch does not increase the total number of javac compiler warnings.
+1 eclipse:eclipse. The patch built with eclipse:eclipse.
+1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.
+1 release audit. The applied patch does not increase the total number of release audit warnings.
-1 core tests. The patch failed these unit tests:
org.apache.hadoop.mapred.TestWritableJobConf
+1 contrib tests. The patch passed contrib unit tests.
Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2061//testReport/
Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2061//console
This message is automatically generated.

Steve, apologies for the delayed feedback. I was wondering which parts of MAPREDUCE-3939 are planned for this patch.
"Static methods to choreograph of lifecycle operations" seems to be covered.
Will the others, specifically "AbstractService doesn't prevent duplicate state change requests", "AbstractService state change doesn't defend against race conditions", "state model prevents stopped state being entered if you could not successfully start the service" be separate patches? Definitely looks like they should go in.

state model prevents stopped state being entered if you could not successfully start the service.

I don't believe resources which require an explicit release are meant to be obtained in the init() stage - but that may not always be the case. I'd agree with allowing stop() from any state, as well as a CompositeService attempting to stop all child services when told to stop() (instead of just the ones which have started).
Currently, a failed start() on a composite service will stop() services which had started and move them to STOPPED state, attempt to stop() the failed service but leave it in INITED state, and leave remaining services in INITED state - which doesn't seem correct.

Feedback on this patch

interruptAndJoinThread(Thread target) - could have a joinTimeout parameter as well.

stopIPCServer, stopWebApp, interrupt* - should these be in AbstractService ? or a separate helper class.

Siddharth Seth
added a comment - 28/Jun/12 23:14 Steve, apologies for the delayed feedback. I was wondering which parts of MAPREDUCE-3939 are planned for this patch.
"Static methods to choreograph of lifecycle operations" seems to be covered.
Will the others, specifically "AbstractService doesn't prevent duplicate state change requests", "AbstractService state change doesn't defend against race conditions", "state model prevents stopped state being entered if you could not successfully start the service" be separate patches? Definitely looks like they should go in.
state model prevents stopped state being entered if you could not successfully start the service.
I don't believe resources which require an explicit release are meant to be obtained in the init() stage - but that may not always be the case. I'd agree with allowing stop() from any state, as well as a CompositeService attempting to stop all child services when told to stop() (instead of just the ones which have started).
Currently, a failed start() on a composite service will stop() services which had started and move them to STOPPED state, attempt to stop() the failed service but leave it in INITED state, and leave remaining services in INITED state - which doesn't seem correct.
Feedback on this patch
interruptAndJoinThread(Thread target) - could have a joinTimeout parameter as well.
stopIPCServer, stopWebApp, interrupt* - should these be in AbstractService ? or a separate helper class.
toString() in AbstractService - the text is missing a closing quote.