Core changes in NodeManager to support re-initialization of Containers with new launchContext

Details

Description

JIRA proposes to modify the ContainerManager (and other core classes) to support upgrade of a running container with a new ContainerLaunchContext as well as the ability to rollback the upgrade if the container is not able to restart using the new launch Context.

I havn't included the API changes with this patch. I have just added upgradeContainer and commitUpgrade methods to the ContainerManagerImpl to test the end to end flow via test cases.

The patch assumes the following:

The container is restarted only after ALL the required resources are localized.

If the relaunch of the container with the new bits fails, the Container will be rollback

Rollback involves reverting to the old launch Context and restarting.

It is upto the AM to call the commitUpgrade once the container has completed to ensure that if the Container fails after the upgrade, it is not rolled back. This is required, since if the container fails for some reason after the upgrade, there is no way to distinguish if it is because of the upgrade or for some other reason.

Arun Suresh
added a comment - 06/Sep/16 23:03 Attaching initial patch based on some offline ideas from Jian He , Vinod Kumar Vavilapalli etc.
I havn't included the API changes with this patch. I have just added upgradeContainer and commitUpgrade methods to the ContainerManagerImpl to test the end to end flow via test cases.
The patch assumes the following:
The container is restarted only after ALL the required resources are localized.
If the relaunch of the container with the new bits fails, the Container will be rollback
Rollback involves reverting to the old launch Context and restarting.
It is upto the AM to call the commitUpgrade once the container has completed to ensure that if the Container fails after the upgrade, it is not rolled back. This is required, since if the container fails for some reason after the upgrade, there is no way to distinguish if it is because of the upgrade or for some other reason.

The COMMIT_UPGRADE API: I don’t quite get the necessity of this API. Could you explain under what scenario should the user call this API ?

The ROLLBACK_UPGRADE API: I think it should be able to rollback to any previous version, rather than only the immediate previous one. In some sense, it’s the same as upgrade.
IMHO, we probably can use one API restartContainer(context) for both upgrade and downgrade, if the upgrade fails, user should be notified. And the AM(in our case slider), on user’s instruction, can choose to restartContainer with different version.

Also, Forcing containers to be restarted with previous version if upgrade fails may not be suitable in all cases, as often times, containers are upgraded in waves. User wants to troubleshoot the failure first before triggering a new wave of restarts.
IMO, as first cut implementation, we can fail the container if upgrade fails. we can add retry, rollback, or release the container as RetryPolicy on failure later. your opinion ?

Jian He
added a comment - 07/Sep/16 17:08 Thanks Arun, some questions on the API:
The COMMIT_UPGRADE API: I don’t quite get the necessity of this API. Could you explain under what scenario should the user call this API ?
The ROLLBACK_UPGRADE API: I think it should be able to rollback to any previous version, rather than only the immediate previous one. In some sense, it’s the same as upgrade.
IMHO, we probably can use one API restartContainer(context) for both upgrade and downgrade, if the upgrade fails, user should be notified. And the AM(in our case slider), on user’s instruction, can choose to restartContainer with different version.
Also, Forcing containers to be restarted with previous version if upgrade fails may not be suitable in all cases, as often times, containers are upgraded in waves. User wants to troubleshoot the failure first before triggering a new wave of restarts.
IMO, as first cut implementation, we can fail the container if upgrade fails. we can add retry, rollback, or release the container as RetryPolicy on failure later. your opinion ?

The COMMIT_UPGRADE API: I don’t quite get the necessity of this API. Could you explain under what scenario should the user call this API ?

Consider an AM that upgrades a container with a new binary and the process is subsequently restarted. Now after say around 10 mins the process dies. There is no way form the NM to know if the process died because of the upgrade (memory leak ?) or due to some transient failure.. and therefore it cannot make the decision to Retry the process or Rollback the upgrade. Only the AM knows if the upgrade is actually successful. Essentially, the commit API should be used by the AM to notify the NM that upgrade is fine and any subsequent failure can be handled by the existing Retry Policy AFTER it has performed some upgrade diagnostics on the container. We can provide an autoCommit convenience method that clubs upgrade + commit. But I feel it is important we keep the explicit commit API.

The ROLLBACK_UPGRADE API: I think it should be able to rollback to any previous version, rather than only the immediate previous one. In some sense, it’s the same as upgrade.

I agree AM should be able to move to any previous version, but,

I feel the versioning should NOT be managed by the NM, since a) the launch context is provided and managed by the AM, the AM should take care of tying the context with the version b) There are (possibly huge) storage implications the NM would have to deal with to keep track of all the earlier versions.

It should not be called rollback. The AM should call upgradeContainer(launchContext) with some previous context.

IMHO, we probably can use one API restartContainer(context) for both upgrade and downgrade

I agree that both rollback (explicit rollback via API) and upgrade can be implemented as wrappers over restartContainer(launchContext). But, in my opinion rollback should not be provided with an explicit launchContext, it should always be the just previous context.

Also, Forcing containers to be restarted with previous version if upgrade fails may not be suitable in all cases, User wants to troubleshoot the failure first before triggering a new wave of restarts.

Agreed... I can include an UpgradePolicy which allows users to terminate or rollBack (implicit rollback) on failure. Also COMMIT is useful here if the user wants to verify if one wave has successfully upgraded, commit upgrade in those instances and then move on to the next wave.

IMO, as first cut implementation, we can fail the container if upgrade fails. we can add retry, rollback, or release the container as RetryPolicy on failure later. your opinion ?

Yup.. will include a policy, as I mentioned above. Don't think retry makes sense though.

Arun Suresh
added a comment - 07/Sep/16 21:07 - edited Thanks for the review Jian He
The COMMIT_UPGRADE API: I don’t quite get the necessity of this API. Could you explain under what scenario should the user call this API ?
Consider an AM that upgrades a container with a new binary and the process is subsequently restarted. Now after say around 10 mins the process dies. There is no way form the NM to know if the process died because of the upgrade (memory leak ?) or due to some transient failure.. and therefore it cannot make the decision to Retry the process or Rollback the upgrade. Only the AM knows if the upgrade is actually successful. Essentially, the commit API should be used by the AM to notify the NM that upgrade is fine and any subsequent failure can be handled by the existing Retry Policy AFTER it has performed some upgrade diagnostics on the container. We can provide an autoCommit convenience method that clubs upgrade + commit. But I feel it is important we keep the explicit commit API.
The ROLLBACK_UPGRADE API: I think it should be able to rollback to any previous version, rather than only the immediate previous one. In some sense, it’s the same as upgrade.
I agree AM should be able to move to any previous version, but,
I feel the versioning should NOT be managed by the NM, since a) the launch context is provided and managed by the AM, the AM should take care of tying the context with the version b) There are (possibly huge) storage implications the NM would have to deal with to keep track of all the earlier versions.
It should not be called rollback . The AM should call upgradeContainer(launchContext) with some previous context.
IMHO, we probably can use one API restartContainer(context) for both upgrade and downgrade
I agree that both rollback (explicit rollback via API) and upgrade can be implemented as wrappers over restartContainer(launchContext) . But, in my opinion rollback should not be provided with an explicit launchContext, it should always be the just previous context.
Also, Forcing containers to be restarted with previous version if upgrade fails may not be suitable in all cases, User wants to troubleshoot the failure first before triggering a new wave of restarts.
Agreed... I can include an UpgradePolicy which allows users to terminate or rollBack (implicit rollback) on failure. Also COMMIT is useful here if the user wants to verify if one wave has successfully upgraded, commit upgrade in those instances and then move on to the next wave.
IMO, as first cut implementation, we can fail the container if upgrade fails. we can add retry, rollback, or release the container as RetryPolicy on failure later. your opinion ?
Yup.. will include a policy, as I mentioned above. Don't think retry makes sense though.

In YARN-5221 we have merged update container resource and execution type. I think it's better to merge the API in NM side as well. We have updated NM for container resizing in YARN-1643/YARN-1449, and there's on pending ticket for updating cgroups: YARN-4166.

Wangda Tan
added a comment - 07/Sep/16 22:28 Arun Suresh ,
In YARN-5221 we have merged update container resource and execution type. I think it's better to merge the API in NM side as well. We have updated NM for container resizing in YARN-1643 / YARN-1449 , and there's on pending ticket for updating cgroups: YARN-4166 .
I would suggest to have:
A unified API in AM-NM protocol to update container
A unified implementation inside to update container manager / CGroups.
To avoid making API/Implementation fragmented.
Thoughts?

YARN-5221 deals with the update of a container allocation, properties of the container that are of interest to the Scheduler for allocation of the resource request like Resource size, and ExecutionType.

This JIRA is pertaining to upgrade/rollback of the of the container Process, which are of interest only to the NM for the running of the container, which are encapsulated in the ContainerLaunchContext and Resources that need to be localized. The ContainerLaunchContext and LocalResources are not even known to the RM/Scheduler.

With regard to the AM-NM protocol, I feel we should keep both different.

One for updating the container allocation which is handled by YARN-5221. Maybe we should rename those to updateContainerAllocation rather than just updateContainer ?

One for updating the container process, which this JIRA proposes to call upgrade and rollback

Arun Suresh
added a comment - 07/Sep/16 22:42 - edited Wangda Tan , I feel that YARN-5221 and this are orthogonal.
YARN-5221 deals with the update of a container allocation , properties of the container that are of interest to the Scheduler for allocation of the resource request like Resource size, and ExecutionType.
This JIRA is pertaining to upgrade/rollback of the of the container Process , which are of interest only to the NM for the running of the container, which are encapsulated in the ContainerLaunchContext and Resources that need to be localized. The ContainerLaunchContext and LocalResources are not even known to the RM/Scheduler.
With regard to the AM-NM protocol, I feel we should keep both different.
One for updating the container allocation which is handled by YARN-5221 . Maybe we should rename those to updateContainerAllocation rather than just updateContainer ?
One for updating the container process, which this JIRA proposes to call upgrade and rollback

Wangda Tan
added a comment - 07/Sep/16 22:53 Arun Suresh ,
I still think two APIs should be merged.
In YARN-5221 , it update properties of container in RM, like resource / execution-type
So I think we should have a symmetric API in NM to update properties of container in NM.
To me, localized resource, resource, execution type, any fields in ContainerLaunchContext are all properties of a container in NM, we should be able to use a unified API to update them.
And should we expose a rollback API to application?

Just have an offline chat with Arun Suresh, since this is a implementation change only, and also, container change (like update binary) is orthogonal to allocation change (like update resources). I'm OK with the overall approach.

How does AM determine whether the upgrade is successful (like what kind signal should AM depend on)? I feel once the container starts running, even for AM, it's hard to distinguish whether the failure is caused by upgrade or runtime. IMO, if container fails to launch on upgrade, it should be considered as upgrade failure. Once the container starts running, if the container fails, it can be considered as runtime failure. If user does want to rollback, user call the upgardeContainer/rollback command again to roll back.

But, in my opinion rollback should not be provided with an explicit launchContext, it should always be the just previous context.

I also agree AM can take care of tying the context with version. In our case, the slider AM (also Yarn code) will have the prior context and call the upgardeContainer with the corresponding context, and so NM does not need to remember prior context.

I think for upgrade itself, it is enough work for a single jira with enough corner cases and we have consensus on that. Could you separate the patch to include only the upgrade piece in this jira ? that also makes review easier..

Jian He
added a comment - 08/Sep/16 05:03 - edited Arun Suresh , thanks for the explanation
Only the AM knows if the upgrade is actually successful.
How does AM determine whether the upgrade is successful (like what kind signal should AM depend on)? I feel once the container starts running, even for AM, it's hard to distinguish whether the failure is caused by upgrade or runtime. IMO, if container fails to launch on upgrade, it should be considered as upgrade failure. Once the container starts running, if the container fails, it can be considered as runtime failure. If user does want to rollback, user call the upgardeContainer/rollback command again to roll back.
But, in my opinion rollback should not be provided with an explicit launchContext, it should always be the just previous context.
I also agree AM can take care of tying the context with version. In our case, the slider AM (also Yarn code) will have the prior context and call the upgardeContainer with the corresponding context, and so NM does not need to remember prior context.
I think for upgrade itself, it is enough work for a single jira with enough corner cases and we have consensus on that. Could you separate the patch to include only the upgrade piece in this jira ? that also makes review easier..

Jian He, As per your suggestion, I am uploading a patch with just the restart container for your review convenience. I renamed it reInitialize to signify that the restart is dependent on the container being re-initialized with new bits.

But, as per my previous comments, I do believe that we should not expose an upgrade without a rollback to just-previous launch context (both implicit based on failure policy and well as an explicit rollback API).

I would thus prefer to update the same JIRA with the rollback and commit calls (once you are satisfied with the restart flow) rather than open separate JIRAs.

the slider AM (also Yarn code) will have the prior context and call the upgardeContainer with the corresponding context, and so NM does not need to remember prior context.

Hmmmm... I still believe rollback to just prior version should be supported by the NM.. and for rolling upgrades, atleast for production environments I have had experience with, it is an absolute requirement. The AM (Slider in our case) can subsequently reinitialize to any version it chooses later on if it wants.

Arun Suresh
added a comment - 08/Sep/16 06:14 - edited Jian He , As per your suggestion, I am uploading a patch with just the restart container for your review convenience. I renamed it reInitialize to signify that the restart is dependent on the container being re-initialized with new bits.
But, as per my previous comments, I do believe that we should not expose an upgrade without a rollback to just-previous launch context (both implicit based on failure policy and well as an explicit rollback API).
I would thus prefer to update the same JIRA with the rollback and commit calls (once you are satisfied with the restart flow) rather than open separate JIRAs.
the slider AM (also Yarn code) will have the prior context and call the upgardeContainer with the corresponding context, and so NM does not need to remember prior context.
Hmmmm... I still believe rollback to just prior version should be supported by the NM.. and for rolling upgrades, atleast for production environments I have had experience with, it is an absolute requirement. The AM (Slider in our case) can subsequently reinitialize to any version it chooses later on if it wants.

Do we need to de-dup here? It’s possible that the same link gets added twice?

6)

How does AM determine whether the upgrade is successful (like what kind signal should AM depend on)? I feel once the container starts running, even for AM, it's hard to distinguish whether the failure is caused by upgrade or runtime. IMO, if container fails to launch on upgrade, it should be considered as upgrade failure. Once the container starts running, if the container fails, it can be considered as runtime failure. If user does want to rollback, user call the upgardeContainer/rollback command again to roll back.

I think both Jian He and Arun Suresh raise valid points. I think an explicit commit API(with auto-commit option being the default option) should satisfy both use cases.

Varun Vasudev
added a comment - 08/Sep/16 11:08 Thanks for the patch Arun Suresh !
1)
- containermanager.container.ContainerState.RUNNING)) {
+ containermanager.container.ContainerState.RUNNING)
+ || container.isReInitializing()) {
Minor nit - can we add a function called container.isRunning and move the running state check into that function? Then the function becomes container.isRunning() || container.isReInitializing()
2)
+ private void preUpgradeCheck(ContainerId containerId, String op)
Maybe switch to enum instead of String for the op?
3)
+ container.launchContext = container.reInitContext.newLaunchContext;
+ container.resourceSet.merge(container.reInitContext.resourceSet);
+
+ container.sendLaunchEvent();
Instead of a launch event we should send a relaunch event - the relaunch takes care of trying to run in same work dir as the earlier attempt, etc
4)
+ public void transition(ContainerImpl container, ContainerEvent event) {
+ container.reInitContext = createReInitContext(container, event);
Should there be a guard against calling reint if a reinit is already in progress? Could we end up with the ReInitContext in odd state?
5)
+ List< String > l = resourceSet.resourceLocalized(
+ rsrcEvent.getResource(), rsrcEvent.getLocation());
+ if (l != null ) {
+ links.addAll(l);
+ }
Do we need to de-dup here? It’s possible that the same link gets added twice?
6)
How does AM determine whether the upgrade is successful (like what kind signal should AM depend on)? I feel once the container starts running, even for AM, it's hard to distinguish whether the failure is caused by upgrade or runtime. IMO, if container fails to launch on upgrade, it should be considered as upgrade failure. Once the container starts running, if the container fails, it can be considered as runtime failure. If user does want to rollback, user call the upgardeContainer/rollback command again to roll back.
I think both Jian He and Arun Suresh raise valid points. I think an explicit commit API(with auto-commit option being the default option) should satisfy both use cases.

The addResource better be done in the upgrade API itself so that if the passed local resource is not valid, the API can throw exception directly and AM gets notified. Right now, the failure is ignored to the AM.

While AM issues the upgrade command, the container could exit with success or failure. in this case, should we still continue the upgrade process ? i.e. restart the container with the new launch context, because anyways we wanted to old process to exit.

Jian He
added a comment - 08/Sep/16 14:24 Thanks Arun, few comments on the upgrade flow:
The addResource better be done in the upgrade API itself so that if the passed local resource is not valid, the API can throw exception directly and AM gets notified. Right now, the failure is ignored to the AM.
Map<LocalResourceVisibility, Collection<LocalResourceRequest>>
req = container.reInitContext.resourceSet.addResources( getResourcesToLocalize(event));
While AM issues the upgrade command, the container could exit with success or failure. in this case, should we still continue the upgrade process ? i.e. restart the container with the new launch context, because anyways we wanted to old process to exit.

Should there be a guard against calling reint if a reinit is already in progress? Could we end up with the ReInitContext in odd state?

So there is already a guard in the ContainerManager api... but I have included an additional check in the transition in the new patch as per your suggestion.

Instead of a launch event we should send a relaunch event - the relaunch takes care of trying to run in same work dir as the earlier attempt, etc

I actually tried using relaunch initially... but it looks like the pid has to be running for the re launch to work correctly. Also, looks like we would need an intermediate state there too and would result in same (or more) amount of code change. I would actually prefer to use launch itself, since I am more confident of how it works. I have also updated the testcase to verify that the upgraded container has access to and is able to read files created by the previous process in the working directory.

think an explicit commit API(with auto-commit option being the default option) should satisfy both use cases.

Thanks.. will update the patch with it once we agree that the reinit flow is fine.

While AM issues the upgrade command, the container could exit with success or failure. in this case, should we still continue the upgrade process ?

I am nullifying the reInitContext in the event of an explicit kill or if process completed successfully during the reInit.. the upgrade should thus be cancelled. Do take a look at the latest patch and let me know if you think i've cover all cases.

Arun Suresh
added a comment - 08/Sep/16 16:50 Uploading patch addressing most of Varun Vasudev and Jian He suggestions. Thanks for the comments !!
Varun Vasudev ,
Should there be a guard against calling reint if a reinit is already in progress? Could we end up with the ReInitContext in odd state?
So there is already a guard in the ContainerManager api... but I have included an additional check in the transition in the new patch as per your suggestion.
Instead of a launch event we should send a relaunch event - the relaunch takes care of trying to run in same work dir as the earlier attempt, etc
I actually tried using relaunch initially... but it looks like the pid has to be running for the re launch to work correctly. Also, looks like we would need an intermediate state there too and would result in same (or more) amount of code change. I would actually prefer to use launch itself, since I am more confident of how it works. I have also updated the testcase to verify that the upgraded container has access to and is able to read files created by the previous process in the working directory.
think an explicit commit API(with auto-commit option being the default option) should satisfy both use cases.
Thanks.. will update the patch with it once we agree that the reinit flow is fine.
Jian He ,
While AM issues the upgrade command, the container could exit with success or failure. in this case, should we still continue the upgrade process ?
I am nullifying the reInitContext in the event of an explicit kill or if process completed successfully during the reInit.. the upgrade should thus be cancelled. Do take a look at the latest patch and let me know if you think i've cover all cases.

Overall, I think we can move the logic of ReInitializeContainerTransition to the upgrade API ? All it does is to resend the events to containerLauncher or ResourceLocalizationService , which can be done in the API. Also, this has the benefit of rejecting the upgrade call while the previous upgrade is in_progress.
Current solution has a race condition that if previous upgrade is in_progress, the second one may be ignored instead of rejected, and user will not get notification that the previous upgrade is in_progress.
One other potential race is that the relocalize API has a chance to go through instead of rejected, while upgrading, as the reInitContext is updated asynchronously, those requested resources will then be considered as upgrade resources.

Jian He
added a comment - 09/Sep/16 16:20 Thanks Arun, some more comments and questions:
The reInitContext is updated asynchronously via event, but here it’s being checked synchronously in the upgrade API.
if (!container.isRunning() || container.isReInitializing()) {
Also, if it’s ignored here, then it appears to AM that the upgrade call is somehow ignored. And user who issues the upgrade command will get confused as the call is ignored.
if (container.reInitContext != null ) {
container.addDiagnostics( "Container [" + container.getContainerId()
+ "] ReInitialization already in progress !!" );
return ;
}
Overall, I think we can move the logic of ReInitializeContainerTransition to the upgrade API ? All it does is to resend the events to containerLauncher or ResourceLocalizationService , which can be done in the API. Also, this has the benefit of rejecting the upgrade call while the previous upgrade is in_progress.
Current solution has a race condition that if previous upgrade is in_progress, the second one may be ignored instead of rejected, and user will not get notification that the previous upgrade is in_progress.
One other potential race is that the relocalize API has a chance to go through instead of rejected, while upgrading, as the reInitContext is updated asynchronously, those requested resources will then be considered as upgrade resources.
why is checkAndUpdatePending method needed ?
checkAndUpdatePending(rsrcEvent, container.resourceSet, links);
if (container.isReInitializing()) {
checkAndUpdatePending(
rsrcEvent, container.reInitContext.resourceSet, links);
}
why do we set the reInitContext to be null if once resource localization failed ?
if (container.isReInitializing() &&
container.reInitContext.resourceSet.getPendingResources()
.containsKey(failedEvent.getResource())) {
LOG.error( "Container [" + container.getContainerId() + "] Re-init" +
" failed !! Resource [" + failedEvent.getResource() + "] could" +
" not be localized !!" );
container.reInitContext = null ;
}
In ResourceLocalizedWhileRunningTransition, the symlink creation part is not needed for reinit, because it will be done as part of the containerLaunch.
Given so many if(reinitializing) conditions in containerImpl, should we consider adding a new state?
when launching the container, we need to cleanupPreviousContainerFiles as done in ContainerRelaunch, right?

I think we can move the logic of ReInitializeContainerTransition to the upgrade API ? All it does is to resend the events to containerLauncher or ResourceLocalizationService , which can be done in the API. Also, this has the benefit of rejecting the upgrade call while the previous upgrade is in_progress.

Actually, in one of my initial versions, I was setting the reInitContext on the container object in the upgrade API itself. But then I realized that all state change on the Container is affected via a state machine transition and I did not want to break that. I think we should keep the state change of a Container limited to state machine transitions as the code is cleaner and we can guard against other synchronization issues.

Also, I think we should have a consistent way to letting the AM know about errors. I prefer we use the diagnostic messages (with maybe an error code) to let the AM know if failures. I guess you are already doing it for Failed Localization. This way, the checks and enforcements are done only in transitions. I agree the AM will not know immediately of an error, but subsequent 'getContainerStatus' calls should notify the AM.

One other potential race is that the relocalize API has a chance to go through instead of rejected, while upgrading,

I understand.. I was actually thinking we have a set of reInitilizingContainerIds in the NM context. we can check if containerId is present in the set in the upgrade and relocalize API and add to the set if not. We then remove from this set once the Container moves from LOCALIZED to RUNNING (this signifies that reinitialization is complete). Thoughts ?

Given so many if(reinitializing) conditions in containerImpl, should we consider adding a new state?

I tried that at first... but I preferred having more if thens than extra Container states. I can provide a version with different container states though...

when launching the container, we need to cleanupPreviousContainerFiles as done in ContainerRelaunch, right?

This is actually something I wanted to discuss. Do you think we should cleanup the old script file ? If the upgrade uses the same script file name, it will be overwritten right ? the token file is anyway overwritten right ?

Arun Suresh
added a comment - 09/Sep/16 18:47 - edited Jian He , thanks for thoughtful comments.
Also, if it’s ignored here, then it appears to AM that the upgrade call is somehow ignored
So in the ReInitializeContainerTransition , I do the following
if (container.reInitContext != null) {
container.addDiagnostics("Container [" + container.getContainerId()
+ "] ReInitialization already in progress !!");
return;
}
So a new upgrade is not ignored..
I think we can move the logic of ReInitializeContainerTransition to the upgrade API ? All it does is to resend the events to containerLauncher or ResourceLocalizationService , which can be done in the API. Also, this has the benefit of rejecting the upgrade call while the previous upgrade is in_progress.
Actually, in one of my initial versions, I was setting the reInitContext on the container object in the upgrade API itself. But then I realized that all state change on the Container is affected via a state machine transition and I did not want to break that. I think we should keep the state change of a Container limited to state machine transitions as the code is cleaner and we can guard against other synchronization issues.
Also, I think we should have a consistent way to letting the AM know about errors. I prefer we use the diagnostic messages (with maybe an error code) to let the AM know if failures. I guess you are already doing it for Failed Localization. This way, the checks and enforcements are done only in transitions. I agree the AM will not know immediately of an error, but subsequent 'getContainerStatus' calls should notify the AM.
One other potential race is that the relocalize API has a chance to go through instead of rejected, while upgrading,
I understand.. I was actually thinking we have a set of reInitilizingContainerIds in the NM context. we can check if containerId is present in the set in the upgrade and relocalize API and add to the set if not. We then remove from this set once the Container moves from LOCALIZED to RUNNING (this signifies that reinitialization is complete). Thoughts ?
Given so many if(reinitializing) conditions in containerImpl, should we consider adding a new state?
I tried that at first... but I preferred having more if thens than extra Container states. I can provide a version with different container states though...
when launching the container, we need to cleanupPreviousContainerFiles as done in ContainerRelaunch, right?
This is actually something I wanted to discuss. Do you think we should cleanup the old script file ? If the upgrade uses the same script file name, it will be overwritten right ? the token file is anyway overwritten right ?

why do we set the reInitContext to be null if once resource localization failed ?

So once the localization required for the reinit is complete, a CONTAINER_CLEANUP event is sent to the ContainerLaunch. The ContainerLaunch can only interpret this as a FORCE_KILLED or TERMINATED.. in which case, the only event it can fire is the CONTAINER_KILLED_ON_REQUEST event.. in which case.. the only transition to handle this while container is in RUNNING state is the KilledExternallyTransition. The KilledExternallyTransition needs to know if the the kill needs result in a reinit or not... which it does by checking if reInitContext is null.

I understand we can maybe avoid this with a new RE_INIT state (and thereby do away with the if check). But I have a feeling the state machine might get more complicated... I will give the refactor a shot.

Arun Suresh
added a comment - 09/Sep/16 19:00 Missed this:
why do we set the reInitContext to be null if once resource localization failed ?
So once the localization required for the reinit is complete, a CONTAINER_CLEANUP event is sent to the ContainerLaunch . The ContainerLaunch can only interpret this as a FORCE_KILLED or TERMINATED.. in which case, the only event it can fire is the CONTAINER_KILLED_ON_REQUEST event.. in which case.. the only transition to handle this while container is in RUNNING state is the KilledExternallyTransition . The KilledExternallyTransition needs to know if the the kill needs result in a reinit or not... which it does by checking if reInitContext is null.
I understand we can maybe avoid this with a new RE_INIT state (and thereby do away with the if check). But I have a feeling the state machine might get more complicated... I will give the refactor a shot.

Arun Suresh
added a comment - 11/Sep/16 00:31 Uploading patch addressing Jian He 's suggestions.
Refactored to use a new REINITIALIZING state
Handle race conditions to properly disallow relocalization and reintialization while a container is undergoing reinitialization.

In ResourceLocalizedWhileReInitTransition, why is checkAndUpdatePending needed ? why do we need to pass in a hashset to store the links ? I think the symlinks should be always distinct. I wonder we can can just call “resourceSet.resourceLocalized”

The newly added interfaces (startReInitialization/completeReInitialization/isReInitializing) in NMContext. Thinking if we can have a flag in ContainerImpl for this.. the advantage is that it won’t cause object leak for any reason we forgot to remove the container from the set. Else, we can expose a single getReinitializingContainers() in the NMContext instead of three, as a getter API looks conforming more with the other APIs in the NMContext.

we lost the tracking of the failed resource if we set the reInitContext to null? probably we should add the failed resource to ContainerImpl.resourceSet

Jian He
added a comment - 12/Sep/16 05:38 Arun, thanks for updating ! looks good to me overall, few more comments:
Unused parameter container in createReInitContext
We can use the ResourceSet#getAllResourcesByVisibility method instead, and so the getLocalPendingRequests method and the new constructor in ContainerLocalizationRequestEvent is not needed
Collection<LocalResourceRequest> reqs = getLocalPendingRequests(event);
if (!reqs.isEmpty()) {
container.dispatcher.getEventHandler().handle(
new ContainerLocalizationRequestEvent(container, reqs));
In ResourceLocalizedWhileReInitTransition, why is checkAndUpdatePending needed ? why do we need to pass in a hashset to store the links ? I think the symlinks should be always distinct. I wonder we can can just call “resourceSet.resourceLocalized”
The newly added interfaces (startReInitialization/completeReInitialization/isReInitializing) in NMContext. Thinking if we can have a flag in ContainerImpl for this.. the advantage is that it won’t cause object leak for any reason we forgot to remove the container from the set. Else, we can expose a single getReinitializingContainers() in the NMContext instead of three, as a getter API looks conforming more with the other APIs in the NMContext.
we lost the tracking of the failed resource if we set the reInitContext to null? probably we should add the failed resource to ContainerImpl.resourceSet
LOG.error( "Container [" + container.getContainerId() + "] Re-init" +
" failed !! Resource [" + failedEvent.getResource() + "] could" +
" not be localized !!" );
container.reInitContext = null ;
Looks like we don’t need the killedForReInitialization flag in ContainerLaunch, because container_killed event can already be distinguished based on whether container is at Reinit or running state.
nit: fix the format of the new imports in ContainerManagerImpl
the preUpgradeCheck method can also be reused by the localize API.
Tests looks good, only very minor things:
could you add comments to prepareInitialContainer/prepareContainerUpgrade method about what’s it doing, that helps people to understand without reading through the code
Wait for new processStartfile to be creases typo : to be created
Do you think we should cleanup the old script file ? If the upgrade uses the same script file name, it will be overwritten right ? the token file is anyway overwritten right ?
I agree we don't need to clean up old files, if it can be overwritten.

Will update patch shortly with your suggestions. But, with regard to this :

Looks like we don’t need the killedForReInitialization flag in ContainerLaunch, because container_killed event can already be distinguished based on whether container is at Reinit or running state.

This we actually do need the flag.. Since even while the container is re-initing (while resource is localizing but before the container_cleanup signal is sent), the container should be killable explicitly via an external signal. The ContainerLaunch should be able to know the difference and after killing, it should either ContainerEventType.CONTAINER_KILLED_ON_REQUEST or ContainerEventType.CONTAINER_KILLED_FOR_REINIT. That reminds me... I have to add the following transition as well:

Arun Suresh
added a comment - 12/Sep/16 05:53 Thanks Jian He ..
Will update patch shortly with your suggestions. But, with regard to this :
Looks like we don’t need the killedForReInitialization flag in ContainerLaunch, because container_killed event can already be distinguished based on whether container is at Reinit or running state.
This we actually do need the flag.. Since even while the container is re-initing (while resource is localizing but before the container_cleanup signal is sent), the container should be killable explicitly via an external signal. The ContainerLaunch should be able to know the difference and after killing, it should either ContainerEventType.CONTAINER_KILLED_ON_REQUEST or ContainerEventType.CONTAINER_KILLED_FOR_REINIT . That reminds me... I have to add the following transition as well:
.addTransition(ContainerState.REINITIALIZING,
ContainerState.EXITED_WITH_FAILURE,
ContainerEventType.CONTAINER_KILLED_ON_REQUEST,
new KilledExternallyTransition())

Jian He
added a comment - 12/Sep/16 06:52 the container should be killable explicitly via an external signal.
IIUC, in this case, the ContainerImpl will receive the KILL event first and move to the KILLING state, and the CONTAINER_KILLED_ON_REQUEST will be sent to the container at KILLING state.

One more thing about the test.. In testContainerUpgradeSuccess, could you make newStartFile a new upgrade resource, and verify the output is written into it, this verifies the part about the localization part as well.

Jian He
added a comment - 12/Sep/16 07:08 One more thing about the test.. In testContainerUpgradeSuccess, could you make newStartFile a new upgrade resource, and verify the output is written into it, this verifies the part about the localization part as well.

IIUC, in this case, the ContainerImpl will receive the KILL event first and move to the KILLING state, and the CONTAINER_KILLED_ON_REQUEST will be sent to the container at KILLING state..

It goes to KILLING stage only if the AM explicitly sends a kill signal or the RM asks NM to kill. It is also possible that the an admin logs into the NM and does a 'kill -9' which will also cause the ContainerLaunch to send CONTAINER_KILLED_ON_REQUEST but it wont be in KILLING state.. right ?

..In testContainerUpgradeSuccess, could you make newStartFile a new upgrade resource, and verify the output is written into it, this verifies the part about the localization part as well.

Actually if you look at the prepareContainerUpgrade() function, we create a new script file scriptFile_new which is passed into the prepareContainerLaunchContext() function which associates the new file to a new dest_file_new location.. this should verify that the upgrade needed a new localized resource. The output of the script is also written to a new start_file_n.txt which we read and verify to check if the new process has actually started.

Also by the way:

We can use the ResourceSet#getAllResourcesByVisibility method instead, and so the getLocalPendingRequests method and the new constructor in ContainerLocalizationRequestEvent is not needed

The problem with getAllResourcesByVisibility, is it gets all resources. I just need the pending resources... So if you are ok with it, Id like to keep it as is..

Arun Suresh
added a comment - 12/Sep/16 07:34 - edited Updating patch.
Addressing Jian He 's latest comments
some javadoc, checkstyle and javac fixes
IIUC, in this case, the ContainerImpl will receive the KILL event first and move to the KILLING state, and the CONTAINER_KILLED_ON_REQUEST will be sent to the container at KILLING state..
It goes to KILLING stage only if the AM explicitly sends a kill signal or the RM asks NM to kill. It is also possible that the an admin logs into the NM and does a 'kill -9' which will also cause the ContainerLaunch to send CONTAINER_KILLED_ON_REQUEST but it wont be in KILLING state.. right ?
..In testContainerUpgradeSuccess, could you make newStartFile a new upgrade resource, and verify the output is written into it, this verifies the part about the localization part as well.
Actually if you look at the prepareContainerUpgrade() function, we create a new script file scriptFile_new which is passed into the prepareContainerLaunchContext() function which associates the new file to a new dest_file_new location.. this should verify that the upgrade needed a new localized resource. The output of the script is also written to a new start_file_n.txt which we read and verify to check if the new process has actually started.
Also by the way:
We can use the ResourceSet#getAllResourcesByVisibility method instead, and so the getLocalPendingRequests method and the new constructor in ContainerLocalizationRequestEvent is not needed
The problem with getAllResourcesByVisibility, is it gets all resources. I just need the pending resources... So if you are ok with it, Id like to keep it as is..

It is also possible that the an admin logs into the NM and does a 'kill -9' which will also cause the ContainerLaunch to send CONTAINER_KILLED_ON_REQUEST but it wont be in KILLING state.. right ?

I guess in this case, it’s also fine to do the upgrade… because the upgrade API does accept it, it’s hard to distinguish which one should go first.. It's also likely the reverse can also happen because it's transient, if the setKillForReInitialization is called first, then the container process is killed, It will be considered as re-init, even though it is killed by external signal. so keep it consistent ?

Actually if you look at the prepareContainerUpgrade() function,

ah, yes, mislooked . thank you !

The problem with getAllResourcesByVisibility, is it gets all resources. I just need the pending resources

In this case, the pendingResources in the same as the getAllResourcesByVisibility, right? basically, I meant like below.. and the newly added methods could be not needed.

Jian He
added a comment - 12/Sep/16 09:02 - edited It is also possible that the an admin logs into the NM and does a 'kill -9' which will also cause the ContainerLaunch to send CONTAINER_KILLED_ON_REQUEST but it wont be in KILLING state.. right ?
I guess in this case, it’s also fine to do the upgrade… because the upgrade API does accept it, it’s hard to distinguish which one should go first.. It's also likely the reverse can also happen because it's transient, if the setKillForReInitialization is called first, then the container process is killed, It will be considered as re-init, even though it is killed by external signal. so keep it consistent ?
Actually if you look at the prepareContainerUpgrade() function,
ah, yes, mislooked . thank you !
The problem with getAllResourcesByVisibility, is it gets all resources. I just need the pending resources
In this case, the pendingResources in the same as the getAllResourcesByVisibility, right? basically, I meant like below.. and the newly added methods could be not needed.
Map<LocalResourceVisibility, Collection<LocalResourceRequest>>
pendingResources = ((ContainerReInitEvent) event).getResourceSet()
.getAllResourcesByVisibility();
if (!pendingResources.isEmpty()) {
container.dispatcher.getEventHandler().handle(
new ContainerLocalizationRequestEvent(container, pendingResources));
} else {
Forgot to say, similarly, is the change in ResourceLocalizedWhileRunningTransition required. as the symlinks are also already distinct.

Arun Suresh
added a comment - 12/Sep/16 14:07 Thanks Jian He .. Uploading patch (v011) with the changes.
I left the CLEANUP_CONTAINER_FOR_REINIT there, even though it does the same thing as CLEANUP_CONTAINER. It is sent by a different source, it can be used for debugging etc.

Jian He
added a comment - 12/Sep/16 14:39 Arun, thank you very much for the prompt response..
I think you forgot to remove the unused checkAndUpdatePending method, given that we anyway need one more patch.. few nits:
reInitContext no need to be volatile
remove unused imports in ContainerLocalizationRequestEvent
latest patch looks good to me.
Varun Vasudev , want to take a look ?

Arun Suresh, found a couple more issues while looking at the other patch:

not sure if I missed something, Looks like the ReInitializationContext#resourceSet is never initialized with the pending resources ?

Also, the symlinks for the upgraded resources are not overridden correctly.
we probably need to change this to be a map of (symlink -> localizedPath) so that the same symlink can be overwritten to point to the new resource. If this involves a chain of caller changes, it can be done separately.

Jian He
added a comment - 13/Sep/16 09:04
Arun Suresh , found a couple more issues while looking at the other patch:
not sure if I missed something, Looks like the ReInitializationContext#resourceSet is never initialized with the pending resources ?
Also, the symlinks for the upgraded resources are not overridden correctly.
we probably need to change this to be a map of (symlink -> localizedPath) so that the same symlink can be overwritten to point to the new resource. If this involves a chain of caller changes, it can be done separately.
private Map<Path, List< String >> localizedResources =
new ConcurrentHashMap<>();

ReInitializationContext#resourceSet is never initialized with the pending resources ?

fixed this.. looks like the test case was passing because the localization request was fired from the resourceSet in the event.. so it did get localized.

Also, the symlinks for the upgraded resources are not overridden correctly. we probably need to change this to be a map of (symlink -> localizedPath)

Can you take a look at the changes to ResourceSet in the latest patch? The mergeWith function merges with upgraded resource set with the existing one and creates a new ResourceSet. It takes care of overwriting the symlinks.
I changed the localizedResources map from <Path, List<Links>> to <Path, Set<Links>> since it is easier to de-dup.. like you said, I kept the call chain intact (we can handle in another JIRA) but converting it back to <Path, List<Links>> for the callers. Also, this happens only once when container is started and every upgrade.. so am guess it should be fine.

Also.. I pulled in a change from YARN-5637 here. Essentially, if the LocalResourceTracker gets a REQUEST for a resource and if it already exists, then it just ignores the request.. I modified it slightly to ensure that the container receives Resource Localized message even if the resource exists since re-initialization can be kicked off only after the container receives a Resource Localized message for ALL its required resources,

Arun Suresh
added a comment - 13/Sep/16 13:11 Jian He , thanks for checking.. Please find updated patch (v013)...
ReInitializationContext#resourceSet is never initialized with the pending resources ?
fixed this.. looks like the test case was passing because the localization request was fired from the resourceSet in the event.. so it did get localized.
Also, the symlinks for the upgraded resources are not overridden correctly. we probably need to change this to be a map of (symlink -> localizedPath)
Can you take a look at the changes to ResourceSet in the latest patch? The mergeWith function merges with upgraded resource set with the existing one and creates a new ResourceSet. It takes care of overwriting the symlinks.
I changed the localizedResources map from <Path, List<Links>> to <Path, Set<Links>> since it is easier to de-dup.. like you said, I kept the call chain intact (we can handle in another JIRA) but converting it back to <Path, List<Links>> for the callers. Also, this happens only once when container is started and every upgrade.. so am guess it should be fine.
Also.. I pulled in a change from YARN-5637 here. Essentially, if the LocalResourceTracker gets a REQUEST for a resource and if it already exists, then it just ignores the request.. I modified it slightly to ensure that the container receives Resource Localized message even if the resource exists since re-initialization can be kicked off only after the container receives a Resource Localized message for ALL its required resources,

Arun Suresh
added a comment - 13/Sep/16 16:25 Reverting this:
Also.. I pulled in a change from YARN-5637 here. Essentially, if the LocalResourceTracker gets a REQUEST for a resource and if it already exists, then it just ignores the request..
Since this is already taken care of..

Arun Suresh
added a comment - 13/Sep/16 22:18 Uploading patch with minor refactoring:
ResourceSet::localizedResources is now a Map<String, Path>
Make sure that the RetryContext of the Container is updated after re-initialization

Jian He
added a comment - 14/Sep/16 05:55 Thanks Arun for updating,
The merged variable is not initialized with the existing resources, right? We can probably create a new constructor in ResourceSet to take in ResourceSet as a parameter.
ResourceSet merged = new ResourceSet();
for (ResourceSet rs : resourceSets) {
// This should overwrite existing symlinks
merged.localizedResources.putAll(rs.localizedResources);

Jian He, I intentionally wanted to keep the merge function as a static utility function. I would prefer not adding a new constructor... Also, if you notice in the KilledExternallyForReInitTransition we use this function to merge the existing ResourceSet and upgraded ResourceSet and set the value of container.resourceSet to the merged RS.
I can change it if your are particular... but would prefer to keep it as such.

Arun Suresh
added a comment - 14/Sep/16 06:10 - edited Jian He , I intentionally wanted to keep the merge function as a static utility function. I would prefer not adding a new constructor... Also, if you notice in the KilledExternallyForReInitTransition we use this function to merge the existing ResourceSet and upgraded ResourceSet and set the value of container.resourceSet to the merged RS.
I can change it if your are particular... but would prefer to keep it as such.

Jian He
added a comment - 14/Sep/16 06:34 KilledExternallyForReInitTransition we use this function to merge the existing ResourceSet and upgraded ResourceSe
I see, I overlook it's passing both parameters. thanks

Arun Suresh
added a comment - 14/Sep/16 11:26 The previous Jenkins run looks bad because I had manually cancelled it.
The run before that ran fine with the same patch.
The testcase failure is not related.
Jian He / Varun Vasudev .. if you are fine with the latest patch, can we push this into trunk so we can start jenkins against YARN-5637 ?

Varun Vasudev
added a comment - 15/Sep/16 10:11 Thanks for the patch Arun Suresh . +1 except for some minor comment fixes.
1)
+ * Resource is localized while the container is running - create symlinks.
Comment is the same for two transition handlers - maybe change slightly to provide more context?
2)
+ // If Container died during an upgrade, dont bother retrying.
What is this comment for? There’s no change in the code and it looks we're just going through the regular retry mechanism.
3)
+ static class KilledExternallyForReInitTransition extends ContainerTransition {
Maybe this should be renamed? My understanding is that this really is “Killed by YARN framework to restart container"

Arun Suresh
added a comment - 15/Sep/16 13:38 Thanks for the review Varun Vasudev ..
Uploading final patch with the changes in comments and class rename you suggested.
Will commit after a good Jenkins run

YARN-5657 blocks ongoing Jenkins testing to some NM/container related new code (noticed this when testing YARN-5638). Also, I noticed there were several trivial complaints from checkstyle in the last Jenkins run. Were those warnings addressed when the code got committed? There's no note about this either...

Li Lu
added a comment - 16/Sep/16 18:24 YARN-5657 blocks ongoing Jenkins testing to some NM/container related new code (noticed this when testing YARN-5638 ). Also, I noticed there were several trivial complaints from checkstyle in the last Jenkins run. Were those warnings addressed when the code got committed? There's no note about this either...

Arun Suresh
added a comment - 16/Sep/16 18:36 The test failed locally for me when i reverted the patch, so I assumed it was unrelated... ill take a look.
There's no note about this either...
I mentioned i will be cleaning up the imports when I check in and the remaining as switch case indentation issues.. which are generally ignored, since they conflict with existing indentation.

Arun Suresh,
I think we need one more jira to release the older than previous resources, if the container is upgraded from v1 -> v2 -> v3. The v1 resources are not referenced by container any more, it needs to be released, I briefly checked the code, maybe below code in ResourceLocalizationService need to be invoked for v1 resources.

Arun Suresh
added a comment - 26/Sep/16 16:35 I agree.. but what if the resourcesets for v2 and v3 are only incremental resourceSets (we merge the newResourceSet with the current) ? In which case, v3 might still reference v1 resources.

In that case, IIUC, if a LocalizedResource is referenced by both v1 and v3. when we release v1 resources, the refCount (LocalizedResource#ref) is decremented, but the LocalizedResource is not actually released because v3 is still referencing it.

Jian He
added a comment - 26/Sep/16 16:43 In that case, IIUC, if a LocalizedResource is referenced by both v1 and v3. when we release v1 resources, the refCount (LocalizedResource#ref) is decremented, but the LocalizedResource is not actually released because v3 is still referencing it.