There are two issues here. They might have the same root cause, so for now, I'm filing one issue. If it turns out they don't, it's more important to address the primary bug, and we can split the secondary one off to be dealt with separately.

When Node Manager gets a stop signal like SIGINT, the daemon shuts down, along with all the monitors it's created. However, actors started in launcher.main (TimedCallbackActor, the list pollers, the node update actor) never stop, causing the Node Manager process to stay alive.

Logs show that the daemon actor stops in Pykka, then many monitor actors shut down after. The daemon actor has no logic to shut down monitor actors. All this makes it look like we're getting to the stop_all() line. Otherwise, how else would the monitor actors be getting stopped? But other actors apparently aren't being stopped by this method call.

Node Manager is supposed to implement escalating shutdown processes when it gets a stop signal repeatedly, until it eventually forces an exit. See launcher.shutdown_signal(). However, subsequent signals seem to have no effect.

Related issues

Related to Arvados - Story #8543: [NodeManager] Don't use Futures when not expecting a reply

History

One possibility: the docs for stop_all say actors are stopped in LIFO order. If that's true, there are two reasons to believe ComputeNodeUpdateActor is the culprit: it sleeps, and it's the last actor started in launcher.main(). If it has a large queue of sync_node requests that all start by sleeping, it could be some time before it finally gets and processes its stop message.

The logs we have are consistent with this theory. After the daemon shuts down, the logs show exceptions like #6225 being raised by the update actor, every 3 minutes (the default maximum wait time between requests in the actor).

If this is the cause, we might need to re-think the implementation of ComputeNodeUpdateActor. We could potentially fix this issue, plus things like #6225 in one fell swoop…