Deadlock in blacklisting when executor is removed due to a heartbeat timeout

Details

Description

I ran into a deadlock dealing with blacklisting where the application was stuck in a state that it thought it had executors to schedule on but it really didn't because those executors were dead. So all executors were blacklisted and it couldn't run any tasks.

Note for this to happen the application has to be in a state where many tasks are going to fail (at least enough tasks fail to cause all executors to be blacklisted) and an executor had to get a heartbeat timeout.

Small disclaimer in that I ran into this on spark 2.1 with the blacklisting feature backported. Looking at the code for master branch I don't see this fixed though.

Normally there is code TaskSetManager. abortIfCompletelyBlacklisted that prevents the app from becoming deadlocked when no executors are available. Unfortunately this did not take affect because in this case a dead executor was in TaskSchedulerImpl.hostToExecutors and it wasn't a blacklisted host because it had a heartbeat timeout so it thought it still had executors to run on.

Now how it got into this situation:

The HeartBeatReceiver was called because it timed out a based on no heartbeat. This called TaskSchedulerImpl.executorLost, this called TaskSchedulerImpl.removeExecutor which removed it from the executorIdToRunningTaskIds and hostToExecutors. It does other handling here and sends a CompletionEvent in DAG.taskEnded. TaskSchedulerImpl.removeExecutor returns and TaskSchedulerImpl.executorLost calls into DAG.executorLost and then CoarseGrainedSchedulerBackend.reviveOffers which both send messages async. This returns and HeartBeatReceiver then calls sc.killAndReplaceExecutor asynchronously.

Now there are a bunch of messages sent which are being processed by other threads. You can see the order in which things ran from the logs below. In this situation, the CoarseGrainedSchedulerBakcend.makeOffers is called before sc.killAndReplaceExecutor can add the host to executorsPendingToRemove. CoarseGrainedSchedulerBackend.makeOffers uses CoarseGrainedSchedulerBackend.executorDataMap and executorsPendingToRemove to check to see if the executor exists and it is alive. At this point its in the executorDataMap and its not Pending to remove. Since its still there it calls into TaskSchedulerImpl.resourceOffers which adds the executor back into executorIdToRunningTaskIds and hostToExecutors. There is nothing else after this to remove it from hostToExecutors so the TaskSetManager. abortIfCompletelyBlacklisted routine always sees an executor that is actually dead.

While looking at this I also found a bug where it doesn't remove the executor from the CoarseGrainedSchedulerBackend.executorDataMap as well. Normally the call to CoarseGrainedSchedulerBackend.removeExecutor would remove that but since it was a heartbeat timeout this is never called. The only way that would have helped in this situation is if it was removed from executorDataMap before the called to TaskSchedulerImpl.resourceOffers -> CoarseGrainedSchedulerBackend.makeOffers, in which case it would have never been added back to executorIdToRunningTaskIds and hostToExecutors. Note other stats like totalCoreCount and totalRegisteredExecutors will also be off since removeExecutor is not called.

Below are the logs that show the race. Notice how the "Requesting to kill executor(s) 137" came in the middle of the removal which is in CoarseGrainedSchedulerBackend.killExecutors and but it doesn't actually run through to add it to executorsPendingToRemove "Actual list of executor(s) to be killed is 137" until after it was added back to the hostToExecutors list ("Host added was in lost list earlier: foo1.com")

Note there may be another bug here with speculation as well because some time later I see this same host in a log message that makes it sound like it thinks the task is still running on that dead host:

We can split this off into another jira if we find its actually a bug. It looks like the issue there is the TaskSetManager.copiesRunning is not decremented because we are using external shuffle service but needs further investigation.