Potential NN deadlock in processDistributedUpgradeCommand

Details

Description

Haven't seen this in practice, but the lock order is inconsistent. processReport locks FSNamesystem, then calls UpgradeManager.startUpgrade, getUpgradeState, and getUpgradeStatus (each of which locks the UpgradeManager). FSNameSystem.processDistributedUpgradeCommand calls upgradeManager.processUpgradeCommand which is synchronized on UpgradeManager, which can call FSNameSystem.leaveSafeMode which synchronizes on FSNamesystem.

Confirming that this happens in practice, at least in tests. The TestDistributedUpgrade test is flaky due to this reason. We're capturing thread dumps of tests failing due to timeouts (HADOOP-8755) and here's the tread dump of TestDistributedUpgrade failure (see attachment). Thread #110 is blocked by #107 (or #109) and in turn #107 (109?) is blocked by #110. The first one acquired a monitor on the UpgradeManagerNamenode instance, and the second one got an fsLock, so both are waiting for each other. The test fails to start the cluster as DN heartbeats can't be processed by NN.

Andrey Klochkov
added a comment - 10/Sep/12 21:43 Confirming that this happens in practice, at least in tests. The TestDistributedUpgrade test is flaky due to this reason. We're capturing thread dumps of tests failing due to timeouts ( HADOOP-8755 ) and here's the tread dump of TestDistributedUpgrade failure (see attachment). Thread #110 is blocked by #107 (or #109) and in turn #107 (109?) is blocked by #110. The first one acquired a monitor on the UpgradeManagerNamenode instance, and the second one got an fsLock, so both are waiting for each other. The test fails to start the cluster as DN heartbeats can't be processed by NN.

Todd Lipcon
added a comment - 11/Sep/12 04:58 Given that we removed the "distributed upgrade" code recently, maybe we should just backport that patch to earlier branches to avoid this issue entirely? Thanks for digging into this, Andrey!