Description

When a large number of nodes are removed by refreshing node lists, the network topology is updated. If the refresh happens at the right moment, the replication monitor thread may stuck in the while loop of chooseRandom(). This is because the cached cluster size is used in the terminal condition check of the loop. This usually happens when a block with a high replication factor is being processed. Since replicas/rack is also calculated beforehand, no node choice may satisfy the goodness criteria if refreshing removed racks.

All nodes will end up in the excluded list, but the size will still be less than the cached cluster size, so it will loop infinitely. This was observed in a production environment.

Activity

This can mostly be avoided by decommissioning nodes in a smaller batch, which is the recommended practice. But for this particular case, the operator added a large number of new nodes and decommissioned old nodes.

Kihwal Lee
added a comment - 25/Jun/13 22:00 This can mostly be avoided by decommissioning nodes in a smaller batch, which is the recommended practice. But for this particular case, the operator added a large number of new nodes and decommissioned old nodes.

It couldn't pick enough number of nodes because the max replicas/rack was already calculated. I think it worked fine for majority of blocks with 3 replicas since the cluster had more than 3 racks even after refresh. The issue was with blocks with many more replicas. But picking enough nodes is just one condition. The other is for checking the exhaustion of candidate nodes. It would have bailed out of the while loop, if the cached cluster size was updated inside the loop.

To avoid frequent cluster-size refresh for this rare condition, we can make it update the cached value after dfs.replication.max iterations, within which most blocks should find all they need. If NN hits this issue, it will loop dfs.replication.max times and break out. I prefer this over adding locking, which will slow down normal cases.

Kihwal Lee
added a comment - 01/Jul/13 16:22 Even then it was not able choose at least from them?
It couldn't pick enough number of nodes because the max replicas/rack was already calculated. I think it worked fine for majority of blocks with 3 replicas since the cluster had more than 3 racks even after refresh. The issue was with blocks with many more replicas. But picking enough nodes is just one condition. The other is for checking the exhaustion of candidate nodes. It would have bailed out of the while loop, if the cached cluster size was updated inside the loop.
To avoid frequent cluster-size refresh for this rare condition, we can make it update the cached value after dfs.replication.max iterations, within which most blocks should find all they need. If NN hits this issue, it will loop dfs.replication.max times and break out. I prefer this over adding locking, which will slow down normal cases.

-1 tests included. The patch doesn't appear to include any new or modified tests.
Please justify why no new tests are needed for this patch.
Also please list what manual steps were performed to verify this patch.

+1 javac. The applied patch does not increase the total number of javac compiler warnings.

+1 javadoc. The javadoc tool did not generate any warning messages.

+1 eclipse:eclipse. The patch built with eclipse:eclipse.

+1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

+1 release audit. The applied patch does not increase the total number of release audit warnings.

Hadoop QA
added a comment - 01/Aug/13 19:43 -1 overall . Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12595453/HDFS-4937.patch
against trunk revision .
+1 @author . The patch does not contain any @author tags.
-1 tests included . The patch doesn't appear to include any new or modified tests.
Please justify why no new tests are needed for this patch.
Also please list what manual steps were performed to verify this patch.
+1 javac . The applied patch does not increase the total number of javac compiler warnings.
+1 javadoc . The javadoc tool did not generate any warning messages.
+1 eclipse:eclipse . The patch built with eclipse:eclipse.
+1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings.
+1 release audit . The applied patch does not increase the total number of release audit warnings.
+1 core tests . The patch passed unit tests in hadoop-hdfs-project/hadoop-hdfs.
+1 contrib tests . The patch passed contrib unit tests.
Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/4754//testReport/
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/4754//console
This message is automatically generated.

line 672 chooseDataNode(scope) is random, if chosenNode happens to be a excluded one, it won't go to line 674. But refreshCounter is still decreased.
If we out of luck, too many times of chooseDataNode(scope) return a already excluded one, we go inside line 716, and break at line 720.
Then we end up with choosing not enough numOfReplicas. In fact we could have.

Brahma Reddy Battula
added a comment - 31/Oct/15 06:00 I don't know if I can give a -1. But shall we revert this? A low of tests are broken because of it.
662 int refreshCounter = numOfAvailableNodes;
...
671 while (numOfReplicas > 0 && numOfAvailableNodes > 0) {
672 DatanodeDescriptor chosenNode = chooseDataNode(scope);
673 if (excludedNodes.add(chosenNode)) { //was not in the excluded list
674 if (LOG.isDebugEnabled()) {
675 builder.append( "\nNode " ).append(NodeBase.getPath(chosenNode)).append( " [" );
676 }
677 numOfAvailableNodes--;
678 DatanodeStorageInfo storage = null ;
679 if (isGoodDatanode(chosenNode, maxNodesPerRack, considerLoad,
...
711 }
712 // Refresh the node count. If the live node count became smaller,
713 // but it is not reflected in this loop, it may loop forever in case
714 // the replicas/rack cannot be satisfied.
715 if (--refreshCounter == 0) {
716 refreshCounter = clusterMap.countNumOfAvailableNodes(scope,
717 excludedNodes);
718 // It has already gone through enough number of nodes.
719 if (refreshCounter <= excludedNodes.size()) {
720 break ;
721 }
722 }
723 }
line 672 chooseDataNode(scope) is random, if chosenNode happens to be a excluded one, it won't go to line 674. But refreshCounter is still decreased.
If we out of luck, too many times of chooseDataNode(scope) return a already excluded one, we go inside line 716, and break at line 720.
Then we end up with choosing not enough numOfReplicas . In fact we could have.

I did consider the situation you mentioned, But I thought in real env the NN could find other racks/DNs if it has gone through enough (not all) number of nodes. But I missed the fact that many tests may only contain few available DNs, and refreshCounter <= excludedNodes.size() will be true, also in real env this also may happen if total number of DNs is few. So the patch should not be correct for these cases, revert them.

Yi Liu
added a comment - 31/Oct/15 08:44 - edited I did consider the situation you mentioned, But I thought in real env the NN could find other racks/DNs if it has gone through enough (not all) number of nodes. But I missed the fact that many tests may only contain few available DNs, and refreshCounter <= excludedNodes.size() will be true, also in real env this also may happen if total number of DNs is few. So the patch should not be correct for these cases, revert them.

So sorry about the spectacular 118 test failures! It should have refreshed the count with an empty exclude node set to obtain the correct count. Looks like a few failed test cases are passing with the change. Let's see if the precommit agrees.

Kihwal Lee
added a comment - 02/Nov/15 16:44 So sorry about the spectacular 118 test failures! It should have refreshed the count with an empty exclude node set to obtain the correct count. Looks like a few failed test cases are passing with the change. Let's see if the precommit agrees.

Kihwal Lee
added a comment - 03/Nov/15 14:42 The test failures are definitely related. When I run TestReplicationPolicy , different cases fail depending on test ordering. One failure might be affecting other cases.

Actually the patch I wrote 2 years ago and the rebased one are correct. My subsequent attempt to "improve" it was based on my recent incorrect understanding of the patch. Now I remember why I did that way.

After a sufficient number of random, potentially duplicate picking is tried, the total candidate node count is refreshed. The refreshed number will not include what is already tried and excluded, so it is truly the remaining candidate node count based on the list of nodes that it already tried and the latest network topology. The loop will continue until all candidate nodes are exhausted or enough number of replicas are picked.

Kihwal Lee
added a comment - 03/Nov/15 22:56 - edited Actually the patch I wrote 2 years ago and the rebased one are correct. My subsequent attempt to "improve" it was based on my recent incorrect understanding of the patch. Now I remember why I did that way.
After a sufficient number of random, potentially duplicate picking is tried, the total candidate node count is refreshed. The refreshed number will not include what is already tried and excluded, so it is truly the remaining candidate node count based on the list of nodes that it already tried and the latest network topology. The loop will continue until all candidate nodes are exhausted or enough number of replicas are picked.
Resubmitting v1, the rebased original patch.

First of all, the precommit build ran 4,075 test cases, so I think it ran all of them this time.

The test failures are not related to the patch. I've rerun the failed tests and only TestSeveralNameNodes were failing occasionally. It was timing out waiting for a thread to finish writing. This test has been failing in other precommit builds as well. When I increase the timeout, it passed 100% of times. I will file a jira for this.

Siva Teja Patibandla
added a comment - 09/Mar/17 21:48 Hi Kihwal, was the v3 patch tested? it seems the whole function chooseRandom() got rewritten in later releases so the fix may not have gotten much test mileage so whether I should use it or not.