Details

Description

When a datanode heartbeat times out namenode marks it dead. The subsequent heartbeat from the datanode is rejected with a command to datanode to re-register. However namenode accepts block report from the datanode although it is marked dead.

Suresh Srinivas
added a comment - 18/Jun/10 22:04 This creates several problems:
Blocks from dead datanode is added to block map. Opening a file returns blocks from the dead datanode. The client experience failure to read it repeatedly.
Decommissioning the datanode does not resolve this problem as the blocks continue to remain in the block map. Only way to solve this problem is by restarting namenode.
This problem was observed on a yahoo cluster, where a datanode was marked dead, however a block report from it made it to namenode.

Hudson is stuck. I ran testpatch; here is the result:[exec] -1 overall. [exec][exec] +1 @author. The patch does not contain any @author tags.[exec][exec] -1 tests included. The patch doesn't appear to include any new or modified tests.[exec] Please justify why no new tests are needed for this patch.[exec] Also please list what manual steps were performed to verify this patch.[exec][exec] +1 javadoc. The javadoc tool did not generate any warning messages.[exec][exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings.[exec][exec] +1 findbugs. The patch does not introduce any new Findbugs warnings.[exec][exec] +1 release audit. The applied patch does not increase the total number of release audit warnings.

This change just tags the code with interface classification. Hence tests are not included. I also ran unit tests and it ran without failures.

Suresh Srinivas
added a comment - 18/Jun/10 23:29 Hudson is stuck. I ran testpatch; here is the result:
[exec] -1 overall.
[exec]
[exec] +1 @author. The patch does not contain any @author tags.
[exec]
[exec] -1 tests included. The patch doesn't appear to include any new or modified tests.
[exec] Please justify why no new tests are needed for this patch.
[exec] Also please list what manual steps were performed to verify this patch.
[exec]
[exec] +1 javadoc. The javadoc tool did not generate any warning messages.
[exec]
[exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings.
[exec]
[exec] +1 findbugs. The patch does not introduce any new Findbugs warnings.
[exec]
[exec] +1 release audit. The applied patch does not increase the total number of release audit warnings.
This change just tags the code with interface classification. Hence tests are not included. I also ran unit tests and it ran without failures.

Currently when an unknown datanode sends blockReport(), namenode rejects it with an IOException. Same is done when a dead datanode sends block report.

Currently when an unknown datanode sends blockReceived() request, namenode rejects it with IllegalArgumentException. I am changing this to IOException, to be consistent with blockReport(). Same IOException is thrown when a dead datanode sends blockReceived(), IOException.

I have added a new test to ensure the following requests are rejected from dead datanodes:

Suresh Srinivas
added a comment - 24/Jun/10 23:15 The attached patch makes the following changes:
Currently when an unknown datanode sends blockReport(), namenode rejects it with an IOException. Same is done when a dead datanode sends block report.
Currently when an unknown datanode sends blockReceived() request, namenode rejects it with IllegalArgumentException. I am changing this to IOException, to be consistent with blockReport(). Same IOException is thrown when a dead datanode sends blockReceived(), IOException.
I have added a new test to ensure the following requests are rejected from dead datanodes:
blockReceived()
blockReport()
sendHeartBeat()

Konstantin Shvachko
added a comment - 25/Jun/10 22:51
Some more candidates, could you please verify and add isAlive() condition if necessary:
commitBlockSynchronization()
updatePipeline()
You used "dead or unknown" twice and "unknown or dead" in one other place. Would be better to have it consistent.

We need to add the same checks commitBlockSynchronization(), reportBadBlocks() methods. Currently these methods do not have DatanodeRegistration parameter in them. Without that checking if the node is registered or alive is not possible. I will create a separate jira to track this.

Note that the check to see if the node is registered/alive is not done for the method errorReport() as it is done for logging an error.

Suresh Srinivas
added a comment - 25/Jun/10 23:35 updatePipeLine() is not a DatanodeProtocol method.
We need to add the same checks commitBlockSynchronization(), reportBadBlocks() methods. Currently these methods do not have DatanodeRegistration parameter in them. Without that checking if the node is registered or alive is not possible. I will create a separate jira to track this.
Note that the check to see if the node is registered/alive is not done for the method errorReport() as it is done for logging an error.

Konstantin Shvachko
added a comment - 26/Jun/10 00:29 Yes I agree commitBlockSynchronization(), and reportBadBlocks() should have registration parameter and check isAlive(), to be fixed in HDFS-1270 .
+1 for the patch.