dhruba borthakur
added a comment - 03/Jan/11 07:53 I committed this to trunk only. I did not commit it to 0.22 because it is not a regression. But I am not opposed if the 0.22 release manager wants to pull it to 0.22 branch.

-1 tests included. The patch doesn't appear to include any new or modified tests.
Please justify why no new tests are needed for this patch.
Also please list what manual steps were performed to verify this patch.

+1 javadoc. The javadoc tool did not generate any warning messages.

+1 javac. The applied patch does not increase the total number of javac compiler warnings.

+1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

+1 release audit. The applied patch does not increase the total number of release audit warnings.

Hadoop QA
added a comment - 30/Dec/10 11:23 -1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12467161/datanodeException5.txt
against trunk revision 1053214.
+1 @author. The patch does not contain any @author tags.
-1 tests included. The patch doesn't appear to include any new or modified tests.
Please justify why no new tests are needed for this patch.
Also please list what manual steps were performed to verify this patch.
+1 javadoc. The javadoc tool did not generate any warning messages.
+1 javac. The applied patch does not increase the total number of javac compiler warnings.
+1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.
+1 release audit. The applied patch does not increase the total number of release audit warnings.
-1 core tests. The patch failed these core unit tests:
org.apache.hadoop.hdfs.server.namenode.TestStorageRestore
org.apache.hadoop.hdfs.TestFileConcurrentReader
-1 contrib tests. The patch failed contrib unit tests.
+1 system test framework. The patch passed system test framework compile.
Test results: https://hudson.apache.org/hudson/job/PreCommit-HDFS-Build/60//testReport/
Findbugs warnings: https://hudson.apache.org/hudson/job/PreCommit-HDFS-Build/60//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: https://hudson.apache.org/hudson/job/PreCommit-HDFS-Build/60//console
This message is automatically generated.

Konstantin Shvachko
added a comment - 29/Dec/10 19:45 I meant that LOG.warn(msg,e) is equivalent to LOG.warn(msg + stringifyException(e)) . Both print the stack trace for e, but LOG.warn(msg,e) looks better in the logs - more standard way.

-1 tests included. The patch doesn't appear to include any new or modified tests.
Please justify why no new tests are needed for this patch.
Also please list what manual steps were performed to verify this patch.

+1 javadoc. The javadoc tool did not generate any warning messages.

+1 javac. The applied patch does not increase the total number of javac compiler warnings.

+1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

+1 release audit. The applied patch does not increase the total number of release audit warnings.

Hadoop QA
added a comment - 28/Dec/10 23:21 -1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12467071/datanodeException4.txt
against trunk revision 1053214.
+1 @author. The patch does not contain any @author tags.
-1 tests included. The patch doesn't appear to include any new or modified tests.
Please justify why no new tests are needed for this patch.
Also please list what manual steps were performed to verify this patch.
+1 javadoc. The javadoc tool did not generate any warning messages.
+1 javac. The applied patch does not increase the total number of javac compiler warnings.
+1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.
+1 release audit. The applied patch does not increase the total number of release audit warnings.
-1 core tests. The patch failed these core unit tests:
org.apache.hadoop.hdfs.server.namenode.TestStorageRestore
org.apache.hadoop.hdfs.TestFileConcurrentReader
-1 contrib tests. The patch failed contrib unit tests.
+1 system test framework. The patch passed system test framework compile.
Test results: https://hudson.apache.org/hudson/job/PreCommit-HDFS-Build/53//testReport/
Findbugs warnings: https://hudson.apache.org/hudson/job/PreCommit-HDFS-Build/53//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: https://hudson.apache.org/hudson/job/PreCommit-HDFS-Build/53//console
This message is automatically generated.

-1 tests included. The patch doesn't appear to include any new or modified tests.
Please justify why no new tests are needed for this patch.
Also please list what manual steps were performed to verify this patch.

+1 javadoc. The javadoc tool did not generate any warning messages.

+1 javac. The applied patch does not increase the total number of javac compiler warnings.

+1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

+1 release audit. The applied patch does not increase the total number of release audit warnings.

Hadoop QA
added a comment - 28/Dec/10 22:55 -1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12467070/datanodeException3.txt
against trunk revision 1053214.
+1 @author. The patch does not contain any @author tags.
-1 tests included. The patch doesn't appear to include any new or modified tests.
Please justify why no new tests are needed for this patch.
Also please list what manual steps were performed to verify this patch.
+1 javadoc. The javadoc tool did not generate any warning messages.
+1 javac. The applied patch does not increase the total number of javac compiler warnings.
+1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.
+1 release audit. The applied patch does not increase the total number of release audit warnings.
-1 core tests. The patch failed these core unit tests:
org.apache.hadoop.hdfs.server.namenode.TestStorageRestore
org.apache.hadoop.hdfs.TestFileConcurrentReader
-1 contrib tests. The patch failed contrib unit tests.
+1 system test framework. The patch passed system test framework compile.
Test results: https://hudson.apache.org/hudson/job/PreCommit-HDFS-Build/52//testReport/
Findbugs warnings: https://hudson.apache.org/hudson/job/PreCommit-HDFS-Build/52//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: https://hudson.apache.org/hudson/job/PreCommit-HDFS-Build/52//console
This message is automatically generated.

Konstantin Shvachko
added a comment - 28/Dec/10 22:46 A nit. Would you use LOG.info(msg, e) instead of stringifyException . It really looks better in the logs that way.
The rest looks good to me.
+1
Could you please fill the missing jira fields (versions, components).

Also you probably need similar logic for the handshake (versionRequest()). Otherwise DNs that are in the handshaking stage will fail the way you described.
The second catch in my version is catching IOException rather Exception, because we probably want to DN fail RuntimeException, say IllegalArgumentException.

Konstantin Shvachko
added a comment - 28/Dec/10 21:42 Also you probably need similar logic for the handshake (versionRequest()). Otherwise DNs that are in the handshaking stage will fail the way you described.
The second catch in my version is catching IOException rather Exception, because we probably want to DN fail RuntimeException, say IllegalArgumentException.

Konstantin Shvachko
added a comment - 28/Dec/10 21:23 From the stack trace I understand you restarted the name-node, which caused failure of DNs that were registering.
May I propose a variant, which easier than commenting on every line.
} catch (RemoteException re) {
IOException ue = re.unwrapRemoteException(
UnregisteredNodeException.class,
DisallowedDatanodeException.class,
IncorrectVersionException.class);
if (ue != re) throw ue;
} catch (IOException e) { // namenode cannot be contacted
LOG.info( "Problem connecting to server: " + getNameNodeAddr(), e);
You need to throw rather than break in order to avoid the registration logic after the loop.
It might be useful to write a test case for that.

-1 tests included. The patch doesn't appear to include any new or modified tests.
Please justify why no new tests are needed for this patch.
Also please list what manual steps were performed to verify this patch.

+1 javadoc. The javadoc tool did not generate any warning messages.

+1 javac. The applied patch does not increase the total number of javac compiler warnings.

+1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

+1 release audit. The applied patch does not increase the total number of release audit warnings.

Hadoop QA
added a comment - 28/Dec/10 20:03 -1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12467058/datanodeException2.txt
against trunk revision 1053214.
+1 @author. The patch does not contain any @author tags.
-1 tests included. The patch doesn't appear to include any new or modified tests.
Please justify why no new tests are needed for this patch.
Also please list what manual steps were performed to verify this patch.
+1 javadoc. The javadoc tool did not generate any warning messages.
+1 javac. The applied patch does not increase the total number of javac compiler warnings.
+1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.
+1 release audit. The applied patch does not increase the total number of release audit warnings.
-1 core tests. The patch failed these core unit tests:
org.apache.hadoop.hdfs.server.namenode.TestStorageRestore
org.apache.hadoop.hdfs.TestFileConcurrentReader
-1 contrib tests. The patch failed contrib unit tests.
+1 system test framework. The patch passed system test framework compile.
Test results: https://hudson.apache.org/hudson/job/PreCommit-HDFS-Build/50//testReport/
Findbugs warnings: https://hudson.apache.org/hudson/job/PreCommit-HDFS-Build/50//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: https://hudson.apache.org/hudson/job/PreCommit-HDFS-Build/50//console
This message is automatically generated.

Thanks Todd for the review. I meant to do this for the register call, but my merge from our 0.20 based tree into apache trunk caused the earlier error. For background: when we restarted one of our namenodes, many datanodes exited out with the following stack trace:

Retrying connect to server: zzz.com/yyy:9005. Already tried 5 time(s).
org.apache.hadoop.hdfs.server.datanode.DataNode: java.io.IOException: Call to yyy.com/:9005 failed on local exception: java.io.IOException: Connection reset by peer
at org.apache.hadoop.ipc.Client.wrapException(Client.java:796)
at org.apache.hadoop.ipc.Client.call(Client.java:764)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220)
at $Proxy4.register(Unknown Source)
at org.apache.hadoop.hdfs.server.datanode.DataNode.register(DataNode.java:554)
at org.apache.hadoop.hdfs.server.datanode.DataNode.runDatanodeDaemon(DataNode.java:1281)
at org.apache.hadoop.hdfs.server.datanode.DataNode.createDataNode(DataNode.java:1320)
at org.apache.hadoop.hdfs.server.datanode.DataNode.main(DataNode.java:1441)
Caused by: java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcher.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:233)
at sun.nio.ch.IOUtil.read(IOUtil.java:206)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:236)

This caused the restart to get slowed down because many blocks were missing and the NN refused to exit safemode automatically as expected.

The fix here is to avoid the datanode from exiting itself unless a explicit shutdown-exception is thrown by the NN. Otherwise, the datanode should never exit and should continuously try to establish contact with the NN.

dhruba borthakur
added a comment - 28/Dec/10 18:38 - edited Patch also uploaded at https://reviews.apache.org/r/195/
Thanks Todd for the review. I meant to do this for the register call, but my merge from our 0.20 based tree into apache trunk caused the earlier error. For background: when we restarted one of our namenodes, many datanodes exited out with the following stack trace:
Retrying connect to server: zzz.com/yyy:9005. Already tried 5 time(s).
org.apache.hadoop.hdfs.server.datanode.DataNode: java.io.IOException: Call to yyy.com/:9005 failed on local exception: java.io.IOException: Connection reset by peer
at org.apache.hadoop.ipc.Client.wrapException(Client.java:796)
at org.apache.hadoop.ipc.Client.call(Client.java:764)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220)
at $Proxy4.register(Unknown Source)
at org.apache.hadoop.hdfs.server.datanode.DataNode.register(DataNode.java:554)
at org.apache.hadoop.hdfs.server.datanode.DataNode.runDatanodeDaemon(DataNode.java:1281)
at org.apache.hadoop.hdfs.server.datanode.DataNode.createDataNode(DataNode.java:1320)
at org.apache.hadoop.hdfs.server.datanode.DataNode.main(DataNode.java:1441)
Caused by: java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcher.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:233)
at sun.nio.ch.IOUtil.read(IOUtil.java:206)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:236)
This caused the restart to get slowed down because many blocks were missing and the NN refused to exit safemode automatically as expected.
The fix here is to avoid the datanode from exiting itself unless a explicit shutdown-exception is thrown by the NN. Otherwise, the datanode should never exit and should continuously try to establish contact with the NN.

Konstantin Shvachko
added a comment - 28/Dec/10 01:29 Same here. Looks like you remove the timeout exception, which we still need to track. Besides, when DN receives the timeout exception it doesn't die, but retries. Which version is it targeted for?

Hi Dhruba. I'm a little confused by this patch. When can versionRequest throw exceptions like UnregisteredNodeException or DisallowedDatanodeException? These exceptions are thrown by registerDatanode, not by versionRequest.

I think maybe we should just widen SocketTimeoutException to IOException overall?

Todd Lipcon
added a comment - 28/Dec/10 00:05 Hi Dhruba. I'm a little confused by this patch. When can versionRequest throw exceptions like UnregisteredNodeException or DisallowedDatanodeException? These exceptions are thrown by registerDatanode, not by versionRequest.
I think maybe we should just widen SocketTimeoutException to IOException overall?

-1 tests included. The patch doesn't appear to include any new or modified tests.
Please justify why no new tests are needed for this patch.
Also please list what manual steps were performed to verify this patch.

+1 javadoc. The javadoc tool did not generate any warning messages.

+1 javac. The applied patch does not increase the total number of javac compiler warnings.

+1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

+1 release audit. The applied patch does not increase the total number of release audit warnings.

Hadoop QA
added a comment - 27/Dec/10 23:53 -1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12466370/datanodeException1.txt
against trunk revision 1053203.
+1 @author. The patch does not contain any @author tags.
-1 tests included. The patch doesn't appear to include any new or modified tests.
Please justify why no new tests are needed for this patch.
Also please list what manual steps were performed to verify this patch.
+1 javadoc. The javadoc tool did not generate any warning messages.
+1 javac. The applied patch does not increase the total number of javac compiler warnings.
+1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.
+1 release audit. The applied patch does not increase the total number of release audit warnings.
-1 core tests. The patch failed these core unit tests:
org.apache.hadoop.hdfs.server.namenode.TestStorageRestore
org.apache.hadoop.hdfs.TestFileConcurrentReader
-1 contrib tests. The patch failed contrib unit tests.
+1 system test framework. The patch passed system test framework compile.
Test results: https://hudson.apache.org/hudson/job/PreCommit-HDFS-Build/48//testReport/
Findbugs warnings: https://hudson.apache.org/hudson/job/PreCommit-HDFS-Build/48//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: https://hudson.apache.org/hudson/job/PreCommit-HDFS-Build/48//console
This message is automatically generated.

If the namenode.register calls throws a UnregisteredNodeException, DisallowedDatanodeException or a IncorrectVersionException then bail out, otherwise continue to retry the register call over and over again.

dhruba borthakur
added a comment - 16/Dec/10 06:07 If the namenode.register calls throws a UnregisteredNodeException, DisallowedDatanodeException or a IncorrectVersionException then bail out, otherwise continue to retry the register call over and over again.