SocketTimeoutException in unit tests

Details

Description

TestJobStatusPersistency failed and contained DataNode stacktraces similar to the following :

2008-03-07 21:27:00,410 ERROR dfs.DataNode (DataNode.java:run(976)) - 127.0.0.1:57790:DataXceiver: java.net.SocketTimeoutException: 0 millis
timeout while waiting for Unknown Addr (local: /127.0.0.1:57790) to be ready for read
at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:188)
at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:135)
at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:121)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
at java.io.BufferedInputStream.read(BufferedInputStream.java:237)
at java.io.DataInputStream.readInt(DataInputStream.java:370)
at org.apache.hadoop.dfs.DataNode$BlockReceiver.receiveBlock(DataNode.java:2434)
at org.apache.hadoop.dfs.DataNode$DataXceiver.writeBlock(DataNode.java:1170)
at org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:953)
at java.lang.Thread.run(Thread.java:619)

This is mostly related to HADOOP-2346. The error is strange. socket.getRemoteSocketAddress() returned null implying this socket is not connected yet. But we have already read a few bytes from it!.

Raghu Angadi
added a comment - 08/Mar/08 01:32 - edited I thought I could avoid calling System.currentTimeMillis() while waiting and depend on select(). Tough luck.
The attached patch polls in a loop until timeout passes. Also removes a large block for setting "channeStr". we use channel.toString() instead.

I am willing to bet that for random reasons Java select() returns 0, irrespective of timeout. So we need to keep track how long we have waited. Oddly, when the test passes, there are no instances of these. But when the test fails, there are lot of instances of this.

Raghu Angadi
added a comment - 08/Mar/08 00:55 I am willing to bet that for random reasons Java select() returns 0, irrespective of timeout. So we need to keep track how long we have waited. Oddly, when the test passes, there are no instances of these. But when the test fails, there are lot of instances of this.