Matt Foley
added a comment - 03/May/11 22:04 The following test has failed on all or most PreCommit-HDFS-Build builds since #427, but not before that. #427 was the build for HDFS-1052 .
org.apache.hadoop.hdfs.TestDFSStorageStateRecovery.testBlockPoolStorageStates 427-446

I took a look into this. Turns out the failure is due to running out of available FDs in the test process. Each of these three test cases leak FDs, but the two DN-related tests leak more than the NN test. If the two DN tests are both run, the one run second will fail at some random point when it runs out of available FDs.

The other things that made this tough to track down is that the exception which triggered the failure was some times being masked by two unnecessary "catch (Exception e)

{ // ignore }

" blocks, and a spurious NPE. I'll be attaching a patch shortly which addresses these issues.

I tried running this test with the patch posted in HADOOP-7146 and can confirm that it passes reliably, and is no longer leaking FDs test-to-test.

Aaron T. Myers
added a comment - 20/May/11 21:26 I took a look into this. Turns out the failure is due to running out of available FDs in the test process. Each of these three test cases leak FDs, but the two DN-related tests leak more than the NN test. If the two DN tests are both run, the one run second will fail at some random point when it runs out of available FDs.
The other things that made this tough to track down is that the exception which triggered the failure was some times being masked by two unnecessary "catch (Exception e)
{ // ignore }
" blocks, and a spurious NPE. I'll be attaching a patch shortly which addresses these issues.
I tried running this test with the patch posted in HADOOP-7146 and can confirm that it passes reliably, and is no longer leaking FDs test-to-test.

It is not. Note that the test implies both that it should and should not be. The original version has assert(...)}}s right after the statement that should theoretically be throwing, which wouldn't be reached if it were in fact throwing. Perhaps it should in fact be throwing, but in that case we should be putting {{fail(...)}}s right after the {{cluster.startDataNodes(...) calls, and the asserts in the catch(...) clauses.

Aaron T. Myers
added a comment - 20/May/11 22:07 It is not. Note that the test implies both that it should and should not be. The original version has assert(...)}}s right after the statement that should theoretically be throwing, which wouldn't be reached if it were in fact throwing. Perhaps it should in fact be throwing, but in that case we should be putting {{fail(...)}}s right after the {{cluster.startDataNodes(...) calls, and the asserts in the catch(...) clauses.

Hadoop QA
added a comment - 27/May/11 07:32 -1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12479960/hdfs-1884.0.patch
against trunk revision 1128009.
+1 @author. The patch does not contain any @author tags.
+1 tests included. The patch appears to include 3 new or modified tests.
-1 patch. The patch command could not apply the patch.
Console output: https://builds.apache.org/hudson/job/PreCommit-HDFS-Build/643//console
This message is automatically generated.

Aaron T. Myers
added a comment - 27/May/11 20:55 The only test which failed was TestFiDataTransferProtocol2 . I just ran this test locally on my box and it passed just fine. I'm betting the failure was unrelated.