Description

Our secondary name node is not able to start on NullPointerException:
ERROR org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: java.lang.NullPointerException
at org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedSetTimes(FSDirectory.java:1232)
at org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedSetTimes(FSDirectory.java:1221)
at org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:776)
at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:992)
at
org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode$CheckpointStorage.doMerge(SecondaryNameNode.java:590)
at
org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode$CheckpointStorage.access$000(SecondaryNameNode.java:473)
at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doMerge(SecondaryNameNode.java:350)
at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doCheckpoint(SecondaryNameNode.java:314)
at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.run(SecondaryNameNode.java:225)
at java.lang.Thread.run(Thread.java:619)

Activity

The edit log showed that the file that caused the problem was moved to the trash right before it was open for read and thus reset its access time. But how would open succeed? It turned out to be a synchronization problem. NameNode fetches the inode of the file to be open without holding fsnamesystem lock.

This was approximately what happened. Moving the file to trash and open requests arrived at NN simultaneously. So NN
1. fetched the inode for the file;
2. moved the file to trash;
3. opened the file and set its access time.

Although 2 removed the file's inode, 3 succeeded because it accessed the file through the inode fetched in 1.

Hairong Kuang
added a comment - 08/Oct/09 22:01 The edit log showed that the file that caused the problem was moved to the trash right before it was open for read and thus reset its access time. But how would open succeed? It turned out to be a synchronization problem. NameNode fetches the inode of the file to be open without holding fsnamesystem lock.
This was approximately what happened. Moving the file to trash and open requests arrived at NN simultaneously. So NN
1. fetched the inode for the file;
2. moved the file to trash;
3. opened the file and set its access time.
Although 2 removed the file's inode, 3 succeeded because it accessed the file through the inode fetched in 1.

Looks that Hudson have not picked up the submitted patch yet. I ran ant test-patch on my local machine:[exec] +1 overall.[exec][exec] +1 @author. The patch does not contain any @author tags.[exec][exec] -1 tests included. The patch doesn't appear to include any new or modified tests.[exec] Please justify why no new tests are needed for this patch.[exec] Also please list what manual steps were performed to verify this patch.[exec][exec] +1 javadoc. The javadoc tool did not generate any warning messages.[exec][exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings.[exec][exec] +1 findbugs. The patch does not introduce any new Findbugs warnings.[exec][exec] +1 release audit. The applied patch does not increase the total number of release audit warnings.

No test is included because it is very hard to write tests for race conditions and the fix is simple too.

Hairong Kuang
added a comment - 13/Oct/09 19:17 Looks that Hudson have not picked up the submitted patch yet. I ran ant test-patch on my local machine:
[exec] +1 overall.
[exec]
[exec] +1 @author. The patch does not contain any @author tags.
[exec]
[exec] -1 tests included. The patch doesn't appear to include any new or modified tests.
[exec] Please justify why no new tests are needed for this patch.
[exec] Also please list what manual steps were performed to verify this patch.
[exec]
[exec] +1 javadoc. The javadoc tool did not generate any warning messages.
[exec]
[exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings.
[exec]
[exec] +1 findbugs. The patch does not introduce any new Findbugs warnings.
[exec]
[exec] +1 release audit. The applied patch does not increase the total number of release audit warnings.
No test is included because it is very hard to write tests for race conditions and the fix is simple too.

Hairon, for accounting purposes I'd like to have a comment on why no tests were included. It might be obvious right now, but later on it will require an extra effort to find out why -1'ed patch has been committed.

Konstantin Boudnik
added a comment - 14/Oct/09 03:16 Hairon, for accounting purposes I'd like to have a comment on why no tests were included. It might be obvious right now, but later on it will require an extra effort to find out why -1'ed patch has been committed.

Hairong Kuang
added a comment - 14/Oct/09 17:54 i already explained why I committed the patch without a test:
> No test is included because it is very hard to write tests for race conditions and the fix is simple too.

Hairong Kuang
added a comment - 14/Oct/09 18:01 If you already experienced this problem either when starting namenode or secondary namenode, you can apply the attached patch to get your node up again. This is a patch for 0.20.

ryan rawson
added a comment - 18/Feb/10 02:56 Hey guys, I ran into this on branch-0.21, probably because the patch contained in the file "nullSetTime.patch" was never committed!
Why would this issue be closed?!

here is my stack trace:
2010-02-13 21:46:54,748 ERROR org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: Throwable Exception in doCheckpoint:
2010-02-13 21:46:54,749 ERROR org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: java.lang.NullPointerException
at org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedSetTimes(FSDirectory.java:1399)
at org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedSetTimes(FSDirectory.java:1387)
at org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadEditRecords(FSEditLog.java:672)
at org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:401)
at org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:368)
at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:1172)
at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode$CheckpointStorage.doMerge(SecondaryNameNode.java:594)
at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode$CheckpointStorage.access$000(SecondaryNameNode.java:476)
at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doMerge(SecondaryNameNode.java:353)
at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doCheckpoint(SecondaryNameNode.java:317)
at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.run(SecondaryNameNode.java:219)
at java.lang.Thread.run(Thread.java:619)

Hairong Kuang
added a comment - 25/Feb/10 19:23 nullSetTime.patch is not supposed to be committed. It is used for those users who came across the problem and get their the NN or SNN up.
Ryan, if you continue to see this bug in 0.21, please file another jira and provide mroe details on your edit log & NN logs. There might be another race condition in the code.
Let me close this jira for now.

2010-02-24 05:02:48,054 ERROR org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: java.lang.NullPointerException
at org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1152)
at org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1164)
at org.apache.hadoop.hdfs.server.namenode.FSDirectory.addNode(FSDirectory.java:1067)
at org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedAddFile(FSDirectory.java:213)
at org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadEditRecords(FSEditLog.java:511)
at org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:401)
at org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:368)
at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:1172)
at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode$CheckpointStorage.doMerge(SecondaryNameNode.java:594)
at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode$CheckpointStorage.access$000(SecondaryNameNode.java:476)
at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doMerge(SecondaryNameNode.java:353)
at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doCheckpoint(SecondaryNameNode.java:317)
at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.run(SecondaryNameNode.java:219)
at java.lang.Thread.run(Thread.java:619)

ryan rawson
added a comment - 25/Feb/10 19:46 this just happened again on me:
2010-02-24 05:02:48,054 ERROR org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: java.lang.NullPointerException
at org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1152)
at org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1164)
at org.apache.hadoop.hdfs.server.namenode.FSDirectory.addNode(FSDirectory.java:1067)
at org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedAddFile(FSDirectory.java:213)
at org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadEditRecords(FSEditLog.java:511)
at org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:401)
at org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:368)
at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:1172)
at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode$CheckpointStorage.doMerge(SecondaryNameNode.java:594)
at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode$CheckpointStorage.access$000(SecondaryNameNode.java:476)
at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doMerge(SecondaryNameNode.java:353)
at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doCheckpoint(SecondaryNameNode.java:317)
at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.run(SecondaryNameNode.java:219)
at java.lang.Thread.run(Thread.java:619)
I can provide image and edits.

2010-02-25 03:24:11,725 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: java.lang.NullPointerException
at org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1006)
at org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:992)
at org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedMkdir(FSDirectory.java:966)
at org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedMkdir(FSDirectory.java:953)
at org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:696)
at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:1001)
at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:812)
at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:364)
at org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:87)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:311)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:292)
at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:201)
at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:279)
at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:956)
at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:965)

Trying to recover a corrupted FS image from snapshot. Currently a no-go.

Andrew Purtell
added a comment - 25/Feb/10 19:54 We saw this also today I think:
2010-02-25 03:24:11,725 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: java.lang.NullPointerException
at org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1006)
at org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:992)
at org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedMkdir(FSDirectory.java:966)
at org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedMkdir(FSDirectory.java:953)
at org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:696)
at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:1001)
at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:812)
at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:364)
at org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:87)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:311)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:292)
at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:201)
at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:279)
at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:956)
at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:965)
Trying to recover a corrupted FS image from snapshot. Currently a no-go.

The fix is in 0.20, 0.21, and the trunk. Please check if the version that you are running has the fix: openNPE.patch.

If yes, as I said, the bug may be caused by a different race condition from the one that I fixed. Need more information from the edit log & NameNode log to figure out the cause the problem, please open a different jira for this.

Hairong Kuang
added a comment - 25/Feb/10 22:06 The fix is in 0.20, 0.21, and the trunk. Please check if the version that you are running has the fix: openNPE.patch.
If yes, as I said, the bug may be caused by a different race condition from the one that I fixed. Need more information from the edit log & NameNode log to figure out the cause the problem, please open a different jira for this.

2010-02-26 05:47:17,598 ERROR org.apache.hadoop.hdfs.server.common.Storage: java.io.IOException: Unexpected block size: -3664558185340993536
2010-02-26 05:47:17,600 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: java.lang.NullPointerException
at org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1007)
at org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:993)
at org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedMkdir(FSDirectory.java:967)
at org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedMkdir(FSDirectory.java:954)
at org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:696)
at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:1001)
at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:812)
at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:364)
at org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:88)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:311)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:292)
at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:201)
at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:279)
at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:956)
at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:965) It doesn't work.

Btw, we tried to skip all of these corrupt record by modifying Block.java. However, we then saw this:

2010-02-26 03:34:58,627 ERROR org.apache.hadoop.hdfs.server.common.Storage: java.lang.IllegalArgumentException: No enum const class org.apache.hadoop.hdfs.protocol.DatanodeInfo$AdminStates.hive_2009_05_31_VSAPI_001_12.VSAPI_001.10637.0.archive.pb$����&;=��ٴ��
2010-02-26 03:34:58,629 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: java.lang.NullPointerException
at org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1007)
at org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:993)
at org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedMkdir(FSDirectory.java:967)
at org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedMkdir(FSDirectory.java:954)
at org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:696)
at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:1001)
at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:812)
at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:364)
at org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:88)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:311)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:292)
at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:201)
at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:279)
at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:956)
at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:965)

It looks like some field in the record is not a valid value. Do you have any idea?

Can we provide you the snapshot to take a look at? We would really like to recover this volume.

Andrew Purtell
added a comment - 26/Feb/10 17:39 Hairong,
We are using 0.20.
We are trying to recover a corrupt volume with about 3 TB of data we'd like to get back. We have followed the steps on the page http://hadoop.apache.org/common/docs/current/hdfs_user_guide.html#Secondary+NameNode but are getting exceptions that look very similar to the ones posted on this issue.
We tried to use the openNEP patch. The result looks the same:
2010-02-26 05:47:17,598 ERROR org.apache.hadoop.hdfs.server.common.Storage: java.io.IOException: Unexpected block size: -3664558185340993536
2010-02-26 05:47:17,600 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: java.lang.NullPointerException
at org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1007)
at org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:993)
at org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedMkdir(FSDirectory.java:967)
at org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedMkdir(FSDirectory.java:954)
at org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:696)
at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:1001)
at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:812)
at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:364)
at org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:88)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:311)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:292)
at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:201)
at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:279)
at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:956)
at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:965) It doesn't work.
Btw, we tried to skip all of these corrupt record by modifying Block.java. However, we then saw this:
2010-02-26 03:34:58,627 ERROR org.apache.hadoop.hdfs.server.common.Storage: java.lang.IllegalArgumentException: No enum const class org.apache.hadoop.hdfs.protocol.DatanodeInfo$AdminStates.hive_2009_05_31_VSAPI_001_12.VSAPI_001.10637.0.archive.pb$����&;=��ٴ��
2010-02-26 03:34:58,629 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: java.lang.NullPointerException
at org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:1007)
at org.apache.hadoop.hdfs.server.namenode.FSDirectory.addChild(FSDirectory.java:993)
at org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedMkdir(FSDirectory.java:967)
at org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedMkdir(FSDirectory.java:954)
at org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:696)
at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:1001)
at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:812)
at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:364)
at org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:88)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:311)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:292)
at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:201)
at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:279)
at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:956)
at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:965)
It looks like some field in the record is not a valid value. Do you have any idea?
Can we provide you the snapshot to take a look at? We would really like to recover this volume.

2010-03-25 19:57:33,538 ERROR org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: java.lang.NullPointerException
at org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedSetTimes(FSDirectory.java:1399)
at org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedSetTimes(FSDirectory.java:1387)
at org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadEditRecords(FSEditLog.java:672)
at org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:401)
at org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:368)
at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:1172)
at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode$CheckpointStorage.doMerge(SecondaryNameNode.java:594)
at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode$CheckpointStorage.access$000(SecondaryNameNode.java:476)
at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doMerge(SecondaryNameNode.java:353)
at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doCheckpoint(SecondaryNameNode.java:317)
at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.run(SecondaryNameNode.java:219)
at java.lang.Thread.run(Thread.java:619)

Not sure if it matters, but according to my Nagios monitor, one of the data nodes (web211, 136.156.30.211) run out of disk space at 2010-03-25 17:55:26 (about a minute before the edit log roll started)

Carlos Valiente
added a comment - 26/Mar/10 09:20 I'm seeing this problem on a 4-node cluster running 0.21.0, revision 916530:
2010-03-25 19:57:33,538 ERROR org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: java.lang.NullPointerException
at org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedSetTimes(FSDirectory.java:1399)
at org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedSetTimes(FSDirectory.java:1387)
at org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadEditRecords(FSEditLog.java:672)
at org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:401)
at org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:368)
at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:1172)
at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode$CheckpointStorage.doMerge(SecondaryNameNode.java:594)
at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode$CheckpointStorage.access$000(SecondaryNameNode.java:476)
at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doMerge(SecondaryNameNode.java:353)
at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doCheckpoint(SecondaryNameNode.java:317)
at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.run(SecondaryNameNode.java:219)
at java.lang.Thread.run(Thread.java:619)
The content of one of my dfs.name.dir is available at http://static.ecmwf.int/hadoop/dfs-2010-03-26-08-44.tar.bz2 .
I've put a consolidated log file of all Hadoop processes at http://static.ecmwf.int/hadoop/hadoop-2010-03-25.log.bz2 . The NPE happens at line 206027. When the NPE was thrown, there were serveral writes taking place.
Not sure if it matters, but according to my Nagios monitor, one of the data nodes (web211, 136.156.30.211) run out of disk space at 2010-03-25 17:55:26 (about a minute before the edit log roll started)

Hi Andrew, it looks that the stack trace that you posted is more like the one described in HDFS-1002. From the image & edits that Ryan posted to HDFS-1002, it seems to me that HDFS may have missed some edit logs. So I am guessing that it was caused by HDFS-955.

Carlos, thanks a lot for all the information that you provided. Please give me sometime to exam it. Since this jira is closed, could you please open a new jira or we could continue discussion in HDFS-1002.

Hairong Kuang
added a comment - 26/Mar/10 17:28 Hi Andrew, it looks that the stack trace that you posted is more like the one described in HDFS-1002 . From the image & edits that Ryan posted to HDFS-1002 , it seems to me that HDFS may have missed some edit logs. So I am guessing that it was caused by HDFS-955 .
Carlos, thanks a lot for all the information that you provided. Please give me sometime to exam it. Since this jira is closed, could you please open a new jira or we could continue discussion in HDFS-1002 .