FSNamesystem#delete drops the FSN lock between removing INodes from the tree and deleting them from the inode map

Details

Description

After HDFS-6527, we have not seen the edit log corruption for weeks on multiple clusters until yesterday. Previously, we would see it within 30 minutes on a cluster.

But the same condition was reproduced even with HDFS-6527. The only explanation is that the RPC handler thread serving addBlock() was accessing stale parent value. Although nulling out parent is done inside the FSNamesystem and FSDirectory write lock, there is no memory barrier because there is no "synchronized" block involved in the process.

Jing Zhao
added a comment - 01/Jul/14 18:58 If the corruption is caused by the same issue, can we move the following code into the writelock of FSNamesystem in deleteInternal?
dir.writeLock();
try {
dir.removeFromInodeMap(removedINodes);
} finally {
dir.writeUnlock();
}

Kihwal Lee
added a comment - 01/Jul/14 20:01 - edited can we move the following code into the writelock of FSNamesystem in deleteInternal?
That can degrade performance when deleting a large tree. I am also worried about similar kind of race during rename and others.

-1 tests included. The patch doesn't appear to include any new or modified tests.
Please justify why no new tests are needed for this patch.
Also please list what manual steps were performed to verify this patch.

+1 javac. The applied patch does not increase the total number of javac compiler warnings.

+1 javadoc. There were no new javadoc warning messages.

+1 eclipse:eclipse. The patch built with eclipse:eclipse.

+1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

+1 release audit. The applied patch does not increase the total number of release audit warnings.

Hadoop QA
added a comment - 01/Jul/14 21:06 -1 overall . Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12653420/HDFS-6618.patch
against trunk revision .
+1 @author . The patch does not contain any @author tags.
-1 tests included . The patch doesn't appear to include any new or modified tests.
Please justify why no new tests are needed for this patch.
Also please list what manual steps were performed to verify this patch.
+1 javac . The applied patch does not increase the total number of javac compiler warnings.
+1 javadoc . There were no new javadoc warning messages.
+1 eclipse:eclipse . The patch built with eclipse:eclipse.
+1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings.
+1 release audit . The applied patch does not increase the total number of release audit warnings.
+1 core tests . The patch passed unit tests in hadoop-hdfs-project/hadoop-hdfs.
+1 contrib tests . The patch passed contrib unit tests.
Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/7263//testReport/
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/7263//console
This message is automatically generated.

I guess we can move it inside the first lock, since it is already holding the directory write lock. Not many types of ops will go through anyway. But if we remove them as we unlink inodes, instead of building up potentially huge data structure and do it at once, it may be faster & cheaper.

Is there a clean way to remove each inode from the inode map from destroyAndCollectBlocks() of INodeFile and INodeDirectory?

Kihwal Lee
added a comment - 01/Jul/14 21:46 I guess we can move it inside the first lock, since it is already holding the directory write lock. Not many types of ops will go through anyway. But if we remove them as we unlink inodes, instead of building up potentially huge data structure and do it at once, it may be faster & cheaper.
Is there a clean way to remove each inode from the inode map from destroyAndCollectBlocks() of INodeFile and INodeDirectory ?

Kihwal Lee
added a comment - 02/Jul/14 04:59 Here is a new patch that implements INodeRemover . It replaces replaces removedINodes list and remove() is called instead of add() . There is no longer deferred removal of inodes from inodeMap .

Hadoop QA
added a comment - 02/Jul/14 16:57 -1 overall . Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12653582/HDFS-6618.inodeRemover.v2.patch
against trunk revision .
+1 @author . The patch does not contain any @author tags.
+1 tests included . The patch appears to include 2 new or modified test files.
+1 javac . The applied patch does not increase the total number of javac compiler warnings.
+1 javadoc . There were no new javadoc warning messages.
+1 eclipse:eclipse . The patch built with eclipse:eclipse.
+1 findbugs . The patch does not introduce any new Findbugs (version 1.3.9) warnings.
+1 release audit . The applied patch does not increase the total number of release audit warnings.
-1 core tests . The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs:
org.apache.hadoop.hdfs.server.blockmanagement.TestBlockTokenWithDFS
+1 contrib tests . The patch passed contrib unit tests.
Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/7272//testReport/
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/7272//console
This message is automatically generated.

One thing that I'm a bit afraid of here is that previously, if we threw an exception while building up the List<INode> removedINodes, we'd avoid modifying the INodeMap. But now, we may leave the INodeMap in an inconsistent state.

A lot of these functions you're passing the remover to are declared to throw checked exceptions. Like this one:

What happens if a QuotaExceededException is thrown here after we modified the INodeMap, but before we invalidated the blocks (which are still collected in collectedBlocks)? Seems like we'll be in an inconsistent state.

Maybe I'm missing something obvious, but can't we just move the dir.removeFromInodeMap line into the first writeLock block? Then we could keep the "add inodes to a list" approach (deferred approach) rather than directly mutating the inodeMap.

This is indeed problematic, but is also the case for existing code and what you are suggesting. If an exception is thrown in the middle of deleting, the partial delete is not undone. The inode at the top of the tree being deleted and potentially more will have already unlinked and the rest will remain linked, but unreachable. If inodes are removed altogether at the end, none of inodes will get removed from inodeMap, when an exception is thrown. This will cause inodes and blocks to leak. If we remove inodes as we go, at leaset some inodes will get removed in the same situation. Either way things will leak, but to a lesser degree in the latter case. But I wouldn't say the latter is superior because of this difference. I am just saying it's no worse.

One of the key motivation of removing inodes inline was to avoid overhead of building up large data structure when deleting a large tree. Although now it's backed by ChunkedArrayList, there will be lots of realloc and quite a bit of memory consumption. All or part of them may be promoted and remain in the heap until the next old gen collection. This may be acceptable if we are doing deferred removal outside the lock. But since we are trying to do it inside both FSNamesystem and FSDirectory lock, building the list is just a waste.

About leaking inodes and blocks:

inodes were removed from inodeMap but blocks weren't. This includes adding a block after the inode is deleted due to the delete-addBlock race. Since the block is not removed from blocksMap, but the block still has reference to the block collection (i.e. inode), the block will look valid to BlockManager. This will cause memory leak, which will disappear when namenode is restarted.

unlinked/deleted inodes were not deleted from inodeMap. The deleted inodes will remain in memory. If blocks were also not removed from blocksMap, they will remain in memory. If blocks were collected, but not removed from blocksMap, they will disappear after restart. When saving fsimage, the orphaned inodes will be saved in the inode section. The way it saves INodeDirectorySection also causes all leaked (still linked) children and blocks to be saved. When loading the fsimage, the leak will be recreated in memory.

I am a bit depressed after writing this. Let's fix things one at time...

Kihwal Lee
added a comment - 03/Jul/14 21:34 - edited Colin Patrick McCabe , thanks for the review.
What happens if a QuotaExceededException is thrown here .....
This is indeed problematic, but is also the case for existing code and what you are suggesting. If an exception is thrown in the middle of deleting, the partial delete is not undone. The inode at the top of the tree being deleted and potentially more will have already unlinked and the rest will remain linked, but unreachable. If inodes are removed altogether at the end, none of inodes will get removed from inodeMap, when an exception is thrown. This will cause inodes and blocks to leak. If we remove inodes as we go, at leaset some inodes will get removed in the same situation. Either way things will leak, but to a lesser degree in the latter case. But I wouldn't say the latter is superior because of this difference. I am just saying it's no worse.
One of the key motivation of removing inodes inline was to avoid overhead of building up large data structure when deleting a large tree. Although now it's backed by ChunkedArrayList , there will be lots of realloc and quite a bit of memory consumption. All or part of them may be promoted and remain in the heap until the next old gen collection. This may be acceptable if we are doing deferred removal outside the lock. But since we are trying to do it inside both FSNamesystem and FSDirectory lock, building the list is just a waste.
About leaking inodes and blocks:
inodes were removed from inodeMap but blocks weren't. This includes adding a block after the inode is deleted due to the delete-addBlock race. Since the block is not removed from blocksMap, but the block still has reference to the block collection (i.e. inode), the block will look valid to BlockManager . This will cause memory leak, which will disappear when namenode is restarted.
unlinked/deleted inodes were not deleted from inodeMap. The deleted inodes will remain in memory. If blocks were also not removed from blocksMap, they will remain in memory. If blocks were collected, but not removed from blocksMap, they will disappear after restart. When saving fsimage, the orphaned inodes will be saved in the inode section. The way it saves INodeDirectorySection also causes all leaked (still linked) children and blocks to be saved. When loading the fsimage, the leak will be recreated in memory.
I am a bit depressed after writing this. Let's fix things one at time...

Thanks for that explanation. If we're just trying to make a minimal change to put out the fire without fixing the (existing) leak issue, why not just move the call to dir.removeFromInodeMap(removedINodes) up inside the first try... catch block? Then we can file a JIRA for some refactoring which fixes the leak issue (we'll need to discuss how)

Colin Patrick McCabe
added a comment - 03/Jul/14 21:48 Thanks for that explanation. If we're just trying to make a minimal change to put out the fire without fixing the (existing) leak issue, why not just move the call to dir.removeFromInodeMap(removedINodes) up inside the first try... catch block? Then we can file a JIRA for some refactoring which fixes the leak issue (we'll need to discuss how)

If we're just trying to make a minimal change to put out the fire without fixing the (existing) leak issue, why not just move the call to dir.removeFromInodeMap(removedINodes) up inside the first try... catch block? Then we can file a JIRA for some refactoring which fixes the leak issue

I agree with Colin. Maybe we can first fix the bug here by moving the removeFromInodeMap call into the FSNamesystem lock, and fix the remaining issues in a separate jira(s).

Since the block is not removed from blocksMap, but the block still has reference to the block collection (i.e. inode), the block will look valid to BlockManager. This will cause memory leak, which will disappear when namenode is restarted.

The block deletion has to be after the logSync call here thus I guess it's hard to avoid this kind of leak (or we have to change the blocksMap structure etc.). Since the memory leak can go away after restarting, maybe we do not need to worry about this part too much right now and focus on the leak on the inodeMap.

Jing Zhao
added a comment - 07/Jul/14 18:56 If we're just trying to make a minimal change to put out the fire without fixing the (existing) leak issue, why not just move the call to dir.removeFromInodeMap(removedINodes) up inside the first try... catch block? Then we can file a JIRA for some refactoring which fixes the leak issue
I agree with Colin. Maybe we can first fix the bug here by moving the removeFromInodeMap call into the FSNamesystem lock, and fix the remaining issues in a separate jira(s).
Since the block is not removed from blocksMap, but the block still has reference to the block collection (i.e. inode), the block will look valid to BlockManager. This will cause memory leak, which will disappear when namenode is restarted.
The block deletion has to be after the logSync call here thus I guess it's hard to avoid this kind of leak (or we have to change the blocksMap structure etc.). Since the memory leak can go away after restarting, maybe we do not need to worry about this part too much right now and focus on the leak on the inodeMap.

With the new patch removedINodes is passed to FSNamesystem#removePathAndBlocks() while in the write lock. The method was modified to conditionally acquire the directory lock. I didn't move the removal to FSDirectory, since we may want to do something with the inodes in FSNamesystem later as a part of failure handling in a separate jira.

Kihwal Lee
added a comment - 09/Jul/14 18:21 With the new patch removedINodes is passed to FSNamesystem#removePathAndBlocks() while in the write lock. The method was modified to conditionally acquire the directory lock. I didn't move the removal to FSDirectory , since we may want to do something with the inodes in FSNamesystem later as a part of failure handling in a separate jira.

-1 tests included. The patch doesn't appear to include any new or modified tests.
Please justify why no new tests are needed for this patch.
Also please list what manual steps were performed to verify this patch.

Hadoop QA
added a comment - 09/Jul/14 18:34 -1 overall . Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12654834/HDFS-6618.simpler.patch
against trunk revision .
+1 @author . The patch does not contain any @author tags.
-1 tests included . The patch doesn't appear to include any new or modified tests.
Please justify why no new tests are needed for this patch.
Also please list what manual steps were performed to verify this patch.
-1 javac . The patch appears to cause the build to fail.
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/7306//console
This message is automatically generated.

Just one small note... I'd prefer to see lock... unlock blocks around removePathAndBlocks when appropriate, rather than a boolean "lock me" passed in, but we can address that in the refactoring, I guess

Colin Patrick McCabe
added a comment - 09/Jul/14 20:01 +1 for the patch.
Just one small note... I'd prefer to see lock... unlock blocks around removePathAndBlocks when appropriate, rather than a boolean "lock me" passed in, but we can address that in the refactoring, I guess

-1 tests included. The patch doesn't appear to include any new or modified tests.
Please justify why no new tests are needed for this patch.
Also please list what manual steps were performed to verify this patch.

+1 javac. The applied patch does not increase the total number of javac compiler warnings.

+1 javadoc. There were no new javadoc warning messages.

+1 eclipse:eclipse. The patch built with eclipse:eclipse.

+1 findbugs. The patch does not introduce any new Findbugs (version 2.0.3) warnings.

+1 release audit. The applied patch does not increase the total number of release audit warnings.

Hadoop QA
added a comment - 09/Jul/14 22:43 -1 overall . Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12654834/HDFS-6618.simpler.patch
against trunk revision .
+1 @author . The patch does not contain any @author tags.
-1 tests included . The patch doesn't appear to include any new or modified tests.
Please justify why no new tests are needed for this patch.
Also please list what manual steps were performed to verify this patch.
+1 javac . The applied patch does not increase the total number of javac compiler warnings.
+1 javadoc . There were no new javadoc warning messages.
+1 eclipse:eclipse . The patch built with eclipse:eclipse.
+1 findbugs . The patch does not introduce any new Findbugs (version 2.0.3) warnings.
+1 release audit . The applied patch does not increase the total number of release audit warnings.
+1 core tests . The patch passed unit tests in hadoop-hdfs-project/hadoop-hdfs.
+1 contrib tests . The patch passed contrib unit tests.
Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/7309//testReport/
Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/7309//console
This message is automatically generated.