Details

Description

Currently for snapshot files, fsck -list-corruptfileblocks shows corrupt blocks with the original file dir instead of the snapshot dir, and fsck -list-corruptfileblocks -includeSnapshots behave the same.
This can be confusing because even when the original file is deleted, fsck will still show that deleted file as corrupted, although what's actually corrupted is the snapshot.

As a side note, fsck -files -includeSnapshots shows the snapshot dirs.

Activity

fsck from command line with -includeSnapshots will also show full dir of snapshots

fsck from command line without -includeSnapshots behavior unchanged

NameNode WebUI always show full dir of snapshots

Some refactoring to reuse getSnapshottableDirs and ListCorruptFileBlocksWithSnapshot

Getting all possible snapshots is not so efficient, but considering fsck should not be performed frequently and nothing is added into listCorruptFileBlocks where fslock is used, the impact should be minimal.

Xiao Chen
added a comment - 14/Oct/15 06:01 Attached patch 001. Description below:
fsck from command line with -includeSnapshots will also show full dir of snapshots
fsck from command line without -includeSnapshots behavior unchanged
NameNode WebUI always show full dir of snapshots
Some refactoring to reuse getSnapshottableDirs and ListCorruptFileBlocksWithSnapshot
Getting all possible snapshots is not so efficient, but considering fsck should not be performed frequently and nothing is added into listCorruptFileBlocks where fslock is used, the impact should be minimal.

Xiao Chen
added a comment - 14/Oct/15 23:33 The pre-patch Findbugs warning is unrelated. (see HDFS-9242 )
The checkstyle issues are not introduced by this patch.
The test failures are unrelated, and passed locally.

Xiao Chen
added a comment - 21/Oct/15 17:43 patch 004 is attached. I think I should re-summarize it below:
fsck from command line with -includeSnapshots will also show full dir of snapshots
fsck from command line without -includeSnapshots behavior unchanged
NameNode WebUI's way of showing corrupted files/blocks unchanged.
Added a sentence in NN WebUI to hint the admin to run fsck with -includeSnapshots if there're snapshots present in the system.
Some refactoring to reuse existing code in new methods getSnapshottableDirs and ListCorruptFileBlocksWithSnapshot
The reasoning of keep minimal change to NN WebUI and fsck without -includeSnapshots is that getting all possible snapshots may be slow, since it's user configured.

The test failures look unrelated and passed locally.
Findbugs seems to be having problem.

Exception in thread "main" java.io.FileNotFoundException: /home/jenkins/jenkins-slave/workspace/PreCommit-HDFS-Build/patchprocess/trunkFindbugsWarningshadoop-hdfs.xml (No such file or directory)
at java.io.FileInputStream.open(Native Method)
at java.io.FileInputStream.<init>(FileInputStream.java:146)
at edu.umd.cs.findbugs.SortedBugCollection.progessMonitoredInputStream(SortedBugCollection.java:1231)
at edu.umd.cs.findbugs.SortedBugCollection.readXML(SortedBugCollection.java:308)
at edu.umd.cs.findbugs.SortedBugCollection.readXML(SortedBugCollection.java:295)
at edu.umd.cs.findbugs.workflow.Filter.main(Filter.java:712)
Pre-patch trunk findbugs is broken?

Thanks for reporting the issue and the patch. Good to have discussed in person. The patch looks good in general. I have some comments here:

1. The description is not quite accurate per our discussion, suggest to modify. Especially the patch actually does change (and fix) the behavior when without -includeSnapshots.
2. A possible optimization in FSDirSnapshotOp#getSnapshotFiles. It seems that the sf variable could be calculated in caller for once before the loop in the caller, and pass to this method.
3. final INodesInPath iip = fsd.getINodesInPath4Write(snap, false); maybe substituted with call to getINodesInPath
4. The check if (!corruptFileBlocks.isEmpty()) in listCorruptFileBlocksWithSnapshot is not needed
5. Add comment in listCorruptFileBlocks() before the call namenode.getNamesystem().listCorruptFileBlocksWithSnapshot, to indicate that snapshottableDirs is only relevant when -includeSnapshots is specified.
6. In listCorruptFileBlocksWithSnapshot, we can add

Yongjun Zhang
added a comment - 22/Oct/15 22:46 Hi Xiao Chen ,
Thanks for reporting the issue and the patch. Good to have discussed in person. The patch looks good in general. I have some comments here:
1. The description is not quite accurate per our discussion, suggest to modify. Especially the patch actually does change (and fix) the behavior when without -includeSnapshots.
2. A possible optimization in FSDirSnapshotOp#getSnapshotFiles. It seems that the sf variable could be calculated in caller for once before the loop in the caller, and pass to this method.
3. final INodesInPath iip = fsd.getINodesInPath4Write(snap, false); maybe substituted with call to getINodesInPath
4. The check if (!corruptFileBlocks.isEmpty()) in listCorruptFileBlocksWithSnapshot is not needed
5. Add comment in listCorruptFileBlocks() before the call namenode.getNamesystem().listCorruptFileBlocksWithSnapshot , to indicate that snapshottableDirs is only relevant when -includeSnapshots is specified.
6. In listCorruptFileBlocksWithSnapshot , we can add
if (snapshottableDirs == null ) {
continue ;
}
to avoid the call to getSnapshotFiles .
Thanks.

Good catch! I updated the code to call getINode which invokes getINodesInPath.

4. The check if (!corruptFileBlocks.isEmpty()) in listCorruptFileBlocksWithSnapshot is not needed

Good call. Fixed.

5. Add comment in listCorruptFileBlocks() before the call namenode.getNamesystem().listCorruptFileBlocksWithSnapshot, to indicate that snapshottableDirs is only relevant when -includeSnapshots is specified.

Added a link to FSNamesystem#listCorruptFileBlocksWithSnapshot which explains that parameter in javadoc.

6. In listCorruptFileBlocksWithSnapshot, we can add

if (snapshottableDirs == null) {
continue;
}

to avoid the call to getSnapshotFiles.

I'm not sure this is necessary. On one hand, it definitely saves 1 call stack. On the other hand, with the existence of all those loops and checks, I think the performance gain of saving 1 call stack would be trivial. And the nullity check of snapshottableDirs is already performed as a first step in getSnapshotFiles.

Xiao Chen
added a comment - 23/Oct/15 17:56 Thanks a lot for the review Yongjun Zhang !
1. The description is not quite accurate per our discussion, suggest to modify. Especially the patch actually does change (and fix) the behavior when without -includeSnapshots.
It was great to talk to you. I have updated the description. Modified patch summary in the end of this comment.
2. A possible optimization in FSDirSnapshotOp#getSnapshotFiles. It seems that the sf variable could be calculated in caller for once before the loop in the caller, and pass to this method.
My apologies for the confusion, I added some comments in this method. But getting sf for each snapshottable dir is needed, since /d1 and /d2 have different snapshotlist.
3. final INodesInPath iip = fsd.getINodesInPath4Write(snap, false); maybe substituted with call to getINodesInPath
Good catch! I updated the code to call getINode which invokes getINodesInPath .
4. The check if (!corruptFileBlocks.isEmpty()) in listCorruptFileBlocksWithSnapshot is not needed
Good call. Fixed.
5. Add comment in listCorruptFileBlocks() before the call namenode.getNamesystem().listCorruptFileBlocksWithSnapshot, to indicate that snapshottableDirs is only relevant when -includeSnapshots is specified.
Added a link to FSNamesystem#listCorruptFileBlocksWithSnapshot which explains that parameter in javadoc.
6. In listCorruptFileBlocksWithSnapshot, we can add
if (snapshottableDirs == null ) {
continue ;
}
to avoid the call to getSnapshotFiles.
I'm not sure this is necessary. On one hand, it definitely saves 1 call stack. On the other hand, with the existence of all those loops and checks, I think the performance gain of saving 1 call stack would be trivial. And the nullity check of snapshottableDirs is already performed as a first step in getSnapshotFiles .
Attached patch 005 with the above modifications. Updated summary below:
fsck -list-corruptfileblocks -includeSnapshots will also show full dir of snapshots
fsck -list-corruptfileblocks without -includeSnapshots will not show corrupt blocks that only have snapshot files
NameNode WebUI's way of showing corrupted files/blocks unchanged.
Added a sentence in NN WebUI to hint the admin to run fsck with -includeSnapshots, if there're snapshots present in the system.
Some refactoring to reuse existing code in new methods getSnapshottableDirs and ListCorruptFileBlocksWithSnapshot
The reasoning of keep minimal change to NN WebUI and fsck without -includeSnapshots is that getting all possible snapshots may be slow.

My apologies for the confusion, I added some comments in this method. But getting sf for each snapshottable dir is needed, since /d1 and /d2 have different snapshotlist.

I meant to precalcuate a list of "sf"s, and pass the list to FSDirSnapshotOp#getSnapshotFiles so to avoid doing it once for each file. Sorry for not having been clear.

2.

Added a link to FSNamesystem#listCorruptFileBlocksWithSnapshot which explains that parameter in javadoc.

I think it's ok to not have this link since it looks a bit redundant. So consider I took back my earlier comment since you have javadoc in FSNamesystem#listCorruptFileBlocksWithSnapshot. sorry about that.

3.

Good catch! I updated the code to call getINode which invokes getINodesInPath.

Good change to use getINode.

4. In {{testFsckListCorruptSnapshotFiles() }}
Suggest to introduce two variables, numFiles, numSnapshotFiles, initially numFile = 3 and numSnapshotFiles = 0, then when you create /corruptData/file, increment numFiles by 1, when you create snapshot, set numSnapshotFiles to 4. when deleting the file, decrement numFile by 1. When gets to assertion part, the first one check against (numFiles + numSnapshotFiles) (in place of the "7"), the second one check against numFiles.

Yongjun Zhang
added a comment - 23/Oct/15 19:22 Hi Xiao Chen ,
Thanks for the new rev. Some further comments:
1.
My apologies for the confusion, I added some comments in this method. But getting sf for each snapshottable dir is needed, since /d1 and /d2 have different snapshotlist.
I meant to precalcuate a list of "sf"s, and pass the list to FSDirSnapshotOp#getSnapshotFiles so to avoid doing it once for each file. Sorry for not having been clear.
2.
Added a link to FSNamesystem#listCorruptFileBlocksWithSnapshot which explains that parameter in javadoc.
I think it's ok to not have this link since it looks a bit redundant. So consider I took back my earlier comment since you have javadoc in FSNamesystem#listCorruptFileBlocksWithSnapshot. sorry about that.
3.
Good catch! I updated the code to call getINode which invokes getINodesInPath.
Good change to use getINode .
4. In {{testFsckListCorruptSnapshotFiles() }}
Suggest to introduce two variables, numFiles, numSnapshotFiles, initially numFile = 3 and numSnapshotFiles = 0, then when you create /corruptData/file, increment numFiles by 1, when you create snapshot, set numSnapshotFiles to 4. when deleting the file, decrement numFile by 1. When gets to assertion part, the first one check against (numFiles + numSnapshotFiles) (in place of the "7"), the second one check against numFiles.
Thanks.

Per the offline talk, we can make FSDirSnapshotOp#getSnapshotFiles more elegant by adding an interface on Snapshot to get the full snapshot root path, hence eliminating the need of pass in snapshottableDir. Attached patch 07 to do this. Thanks Yongjun.

Xiao Chen
added a comment - 23/Oct/15 21:57 Per the offline talk, we can make FSDirSnapshotOp#getSnapshotFiles more elegant by adding an interface on Snapshot to get the full snapshot root path, hence eliminating the need of pass in snapshottableDir. Attached patch 07 to do this. Thanks Yongjun.

Mingliang Liu
added a comment - 04/Nov/15 22:09 I found a findbugs warning in FSDirSnapshotOp probably brought by this patch, tracked by HDFS-9377 . Please close that jira if it's wrong, or review that otherwise. Thanks.

Mingliang Liu
added a comment - 04/Nov/15 23:05 Thanks for your quick reply, Yongjun Zhang and Xiao Chen .
Actually the Jenkins test-patch report is not showing this findbugs warning and it's understood to miss it. I also hope the HADOOP-12517 be resolved soon, if there is any fundamental issue.