Description

After doing a bunch of fault testing, I noticed that the JNs had a bunch of temporary files left around in their journal directories which were no longer within the retention period. For example, if a JN crashes in the middle of recovery, it can leave around a file like edits_inprogress_123.epoch=10. These files are handy to keep around for forensics/debugging while they are still in their retention period, but we should not leave them forever. The normal purging policy should apply.

I fixed a bug whereby the random fault test wasn't actually purging the files before – since it was calling purgeLogsOlderThan before it called recoverUnclosedSegments, the request was just getting rejected. Now it properly purges them, and I verified the purging behavior by running watch 'find ./build/test/data/dfs/journalnode-2 | sort' during the test run.

I ran 5000 instances of the random fault test and it passed with no AssertionErrors

Todd Lipcon
added a comment - 19/Sep/12 04:35 Attached patch fixes the issue.
Testing:
I added some new files to the existing purging test
I fixed a bug whereby the random fault test wasn't actually purging the files before – since it was calling purgeLogsOlderThan before it called recoverUnclosedSegments , the request was just getting rejected. Now it properly purges them, and I verified the purging behavior by running watch 'find ./build/test/data/dfs/journalnode-2 | sort' during the test run.
I ran 5000 instances of the random fault test and it passed with no AssertionErrors
This applies on top of HDFS-3950 and HDFS-3955