staging directory deletion fails because delegation tokens have been cancelled

Details

Description

In a secure setup, the jobtracker needs the job's delegation tokens to delete the staging directory. MAPREDUCE-4850 made it so that job cleanup staging directory deletion occurs asynchronously, so that it could order it with system directory deletion. This introduced the issue that a job's delegation tokens could be cancelled before the cleanup thread got around to deleting it, causing the deletion to fail.

Sandy Ryza
added a comment - 17/Apr/13 02:16 Uploading a patch that cancels the delegation tokens asynchronously as well. This required modifying CleanupQueue to accept delegation tokens to cancel in addition to files to delete.
Both TestJobRecovery and TestJobCleanup pass.

Arun C Murthy
added a comment - 01/May/13 00:58 Sandy Ryza Wouldn't it be simpler to just pass in the delegation token to the PathCleanupContext and get it to (optionally) cancel the token inline i.e. after the delete?

I considered that, and my thought for the separate async call was that I didn't see the token cancellation as related to deletions in a general sense. But your approach seems reasonable to me too, and has the advantage of stirring up less code, so +1.

Nit: it might be good to add a comment to the new PathDeletionContext that explains what it's used for and makes it clear that the token cancellation comes after the deletion.

Sandy Ryza
added a comment - 01/May/13 01:21 I considered that, and my thought for the separate async call was that I didn't see the token cancellation as related to deletions in a general sense. But your approach seems reasonable to me too, and has the advantage of stirring up less code, so +1.
Nit: it might be good to add a comment to the new PathDeletionContext that explains what it's used for and makes it clear that the token cancellation comes after the deletion.