Details

Fixes bugs in linux task controller and TaskRunner.setupWorkDir() related to handling of symlinks.

Description

With JVM reuse, TaskRunner.setupWorkDir() lists the contents of workDir and does a fs.delete on each path listed. If the listed file is a symlink to directory, it will delete the contents of those linked directories. This would delete files from distributed cache and jars directory,if mapred.create.symlink is true.
Changing ownership/permissions of symlinks through ENABLE_TASK_FOR_CLEANUP would change ownership/permissions of underlying files.

This is observed by Karam while running streaming jobs with DistributedCache and jvm reuse.

Ravi Gummadi
added a comment - 02/Feb/10 09:43 Attaching patch for trunk fixing the issues in TaskRunner.setupWorkDir() and in task-controller.c
This patch assumes that the api FileUtil.fullyDeleteContents() is available(that is part of HADOOP-6531 ).
Please review and provide your comments.

As discussed offline, we probably need a unit test coverage for the changes in TaskRunner.setupWorkDir as well. Can you please include that ? Also, we may want to add a public distributed cache file in the streaming test case and ensure its permissions do not change too.

Hemanth Yamijala
added a comment - 03/Feb/10 04:35 As discussed offline, we probably need a unit test coverage for the changes in TaskRunner.setupWorkDir as well. Can you please include that ? Also, we may want to add a public distributed cache file in the streaming test case and ensure its permissions do not change too.

checkPermissionsOnPrivateDistCache: Do we expect at least one of the directories to have the file. If yes, we should check for that; otherwise we will mask errors. Similarly for checkPermissionsOnPublicDistCache

checkPublicFilePermissions seems to be verifying permissions only for others. Should we also check permissions for owner and group ?

Comments in createSubDirsAndSymLinks should be dir1/subDir, dir1/file

Do we need DefaultTaskController.enableTaskForCleanup to give rwx access to group ? Setting the permissions for user should be sufficient, I think.

Hemanth Yamijala
added a comment - 06/Feb/10 09:55 A few minor nits:
Typo: contets in comments of ClusterWithLinuxTaskController.
checkPermissionsOnPrivateDistCache: Do we expect at least one of the directories to have the file. If yes, we should check for that; otherwise we will mask errors. Similarly for checkPermissionsOnPublicDistCache
checkPublicFilePermissions seems to be verifying permissions only for others. Should we also check permissions for owner and group ?
Comments in createSubDirsAndSymLinks should be dir1/subDir, dir1/file
Do we need DefaultTaskController.enableTaskForCleanup to give rwx access to group ? Setting the permissions for user should be sufficient, I think.

checkPermissionsOnPrivateDistCache: Do we expect at least one of the directories to have the file. If yes, we should check for that; otherwise we will mask errors. Similarly for checkPermissionsOnPublicDistCache

I have introduced new APIs in ClusterWithLinuxTaskController to check that a list of file names are present among the localized files. Essentially, I separated the permissions checking with the presence checks.

checkPublicFilePermissions seems to be verifying permissions only for others. Should we also check permissions for owner and group ?

Similar code existed in TestTrackerDistributedCacheManager. I refactored this code to a new API in the same class so that it can be called from outside and I am using it in checkPublicFilePermissions. I added checks for owner and group in TestTrackerDistributedCacheManager.

Hemanth Yamijala
added a comment - 02/Mar/10 19:02 Attached patch addresses my review comments.
checkPermissionsOnPrivateDistCache: Do we expect at least one of the directories to have the file. If yes, we should check for that; otherwise we will mask errors. Similarly for checkPermissionsOnPublicDistCache
I have introduced new APIs in ClusterWithLinuxTaskController to check that a list of file names are present among the localized files. Essentially, I separated the permissions checking with the presence checks.
checkPublicFilePermissions seems to be verifying permissions only for others. Should we also check permissions for owner and group ?
Similar code existed in TestTrackerDistributedCacheManager. I refactored this code to a new API in the same class so that it can be called from outside and I am using it in checkPublicFilePermissions. I added checks for owner and group in TestTrackerDistributedCacheManager.

In TestStreamingAsDifferentUser, group ownership is compared with primary group of the launcher. It would fail if task-controller binary is owned by secondary group of the launcher. It can be compared with ClusterWithLinuxTaskController.taskTrackerSpecialGroup directly, which introduced in MAPREDUCE-1421.

Amareshwari Sriramadasu
added a comment - 03/Mar/10 08:00 In TestStreamingAsDifferentUser, group ownership is compared with primary group of the launcher. It would fail if task-controller binary is owned by secondary group of the launcher. It can be compared with ClusterWithLinuxTaskController.taskTrackerSpecialGroup directly, which introduced in MAPREDUCE-1421 .

Would it be better if we check if the owner and group of public distributed cache files are correct also ?

Once we add that also to the checks of public distributed cache files, then ClusterWithLinuxTaskController.checkPermissionsOnDir() can be reused for these checks also and can avoid TestTrackerDistributedCacheManager.checkPublicFilePermissions() possibly.

Ravi Gummadi
added a comment - 03/Mar/10 08:12 Would it be better if we check if the owner and group of public distributed cache files are correct also ?
Once we add that also to the checks of public distributed cache files, then ClusterWithLinuxTaskController.checkPermissionsOnDir() can be reused for these checks also and can avoid TestTrackerDistributedCacheManager.checkPublicFilePermissions() possibly.

I am now using ClusterWithLinuxTaskController.taskTrackerSpecialGroup as the expected group for private distributed cache files.

Added ownership and group ownership checks for public distributed cache files. Group owner for public distributed cache is the primary owner of the tasktracker. I added a ClusterWithLinuxTaskController.taskTrackerPrimaryGroup on similar lines as ClusterWithLinuxTaskController.taskTrackerSpecialGroup.

However,

Once we add that also to the checks of public distributed cache files, then ClusterWithLinuxTaskController.checkPermissionsOnDir() can be reused for these checks also and can avoid TestTrackerDistributedCacheManager.checkPublicFilePermissions() possibly.

I have not done the above. This is because the checks for permissions of private distributed cache files includes exact match of all the permissions for owner, group and others. For public distributed cache files, the code only adds 'read' and 'execute' bits for all users. Specifically, it does not modify the 'write' bits. This means that the write permissions are indeterminate (for e.g. they could depend on permissions of files in an archive which are unarchived in distributed cache). Hence, instead of reusing the model for checking permissions, I have retained the original model for checking permissions of the public cache files.

Hemanth Yamijala
added a comment - 04/Mar/10 09:33 Patch incorporates review comments from Amarsri and Ravi. Changes are:
I am now using ClusterWithLinuxTaskController.taskTrackerSpecialGroup as the expected group for private distributed cache files.
Added ownership and group ownership checks for public distributed cache files. Group owner for public distributed cache is the primary owner of the tasktracker. I added a ClusterWithLinuxTaskController.taskTrackerPrimaryGroup on similar lines as ClusterWithLinuxTaskController.taskTrackerSpecialGroup.
However,
Once we add that also to the checks of public distributed cache files, then ClusterWithLinuxTaskController.checkPermissionsOnDir() can be reused for these checks also and can avoid TestTrackerDistributedCacheManager.checkPublicFilePermissions() possibly.
I have not done the above. This is because the checks for permissions of private distributed cache files includes exact match of all the permissions for owner, group and others. For public distributed cache files, the code only adds 'read' and 'execute' bits for all users. Specifically, it does not modify the 'write' bits. This means that the write permissions are indeterminate (for e.g. they could depend on permissions of files in an archive which are unarchived in distributed cache). Hence, instead of reusing the model for checking permissions, I have retained the original model for checking permissions of the public cache files.
I ran all task-controller tests on this, and they passed.

[exec] +1 overall.
[exec]
[exec] +1 @author. The patch does not contain any @author tags.
[exec]
[exec] +1 tests included. The patch appears to include 18 new or modified tests.
[exec]
[exec] +1 javadoc. The javadoc tool did not generate any warning messages.
[exec]
[exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings.
[exec]
[exec] +1 findbugs. The patch does not introduce any new Findbugs warnings.
[exec]
[exec] +1 release audit. The applied patch does not increase the total number of release audit warnings.

Hemanth Yamijala
added a comment - 04/Mar/10 10:26 Output of test-patch:
[exec] +1 overall.
[exec]
[exec] +1 @author. The patch does not contain any @author tags.
[exec]
[exec] +1 tests included. The patch appears to include 18 new or modified tests.
[exec]
[exec] +1 javadoc. The javadoc tool did not generate any warning messages.
[exec]
[exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings.
[exec]
[exec] +1 findbugs. The patch does not introduce any new Findbugs warnings.
[exec]
[exec] +1 release audit. The applied patch does not increase the total number of release audit warnings.

This patch is over MR-1435-y20s.patch for earlier version of Hadoop. It improves test cases by adding a couple of more tests, synchronizing this with trunk's patch. I am running tests. This patch is not for commit here.

Hemanth Yamijala
added a comment - 08/Mar/10 10:21 This patch is over MR-1435-y20s.patch for earlier version of Hadoop. It improves test cases by adding a couple of more tests, synchronizing this with trunk's patch. I am running tests. This patch is not for commit here.