Activity

Currently, the LinuxTaskController binary does not kill the process group instead kills only the process id. If that is the case then jobs which fork off child process, wont kill child process when the job/task is killed.

Sreekanth Ramakrishnan
added a comment - 06/Mar/09 06:30 Currently, the LinuxTaskController binary does not kill the process group instead kills only the process id. If that is the case then jobs which fork off child process, wont kill child process when the job/task is killed.

Attaching patch to this issue, added new configuration variable to mimic mapred.tasktracker.tasks.sleeptime-before-sigkill. if configuration is not present would default to 5 seconds before sending SIGKILL.

Sreekanth Ramakrishnan
added a comment - 06/Mar/09 06:36 Attaching patch to this issue, added new configuration variable to mimic mapred.tasktracker.tasks.sleeptime-before-sigkill . if configuration is not present would default to 5 seconds before sending SIGKILL.

Looked at the patch. I think we should set the default value for mapred.tasktracker.tasks.sleeptime-before-sigkill to be 5 seconds in the taskcontroller.cfg itself so that it serves as a source of documentation for the default value.

Another point that may be the content of another JIRA, The issue, though not directly related to this patch, but that gets a bit amplified with this patch is the fact that we have a separate configuration for task-controller. mapred.tasktracker.tasks.sleeptime-before-sigkill is already provided via mapred configuration and admins will have two places to chose depending on the task-controller in use. Further, in the current code, this parameter is job configurable and as part of HADOOP-5614, Ravi is planning to have a job configurable mapred.job.tasks.sleeptime-before-sigkill and a TT level mapred.job.tasks.sleeptime-before-sigkill.

Vinod Kumar Vavilapalli
added a comment - 09/Apr/09 07:15 Looked at the patch. I think we should set the default value for mapred.tasktracker.tasks.sleeptime-before-sigkill to be 5 seconds in the taskcontroller.cfg itself so that it serves as a source of documentation for the default value.
Another point that may be the content of another JIRA, The issue, though not directly related to this patch, but that gets a bit amplified with this patch is the fact that we have a separate configuration for task-controller. mapred.tasktracker.tasks.sleeptime-before-sigkill is already provided via mapred configuration and admins will have two places to chose depending on the task-controller in use. Further, in the current code, this parameter is job configurable and as part of HADOOP-5614 , Ravi is planning to have a job configurable mapred.job.tasks.sleeptime-before-sigkill and a TT level mapred.job.tasks.sleeptime-before-sigkill .

One observation is that we are not specifically setting the LOG file in the TT via -l option and so all the log statements are directed to stdout which are duly ignored. Unless we plan to have the log statements redirected somewhere, I propose we remove the log statements and the LOG file altogether. We already have error codes to look at to diagnose any problems.

Vinod Kumar Vavilapalli
added a comment - 09/Apr/09 09:33 +1 for the patch.
One observation is that we are not specifically setting the LOG file in the TT via -l option and so all the log statements are directed to stdout which are duly ignored. Unless we plan to have the log statements redirected somewhere, I propose we remove the log statements and the LOG file altogether. We already have error codes to look at to diagnose any problems.

Sigh. I didn't notice this. In that case, I am not comfortable with this patch. Configuring a value at two places is inconvenient, but does not violate correctness. However, a per job value being overridden by a cluster wide configuration just in the case of LinuxTaskController does not seem right. Further, with the kind of plans mentioned in the comment above (though I saw nothing in HADOOP-5614 - having the thoughts recorded there will help), I think we should relook at how this part is designed.

For the short term, it is probably still OK to not have code corrresponding to the SIGKILL code at all in the task-controller. Then the issue of how to handle SIGKILL can be addressed as a follow-up preferably after HADOOP-4490 is committed.

Hemanth Yamijala
added a comment - 12/Apr/09 08:49 Further, in the current code, this parameter is job configurable ...
Sigh. I didn't notice this. In that case, I am not comfortable with this patch. Configuring a value at two places is inconvenient, but does not violate correctness. However, a per job value being overridden by a cluster wide configuration just in the case of LinuxTaskController does not seem right. Further, with the kind of plans mentioned in the comment above (though I saw nothing in HADOOP-5614 - having the thoughts recorded there will help), I think we should relook at how this part is designed.
For the short term, it is probably still OK to not have code corrresponding to the SIGKILL code at all in the task-controller. Then the issue of how to handle SIGKILL can be addressed as a follow-up preferably after HADOOP-4490 is committed.

Modified binary to accept a new argument which is sleep time before it issues SIGKILL.

The argument has to be passed via LinuxTaskController in HADOOP-4490. Made modification in the same in appropriate JIRA. When we go for job specific kill time, we can make the appropriate modification in the java implementation, so there by following the implementation which default task controller would provide.

Sreekanth Ramakrishnan
added a comment - 13/Apr/09 05:24 Attaching new patch.
Removed configuration variable from the configuration variable.
Modified binary to accept a new argument which is sleep time before it issues SIGKILL.
The argument has to be passed via LinuxTaskController in HADOOP-4490 . Made modification in the same in appropriate JIRA. When we go for job specific kill time, we can make the appropriate modification in the java implementation, so there by following the implementation which default task controller would provide.

TT's pidfile should not be made world writable (changeDirectoryPermissions does this). In fact, it can be set 000 permissions right-away. This would still work as we run as root to read it.

killHelper can better be something like finishTask()

What should we do when the commands for SIGTERM/SIGKILL fail?

buildTaskControllerExecutor() : For taskControllerCmd, you can instead use an Array
Why the usage of command.ordinal() in this method?

I think it would be clear to have a separate buildKillTaskArgs(). It would definitely help to also 'javadoc' the task-controller's command-line syntax for buildLaunchTaskArgs and this new buildKillTaskArgs(). With this you can completely remove the killHelper method()

ProcessTree.java

killProcessGroup() can be renamed to terminateProcessGroup(), killProcess() to terminateProcess(), sigKillProcessGroup to killProcessGroup() and sigKillProcess to killProcess().

Also, I think, with these semantics, we need a ProcessTree.isProcessGroupAlive(pgrpid) along with the current ProcessTree.isAlive(pid)

main.c

The argc length check should be command specific.

job_pid var can better be task_pid

You have a printf statement in the case DESTROY_TASK_JVM. This can be removed or redirected to log if needed.

Return values of verify_parent like UNABLE_TO_READ_TT_PID_FILE, OUT_OF_MEMORY are lost, as we are only returning INVALID_PROCESS_LAUNCHING_TASKCONTROLLER in error scenarios. Can't we do something here?

taskcontroller.c

Documentation of kill_user_task() should be fixed. Currently, it only refers to SIGTERM.

The get_pid_path() function can be removed. Also references to it from code documentation of run_task_as_user, kill_user_task can be removed too along with the TT_SYS_DIR var from taskcontroller.h

+343 "//Don't continue if the process is not alive anymore." should instead be " // Don't continue if the process-group is not alive anymore."

TestKillSubProcessesWithTaskController

Rename to TestKillSubProcessesWithLinuxTaskController

Remove unused imports

Add Apache license header and add javadoc similar to other tests with LinuxTaskController

Just like in HADOOP-5771, we may want to verify job-ownership in this test too.

The test ran successfully, but it is running the jobs as the same user as the user running the cluster. Need to fix this.

Also, I see a lot of messages like this : "WARN mapred.LinuxTaskController (LinuxTaskController.java:destroyTask(454)) - Exception thrown while sending destroy to the Task VM org.apache.hadoop.util.Shell$ExitCodeException:". Need to debug the cause for these.

Vinod Kumar Vavilapalli
added a comment - 08/May/09 06:35 Looked at the patch. Some comments:
The patch needs sync-up with the latest patch on HADOOP-5771
Overall, I think the terms terminate and kill are better than the current usage of kill and destroy.
TaskController.java
killTask, destroyTask can better be terminateTask, killTask. And accordingly javadoc should be more explicit.
killTaskJVM: Null check of context.task belongs to LinuxTaskController, need it to be moved there.
DefaultTaskController.java
For WINDOWS process.destroy() is again not needed in destroyTask()
LinuxTaskController.java
We completely bring down the TT if we fail to write the pidFile to any one of the disks. This should change. We should fail only if we cannot write to any of the disks.
javadoc of setup() is incomplete. It should also quote the additional steps of writing TT pid-file.
Log message at +384 can be better: LOG.info("Written :: " + pid + " to file " + ttPidFile);
TT's pidfile should not be made world writable (changeDirectoryPermissions does this). In fact, it can be set 000 permissions right-away. This would still work as we run as root to read it.
killHelper can better be something like finishTask()
What should we do when the commands for SIGTERM/SIGKILL fail?
buildTaskControllerExecutor() : For taskControllerCmd, you can instead use an Array
Why the usage of command.ordinal() in this method?
I think it would be clear to have a separate buildKillTaskArgs(). It would definitely help to also 'javadoc' the task-controller's command-line syntax for buildLaunchTaskArgs and this new buildKillTaskArgs(). With this you can completely remove the killHelper method()
ProcessTree.java
killProcessGroup() can be renamed to terminateProcessGroup(), killProcess() to terminateProcess(), sigKillProcessGroup to killProcessGroup() and sigKillProcess to killProcess().
Also, I think, with these semantics, we need a ProcessTree.isProcessGroupAlive(pgrpid) along with the current ProcessTree.isAlive(pid)
main.c
The argc length check should be command specific.
job_pid var can better be task_pid
You have a printf statement in the case DESTROY_TASK_JVM. This can be removed or redirected to log if needed.
Return values of verify_parent like UNABLE_TO_READ_TT_PID_FILE, OUT_OF_MEMORY are lost, as we are only returning INVALID_PROCESS_LAUNCHING_TASKCONTROLLER in error scenarios. Can't we do something here?
taskcontroller.c
Documentation of kill_user_task() should be fixed. Currently, it only refers to SIGTERM.
The get_pid_path() function can be removed. Also references to it from code documentation of run_task_as_user, kill_user_task can be removed too along with the TT_SYS_DIR var from taskcontroller.h
+343 "//Don't continue if the process is not alive anymore." should instead be " // Don't continue if the process-group is not alive anymore."
TestKillSubProcessesWithTaskController
Rename to TestKillSubProcessesWithLinuxTaskController
Remove unused imports
Add Apache license header and add javadoc similar to other tests with LinuxTaskController
Just like in HADOOP-5771 , we may want to verify job-ownership in this test too.
The test ran successfully, but it is running the jobs as the same user as the user running the cluster. Need to fix this.
Also, I see a lot of messages like this : "WARN mapred.LinuxTaskController (LinuxTaskController.java:destroyTask(454)) - Exception thrown while sending destroy to the Task VM org.apache.hadoop.util.Shell$ExitCodeException:". Need to debug the cause for these.
We also need a test to verify cleanup of tasks that ignore SIGTERM.

Changed TestKillSubProcesses to use both LocalFileSystem and HDFSFileSystem

Added a new method TestKillSubProcesses.isAlive(pid) instead of using ProcessTree.isAlive(pid) because in case TestKillSubProcessesWithLinuxTaskController would always return false since the task are run as different user.

Changed the all the child process so that they now ignore SIGTERM and would exit only if SIGKILL is passed.

Sreekanth Ramakrishnan
added a comment - 11/May/09 07:12 Attaching latest patch incorporating Vinod's comment:
Changed TestKillSubProcesses to use both LocalFileSystem and HDFSFileSystem
Added a new method TestKillSubProcesses.isAlive(pid) instead of using ProcessTree.isAlive(pid) because in case TestKillSubProcessesWithLinuxTaskController would always return false since the task are run as different user.
Changed the all the child process so that they now ignore SIGTERM and would exit only if SIGKILL is passed.
Patch applies to trunk after HADOOP-5771 patch is applied.

We should have a separate exit-code for the case when pid!=ppid in verify_parent() function.

I don't see the per-command checks for the number of arguments. atoi(argv[optind++]) will segfault if we go beyond arg-length. Correspondingly, the null checks and pid<=0 can be done away with in task-controller.c once we have per-command check for the number of arguments.

The following code-comment has to be removed. And the check should be only for 2 args : user and ttroot

// when we support additional commands without ttroot, this check
// may become command specific.
if (argc < 4) {
display_usage(stderr);
return INVALID_ARGUMENT_NUMBER;
}

taskcontroller.c

javadoc for kill_user_task should be "Function used to terminate/kill a task launched by the user."

taskcontroller.h

Consistent with the terminology elsewhere, KILL_TASK and DESTROY_TASK_JVM should instead be TERMINATE_TASK and KILL_TASK

ProcessTree.java

killProcessGroup(String pid, boolean isProcessGroup) isn't used anywhere, can be removed.

Can the the terminate* methods, kill*, destroy*, sigKill* and is*Alive methods grouped together?

Need to fix the javadoc of terminateProcess, killProcess, terminateProcessGroup, killProcessGroup, isAlive() and isProcessGroupAlive()

sigKillInCurrentThread: The following code is misplaced, it should be inside the "if (isProcessGroup || ProcessTree.isAlive(pid))" check (+68).

Vinod Kumar Vavilapalli
added a comment - 13/May/09 10:14 Patch is coming out good. Some more comments, mostly minor:
main.c
We should have a separate exit-code for the case when pid!=ppid in verify_parent() function.
I don't see the per-command checks for the number of arguments. atoi(argv [optind++] ) will segfault if we go beyond arg-length. Correspondingly, the null checks and pid<=0 can be done away with in task-controller.c once we have per-command check for the number of arguments.
The following code-comment has to be removed. And the check should be only for 2 args : user and ttroot
// when we support additional commands without ttroot, this check
// may become command specific.
if (argc < 4) {
display_usage(stderr);
return INVALID_ARGUMENT_NUMBER;
}
taskcontroller.c
javadoc for kill_user_task should be "Function used to terminate/kill a task launched by the user."
taskcontroller.h
Consistent with the terminology elsewhere, KILL_TASK and DESTROY_TASK_JVM should instead be TERMINATE_TASK and KILL_TASK
ProcessTree.java
killProcessGroup(String pid, boolean isProcessGroup) isn't used anywhere, can be removed.
Can the the terminate* methods, kill*, destroy*, sigKill* and is*Alive methods grouped together?
Need to fix the javadoc of terminateProcess, killProcess, terminateProcessGroup, killProcessGroup, isAlive() and isProcessGroupAlive()
sigKillInCurrentThread: The following code is misplaced, it should be inside the "if (isProcessGroup || ProcessTree.isAlive(pid))" check (+68).
if (isProcessGroup) {
killProcessGroup(pid);
} else {
killProcess(pid);
}
TaskController.java
Rename killTaskJvm to cleanupTaskJvm ??
Javadoc of this method should instead read "Sends a kill signal to forcefully kill the task JVM and all its sub-processes"
DefaultTaskController.java
killTask(): put a comment for the WINDOWS case
killTask(): "else" not needed
LinuxTaskController.java
Consistent with the terminology elsewhere, KILL_TASK_JVM and DESTROY_TASK_JVM should be TERMINATE_TASK_JVM and KILL_TASK_JVM
TT is not brought down when it fails to write pid on all the configured local directories.
Log statement should be added so that we know on which directories TT could write the pid file.
When on some directory, pid files are not created because of a transient errors, tasks launched on that directory will not run. Should we read pid from other disks?
TestKillSubProcessesWithLinuxTaskController
Remove blank lines
TestKillSubProcesses:
In runJob, why is the check needed for the inDir and outDir?
In isAlive(), if we get an IOException while running the ps command, we need to re-throw this exception instead of just logging it.

In case of transient failures while writing TT pid, we are storing the failed disk, and next a task is launched/killed with that tt root we try to write the task-tracker pid to same disk. If we fail again we are throwing an IOException as disk is not writable for task-tracker, else we remove it from global failed list.

Sreekanth Ramakrishnan
added a comment - 14/May/09 06:09 Attaching file incorporating Vinod's comments:
In case of transient failures while writing TT pid, we are storing the failed disk, and next a task is launched/killed with that tt root we try to write the task-tracker pid to same disk. If we fail again we are throwing an IOException as disk is not writable for task-tracker, else we remove it from global failed list.

I think TaskController.terminateTask() and TaskController.killTask() are the better names than the ones the latest patch introduced. We are actually cleaning up the task process-trees and just not the JVMs.

Instead of the latest approach w.r.t encountering disk with no pid written, it would be better for the LinuxTaskController binary to look in all the disks configured via mapred.local.dir. It would be simpler to understand this way.

Vinod Kumar Vavilapalli
added a comment - 19/May/09 10:10
I think TaskController.terminateTask() and TaskController.killTask() are the better names than the ones the latest patch introduced. We are actually cleaning up the task process-trees and just not the JVMs.
Instead of the latest approach w.r.t encountering disk with no pid written, it would be better for the LinuxTaskController binary to look in all the disks configured via mapred.local.dir. It would be simpler to understand this way.

Sreekanth Ramakrishnan
added a comment - 19/May/09 10:38 Attaching patch modifying task controller based on the offline discussion with Hemanth and Vinod.
Changed verify_parent method in task-controller.c not to take tt_root rather than find it from configured mapred.local.dir.
Added new function in configuration.c to return an array of values of a configuration key.
Added log statements in LinuxTaskController to log the exception which can occur during launch and kill of task.
Cleaned up exit codes in task-controller.h.

Attaching new patch with few changes to code based on internal testing results:

It was found that users could execute arbitrary scripts using task-controller binary by directly invoking binary in their tasks and using relative paths to point to that script. So following checks are added, all constructed paths in binary are resolved and checked if the resolved path and absolute path are one and same. Second check was added to check if the taskjvm.sh is actually owned by task-tracker. If the file is not owned by the task-tracker. The binary would not execute the script.

Changed documentation to note the fact that, the binary's group ownership should be that of task-tracker and cluster users should not be part of this group. The suggested permission of the task-controller has been changed to 4510.

Sreekanth Ramakrishnan
added a comment - 26/May/09 06:01 Attaching new patch with few changes to code based on internal testing results:
It was found that users could execute arbitrary scripts using task-controller binary by directly invoking binary in their tasks and using relative paths to point to that script. So following checks are added, all constructed paths in binary are resolved and checked if the resolved path and absolute path are one and same. Second check was added to check if the taskjvm.sh is actually owned by task-tracker. If the file is not owned by the task-tracker. The binary would not execute the script.
Changed documentation to note the fact that, the binary's group ownership should be that of task-tracker and cluster users should not be part of this group. The suggested permission of the task-controller has been changed to 4510.
Removed pid writing by task tracker.

Vinod Kumar Vavilapalli
added a comment - 04/Jun/09 07:33 Looked at the patch. Some comments, hopefully the last lap..
task-controller.h:
Plz knock off the un-used error codes
OUT_OF_MEMORY, //5
INVALID_TT_PID, //11
task-controller.c
In check_tt_root(), the line "const char ** mapred_local_dir = get_values(TT_SYS_DIR_KEY);" has to be moved after NULL check for tt_root, because otherwise there is a memory leak.
Plz add a comment as to why we do check_tt_root before changing user in run_task_as_user
Fix comments of kill_user_task. No need of the mention for tt_root here any longer.
TaskController.java
(Don't hit me!) Rename TaskController.cleanupTaskJvm() to destroyTaskJvm(). This is consistent with terminology in ProcessTree.java

[exec]
[exec] +1 overall.
[exec]
[exec] +1 @author. The patch does not contain any @author tags.
[exec]
[exec] +1 tests included. The patch appears to include 6 new or modified tests.
[exec]
[exec] +1 javadoc. The javadoc tool did not generate any warning messages.
[exec]
[exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings.
[exec]
[exec] +1 findbugs. The patch does not introduce any new Findbugs warnings.
[exec]
[exec] +1 Eclipse classpath. The patch retains Eclipse classpath integrity.
[exec]
[exec] +1 release audit. The applied patch does not increase the total number of release audit warnings.

[exec] +1 overall.
[exec]
[exec] +1 @author. The patch does not contain any @author tags.
[exec]
[exec] +1 tests included. The patch appears to include 6 new or modified tests.
[exec]
[exec] +1 javadoc. The javadoc tool did not generate any warning messages.
[exec]
[exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings.
[exec]
[exec] +1 findbugs. The patch does not introduce any new Findbugs warnings.
[exec]
[exec] +1 Eclipse classpath. The patch retains Eclipse classpath integrity.
[exec]
[exec] +1 release audit. The applied patch does not increase the total number of release audit warnings.

Sreekanth Ramakrishnan
added a comment - 12/Jun/09 05:08 Result of the ant test-patch is as follows:
[exec] +1 overall.
[exec]
[exec] +1 @author. The patch does not contain any @author tags.
[exec]
[exec] +1 tests included. The patch appears to include 6 new or modified tests.
[exec]
[exec] +1 javadoc. The javadoc tool did not generate any warning messages.
[exec]
[exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings.
[exec]
[exec] +1 findbugs. The patch does not introduce any new Findbugs warnings.
[exec]
[exec] +1 Eclipse classpath. The patch retains Eclipse classpath integrity.
[exec]
[exec] +1 release audit. The applied patch does not increase the total number of release audit warnings.
Running ant test.

This issue is marked as Fixed. Open a new JIRA with a description about the issue and attach the patch.

Rajiv, the reason why there is no new jira is because this is a bug that occurs only on the Yahoo! distribution and not on trunk.

To give an explanation, the patch HADOOP-5420-v20.patch introduced some code to make sure the pid files written by the task-controller were owned by the tasktracker process, as a security check. Inadvertently, in this patch, we removed some code that changed the ownership of the pid file (which was written as root) to be owned by the TT user. As a result, pid files were created as root, but the new check introduced in the patch failed during kill because it found the PID files were not owned by the TT user and hence treated them as suspect. Hence tasks failed to be killed causing runaway processes on the cluster.

The attached patch re-introduces the code that changes ownership of the pid file to the TT user so that during killing the security check would pass and processes would be killed.

Hemanth Yamijala
added a comment - 15/Jul/09 17:51 This issue is marked as Fixed. Open a new JIRA with a description about the issue and attach the patch.
Rajiv, the reason why there is no new jira is because this is a bug that occurs only on the Yahoo! distribution and not on trunk.
To give an explanation, the patch HADOOP-5420 -v20.patch introduced some code to make sure the pid files written by the task-controller were owned by the tasktracker process, as a security check. Inadvertently, in this patch, we removed some code that changed the ownership of the pid file (which was written as root) to be owned by the TT user. As a result, pid files were created as root, but the new check introduced in the patch failed during kill because it found the PID files were not owned by the TT user and hence treated them as suspect. Hence tasks failed to be killed causing runaway processes on the cluster.
The attached patch re-introduces the code that changes ownership of the pid file to the TT user so that during killing the security check would pass and processes would be killed.

Sreekanth Ramakrishnan
added a comment - 11/Sep/09 06:11 There is a double free expection thrown randomly when we try to mapred_local_dir is being freed when it is null. Fixing this issue by having a copy of pointer and using it for freeing.