Details

Fixed a race condition involving JvmRunner.kill() and KillTaskAction, which was leading to an NullPointerException causing a transient inconsistent state in JvmManager and failure of tasks.

Description

In an environment where many jobs are killed simultaneously, NPEs are observed in the TT/JT logs when a task fails. The situation is aggravated when the taskcontroller.cfg is not configured properly. Below is the exception obtained:

INFO org.apache.hadoop.mapred.TaskInProgress: Error from <attempt_ID>:
java.lang.Throwable: Child Error
at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:529)
Caused by: java.lang.NullPointerException
at org.apache.hadoop.mapred.JvmManager$JvmManagerForType.getDetails(JvmManager.java:329)
at org.apache.hadoop.mapred.JvmManager$JvmManagerForType.reapJvm(JvmManager.java:315)
at org.apache.hadoop.mapred.JvmManager$JvmManagerForType.access$000(JvmManager.java:146)
at org.apache.hadoop.mapred.JvmManager.launchJvm(JvmManager.java:109)
at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:502)

Activity

After looking at the TaskTracker logs, we found the problem is as follows:
One of the task attempts failed to launch jvm. Finally block of JvmRunner.runChild() calls kill(), which calls terminateTask() which also fails. Then it will sleep for configured duration (default, 5 seconds) and then calls killTask(). Then it removes the jvmid mapping from jvmIdToRunner map.
Meanwhile, there was a killTaskAction for the same attempt from TaskTracker. This call removes the jvmId mapping from jvmToRunningTask. Then, it sees that JvmRunner.kill() is already called and it goes ahead and releases slot.
As there are free slots, TaskTracker tries to launch a task and finds the JvmManager in inconsistent state, since the jvm is not yet removed from jvmIdToRunner map. When it tries to find the details through getDetails(), it gets NullPointerException since jvmToRunningTask does not have an entry for the same.

I think JvmRunner.kill() should not do a back call to JvmManager for removing jvmid mapping from jvmIdToRunner map. The removal should be done by the callers of kill(). i.e. killJvm(), stop() and reapJvm(). JvmRunner.runChild() already does from UpdateOnJvmExit(), in next method call after kill().
Thoughts?

Amareshwari Sriramadasu
added a comment - 22/Jan/10 11:16 After looking at the TaskTracker logs, we found the problem is as follows:
One of the task attempts failed to launch jvm. Finally block of JvmRunner.runChild() calls kill(), which calls terminateTask() which also fails. Then it will sleep for configured duration (default, 5 seconds) and then calls killTask(). Then it removes the jvmid mapping from jvmIdToRunner map.
Meanwhile, there was a killTaskAction for the same attempt from TaskTracker. This call removes the jvmId mapping from jvmToRunningTask. Then, it sees that JvmRunner.kill() is already called and it goes ahead and releases slot.
As there are free slots, TaskTracker tries to launch a task and finds the JvmManager in inconsistent state, since the jvm is not yet removed from jvmIdToRunner map. When it tries to find the details through getDetails(), it gets NullPointerException since jvmToRunningTask does not have an entry for the same.
I think JvmRunner.kill() should not do a back call to JvmManager for removing jvmid mapping from jvmIdToRunner map. The removal should be done by the callers of kill(). i.e. killJvm(), stop() and reapJvm(). JvmRunner.runChild() already does from UpdateOnJvmExit(), in next method call after kill().
Thoughts?

Amareshwari Sriramadasu
added a comment - 22/Jan/10 11:28 One of the task attempts failed to launch jvm. Finally block of JvmRunner.runChild() calls kill(), which calls terminateTask() which also fails.
The failures for launching jvm and terminateTask() were because of misconfigured taskcontroller.cfg.

The patch still has other problems. Even though with the attached patch JvmManager state becomes consistent, because even though JvmRuner.kill() is in progress, other threads trying to kill a JVM will return immediately and may cause other inconsistencies. Fundamentally, I think all threads trying to do a kill on the same JVM should block till the killing finishes.

Argh.. with this bug in view, I see so many design problems in JvmManager. Time for some refactoring!

Vinod Kumar Vavilapalli
added a comment - 13/Apr/10 12:09 The patch still has other problems. Even though with the attached patch JvmManager state becomes consistent, because even though JvmRuner.kill() is in progress, other threads trying to kill a JVM will return immediately and may cause other inconsistencies. Fundamentally, I think all threads trying to do a kill on the same JVM should block till the killing finishes.
Argh.. with this bug in view, I see so many design problems in JvmManager. Time for some refactoring!

Patch makes the method JvmRunner.kill() synchronized so that if kill is in-progress, the caller will wait till kill completes. Making the method JvmRunner.kill() synchronized is fine because all the external callers are already synchronized on JvmManager. Now patch makes it synchronized on JvmRunner also.

Amareshwari Sriramadasu
added a comment - 13/Apr/10 12:47 Patch makes the method JvmRunner.kill() synchronized so that if kill is in-progress, the caller will wait till kill completes. Making the method JvmRunner.kill() synchronized is fine because all the external callers are already synchronized on JvmManager. Now patch makes it synchronized on JvmRunner also.

Patch looks good. I myself reviewed the patch for ydist and the fact that the same patch applies cleanly on trunk too helps!

But like I pointed before the interactions of JvmManager with TaskRunner, TaskTracker and TaskController are really nasty and unmaintainable. Will open a JIRA for refactoring so as to align these interfaces better.

We can't wait for the refactor, I am going to commit this patch for now.

Vinod Kumar Vavilapalli
added a comment - 27/Apr/10 10:50 Patch looks good. I myself reviewed the patch for ydist and the fact that the same patch applies cleanly on trunk too helps!
But like I pointed before the interactions of JvmManager with TaskRunner, TaskTracker and TaskController are really nasty and unmaintainable. Will open a JIRA for refactoring so as to align these interfaces better.
We can't wait for the refactor, I am going to commit this patch for now.