Details

Description

We saw an instance where the RM stopped launch Application masters. We found that the launcher thread was hung because something weird/bad happened to the NM node. Currently there is only 1 launcher thread (jira 4061 to fix that). We need this to not happen. Even once we increase the number of threads to > 1 if that many nodes go bad the RM would be stuck. Note that this was stuck like this for approximately 9 hours.

locked <0x00002aab05a4f3f0> (a org.apache.hadoop.ipc.Client$Call)
at
org.apache.hadoop.yarn.ipc.ProtoOverHadoopRpcEngine$Invoker.invoke(ProtoOverHadoopRpcEngine.java:135)
at $Proxy76.startContainer(Unknown Source)
at
org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagerPBClientImpl.startContainer(ContainerManagerPBClientImpl.java:87)
at
org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.launch(AMLauncher.java:118)
at
org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.run(AMLauncher.java:265)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:619)

Activity

this seems to be the same issue that was seen when the AM hung launching containers in MAPREDUCE-3228. I'm investigating using an rpmTimeout when ContainerManagerPBClientImpl creates the proxy. If anyone knows a reason not to use the rpcTimeout please let me know.

Thomas Graves
added a comment - 29/Mar/12 20:21 this seems to be the same issue that was seen when the AM hung launching containers in MAPREDUCE-3228 . I'm investigating using an rpmTimeout when ContainerManagerPBClientImpl creates the proxy. If anyone knows a reason not to use the rpcTimeout please let me know.

We've seen an issue where using both an RPC-level timeout (in this case the ping timeout) and the timer task can cause the AM to lose track of a container and hang the job. Here's the relevant part of the AM log:

Looks like the socket timeout and the timer task timeout occurred almost simultaneously. The socket exception was caught first, and during the catch clause we fielded the interrupted exception. That broke us out of the handling of the socket exception and we never marked the container status properly before leaving.

So I'm +1 on removing the timer task and relying on the rpcTimeout, unless there are other cases besides RPC where we'd expect the container launcher to get stuck.

Jason Lowe
added a comment - 30/Mar/12 22:14 We've seen an issue where using both an RPC-level timeout (in this case the ping timeout) and the timer task can cause the AM to lose track of a container and hang the job. Here's the relevant part of the AM log:
2012-03-29 07:32:17,794 ERROR [ContainerLauncher #199] org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException as:container_1333003059741_0010_01_003408 (auth:SIMPLE) cause:java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/x.x.x.x:x remote=x.x.x.x.x/x.x.x.x:x]
2012-03-29 07:32:17,794 WARN [ContainerLauncher #199] org.apache.hadoop.ipc.Client: Exception encountered while connecting to the server : java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/x.x.x.x:x remote=x.x.x.x.x/x.x.x.x:x]
2012-03-29 07:32:17,794 ERROR [ContainerLauncher #199] org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException as:container_1333003059741_0010_01_003408 (auth:SIMPLE) cause:java.io.IOException: java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/x.x.x.x:x remote=x.x.x.x.x/x.x.x.x:x]
2012-03-29 07:32:17,795 WARN [Timer-1] org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: Couldn't complete CONTAINER_REMOTE_CLEANUP on container_1333003059741_0010_01_003408/attempt_1333003059741_0010_m_003097_0. Interrupting and returning
2012-03-29 07:32:17,798 INFO [Timer-1] org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: Stack trace of the command-thread:
at java.util.Arrays.copyOf(Arrays.java:2882)
at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:100)
at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:390)
at java.lang.StringBuilder.append(StringBuilder.java:119)
at java.lang.StackTraceElement.toString(StackTraceElement.java:157)
at java.lang.String.valueOf(String.java:2826)
at java.lang.StringBuilder.append(StringBuilder.java:115)
at java.lang.Throwable.printStackTrace(Throwable.java:512)
at org.apache.hadoop.util.StringUtils.stringifyException(StringUtils.java:64)
at org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$Container.kill(ContainerLauncherImpl.java:260)
at org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$EventProcessor.run(ContainerLauncherImpl.java:479)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:619)
2012-03-29 07:32:17,800 WARN [ContainerLauncher #199] org.apache.hadoop.yarn.event.AsyncDispatcher: AsyncDispatcher thread interrupted
java.lang.InterruptedException
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1199)
at java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:312)
at java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:294)
at org.apache.hadoop.yarn.event.AsyncDispatcher$GenericEventHandler.handle(AsyncDispatcher.java:179)
at org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$Container.kill(ContainerLauncherImpl.java:263)
at org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$EventProcessor.run(ContainerLauncherImpl.java:479)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:619)
Looks like the socket timeout and the timer task timeout occurred almost simultaneously. The socket exception was caught first, and during the catch clause we fielded the interrupted exception. That broke us out of the handling of the socket exception and we never marked the container status properly before leaving.
So I'm +1 on removing the timer task and relying on the rpcTimeout, unless there are other cases besides RPC where we'd expect the container launcher to get stuck.

Hadoop QA
added a comment - 02/Apr/12 19:36 -1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12521004/MAPREDUCE-4062.patch
against trunk revision .
+1 @author. The patch does not contain any @author tags.
+1 tests included. The patch appears to include 6 new or modified tests.
-1 patch. The patch command could not apply the patch.
Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2127//console
This message is automatically generated.

unless there are other cases besides RPC where we'd expect the container launcher to get stuck.

Currently the timer task's purpose is only to catch rpcs that get stuck.

When I originally did the timer-task, I didn't really think that rpcTimeout could work. If you feel that works, sure, +1. You should definitely modify TestContainerLauncher to verify the changes with the rpc-timeout.

Vinod Kumar Vavilapalli
added a comment - 02/Apr/12 19:38 unless there are other cases besides RPC where we'd expect the container launcher to get stuck.
Currently the timer task's purpose is only to catch rpcs that get stuck.
When I originally did the timer-task, I didn't really think that rpcTimeout could work. If you feel that works, sure, +1. You should definitely modify TestContainerLauncher to verify the changes with the rpc-timeout.

I'll see if I can add something to TestContainerLauncher also. It is tested indirectly via TestContainerLaunchRPC. I figured that one catches both RM launch of AM as well as AM launching containers.

I also tested things manually on a two node cluster by added sleep into the ContainerManager to simulate a hang. I verified that it properly errors out in both the RM launching AM and AM launching containers cases. In all cases it properly retried when the retry values were > 1.

Thomas Graves
added a comment - 02/Apr/12 22:15 I'll see if I can add something to TestContainerLauncher also. It is tested indirectly via TestContainerLaunchRPC. I figured that one catches both RM launch of AM as well as AM launching containers.
I also tested things manually on a two node cluster by added sleep into the ContainerManager to simulate a hang. I verified that it properly errors out in both the RM launching AM and AM launching containers cases. In all cases it properly retried when the retry values were > 1.

Hadoop QA
added a comment - 03/Apr/12 17:26 -1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12521159/MAPREDUCE-4062.patch
against trunk revision .
+1 @author. The patch does not contain any @author tags.
+1 tests included. The patch appears to include 6 new or modified tests.
+1 javadoc. The javadoc tool did not generate any warning messages.
+1 javac. The applied patch does not increase the total number of javac compiler warnings.
+1 eclipse:eclipse. The patch built with eclipse:eclipse.
+1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.
+1 release audit. The applied patch does not increase the total number of release audit warnings.
-1 core tests. The patch failed these unit tests:
org.apache.hadoop.yarn.server.resourcemanager.TestClientRMService
org.apache.hadoop.yarn.server.resourcemanager.resourcetracker.TestNMExpiry
org.apache.hadoop.yarn.server.resourcemanager.TestAMAuthorization
org.apache.hadoop.yarn.server.resourcemanager.TestApplicationACLs
+1 contrib tests. The patch passed contrib unit tests.
Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2133//testReport/
Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/2133//console
This message is automatically generated.

Robert Joseph Evans
added a comment - 03/Apr/12 17:46 I reviewed the code for trunk/branch-2 and it looks good to me. I like how there is lots of code being deleted and almost only tests being added.
I am going to look at the branch-0.23 patch now.

Robert Joseph Evans
added a comment - 03/Apr/12 17:47 I reviewed the code for trunk/branch-2 and it looks good to me. I like how there is lots of code being deleted and almost only tests being added.
I am going to look at the branch-0.23 patch now.