Details

Description

Found this:

Java stack information for the threads listed above:
===================================================
"Thread-45":
at org.apache.hadoop.yarn.api.records.impl.pb.ApplicationAttemptIdPBImpl.getApplicationId(ApplicationAttemptIdPBImpl.java:101)
- waiting to lock <0xb6a43ba0> (a org.apache.hadoop.yarn.api.records.impl.pb.ApplicationAttemptIdPBImpl)
at org.apache.hadoop.yarn.api.records.impl.pb.ApplicationAttemptIdPBImpl.compareTo(ApplicationAttemptIdPBImpl.java:144)
- locked <0xb6a443a0> (a org.apache.hadoop.yarn.api.records.impl.pb.ApplicationAttemptIdPBImpl)
at org.apache.hadoop.yarn.api.records.impl.pb.ApplicationAttemptIdPBImpl.compareTo(ApplicationAttemptIdPBImpl.java:31)
at org.apache.hadoop.yarn.api.records.impl.pb.ContainerIdPBImpl.compareTo(ContainerIdPBImpl.java:215)
at org.apache.hadoop.yarn.api.records.impl.pb.ContainerIdPBImpl.compareTo(ContainerIdPBImpl.java:34)
at java.util.concurrent.ConcurrentSkipListMap.doGet(ConcurrentSkipListMap.java:797)
at java.util.concurrent.ConcurrentSkipListMap.get(ConcurrentSkipListMap.java:1640)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:360)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:355)
at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:113)
at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:75)
at java.lang.Thread.run(Thread.java:619)
"Thread-30":
at org.apache.hadoop.yarn.api.records.impl.pb.ApplicationAttemptIdPBImpl.getApplicationId(ApplicationAttemptIdPBImpl.java:101)
- waiting to lock <0xb6a443a0> (a org.apache.hadoop.yarn.api.records.impl.pb.ApplicationAttemptIdPBImpl)
at org.apache.hadoop.yarn.api.records.impl.pb.ApplicationAttemptIdPBImpl.compareTo(ApplicationAttemptIdPBImpl.java:144)
- locked <0xb6a43ba0> (a org.apache.hadoop.yarn.api.records.impl.pb.ApplicationAttemptIdPBImpl)
at org.apache.hadoop.yarn.api.records.impl.pb.ApplicationAttemptIdPBImpl.compareTo(ApplicationAttemptIdPBImpl.java:31)
at org.apache.hadoop.yarn.api.records.impl.pb.ContainerIdPBImpl.compareTo(ContainerIdPBImpl.java:215)
at org.apache.hadoop.yarn.api.records.impl.pb.ContainerIdPBImpl.compareTo(ContainerIdPBImpl.java:34)
at java.util.concurrent.ConcurrentSkipListMap.doRemove(ConcurrentSkipListMap.java:1078)
at java.util.concurrent.ConcurrentSkipListMap.remove(ConcurrentSkipListMap.java:1673)
at java.util.concurrent.ConcurrentSkipListMap$Iter.remove(ConcurrentSkipListMap.java:2256)
at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.getNodeStatus(NodeStatusUpdaterImpl.java:223)
at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.access$300(NodeStatusUpdaterImpl.java:62)
at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl$1.run(NodeStatusUpdaterImpl.java:262)
Found 1 deadlock.

Vinod Kumar Vavilapalli
added a comment - 08/Sep/11 13:40 Because of this, the (only) MR AM had one of its map stuck in SUCCESS_CONTAINER_CLEANUP state. On the NM, the stopContainer() request from this AM was stuck on ApplicationAttemptId too.
"IPC Server handler 3 on 45450" daemon prio=10 tid=0xafd29000 nid=0x68a3 waiting for monitor entry [0xaf00b000]
java.lang. Thread .State: BLOCKED (on object monitor)
at org.apache.hadoop.yarn.api.records.impl.pb.ApplicationAttemptIdPBImpl.getApplicationId(ApplicationAttemptIdPBImpl.java:101)
- waiting to lock <0xb6a43ba0> (a org.apache.hadoop.yarn.api.records.impl.pb.ApplicationAttemptIdPBImpl)
at org.apache.hadoop.yarn.api.records.impl.pb.ApplicationAttemptIdPBImpl.compareTo(ApplicationAttemptIdPBImpl.java:144)
- locked <0xb6bceac8> (a org.apache.hadoop.yarn.api.records.impl.pb.ApplicationAttemptIdPBImpl)
at org.apache.hadoop.yarn.api.records.impl.pb.ApplicationAttemptIdPBImpl.compareTo(ApplicationAttemptIdPBImpl.java:31)
at org.apache.hadoop.yarn.api.records.impl.pb.ContainerIdPBImpl.compareTo(ContainerIdPBImpl.java:215)
at org.apache.hadoop.yarn.api.records.impl.pb.ContainerIdPBImpl.compareTo(ContainerIdPBImpl.java:34)
at java.util.concurrent.ConcurrentSkipListMap.doGet(ConcurrentSkipListMap.java:797)
at java.util.concurrent.ConcurrentSkipListMap.get(ConcurrentSkipListMap.java:1640)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.stopContainer(ContainerManagerImpl.java:311)
at org.apache.hadoop.yarn.api.impl.pb.service.ContainerManagerPBServiceImpl.stopContainer(ContainerManagerPBServiceImpl.java:80)
at org.apache.hadoop.yarn.proto.ContainerManager$ContainerManagerService$2.callBlockingMethod(ContainerManager.java:85)
at org.apache.hadoop.yarn.ipc.ProtoOverHadoopRpcEngine$Server.call(ProtoOverHadoopRpcEngine.java:337)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1496)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1492)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1135)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1490)
So because of this, the map got stuck, all the reducers were spinning for TaskCompletionEvents and the world came to a halt.

Java stack information for the threads listed above:
===================================================
"Thread-45":
at org.apache.hadoop.yarn.api.records.impl.pb.ApplicationAttemptIdPBImpl.getAttemptId(ApplicationAttemptIdPBImpl.java:90)
- waiting to lock <0xb5e2d1b0> (a org.apache.hadoop.yarn.api.records.impl.pb.ApplicationAttemptIdPBImpl)
at org.apache.hadoop.yarn.api.records.impl.pb.ApplicationAttemptIdPBImpl.compareTo(ApplicationAttemptIdPBImpl.java:147)
- locked <0xb5e2cb28> (a org.apache.hadoop.yarn.api.records.impl.pb.ApplicationAttemptIdPBImpl)
at org.apache.hadoop.yarn.api.records.impl.pb.ApplicationAttemptIdPBImpl.compareTo(ApplicationAttemptIdPBImpl.java:31)
at org.apache.hadoop.yarn.api.records.impl.pb.ContainerIdPBImpl.compareTo(ContainerIdPBImpl.java:215)
at org.apache.hadoop.yarn.api.records.impl.pb.ContainerIdPBImpl.compareTo(ContainerIdPBImpl.java:34)
at java.util.concurrent.ConcurrentSkipListMap.doGet(ConcurrentSkipListMap.java:797)
at java.util.concurrent.ConcurrentSkipListMap.get(ConcurrentSkipListMap.java:1640)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:360)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:355)
at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:113)
at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:75)
at java.lang.Thread.run(Thread.java:619)
"Thread-30":
at org.apache.hadoop.yarn.api.records.impl.pb.ApplicationAttemptIdPBImpl.getApplicationId(ApplicationAttemptIdPBImpl.java:101)
- waiting to lock <0xb5e2cb28> (a org.apache.hadoop.yarn.api.records.impl.pb.ApplicationAttemptIdPBImpl)
at org.apache.hadoop.yarn.api.records.impl.pb.ApplicationAttemptIdPBImpl.compareTo(ApplicationAttemptIdPBImpl.java:144)
- locked <0xb5e2d1b0> (a org.apache.hadoop.yarn.api.records.impl.pb.ApplicationAttemptIdPBImpl)
at org.apache.hadoop.yarn.api.records.impl.pb.ApplicationAttemptIdPBImpl.compareTo(ApplicationAttemptIdPBImpl.java:31)
at org.apache.hadoop.yarn.api.records.impl.pb.ContainerIdPBImpl.compareTo(ContainerIdPBImpl.java:215)
at org.apache.hadoop.yarn.api.records.impl.pb.ContainerIdPBImpl.compareTo(ContainerIdPBImpl.java:34)
at java.util.concurrent.ConcurrentSkipListMap.doRemove(ConcurrentSkipListMap.java:1078)
at java.util.concurrent.ConcurrentSkipListMap.remove(ConcurrentSkipListMap.java:1673)
at java.util.concurrent.ConcurrentSkipListMap$Iter.remove(ConcurrentSkipListMap.java:2256)
at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.getNodeStatus(NodeStatusUpdaterImpl.java:223)
at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.access$300(NodeStatusUpdaterImpl.java:62)
at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl$1.run(NodeStatusUpdaterImpl.java:262)
Found 1 deadlock.

Siddharth Seth
added a comment - 09/Sep/11 01:19 compareTo and equals should not be synchronized in ApplicationAttemptId
The patch fixes this +
ads synchronization for other methods in ApplicationId and ContainerId.
hashCode, equals, toString, compareTo implemented for ApplicationId, ApplicationAttemptId and ContainerId - so that the backing ProtoBuf object is not serialized for each of these calls.
Moved these methods up one level (AppId, AppAttemptId, CId changed to abstract classes)
The tests are kindof lame. Can be dropped if they're not required.

Vinod Kumar Vavilapalli
added a comment - 09/Sep/11 05:06 Patch looks good overall.
Assuming the hashCode() methods are generated(eclipse?) and good enough.
I also like the fact that now equals() , hashCode() , and comparesTo() are tied to the records themselves instead of the PB implementations. We should do this for other records too, but another ticket.
+1 for the patch.

hashCode() wasn't generated by eclipse. they've taken elements from the MrV1 JobID etc. ApplicationId specifically is something which may need more looking into (post RM re-factor and for JobHistory).
Agree with the bit about tying equals(), hashCode, compareTo.. and also toString() to all the records. ProtoBase is more of a convenience to provide this functionality in all records - and likely the reason for most of the races and sync in PBImpls (serializing the proto object for each equals, hashCode). This - along with some other PB performance related changes need to be made sometime later.

Siddharth Seth
added a comment - 09/Sep/11 05:13 hashCode() wasn't generated by eclipse. they've taken elements from the MrV1 JobID etc. ApplicationId specifically is something which may need more looking into (post RM re-factor and for JobHistory).
Agree with the bit about tying equals(), hashCode, compareTo.. and also toString() to all the records. ProtoBase is more of a convenience to provide this functionality in all records - and likely the reason for most of the races and sync in PBImpls (serializing the proto object for each equals, hashCode). This - along with some other PB performance related changes need to be made sometime later.

Vinod Kumar Vavilapalli
added a comment - 09/Sep/11 06:26 hashCode() methods generated by eclipse have better null checks etc. and also have double the product-sums.
Attaching patch using the eclipse-generated hashes. We should be able to do better if we analyse more on our IDs, but this should work for now.

Looks ok - but am not sure about the large prime - will almost definitely cause the hashcode to wrap around the integer range which is likely not a problem. We could revert to the eclipse generated default of 31.

We should be able to do better if we analyse more on our IDs, but this should work for now.

Completely agree with this though - clusterTimestamp is in ms, there's unlikely to be a very large number of attemptIds and container per app.

Siddharth Seth
added a comment - 09/Sep/11 06:48 Looks ok - but am not sure about the large prime - will almost definitely cause the hashcode to wrap around the integer range which is likely not a problem. We could revert to the eclipse generated default of 31.
We should be able to do better if we analyse more on our IDs, but this should work for now.
Completely agree with this though - clusterTimestamp is in ms, there's unlikely to be a very large number of attemptIds and container per app.