A potential solution would be the following
1) have the scheduler interface return the set of bad nodes on which it has stopped scheduling. This keeps the decision of which node is bad in the scheduler. The scheduler is the ultimate authority on what runs on a node and should tell its clients whether about the nodes that it is not considering for scheduling.
2) 1) above could be done as another interface API or piggybacked on the scheduler.allocate() API.
3) The response could contain all the known bad nodes or deltas to the previous response. Deltas are cheaper to send but are susceptible to message loss and retransmission. Also, deltas would have to be divided into new bad nodes and new good nodes.
4) The AM might want to know the type of bad node. Say lost or unhealthy etc. The bad nodes information could be enhanced via querying the RMNode object for the actual reason/health.

As an enhancement, we could add a new RMNodeMananger entity that manages all the RMNodes. The above functionality could move from the scheduler into RMNodeManager (though it would need to be in sync with the scheduler). After that, getting detailed information may not need direct access to RMNode object. Potentially, other interactions with RMNode could be forwarded through the RMNodeManager. But this would be a fairly significant refactoring thats best left to a separate future work item.

Bikas Saha
added a comment - 24/Feb/12 01:56 A potential solution would be the following
1) have the scheduler interface return the set of bad nodes on which it has stopped scheduling. This keeps the decision of which node is bad in the scheduler. The scheduler is the ultimate authority on what runs on a node and should tell its clients whether about the nodes that it is not considering for scheduling.
2) 1) above could be done as another interface API or piggybacked on the scheduler.allocate() API.
3) The response could contain all the known bad nodes or deltas to the previous response. Deltas are cheaper to send but are susceptible to message loss and retransmission. Also, deltas would have to be divided into new bad nodes and new good nodes.
4) The AM might want to know the type of bad node. Say lost or unhealthy etc. The bad nodes information could be enhanced via querying the RMNode object for the actual reason/health.
As an enhancement, we could add a new RMNodeMananger entity that manages all the RMNodes. The above functionality could move from the scheduler into RMNodeManager (though it would need to be in sync with the scheduler). After that, getting detailed information may not need direct access to RMNode object. Potentially, other interactions with RMNode could be forwarded through the RMNodeManager. But this would be a fairly significant refactoring thats best left to a separate future work item.

Not doing deltas on the RM-AM channel does not seem viable because of high frequency message traffic. Sending information about 100 bad nodes at 100 bytes per node for 1000AM's every second is about 10MB/s of traffic.
Sending deltas means tracking last and current states on the RM on a per AM attempt basis. That would not be good to do in the scheduler because its not the responsibility of the scheduler. So this needs to be done on each RMAttempt object. The RMAttempt object gets the current list of bad nodes and compares it with its last known list of bad nodes. Additions and deletions are sent to the AM as new bad and good nodes.
Alternatively, each RMNode could send an event to each RMAppAttempt for healthy->unhealthy and vice versa transitions. These events could be accumulated and copied to the AM via the allocate response.

Bikas Saha
added a comment - 24/Feb/12 02:51 Not doing deltas on the RM-AM channel does not seem viable because of high frequency message traffic. Sending information about 100 bad nodes at 100 bytes per node for 1000AM's every second is about 10MB/s of traffic.
Sending deltas means tracking last and current states on the RM on a per AM attempt basis. That would not be good to do in the scheduler because its not the responsibility of the scheduler. So this needs to be done on each RMAttempt object. The RMAttempt object gets the current list of bad nodes and compares it with its last known list of bad nodes. Additions and deletions are sent to the AM as new bad and good nodes.
Alternatively, each RMNode could send an event to each RMAppAttempt for healthy->unhealthy and vice versa transitions. These events could be accumulated and copied to the AM via the allocate response.

1) Add a ClusterManager to RM that currently only provides bad node information. All node management will incrementally be moved to it. Eg. RmContext.RMnodes and RMContext.invalidNodes could move into ClusterManager. This could be a new class or be an enhancement of NodesListManager.
2) Node is bad when it gets a heartbeat with health=unhealthy or when the livenessmonitor reports lost node. In both cases a SchedulerNodeRemove event is issued. Similar for nodes becoming healthy. Additionally, these will now emit events to update the ClusterManager with corresponding events.
3) In AppMasterService, after calling Scheduler.allocate(), it will call ClusterManager.getUnusableNodes(). These will be passed to RMAttempt object which will calculate the delta with previous known bad machines. The delta will be sent with the allocate response and the current list will be saved to calculate the next delta.

Bikas Saha
added a comment - 24/Feb/12 20:59 1) Add a ClusterManager to RM that currently only provides bad node information. All node management will incrementally be moved to it. Eg. RmContext.RMnodes and RMContext.invalidNodes could move into ClusterManager. This could be a new class or be an enhancement of NodesListManager.
2) Node is bad when it gets a heartbeat with health=unhealthy or when the livenessmonitor reports lost node. In both cases a SchedulerNodeRemove event is issued. Similar for nodes becoming healthy. Additionally, these will now emit events to update the ClusterManager with corresponding events.
3) In AppMasterService, after calling Scheduler.allocate(), it will call ClusterManager.getUnusableNodes(). These will be passed to RMAttempt object which will calculate the delta with previous known bad machines. The delta will be sent with the allocate response and the current list will be saved to calculate the next delta.

Vinod Kumar Vavilapalli
added a comment - 24/Feb/12 21:49 +1 for the latest proposal, we can use NodeListManager itself instead of a new ClusterManager.
In addition, we need AMs to act on the nodes information. Filing a separate ticket.

Attaching patch. This passes all tests locally and has a functional test for main changes.
1) providing all unusable nodes on app attempt registration
2) providing delta updates of nodes thereafter to a running app
I will scan through the changes for final clean ups and look at adding some more tests if necessary.

Bikas Saha
added a comment - 01/Mar/12 18:53 Attaching patch. This passes all tests locally and has a functional test for main changes.
1) providing all unusable nodes on app attempt registration
2) providing delta updates of nodes thereafter to a running app
I will scan through the changes for final clean ups and look at adding some more tests if necessary.

2) The javac warnings are because of events handlers being called in NodesListManager.java and are similar to pre-existing warnings.
===========[WARNING] /home/jenkins/jenkins-slave/workspace/PreCommit-MAPREDUCE-Build/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/NodesListManager.java:[142,19][unchecked] unchecked call to handle(T) as a member of the raw type org.apache.hadoop.yarn.event.EventHandler[WARNING] /home/jenkins/jenkins-slave/workspace/PreCommit-MAPREDUCE-Build/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/NodesListManager.java:[155,21][unchecked] unchecked call to handle(T) as a member of the raw type org.apache.hadoop.yarn.event.EventHandler[WARNING] /home/jenkins/jenkins-slave/workspace/PreCommit-MAPREDUCE-Build/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmcontainer/RMContainerImpl.java:[244,35][unchecked] unchecked call to handle(T) as a member of the raw type org.apache.hadoop.yarn.event.EventHandler
===========

Bikas Saha
added a comment - 02/Mar/12 00:55 1) Fixed the javadoc warning.
2) The javac warnings are because of events handlers being called in NodesListManager.java and are similar to pre-existing warnings.
===========
[WARNING] /home/jenkins/jenkins-slave/workspace/PreCommit-MAPREDUCE-Build/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/NodesListManager.java: [142,19] [unchecked] unchecked call to handle(T) as a member of the raw type org.apache.hadoop.yarn.event.EventHandler
[WARNING] /home/jenkins/jenkins-slave/workspace/PreCommit-MAPREDUCE-Build/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/NodesListManager.java: [155,21] [unchecked] unchecked call to handle(T) as a member of the raw type org.apache.hadoop.yarn.event.EventHandler
[WARNING] /home/jenkins/jenkins-slave/workspace/PreCommit-MAPREDUCE-Build/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmcontainer/RMContainerImpl.java: [244,35] [unchecked] unchecked call to handle(T) as a member of the raw type org.apache.hadoop.yarn.event.EventHandler
===========

Amol Kekre
added a comment - 05/Mar/12 02:04 Not sure why this jira is marked critical. This only impacts if a node goes bad during AM life span right? If so given 3 attempts by MR, how important this jira (Major?).

Arun C Murthy
added a comment - 13/Mar/12 21:44 Looks good, some minor nits:
RegisterApplicationMasterResponse.(get,set)UnusableNodes is not used right now, so let's add it later when we need it.
AMResponse.getUpdatedNodes could use javadocs.
BuilderUtils.createNodeReport
RMContextImpl.applications shud be changed to ConcurrentSkipListMap to be safe - we should open a separate jira to fix the signature of RMContext.get* which return ConcurrentMap
RMAppImpl.pullRMNodeUpdates needs a writeLock since it's clearing
RMAppImpl shud ignore NodeUpdate in COMPLETED state (thus we can remove the 'if' condition in RMAppNodeUpdateTransition).

Bikas Saha
added a comment - 16/Mar/12 19:29 New patch. I have not change ConcurrentMap implementations. They use ConcurrentHashmap that have safe iterators, although java docs claim that the iterator itself can be accessed on 1 thread.

They have already been clarified in the first patch submission. Pasting from an earlier comment.
>>>
The javac warnings are because of events handlers being called in NodesListManager.java and are similar to pre-existing warnings.
===========[WARNING] /home/jenkins/jenkins-slave/workspace/PreCommit-MAPREDUCE-Build/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/NodesListManager.java:[142,19][unchecked] unchecked call to handle(T) as a member of the raw type org.apache.hadoop.yarn.event.EventHandler[WARNING] /home/jenkins/jenkins-slave/workspace/PreCommit-MAPREDUCE-Build/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/NodesListManager.java:[155,21][unchecked] unchecked call to handle(T) as a member of the raw type org.apache.hadoop.yarn.event.EventHandler[WARNING] /home/jenkins/jenkins-slave/workspace/PreCommit-MAPREDUCE-Build/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmcontainer/RMContainerImpl.java:[244,35][unchecked] unchecked call to handle(T) as a member of the raw type org.apache.hadoop.yarn.event.EventHandler
===========

Bikas Saha
added a comment - 20/Mar/12 17:13 They have already been clarified in the first patch submission. Pasting from an earlier comment.
>>>
The javac warnings are because of events handlers being called in NodesListManager.java and are similar to pre-existing warnings.
===========
[WARNING] /home/jenkins/jenkins-slave/workspace/PreCommit-MAPREDUCE-Build/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/NodesListManager.java: [142,19] [unchecked] unchecked call to handle(T) as a member of the raw type org.apache.hadoop.yarn.event.EventHandler
[WARNING] /home/jenkins/jenkins-slave/workspace/PreCommit-MAPREDUCE-Build/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/NodesListManager.java: [155,21] [unchecked] unchecked call to handle(T) as a member of the raw type org.apache.hadoop.yarn.event.EventHandler
[WARNING] /home/jenkins/jenkins-slave/workspace/PreCommit-MAPREDUCE-Build/trunk/hadoop-mapreduce-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmcontainer/RMContainerImpl.java: [244,35] [unchecked] unchecked call to handle(T) as a member of the raw type org.apache.hadoop.yarn.event.EventHandler
===========