Support delay scheduling for node locality in MR2's capacity scheduler

Details

Description

The capacity scheduler in MR2 doesn't support delay scheduling for achieving node-level locality. So, jobs exhibit poor data locality even if they have good rack locality. Especially on clusters where disk throughput is much better than network capacity, this hurts overall job performance. We should optionally support node-level delay scheduling heuristics similar to what the fair scheduler implements in MR1.

Todd Lipcon
added a comment - 20/Oct/11 23:23 Turns out the major locality issues I was seeing were related to data locality not being respected at all. This was fixed by MAPREDUCE-2693 (see also MAPREDUCE-3234 )

I'm going to be addressing this as part of MAPREDUCE-3601 and can probably just add to the Capacity scheduler as well.

Delay scheduling is going to be less efficient in MR2 due to the resource request model. Right now, when a map task needs to run, the MR AM creates three separate resource requests to the scheduler, one for a node-local container, one for a rack-local container, and another for an any container. However, the scheduler can't associate these in any way.

In the MR1 Fair scheduler, we basically triage a given request and accept worse levels of locality as time goes on - this won't be possible. In MR2, I don't see a better way than introducing some type of global delay for "any" requests and rack-local requests (the former exists already). It seems like this could lead to undesirable behaviour depending on the order of resource request arrivals.

NO NAME
added a comment - 03/Jan/12 04:53 I'm going to be addressing this as part of MAPREDUCE-3601 and can probably just add to the Capacity scheduler as well.
Delay scheduling is going to be less efficient in MR2 due to the resource request model. Right now, when a map task needs to run, the MR AM creates three separate resource requests to the scheduler, one for a node-local container, one for a rack-local container, and another for an any container. However, the scheduler can't associate these in any way.
In the MR1 Fair scheduler, we basically triage a given request and accept worse levels of locality as time goes on - this won't be possible. In MR2, I don't see a better way than introducing some type of global delay for "any" requests and rack-local requests (the former exists already). It seems like this could lead to undesirable behaviour depending on the order of resource request arrivals.

The current approach is to only schedule "any" requests once the scheduler has failed to allocate a node or rack local container anywhere for several NM check-ins. The corresponding approach for rack-locality is to only schedule rack-local once we've had a given number of global failures scheduling node-local requests.

My concerns are:

1) If the scheduler falls back onto rack-locality, it might fulfil a request for a rack-local container which has already been taken care of via a node-local request. This will be returned to the AM which will have no use for it and release the container. It might take number of rounds of offers to the AM for things to shake out correctly.

2) If a single rack is busy, it might take a long time to trigger the global failover to "any" requests.

Anyways, maybe these won't be a big deal. The first step is to just go ahead and do this and see how good of an approximation it is for a model where we have associations between resource requests.

NO NAME
added a comment - 03/Jan/12 05:14 Just to be clear what I mean:
The current approach is to only schedule "any" requests once the scheduler has failed to allocate a node or rack local container anywhere for several NM check-ins. The corresponding approach for rack-locality is to only schedule rack-local once we've had a given number of global failures scheduling node-local requests.
My concerns are:
1) If the scheduler falls back onto rack-locality, it might fulfil a request for a rack-local container which has already been taken care of via a node-local request. This will be returned to the AM which will have no use for it and release the container. It might take number of rounds of offers to the AM for things to shake out correctly.
2) If a single rack is busy, it might take a long time to trigger the global failover to "any" requests.
Anyways, maybe these won't be a big deal. The first step is to just go ahead and do this and see how good of an approximation it is for a model where we have associations between resource requests.

Your concern #1 is already happening. With MRV2 right now all the requests, global, rack local, and node specific are made at once. This results in the possibility that on an underused cluster all of them might be fulfilled and returned to the AM. If the AM can make use of one of the containers it will, otherwise it will release it.

Perhaps the better way to do this is to have the AM be responsible for making the requests at different times. So for example on the first heartbeat after a container is needed only the node local request is made. If it does not get it after a specific timeout (1 heartbeat by default) then a rack local request is added, and finally the global request is added after another timeout.

It would be nice to have it be more generic so that some how the requests are tied together, but that would require an API change and may not be simple to do in the short term.

Robert Joseph Evans
added a comment - 03/Jan/12 15:35 Your concern #1 is already happening. With MRV2 right now all the requests, global, rack local, and node specific are made at once. This results in the possibility that on an underused cluster all of them might be fulfilled and returned to the AM. If the AM can make use of one of the containers it will, otherwise it will release it.
Perhaps the better way to do this is to have the AM be responsible for making the requests at different times. So for example on the first heartbeat after a container is needed only the node local request is made. If it does not get it after a specific timeout (1 heartbeat by default) then a rack local request is added, and finally the global request is added after another timeout.
It would be nice to have it be more generic so that some how the requests are tied together, but that would require an API change and may not be simple to do in the short term.

-1 tests included. The patch doesn't appear to include any new or modified tests.
Please justify why no new tests are needed for this patch.
Also please list what manual steps were performed to verify this patch.

+1 javac. The applied patch does not increase the total number of javac compiler warnings.

+1 javadoc. The javadoc tool did not generate any warning messages.

+1 eclipse:eclipse. The patch built with eclipse:eclipse.

+1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.

+1 release audit. The applied patch does not increase the total number of release audit warnings.

Hadoop QA
added a comment - 04/Sep/12 22:29 -1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12543531/YARN-80.patch
against trunk revision .
+1 @author. The patch does not contain any @author tags.
-1 tests included. The patch doesn't appear to include any new or modified tests.
Please justify why no new tests are needed for this patch.
Also please list what manual steps were performed to verify this patch.
+1 javac. The applied patch does not increase the total number of javac compiler warnings.
+1 javadoc. The javadoc tool did not generate any warning messages.
+1 eclipse:eclipse. The patch built with eclipse:eclipse.
+1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.
+1 release audit. The applied patch does not increase the total number of release audit warnings.
-1 core tests. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager:
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestLeafQueue
+1 contrib tests. The patch passed contrib unit tests.
Test results: https://builds.apache.org/job/PreCommit-YARN-Build/18//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/18//console
This message is automatically generated.

Hadoop QA
added a comment - 05/Sep/12 08:54 +1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12543823/YARN-80.patch
against trunk revision .
+1 @author. The patch does not contain any @author tags.
+1 tests included. The patch appears to include 1 new or modified test files.
+1 javac. The applied patch does not increase the total number of javac compiler warnings.
+1 javadoc. The javadoc tool did not generate any warning messages.
+1 eclipse:eclipse. The patch built with eclipse:eclipse.
+1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.
+1 release audit. The applied patch does not increase the total number of release audit warnings.
+1 core tests. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager.
+1 contrib tests. The patch passed contrib unit tests.
Test results: https://builds.apache.org/job/PreCommit-YARN-Build/24//testReport/
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/24//console
This message is automatically generated.

Harsh J
added a comment - 10/Sep/12 06:55 Hi Arun,
Thanks very much for doing this! We could probably address this in a new JIRA but I had two questions:
Why was the feature decided to be disabled by default?
Is there no way to not have people change configuration based on their # of racks (i.e. make it automated)?

Perhaps the better way to do this is to have the AM be responsible for making the requests at different times. So for example on the first heartbeat after a container is needed only the node local request is made. If it does not get it after a specific timeout (1 heartbeat by default) then a rack local request is added, and finally the global request is added after another timeout.

Karthik Kambatla (Inactive)
added a comment - 10/Sep/12 07:08 Perhaps the better way to do this is to have the AM be responsible for making the requests at different times. So for example on the first heartbeat after a container is needed only the node local request is made. If it does not get it after a specific timeout (1 heartbeat by default) then a rack local request is added, and finally the global request is added after another timeout.
+1. Should we create a JIRA for this to make sure we don't miss out?

How can one debug this process?
It was easy before with just `grep "Choosing" hadoop-xxx-jobtracker.log`.
I can't find any similar information in YARN log files.

Background: I just upgraded to YARN (hadoop-2.2.0). And despite setting yarn.scheduler.capacity.node-locality-delay=3 in capacity-scheduler.xml data-locality is poor. (It was 100% with hadoop-0.22 and fair-scheduler).

mck
added a comment - 07/Nov/13 13:29 How can one debug this process?
It was easy before with just `grep "Choosing" hadoop-xxx-jobtracker.log`.
I can't find any similar information in YARN log files.
Background: I just upgraded to YARN (hadoop-2.2.0). And despite setting yarn.scheduler.capacity.node-locality-delay=3 in capacity-scheduler.xml data-locality is poor. (It was 100% with hadoop-0.22 and fair-scheduler).