Description

We are seeing the preemption monitor preempting containers from queue A and then seeing the capacity scheduler giving them immediately back to queue A. This happens quite often and causes a lot of churn.

One app is running in each queue. Both apps are asking for more resources, but they have each reached their user limit, so even though both are asking for more and there are resources available, no more resources are allocated to either app.

The preemption monitor will see that B is asking for a lot more resources, and it will see that B is more underserved than A, so the preemption monitor will try to make the queues balance by preempting resources (10, for example) from A.

queue

capacity

max

pending

used

user limit

root

100

100

50

80

N/A

A

10

100

30

60

70

B

10

100

20

20

20

However, when the capacity scheduler tries to give that container to the app in B, the app will recognize that it has no headroom, and refuse the container. So the capacity scheduler offers the container again to the app in A, which accepts it because it has headroom now, and the process starts over again.

Note that this happens even when used cluster resources are below 100% because the used + pending for the cluster would put it above 100%.

Eric Payne
added a comment - 04/Jun/15 21:28 The following configuration will cause this:
queue
capacity
max
pending
used
user limit
root
100
100
40
90
N/A
A
10
100
20
70
70
B
10
100
20
20
20
One app is running in each queue. Both apps are asking for more resources, but they have each reached their user limit, so even though both are asking for more and there are resources available, no more resources are allocated to either app.
The preemption monitor will see that B is asking for a lot more resources, and it will see that B is more underserved than A , so the preemption monitor will try to make the queues balance by preempting resources (10, for example) from A .
queue
capacity
max
pending
used
user limit
root
100
100
50
80
N/A
A
10
100
30
60
70
B
10
100
20
20
20
However, when the capacity scheduler tries to give that container to the app in B , the app will recognize that it has no headroom, and refuse the container. So the capacity scheduler offers the container again to the app in A , which accepts it because it has headroom now, and the process starts over again.
Note that this happens even when used cluster resources are below 100% because the used + pending for the cluster would put it above 100%.

Eric Payne,
This is a very interesting problem, actually not only user-limit causes it.

For example, fair ordering (YARN-3306), hard locality requirements (I want resources from rackA and nodeX only), AM resource limit; In the near future we can have constraints (YARN-3409), all can lead to resource is preempted from one queue, but the other queue cannot use it because of specific resource requirement and limits.

One thing I've thought for a while is adding a "lazy preemption" mechanism, which is: when a container is marked preempted and wait for max_wait_before_time, it becomes a "can_be_killed" container. If there's another queue can allocate on a node with "can_be_killed" container, such container will be killed immediately to make room the new containers.

This mechanism can make preemption policy doesn't need to consider complex resource requirements and limits inside a queue, and also it can avoid kill unnecessary containers.

Wangda Tan
added a comment - 04/Jun/15 21:44 Eric Payne ,
This is a very interesting problem, actually not only user-limit causes it.
For example, fair ordering ( YARN-3306 ), hard locality requirements (I want resources from rackA and nodeX only), AM resource limit; In the near future we can have constraints ( YARN-3409 ), all can lead to resource is preempted from one queue, but the other queue cannot use it because of specific resource requirement and limits.
One thing I've thought for a while is adding a "lazy preemption" mechanism, which is: when a container is marked preempted and wait for max_wait_before_time, it becomes a "can_be_killed" container. If there's another queue can allocate on a node with "can_be_killed" container, such container will be killed immediately to make room the new containers.
This mechanism can make preemption policy doesn't need to consider complex resource requirements and limits inside a queue, and also it can avoid kill unnecessary containers.
If you think it's fine, could I take a shot at it?
Thoughts? Vinod Kumar Vavilapalli .

One thing I've thought for a while is adding a "lazy preemption" mechanism, which is: when a container is marked preempted and wait for max_wait_before_time, it becomes a "can_be_killed" container. If there's another queue can allocate on a node with "can_be_killed" container, such container will be killed immediately to make room the new containers.

IIUC, in your proposal, the preemption monitor would mark the containers as preemptable, and then after some configurable wait period, the capacity scheduler would be the one to do the killing if it finds that it needs the resources on that node. Is my understanding correct?

Eric Payne
added a comment - 04/Jun/15 21:58 Wangda Tan ,
One thing I've thought for a while is adding a "lazy preemption" mechanism, which is: when a container is marked preempted and wait for max_wait_before_time, it becomes a "can_be_killed" container. If there's another queue can allocate on a node with "can_be_killed" container, such container will be killed immediately to make room the new containers.
IIUC, in your proposal, the preemption monitor would mark the containers as preemptable, and then after some configurable wait period, the capacity scheduler would be the one to do the killing if it finds that it needs the resources on that node. Is my understanding correct?

Wangda Tan, for better tracking purposes, would it be better to update the title of this JIRA to something more general, e.g., CapacityScheduler: Improve preemption to preempt only those containers that would satisfy the incoming request (similar to YARN-2154)? This ticket can then be used to address preemption ping-pong issue for both new container request and container resource increase request.

Besides the proposal that you have presented, an alternative solution to consider is: once we collect the list of preemptable containers, we immediately have a dry run of the scheduling algorithm to match the preemptable resources against outstanding new/increase resource requests. We then only preempt the resources that can find a match.

MENG DING
added a comment - 24/Aug/15 18:55 Wangda Tan , for better tracking purposes, would it be better to update the title of this JIRA to something more general, e.g., CapacityScheduler: Improve preemption to preempt only those containers that would satisfy the incoming request (similar to YARN-2154 )? This ticket can then be used to address preemption ping-pong issue for both new container request and container resource increase request.
Besides the proposal that you have presented, an alternative solution to consider is: once we collect the list of preemptable containers, we immediately have a dry run of the scheduling algorithm to match the preemptable resources against outstanding new/increase resource requests. We then only preempt the resources that can find a match.
Thoughts?
Meng

One thing I've thought for a while is adding a "lazy preemption" mechanism, which is: when a container is marked preempted and wait for max_wait_before_time, it becomes a "can_be_killed" container. If there's another queue can allocate on a node with "can_be_killed" container, such container will be killed immediately to make room the new containers.

I will upload a design doc shortly for review.

Wangda Tan, because it's been a couple of months since the last activity on this JIRA, would it be better to use this JIRA for the purpose of making the preemption monitor "user-limit" aware and open a separate JIRA to address a redesign?

Towards that end, I am uploading a couple of patches:

YARN-3769.001.branch-2.7.patch is a patch to 2.7 (and also 2.6) which we have been using internally. This fix has dramatically reduced the instances of "ping-pong"-ing as I outlined in the comment above.

YARN-3769.001.branch-2.8.patch is similar to the fix made in 2.7, but it also takes into consideration node label partitions.
Thanks for your help and please let me know what you think.

Eric Payne
added a comment - 31/Aug/15 19:20
One thing I've thought for a while is adding a "lazy preemption" mechanism, which is: when a container is marked preempted and wait for max_wait_before_time, it becomes a "can_be_killed" container. If there's another queue can allocate on a node with "can_be_killed" container, such container will be killed immediately to make room the new containers.
I will upload a design doc shortly for review.
Wangda Tan , because it's been a couple of months since the last activity on this JIRA, would it be better to use this JIRA for the purpose of making the preemption monitor "user-limit" aware and open a separate JIRA to address a redesign?
Towards that end, I am uploading a couple of patches:
YARN-3769 .001.branch-2.7.patch is a patch to 2.7 (and also 2.6) which we have been using internally. This fix has dramatically reduced the instances of "ping-pong"-ing as I outlined in the comment above .
YARN-3769 .001.branch-2.8.patch is similar to the fix made in 2.7, but it also takes into consideration node label partitions.
Thanks for your help and please let me know what you think.

Thanks for working on the patch, the approach general looks good. Few comments on implementation:

getTotalResourcePending is misleading, I suggest to rename it to something like getTotalResourcePendingConsideredUserLimit, and add a comment to indicate it will be only used by preemption policy.

And for implementation:
I think it's no need to store a appsPerUser. It will be a O(apps-in-the-queue) memory cost, and you need O(apps-in-the-queue) insert opertions as well. Instead, you can do following logic:

Wangda Tan
added a comment - 02/Sep/15 23:55 Eric Payne .
Thanks for working on the patch, the approach general looks good. Few comments on implementation:
getTotalResourcePending is misleading, I suggest to rename it to something like getTotalResourcePendingConsideredUserLimit , and add a comment to indicate it will be only used by preemption policy.
And for implementation:
I think it's no need to store a appsPerUser. It will be a O(apps-in-the-queue) memory cost, and you need O(apps-in-the-queue) insert opertions as well. Instead, you can do following logic:
Map<UserName, Headroom> userNameToHeadroom;
Resource userLimit = computeUserLimit(partition);
Resource pendingAndPreemptable = 0;
for (app in apps) {
if (!userNameToHeadroom.contains(app.getUser())) {
userNameToHeadroom.put(app.getUser(), userLimit - app.getUser().getUsed(partition));
}
pendingAndPreemptable += min(userNameToHeadroom.get(app.getUser()), app.getPending(partition));
userNameToHeadroom.get(app.getUser()) -= app.getPending(partition);
}
return pendingAndPreemptable;
And could you add a test to verify it works?

Eric Payne
added a comment - 09/Sep/15 22:09 Thanks very much Wangda Tan !
I think the above is much more efficient, but I think it needs one small tweak, On this line:
userNameToHeadroom.get(app.getUser()) -= app.getPending(partition);
If app.getPending(partition) is larger than userNameToHeadroom.get(app.getUser()) , then userNameToHeadroom.get(app.getUser()) could easily go negative. I think what we may want is something like this:
Map<UserName, Headroom> userNameToHeadroom;
Resource userLimit = computeUserLimit(partition);
Resource pendingAndPreemptable = 0;
for (app in apps) {
if (!userNameToHeadroom.contains(app.getUser())) {
userNameToHeadroom.put(app.getUser(), userLimit - app.getUser().getUsed(partition));
}
Resource minPendingAndPreemptable = min(userNameToHeadroom.get(app.getUser()), app.getPending(partition));
pendingAndPreemptable += minPendingAndPreemptable;
userNameToHeadroom.get(app.getUser()) -= minPendingAndPreemptable;
}
return pendingAndPreemptable;
Also, I will work on adding a test case.

Eric Payne
added a comment - 03/Oct/15 22:18 Thank you very much, Wangda Tan , for your suggestions and help reviewing this patch. I am attaching an updated patch (version 002) for both branch-2.7 and branch-2.

It looks like the build tried to apply the branch-2.7 version of this patch to trunk. I will cancel the patch and re-upload the branch-2 version of the patch so that Hadoopqa will run the 2.8 build and comment on that patch.

Eric Payne
added a comment - 05/Oct/15 15:31
Subsystem
Report/Notes
Patch URL
http://issues.apache.org/jira/secure/attachment/12764916/YARN-3769-branch-2.7.002.patch
Optional Tests
javadoc javac unit findbugs checkstyle
git revision
trunk / 3b85bd7
It looks like the build tried to apply the branch-2.7 version of this patch to trunk. I will cancel the patch and re-upload the branch-2 version of the patch so that Hadoopqa will run the 2.8 build and comment on that patch.

Eric Payne
added a comment - 10/Oct/15 15:51 Wangda Tan , Thanks for all of your help on this JIRA.
Attaching version 003.
YARN-3769 .003.patch applies to both trunk and branch-2
YARN-3769 -branch-2.7.003.patch applies to branch-2.7

Why this is needed? MAX_PENDING_OVER_CAPACITY. I think this could be problematic, for example, if a queue has capacity = 50, and it's usage is 10 and it has 55 pending resource, if we set MAX_PENDING_OVER_CAPACITY=0.1, the queue cannot preempt resource from other queue.

In LeafQueue, it uses getHeadroom() to compute how many resource that the user can use. But I think it may not correct: getHeadroom is computed by

For above queue status, headroom for a.a1 is 0 since queue-a's currentResourceLimit is 0.
So instead of using headroom, I think you can use computed-user-limit - user.usage(partition) as the headroom. You don't need to consider queue's max capacity here, since we will consider queue's max capacity at following logic of PCPP.

Wangda Tan
added a comment - 15/Oct/15 21:49 Eric Payne , some quick comments:
Why this is needed? MAX_PENDING_OVER_CAPACITY . I think this could be problematic, for example, if a queue has capacity = 50, and it's usage is 10 and it has 55 pending resource, if we set MAX_PENDING_OVER_CAPACITY=0.1, the queue cannot preempt resource from other queue.
In LeafQueue, it uses getHeadroom() to compute how many resource that the user can use. But I think it may not correct: getHeadroom is computed by
* Headroom is:
* min(
* min(userLimit, queueMaxCap) - userConsumed,
* queueMaxLimit - queueUsedResources
* )
(Please note the actual code is slightly different from the original comment, it uses queue's MaxLimit instead of queue's Max resource)
One negative example is:
a (max=100, used=100, configured=100
a.a1 (max=100, used=30, configured=40)
a.a2 (max=100, used=70, configured=60)
For above queue status, headroom for a.a1 is 0 since queue-a's currentResourceLimit is 0.
So instead of using headroom, I think you can use computed-user-limit - user.usage(partition) as the headroom. You don't need to consider queue's max capacity here, since we will consider queue's max capacity at following logic of PCPP.
Thoughts?

Why this is needed? MAX_PENDING_OVER_CAPACITY. I think this could be problematic, for example, if a queue has capacity = 50, and it's usage is 10 and it has 45 pending resource, if we set MAX_PENDING_OVER_CAPACITY=0.1, the queue cannot preempt resource from other queue.

Sorry for the poor naming convention. It is not really being used to check against the queue's capacity, it is used to check for a percentage over the currently used resources. I changed the name to MAX_PENDING_OVER_CURRENT.

As you know, there are multiple reasons why preemption could unnecessarily preempt resources (I call it "flapping"). Only one of which is the lack of consideration for user limit factor. Another is that an app could be requesting an 8-gig container, and the preemption monitor could conceivably preempt 8, one-gig containers, which would then be rejected by the requesting AM and potentially given right back to the preempted app.

The MAX_PENDING_OVER_CURRENT buffer is an attempt to alleviate that particular flapping situation by giving a buffer zone above the currently used resources on a particular queue. This is to say that the preemption monitor shouldn't consider that queue B is asking for pending resources unless pending resources on queue B are above a configured percentage of currently used resources on queue B.

If you want, we can pull this out and put it as part of a different JIRA so we can document and discuss that particular flapping situation separately.

n LeafQueue, it uses getHeadroom() to compute how many resource that the user can use. But I think it may not correct: ... For above queue status, headroom for a.a1 is 0 since queue-a's currentResourceLimit is 0.
So instead of using headroom, I think you can use computed-user-limit - user.usage(partition) as the headroom. You don't need to consider queue's max capacity here, since we will consider queue's max capacity at following logic of PCPP.

Yes, you are correct. getHeadroom could be calculating zero headroom when we don't want it to. And, I agree that we don't need to limit pending resources to max queue capacity when calculating pending resources. The concern for this fix is that user limit factor should be considered and limit the pending value. The max queue capacity will be considered during the offer stage of the preemption calculations.

Eric Payne
added a comment - 02/Nov/15 17:36 Wangda Tan , Thank you for your review, and sorry for the late reply.
Why this is needed? MAX_PENDING_OVER_CAPACITY. I think this could be problematic, for example, if a queue has capacity = 50, and it's usage is 10 and it has 45 pending resource, if we set MAX_PENDING_OVER_CAPACITY=0.1, the queue cannot preempt resource from other queue.
Sorry for the poor naming convention. It is not really being used to check against the queue's capacity, it is used to check for a percentage over the currently used resources. I changed the name to MAX_PENDING_OVER_CURRENT .
As you know, there are multiple reasons why preemption could unnecessarily preempt resources (I call it "flapping"). Only one of which is the lack of consideration for user limit factor. Another is that an app could be requesting an 8-gig container, and the preemption monitor could conceivably preempt 8, one-gig containers, which would then be rejected by the requesting AM and potentially given right back to the preempted app.
The MAX_PENDING_OVER_CURRENT buffer is an attempt to alleviate that particular flapping situation by giving a buffer zone above the currently used resources on a particular queue. This is to say that the preemption monitor shouldn't consider that queue B is asking for pending resources unless pending resources on queue B are above a configured percentage of currently used resources on queue B.
If you want, we can pull this out and put it as part of a different JIRA so we can document and discuss that particular flapping situation separately.
n LeafQueue, it uses getHeadroom() to compute how many resource that the user can use. But I think it may not correct: ... For above queue status, headroom for a.a1 is 0 since queue-a's currentResourceLimit is 0.
So instead of using headroom, I think you can use computed-user-limit - user.usage(partition) as the headroom. You don't need to consider queue's max capacity here, since we will consider queue's max capacity at following logic of PCPP.
Yes, you are correct. getHeadroom could be calculating zero headroom when we don't want it to. And, I agree that we don't need to limit pending resources to max queue capacity when calculating pending resources. The concern for this fix is that user limit factor should be considered and limit the pending value. The max queue capacity will be considered during the offer stage of the preemption calculations.

Eric Payne
added a comment - 03/Nov/15 18:46 Tests hadoop.yarn.server.resourcemanager.TestClientRMTokens and hadoop.yarn.server.resourcemanager.TestAMAuthorization are not failing for me in may own build environment.

If you want, we can pull this out and put it as part of a different JIRA so we can document and discuss that particular flapping situation separately.

I would prefer to make it to be a separate JIRA, since it is a not directly related fix. Will review PCPP after you separate those changes (since you're OK with making it separated )

Yes, you are correct. getHeadroom could be calculating zero headroom when we don't want it to. And, I agree that we don't need to limit pending resources to max queue capacity when calculating pending resources. The concern for this fix is that user limit factor should be considered and limit the pending value. The max queue capacity will be considered during the offer stage of the preemption calculations.

I agree with your existing appoarch, user-limit should be capped by max queue capacity as well.

Wangda Tan
added a comment - 03/Nov/15 22:04 Eric Payne , Thanks for update.
If you want, we can pull this out and put it as part of a different JIRA so we can document and discuss that particular flapping situation separately.
I would prefer to make it to be a separate JIRA, since it is a not directly related fix. Will review PCPP after you separate those changes (since you're OK with making it separated )
Yes, you are correct. getHeadroom could be calculating zero headroom when we don't want it to. And, I agree that we don't need to limit pending resources to max queue capacity when calculating pending resources. The concern for this fix is that user limit factor should be considered and limit the pending value. The max queue capacity will be considered during the offer stage of the preemption calculations.
I agree with your existing appoarch, user-limit should be capped by max queue capacity as well.
One nit for LeafQueue changes:
1534 minPendingAndPreemptable =
1535 Resources.componentwiseMax(Resources.none(),
1536 Resources.subtract(
1537 userNameToHeadroom.get(userName), minPendingAndPreemptable));
1538
you don't need to do componmentwiseMax here, since minPendingAndPreemptable <= headroom, and you can use substractFrom to make code simpler.

Eric Payne
added a comment - 09/Nov/15 16:57 you don't need to do componmentwiseMax here, since minPendingAndPreemptable <= headroom, and you can use substractFrom to make code simpler.
Wangda Tan , you are right, we do know that minPendingAndPreemptable <= headroom . Thanks for the catch. I will make those changes.

Eric Payne
added a comment - 12/Nov/15 03:10 Wangda Tan , Attaching YARN-3769 .005.patch with the changes we discussed.
I have another question that may be an enhancement:
In LeafQueue#getTotalPendingResourcesConsideringUserLimit , the calculation of headroom is as follows in this patch:
Resource headroom = Resources.subtract(
computeUserLimit(app, resources, user, partition,
SchedulingMode.RESPECT_PARTITION_EXCLUSIVITY),
user.getUsed(partition));
Would it be more efficient to just do the following?
Resource headroom =
Resources.subtract(user.getUserResourceLimit(), user.getUsed());

The problem is getUserResourceLimit is not always updated by scheduler. If a queue is not traversed by scheduler OR apps of a queue-user have long heartbeat interval, the user resource limit could be staled.

I found 0005 patch for trunk is computing user-limit every time and 0005 patch for 2.7 is using getUserResourceLimit.

Wangda Tan
added a comment - 16/Nov/15 18:37 Eric Payne , thanks for update:
Would it be more efficient to just do the following? ...
The problem is getUserResourceLimit is not always updated by scheduler. If a queue is not traversed by scheduler OR apps of a queue-user have long heartbeat interval, the user resource limit could be staled.
I found 0005 patch for trunk is computing user-limit every time and 0005 patch for 2.7 is using getUserResourceLimit.
Thoughts?

The problem is getUserResourceLimit is not always updated by scheduler. If a queue is not traversed by scheduler OR apps of a queue-user have long heartbeat interval, the user resource limit could be staled.

Got it

I found 0005 patch for trunk is computing user-limit every time and 0005 patch for 2.7 is using getUserResourceLimit.

Yes, I was concerned about using the 2.7 version of computeUserLimit. It is different than the branch-2 and trunk versions, and it expects a required parameter which, in 2.7, is calculated in assignContainers based on an app's capability requests for a given container priority. I noticed that in branch-2 and trunk, it looks like this required parameter is just given the value of minimumAllocation.

So, in YARN-3769-branch-2.7.006.patch I passed minimumAllocation in the required parameter of computeUserLimit.

Eric Payne
added a comment - 17/Nov/15 21:35 Wangda Tan , thanks for your comments.
The problem is getUserResourceLimit is not always updated by scheduler. If a queue is not traversed by scheduler OR apps of a queue-user have long heartbeat interval, the user resource limit could be staled.
Got it
I found 0005 patch for trunk is computing user-limit every time and 0005 patch for 2.7 is using getUserResourceLimit.
Yes, I was concerned about using the 2.7 version of computeUserLimit . It is different than the branch-2 and trunk versions, and it expects a required parameter which, in 2.7, is calculated in assignContainers based on an app's capability requests for a given container priority. I noticed that in branch-2 and trunk, it looks like this required parameter is just given the value of minimumAllocation .
So, in YARN-3769 -branch-2.7.006.patch I passed minimumAllocation in the required parameter of computeUserLimit .

Eric Payne
added a comment - 18/Nov/15 22:09 Attaching YARN-3769 -branch-2.7.007.patch .
TestLeafQueue was failing for the previous patch. Of the others, they all work for me in my local build environment except TestResourceTrackerService , which may be related to YARN-4317 .

Eric Payne
added a comment - 26/Nov/15 16:10 Could you check if the 2.7 commit applies cleanly to branch-2.6? If not, it would be great if you could post a 2.6 patch. Thanks.
Sangjin Lee , Sure. I can do that.

TestLeafQueue unit test for multiple apps by multiple users had to be modified specifically to allow for all apps to be active at the same time since the way active apps is calculated is different between 2.6 and 2.7.

Eric Payne
added a comment - 01/Dec/15 19:21 Attaching YARN-3769 -branch-2.6.001.patch for backport to branch-2.6.
TestLeafQueue unit test for multiple apps by multiple users had to be modified specifically to allow for all apps to be active at the same time since the way active apps is calculated is different between 2.6 and 2.7.

Eric Payne
added a comment - 04/Feb/16 16:33 Hi Eric Payne, it looks like the patch for branch-2.6 has some problem:
Sorry about that, Junping Du . I built and tested against a stale branch-2.6 branch.
This new patch ( YARN-3769 -branch-2.6.002.patch ) should apply cleanly and works well for me.