Details

Description

With YARN being a general purpose system, it would be useful for several applications (MPI et al) to specify not just memory but also CPU (cores) for their resource requirements. Thus, it would be useful to the CapacityScheduler to account for both.

Sub-Tasks

Activity

Could we change the name of ResourceMemoryCpuComparator to something more like DefaultMultiResourceComparator? I think ResourceMemoryCpuNetworkBandwithDiskStorageGPUComparator is a bit long, but it is the direction we are headed in.

The problem with naming a class "DefaultResource" is that changing the default in the future is a pain. While I prefer MemoryResourceCalculator (it's explicit, and I don't think we'll see lots of different resource calculators) something like SingleResourceComparator fixes the naming issue and also won't have the issue of a new class for every policy. I filed YARN-340 to address this.

Eli Collins
added a comment - 15/Jan/13 18:56 Could we change the name of ResourceMemoryCpuComparator to something more like DefaultMultiResourceComparator? I think ResourceMemoryCpuNetworkBandwithDiskStorageGPUComparator is a bit long, but it is the direction we are headed in.
The problem with naming a class "DefaultResource" is that changing the default in the future is a pain. While I prefer MemoryResourceCalculator (it's explicit, and I don't think we'll see lots of different resource calculators) something like SingleResourceComparator fixes the naming issue and also won't have the issue of a new class for every policy. I filed YARN-340 to address this.

Arun C Murthy
added a comment - 09/Jan/13 05:17 I just committed this. Thanks for all the discussion, reviews and feedback. Please feel free to open follow-ups and I'll promise to be very attentive. Thanks again!

Arun C Murthy
added a comment - 08/Jan/13 12:15 Fixed TestRMWebServicesCapacitySched (had to fix the test) - any final comments?
I think it's good to go, for now I'll commit after jenkins okays it since it's getting harder to maintain this largish patch. We can fix nits etc. post-commit. Thanks.

However, I don't think we are in a position, yet, to define a virtual-core with certainty (should we choose a 2010 Xeon or a bleeding edge 2012 X5000 series?) - hence the leeway for admins until we get more experience with this.

IAC, as Robert Joseph Evans pointed out, we can debate endlessly or run with something reasonable for now - the apis are marked @Evolving, and we'll likely change once we learn more. Essentially, cpu-scheduling is an experimental feature for now.

Arun C Murthy
added a comment - 29/Dec/12 03:47 Sandy Ryza See the discussion between Bikas & I w.r.t definition of a virtual-core. That should take care of your concerns.
Here is AWS's take on an EC2 Compute Unit viz. similar to the 'virtual core': http://aws.amazon.com/ec2/faqs/#What_is_an_EC2_Compute_Unit_and_why_did_you_introduce_it
However, I don't think we are in a position, yet, to define a virtual-core with certainty (should we choose a 2010 Xeon or a bleeding edge 2012 X5000 series?) - hence the leeway for admins until we get more experience with this.
IAC, as Robert Joseph Evans pointed out, we can debate endlessly or run with something reasonable for now - the apis are marked @Evolving, and we'll likely change once we learn more. Essentially, cpu-scheduling is an experimental feature for now.

for consistency, we would also need to report this measure wherever resource requests and consumption are reported (web UI, metrics, command line). Once we expect the user to think about it in a certain way, is there a strong reason for having a different model internally?

Very valid points, Sandy. I agree we should have a strong motivation to support a model with particular usecases and advantages. Let me think more about this.

I like Bikas' idea of supporting a baseline virtual core (1 GHz core) - that doesn't need us to maintain two different versions (one for the user, and one internally), and it also brings standardization to otherwise heterogeneous clusters.

Let us think more about this and start a new JIRA with a relatively more concrete proposal. We have already hijacked this JIRA

Karthik Kambatla (Inactive)
added a comment - 28/Dec/12 18:52
for consistency, we would also need to report this measure wherever resource requests and consumption are reported (web UI, metrics, command line). Once we expect the user to think about it in a certain way, is there a strong reason for having a different model internally?
Very valid points, Sandy. I agree we should have a strong motivation to support a model with particular usecases and advantages. Let me think more about this.
I like Bikas' idea of supporting a baseline virtual core (1 GHz core) - that doesn't need us to maintain two different versions (one for the user, and one internally), and it also brings standardization to otherwise heterogeneous clusters.
Let us think more about this and start a new JIRA with a relatively more concrete proposal. We have already hijacked this JIRA

The idea of virtual cores seems unintuitive to me. Choosing how much of each resource to request is a difficult, somewhat undefined, task already. If I want to try to decide how much CPU to request for my job/task the thing I'd think to do would be to run it locally and see what percentage it takes up, which wouldn't be a perfect measure for a number of reasons, but would probably suffice most of the time. To have to look up how many virtual cores the machines are assigned and then try to translate to an integer factor of that seems unnecessary and confusing.

This becomes even more difficult if I want to run a job against multiple clusters, each with different numbers of virtual cores per node. While vmem-to-pmem can also vary across clusters, it is tied directly as a knob to oversubscription, likely does not vary by orders of magnitude, and has a clear meaning in terms of what is going on in an operating system. On the other hand, virtual cores conflate oversubscription with request granularity - on one cluster my request for a virtual core might mean a quarter of the CPU that it does on another cluster, because the former wants to support finer granularity.

As Karthik says, we might be able to provide a different view for job submission that translates some more intuitive measure to virtual cores, but for consistency, we would also need to report this measure wherever resource requests and consumption are reported (web UI, metrics, command line). Once we expect the user to think about it in a certain way, is there a strong reason for having a different model internally?

Sandy Ryza
added a comment - 28/Dec/12 17:55 The idea of virtual cores seems unintuitive to me. Choosing how much of each resource to request is a difficult, somewhat undefined, task already. If I want to try to decide how much CPU to request for my job/task the thing I'd think to do would be to run it locally and see what percentage it takes up, which wouldn't be a perfect measure for a number of reasons, but would probably suffice most of the time. To have to look up how many virtual cores the machines are assigned and then try to translate to an integer factor of that seems unnecessary and confusing.
This becomes even more difficult if I want to run a job against multiple clusters, each with different numbers of virtual cores per node. While vmem-to-pmem can also vary across clusters, it is tied directly as a knob to oversubscription, likely does not vary by orders of magnitude, and has a clear meaning in terms of what is going on in an operating system. On the other hand, virtual cores conflate oversubscription with request granularity - on one cluster my request for a virtual core might mean a quarter of the CPU that it does on another cluster, because the former wants to support finer granularity.
As Karthik says, we might be able to provide a different view for job submission that translates some more intuitive measure to virtual cores, but for consistency, we would also need to report this measure wherever resource requests and consumption are reported (web UI, metrics, command line). Once we expect the user to think about it in a certain way, is there a strong reason for having a different model internally?

By the way, the API could be still be using a float. And, users can specify share in terms of a physical core. We can translate it to the appropriate virtual cores. For instance, if a user asks for 0.5 core on a node with 8 cores, it turns out to be 64 virtual cores.

Karthik Kambatla (Inactive)
added a comment - 27/Dec/12 20:38 By the way, the API could be still be using a float. And, users can specify share in terms of a physical core. We can translate it to the appropriate virtual cores. For instance, if a user asks for 0.5 core on a node with 8 cores, it turns out to be 64 virtual cores.

Request for less than a full core. The higher the number, the smaller the share one can get. We can achieve this through a float as well, but we know the issues with float precision and floating point operations.

Ask for more resources than available - for x=10, all 1024 shares can be requested - leading to sharing the physical cores.

Weighted sharing of resources asked. Say A asks for 256 and B asks for 768, the cores can be shared in 1:3 ratio as they should be.

I agree this is not the only way to achieve these. Also, agree that the number of virtual cores need not be 1024 as in cgroups; if we go this path, it doesn't hurt to pick that number though.

Karthik Kambatla (Inactive)
added a comment - 27/Dec/12 20:32 Thanks Arun, that should be okay. Created YARN-294 for the same, and added you as a watcher there.
Just to continue discussion on cpu shares (I understand we are not addressing this here in this JIRA):
IIUC, picking a high number of virtual cores (2^x - e.g. x=10 ) helps with:
Request for less than a full core. The higher the number, the smaller the share one can get. We can achieve this through a float as well, but we know the issues with float precision and floating point operations.
Ask for more resources than available - for x=10, all 1024 shares can be requested - leading to sharing the physical cores.
Weighted sharing of resources asked. Say A asks for 256 and B asks for 768, the cores can be shared in 1:3 ratio as they should be.
I agree this is not the only way to achieve these. Also, agree that the number of virtual cores need not be 1024 as in cgroups; if we go this path, it doesn't hurt to pick that number though.

Arun C Murthy
added a comment - 27/Dec/12 19:49 Anyway, thanks Bobby - looks like we are mostly on the same page.
Appreciate a final set of reviews - like I said, it looks good to go.
Karthik Kambatla let's revisit your final point later, IAC, we are debating a private implementation function, not an api.

I am not against shares if there are technical reasons going for it. However, "same terminology as cgroups" makes me wary of introducing more Linux concepts into Hadoop code.

Arun, do we have a baseline value for what a virtual core is (eg 1Ghz Intel Xeon 2010 core)? Like Amazon's classification of CPU for their VM's. Is it published in the docs or somewhere in the code? Are are we planning to make it configurable and so 1 virtual core on cluster A is not comparable to 1 virtual core on cluster B? I am trying to understand if virtual cores is a standardization or merely a multiplicative artifact that gives us finer granularity.

Bikas Saha
added a comment - 27/Dec/12 19:03 I am not against shares if there are technical reasons going for it. However, "same terminology as cgroups" makes me wary of introducing more Linux concepts into Hadoop code.
Arun, do we have a baseline value for what a virtual core is (eg 1Ghz Intel Xeon 2010 core)? Like Amazon's classification of CPU for their VM's. Is it published in the docs or somewhere in the code? Are are we planning to make it configurable and so 1 virtual core on cluster A is not comparable to 1 virtual core on cluster B? I am trying to understand if virtual cores is a standardization or merely a multiplicative artifact that gives us finer granularity.

Alejandro Abdelnur
added a comment - 27/Dec/12 18:17 Is there any reason not to use shares instead of CPUs? By using shares:
We use same terminology like in cgroup CPU controller, which we will leverage for enforcing CPUs utilization.
It allows to give a % of CPU
It allows to over subscribe (by issuing more shares)

I chatted with Arun off line a bit about this, and he pointed out to me that the APIs are marked as Evolving, I should read the patch more closely next time. So I am OK with putting it in with the API as it is. I still think that having a float for the API is preferable, but until we actually start using it in practice we will not know what the real issues are.

Robert Joseph Evans
added a comment - 27/Dec/12 18:06 I chatted with Arun off line a bit about this, and he pointed out to me that the APIs are marked as Evolving, I should read the patch more closely next time. So I am OK with putting it in with the API as it is. I still think that having a float for the API is preferable, but until we actually start using it in practice we will not know what the real issues are.

Sure, we can fix the nits in follow up JIRAs. I am not very particular, but the following one comment would be nice to fix here (for fixed API), unless you think it is not a good idea.

Following up on earlier discussions about this, passing int dominance instead of boolean dominant to #getResourceAsValue would address both performance and API concerns. For now, we can check for (dominance == 1)

Karthik Kambatla (Inactive)
added a comment - 27/Dec/12 17:25 Sure, we can fix the nits in follow up JIRAs. I am not very particular, but the following one comment would be nice to fix here (for fixed API), unless you think it is not a good idea.
Following up on earlier discussions about this, passing int dominance instead of boolean dominant to #getResourceAsValue would address both performance and API concerns. For now, we can check for (dominance == 1)

Karthik Kambatla (Inactive)
added a comment - 27/Dec/12 17:19
If all we do is have the request be a float and then round up to an int for processing I am fine with that, but I don't what to lock us into such a coarse granularity in the future.
Makes sense. +1

I still have bigger concerns then just "naming nits etc." If you want to put this in I am OK with the implementation, I am OK with allowing overcommitment on cores but I really do not want to see the API stuck with the request for a core to be an int. If all we do is have the request be a float and then round up to an int for processing I am fine with that, but I don't what to lock us into such a coarse granularity in the future.

Robert Joseph Evans
added a comment - 27/Dec/12 17:11 Arun,
I still have bigger concerns then just "naming nits etc." If you want to put this in I am OK with the implementation, I am OK with allowing overcommitment on cores but I really do not want to see the API stuck with the request for a core to be an int. If all we do is have the request be a float and then round up to an int for processing I am fine with that, but I don't what to lock us into such a coarse granularity in the future.

Arun C Murthy
added a comment - 27/Dec/12 17:02 Thanks for catching the bug in CSQueueUtils Karthik Kambatla , I've fixed computeMaxActiveApplications to use ResourceCalculator.ratio.
I've also added javadocs for ResourceCalculator which should help future maintenance.
Overall, I think the patch is ready to go - we can fix naming nits etc. separately - I'm having a hard time keeping a giant patch up-to-date. Thanks.

Bikas Saha the original critique of using 'integral' cores was that it would lead to under-utilization of CPUs if certain workloads were very CPU-light, hence the need for floating-point spec. The one problem with that spec is that it gets very hard to deal with heterogenous clusters i.e. you need to be able to say "I need 0.25 CPU at 2.0GHz" or "I need 1.5 CPU at 2.5GHz." Furthermore, similar to memory, you need a minimum memory spec i.e. 0.25 CPUs to ensure we don't fragment resources too finely.

So, rather than make a more complicated spec (#cpus and cpu-frequency etc. and a minimum #cpus/cpu-freq) I propose we normalize to a integral number of 'virtual cores'. This way we get the required 'minimum' i.e. 1 virtual-core and built-in support for heterogenous systems and over-subscription i.e. we can control #virtual-cores on each node depending on their individual characteristics.

Arun C Murthy
added a comment - 27/Dec/12 16:30 Bikas Saha the original critique of using 'integral' cores was that it would lead to under-utilization of CPUs if certain workloads were very CPU-light, hence the need for floating-point spec. The one problem with that spec is that it gets very hard to deal with heterogenous clusters i.e. you need to be able to say "I need 0.25 CPU at 2.0GHz" or "I need 1.5 CPU at 2.5GHz." Furthermore, similar to memory, you need a minimum memory spec i.e. 0.25 CPUs to ensure we don't fragment resources too finely.
So, rather than make a more complicated spec (#cpus and cpu-frequency etc. and a minimum #cpus/cpu-freq) I propose we normalize to a integral number of 'virtual cores'. This way we get the required 'minimum' i.e. 1 virtual-core and built-in support for heterogenous systems and over-subscription i.e. we can control #virtual-cores on each node depending on their individual characteristics.

Could you please elaborate on your proposal of virtual cores a bit more. Specifically, around your ideas for heterogeneous cores and over-subscription. That may help clarify some of the questions raised in other comments.

Though I dont understand the suggestions fully, I would be wary of implicitly linking RM logic with cgroups. Other than the unwritten dependency it also might make life harder for the ongoing Windows port in YARN-191.

Also, other than the case for ~0 CPU tasks, what are the other scenarios for floating cores? IMO we could just specify 0 cores for such tasks. Its safe because we cannot run an infinite number of them because of other resource constraints like memory. I am not quite sure how/when a non-integral CPU requirement would be needed.

Bikas Saha
added a comment - 26/Dec/12 02:24 Could you please elaborate on your proposal of virtual cores a bit more. Specifically, around your ideas for heterogeneous cores and over-subscription. That may help clarify some of the questions raised in other comments.
Though I dont understand the suggestions fully, I would be wary of implicitly linking RM logic with cgroups. Other than the unwritten dependency it also might make life harder for the ongoing Windows port in YARN-191 .
Also, other than the case for ~0 CPU tasks, what are the other scenarios for floating cores? IMO we could just specify 0 cores for such tasks. Its safe because we cannot run an infinite number of them because of other resource constraints like memory. I am not quite sure how/when a non-integral CPU requirement would be needed.

Should we set the total # of virtual cores to 1024 (parity with cgroups), and hence, set the virtual-to-physical ratio statically at startup. For instance, should we operate on a machine with 8 cores, we set the ratio to 128.

Should we rename getResourceAsValue() to getResourceAsNormalizedValue() for better readability?

Following up on earlier discussions about this, passing int dominanceLevel instead of boolean dominant address both performance and API concerns. For now, we can check for (dominanceLevel == 1)

MultiResourceCalculator does show different implementations of divide() and ratio(), but it is hard to understand the difference from code or comments. May be, we need better names to describe exactly what they do.

Also, DefaultResourceCalculator defines ratio() in terms of divide(). Theoretically, shouldn't it be the other way round? I might be wrong here, but I am still confused about the two.

CSQueueUtils and LeafQueue seem to be using divide() and ratio() interchangeably

Should make the stepFactor argument names consistent across ResourceCalculator, DefaultCalculator and MultiResourceCalculator. stepFactor and factor might mean different things.

Karthik Kambatla (Inactive)
added a comment - 25/Dec/12 19:13 Thanks Arun. The patch looks great, quite excited to see multi-resource scheduling.
Few comments (mostly nits):
Should we set the total # of virtual cores to 1024 (parity with cgroups), and hence, set the virtual-to-physical ratio statically at startup. For instance, should we operate on a machine with 8 cores, we set the ratio to 128.
ResourceCalculator#divideAndCeil: LOG.debug() seems more appropriate than LOG.info()
Should we rename getResourceAsValue() to getResourceAsNormalizedValue() for better readability?
Following up on earlier discussions about this, passing int dominanceLevel instead of boolean dominant address both performance and API concerns. For now, we can check for (dominanceLevel == 1)
MultiResourceCalculator does show different implementations of divide() and ratio(), but it is hard to understand the difference from code or comments. May be, we need better names to describe exactly what they do.
Also, DefaultResourceCalculator defines ratio() in terms of divide(). Theoretically, shouldn't it be the other way round? I might be wrong here, but I am still confused about the two.
CSQueueUtils and LeafQueue seem to be using divide() and ratio() interchangeably
Should make the stepFactor argument names consistent across ResourceCalculator, DefaultCalculator and MultiResourceCalculator. stepFactor and factor might mean different things.
FairScheduler comments - shouldn't the comment be saying "ensuring" instead of "insuring"?

Andrew - thanks for the review, I've incorporated most of your comments. I have a couple more to go, but I wanted to put it up to discuss. In particular, thanks for catching SchedulerUtils.normalizeRequests.

Some responses:

MultiResourceCalculator: I've chosen the current impl. to provide a slight perf advantage where-in I don't need to compare 2nd-most dominant resource if I don't have to. Once we add more resources it will be harder/uglier to do the same, but for now it seems worth it.

I thought I already responded to Vinod, but I missed it - ratio and divide are actually very different - one uses the notion of 'dominant resource' but the other doesn't (see MultiResourceCalculator) and hence the need for two different apis.

Arun C Murthy
added a comment - 24/Dec/12 04:40 Andrew - thanks for the review, I've incorporated most of your comments. I have a couple more to go, but I wanted to put it up to discuss. In particular, thanks for catching SchedulerUtils.normalizeRequests.
Some responses:
MultiResourceCalculator: I've chosen the current impl. to provide a slight perf advantage where-in I don't need to compare 2nd-most dominant resource if I don't have to. Once we add more resources it will be harder/uglier to do the same, but for now it seems worth it.
I thought I already responded to Vinod, but I missed it - ratio and divide are actually very different - one uses the notion of 'dominant resource' but the other doesn't (see MultiResourceCalculator) and hence the need for two different apis.

W.r.t cores, on some more thinking, I'm inclined to go along with the concept of integral 'virtual cores' instead of a float-precision 'cores' for the following reasons:

It provides a level on indirection to deal with heterogenous cores viz. much more important for CPUs (as opposed to memory, disk b/w etc.). I've also added a notion of physical-to-virtual cores translation (yarn.nodemanager.vcores-pcores-ratio), per NodeManager, similar to physical-to-virtual memory translation (yarn.nodemanager.vmem-pmem-ratio) that we already have in place.

It ensures we do minimal floating-point operations in the inner-most loop which are very expensive (for e.g. we dropped usage of Math.ceil in MAPREDUCE-1354 for JobTracker - Math.ceil is a JNI call). This is something I've been very focussed on since the dawn, which explains the integral implementations of divideAndCeil we've had since the beginning.

To make it clear, I've also renamed the apis to be (get,set)VirtualCores and marked them evolving - in future we can add (get,set)Cores after we finalize how we specify not just #cores, but also their capabilities (gigahertz?).

Arun C Murthy
added a comment - 24/Dec/12 04:34 Ok, I finally got around to finishing this up.
W.r.t cores, on some more thinking, I'm inclined to go along with the concept of integral 'virtual cores' instead of a float-precision 'cores' for the following reasons:
It provides a level on indirection to deal with heterogenous cores viz. much more important for CPUs (as opposed to memory, disk b/w etc.). I've also added a notion of physical-to-virtual cores translation (yarn.nodemanager.vcores-pcores-ratio), per NodeManager, similar to physical-to-virtual memory translation (yarn.nodemanager.vmem-pmem-ratio) that we already have in place.
It ensures we do minimal floating-point operations in the inner-most loop which are very expensive (for e.g. we dropped usage of Math.ceil in MAPREDUCE-1354 for JobTracker - Math.ceil is a JNI call). This is something I've been very focussed on since the dawn, which explains the integral implementations of divideAndCeil we've had since the beginning.
To make it clear, I've also renamed the apis to be (get,set)VirtualCores and marked them evolving - in future we can add (get,set)Cores after we finalize how we specify not just #cores, but also their capabilities (gigahertz?).
Thoughts?

this patch is looking GREAT! in particular, the ResourceCalculator class is super useful – I really like it. my version, without it, is definitely much harder to follow...

before some specific feedback, I want to say that I agree that cores should be floats/fractional-units for three reasons:

they make sense for long-running services, which may require little CPU, but should be available on each node, with the ease of having been scheduled by YARN.

this gives us a fine-grained knob for implementing dynamic re-adjustment one day; ie, I may want to increase an executing job's weight by 10%, or decrease by 15%, etc.

the publicly released traces of resource requests & usage in Google's cluster (to my knowledge, the only traces of their kind) include fractional amounts for CPU; having fractional CPU requests in YARN may make it easier to translate insights from that dataset to making better resource requests in a YARN cluster.

DefaultContainer.java: divideAndCeil explicitly uses the two-argument form of createResource to create a resource with 0 cores, whereas other Resources created in this calculator create resources with 1 core. this seems counter-intuitive to me, as divideAndCeil tends to result in an overestimate of resource consumption, rather than an underestimate. either way, perhaps a comment would be helpful, as it is the only time this method is used this way in the memory-only comparator

MultiResourceCalculator.java: in compare(), you are looking to order the resources by how dominant they are, and then compare by most-dominant resource, second most-dominant, etc. ... I think the boolean flag to getResourceAsValue() doesn't make this clear. with the flag, the question in my mind would be "wait, why would I want the non-dominant resource?". simply having a boolean flag makes extending to three or more resources less clean. I implemented this by treating each resource request as a vector, normalizing by clusterResources, and then sorting the components by dominance.

MultiResourceCalcuator.java, DefaultCalculator.java, Resources.java: for the multiplyAndNormalizeUp and multiplyAndNormalizeDown methods, consider renaming the third argument to "stepping" instead of "factor" is it's not a factor used for the multiplication, rather it's a unit of discretization to round to ("stepping" may not be the best word, but perhaps it's closer). just a thought...

CSQueueUtils.java: extra spaces in front of @Lock(CSQueue.class)

CapacityScheduler.java: in the allocate() method, there's a call to normalize the request (after a comment about sanity checks). currently, it only normalizes the memory; I think the patch should also normalize the number of CPU's requested, no?

LeafQueue.java: in assignReservedContainer consider changing Resources.divide to Resources.ratio when calculating potentialNewCapacity (and the current capacity). While both calls "should" give the same result, ratio has fewer floating-point operations, and, better yet, is semantically what is meant in this case – we're calculating the ratio between (used + requested) and available. Frankly, this is perhaps something to take a closer look at (as Vinod Kumar Vavilapalli pointed out): whether both divide and ratio are needed, and if so, which should be used in each case.

Also, both ContainerTokenIdentifier.java and BuilderUtils.java assume that memory is the only resource; I'm not certain they should be updated, but I wanted to mention them just in case.

Oh, and should yarn-default.xml be updated with values for yarn.scheduler.minimum-allocaiton-cores and yarn.scheduler.maximum-allocation-cores ?

Hope this helps, Arun! depending on how the discussion of integral vs fractional cores shakes out, I think this patch is good to go.

Andrew Ferguson
added a comment - 21/Oct/12 22:42 hi Arun,
this patch is looking GREAT! in particular, the ResourceCalculator class is super useful – I really like it. my version, without it, is definitely much harder to follow...
before some specific feedback, I want to say that I agree that cores should be floats/fractional-units for three reasons:
they make sense for long-running services, which may require little CPU, but should be available on each node, with the ease of having been scheduled by YARN.
this gives us a fine-grained knob for implementing dynamic re-adjustment one day; ie, I may want to increase an executing job's weight by 10%, or decrease by 15%, etc.
the publicly released traces of resource requests & usage in Google's cluster (to my knowledge, the only traces of their kind) include fractional amounts for CPU; having fractional CPU requests in YARN may make it easier to translate insights from that dataset to making better resource requests in a YARN cluster.
ok, here are some specific comments on the patch:
YarnConfiguration.java : duplicate import of com.google.common.base.Joiner
DefaultContainer.java : divideAndCeil explicitly uses the two-argument form of createResource to create a resource with 0 cores, whereas other Resources created in this calculator create resources with 1 core. this seems counter-intuitive to me, as divideAndCeil tends to result in an overestimate of resource consumption, rather than an underestimate . either way, perhaps a comment would be helpful, as it is the only time this method is used this way in the memory-only comparator
MultiResourceCalculator.java : in compare() , you are looking to order the resources by how dominant they are, and then compare by most-dominant resource, second most-dominant, etc. ... I think the boolean flag to getResourceAsValue() doesn't make this clear. with the flag, the question in my mind would be "wait, why would I want the non-dominant resource?". simply having a boolean flag makes extending to three or more resources less clean. I implemented this by treating each resource request as a vector, normalizing by clusterResources, and then sorting the components by dominance.
MultiResourceCalcuator.java , DefaultCalculator.java , Resources.java : for the multiplyAndNormalizeUp and multiplyAndNormalizeDown methods, consider renaming the third argument to "stepping" instead of "factor" is it's not a factor used for the multiplication, rather it's a unit of discretization to round to ("stepping" may not be the best word, but perhaps it's closer). just a thought...
CSQueueUtils.java : extra spaces in front of @Lock(CSQueue.class)
CapacityScheduler.java : in the allocate() method, there's a call to normalize the request (after a comment about sanity checks). currently, it only normalizes the memory; I think the patch should also normalize the number of CPU's requested, no?
LeafQueue.java : in assignReservedContainer consider changing Resources.divide to Resources.ratio when calculating potentialNewCapacity (and the current capacity). While both calls "should" give the same result, ratio has fewer floating-point operations, and, better yet, is semantically what is meant in this case – we're calculating the ratio between (used + requested) and available. Frankly, this is perhaps something to take a closer look at (as Vinod Kumar Vavilapalli pointed out): whether both divide and ratio are needed, and if so, which should be used in each case.
Also, both ContainerTokenIdentifier.java and BuilderUtils.java assume that memory is the only resource; I'm not certain they should be updated, but I wanted to mention them just in case.
Oh, and should yarn-default.xml be updated with values for yarn.scheduler.minimum-allocaiton-cores and yarn.scheduler.maximum-allocation-cores ?
Hope this helps, Arun! depending on how the discussion of integral vs fractional cores shakes out, I think this patch is good to go.
cheers,
Andrew

I agree with Bobby's case as it maps cleanly with cgroups CPU assignment which is based on CPU shares. Another alternative would be to directly use the concept of CPU shares as well, then it can be an INT and the default value would be 1024 (same as cgroups). What would be provisioned to the RM as node capacity would be a total number of shares, typically min(1024, 1024 * (# core - 1)) (reserving 1 core shares for the OS/NM/DN). And if you want to oversubscription you just bump up the number of shares avail. Thoughts?

Alejandro Abdelnur
added a comment - 17/Oct/12 18:03 I agree with Bobby's case as it maps cleanly with cgroups CPU assignment which is based on CPU shares. Another alternative would be to directly use the concept of CPU shares as well, then it can be an INT and the default value would be 1024 (same as cgroups). What would be provisioned to the RM as node capacity would be a total number of shares, typically min(1024, 1024 * (# core - 1)) (reserving 1 core shares for the OS/NM/DN). And if you want to oversubscription you just bump up the number of shares avail. Thoughts?

What does requesting 1 CPU really mean and how is it different from requesting 1.8? To me 1 CPU means that for this particular container I want to be guaranteed that it gets at least 1 full CPU core to itself for computation at any point in time it needs it, very similar to what requesting 3000MB of memory does. It is a bit more ambiguous because 1 CPU on box A is not necessarily equivalent to 1 CPU on box B. But this JIRA already makes the assumption that they are close enough to being equivalent. It gives me as a user of the container a chance to set a lower bound on the amount of resources that I am guaranteed to be able to use. In practice this probably means that the kernel will give at least X% of the available CPU time to the processes running in that container, if those processes are runnable, where X = CPU requested/Total CPU cores on the box.

1.8 CPUs to me means a few things. First the person requesting this was either a machine or was overly ambitious in trying to get an exact value. Second the container will probably get 2 CPU cores, because just like with memory I would expect the scheduler to round it up to the nearest multiple of a scheduling unit. I proposed initially that quarter or even half CPU marks are probably sufficient. We can always round up and remove precision with a float. It is very hard to go back the other way though and add precision to an int. I am fine with the first go around the CPU number is in float and the scheduling unit is 1 CPU. I just want the door left open so we can easily adjust things if we find a need to.

Over-subscribing makes since but it also has a lot of pitfalls. You have to take into account that resource utilization is not constant. A process can use very little of a resource and then all of a sudden it starts to use lots of that resource. Is the Resource request a guarantee of those resources, or is it just a good effort to provide those resources? I see situations where users would what both, and perhaps if we do support over-subscribing we need to support something like nice on POSIX.

Robert Joseph Evans
added a comment - 17/Oct/12 16:41 Arun, I still disagree with the #cores being an int.
What does requesting 1 CPU really mean and how is it different from requesting 1.8? To me 1 CPU means that for this particular container I want to be guaranteed that it gets at least 1 full CPU core to itself for computation at any point in time it needs it, very similar to what requesting 3000MB of memory does. It is a bit more ambiguous because 1 CPU on box A is not necessarily equivalent to 1 CPU on box B. But this JIRA already makes the assumption that they are close enough to being equivalent. It gives me as a user of the container a chance to set a lower bound on the amount of resources that I am guaranteed to be able to use. In practice this probably means that the kernel will give at least X% of the available CPU time to the processes running in that container, if those processes are runnable, where X = CPU requested/Total CPU cores on the box.
1.8 CPUs to me means a few things. First the person requesting this was either a machine or was overly ambitious in trying to get an exact value. Second the container will probably get 2 CPU cores, because just like with memory I would expect the scheduler to round it up to the nearest multiple of a scheduling unit. I proposed initially that quarter or even half CPU marks are probably sufficient. We can always round up and remove precision with a float. It is very hard to go back the other way though and add precision to an int. I am fine with the first go around the CPU number is in float and the scheduling unit is 1 CPU. I just want the door left open so we can easily adjust things if we find a need to.
Over-subscribing makes since but it also has a lot of pitfalls. You have to take into account that resource utilization is not constant. A process can use very little of a resource and then all of a sudden it starts to use lots of that resource. Is the Resource request a guarantee of those resources, or is it just a good effort to provide those resources? I see situations where users would what both, and perhaps if we do support over-subscribing we need to support something like nice on POSIX.

Arun C Murthy
added a comment - 17/Oct/12 08:46 Thanks for the f/b Vinod & Bobby, I've incorporated most of your f/b:
Renamed the interface and implmentations
Remove clusterResource from ResourceCalculator
I've kept the apis in Resources to ensure consistency of apis and to provide an indirection to fix stuff later.
W.r.t #cores, I spent more time thinking about this - it seems to me we are better of leaving the cores as integers and dynamically monitoring container usage and then over-subscribing as necessary.
I don't see how users/applications can specify 1.8 cores or 1.38 cores etc.
My thinking is that they can ask for integer cores and then the system (RM/NM) can monitoring usage and dynamically over-subscribe. We can also discuss this separately in another jira. Thoughts?

I'd prefer to use a float instead of a int because it will avoid user confusion on what the integer part of the number means, specially when dealing with clusters with large number of cores in their nodes.

NM memory and CPU configs

Currently these values are coming from the config of the NM, we should be able to obtain those values from the OS (ie, in the case of Linux from /proc/meminfo & /proc/cpuinfo). As this is highly OS dependent we should have an interface that obtains this information. In addition implementations of this interface should be able to specify a mem/cpu offset (amount of mem/cpu not to be avail as YARN resource), this would allow to reserve mem/cpu for the OS and other services outside of YARN containers.

I did a quick search and I couldn't find a JIRA for this, opening YARN-160.

Alejandro Abdelnur
added a comment - 15/Oct/12 17:37 Num cores...
I'd prefer to use a float instead of a int because it will avoid user confusion on what the integer part of the number means, specially when dealing with clusters with large number of cores in their nodes.
NM memory and CPU configs
Currently these values are coming from the config of the NM, we should be able to obtain those values from the OS (ie, in the case of Linux from /proc/meminfo & /proc/cpuinfo). As this is highly OS dependent we should have an interface that obtains this information. In addition implementations of this interface should be able to specify a mem/cpu offset (amount of mem/cpu not to be avail as YARN resource), this would allow to reserve mem/cpu for the OS and other services outside of YARN containers.
I did a quick search and I couldn't find a JIRA for this, opening YARN-160 .

Num cores to be a float to allow a bit of over-subscription like it is possible today without any concept of cores. Granted this can be simulated by artificially increasing the number of cores in configuration, but we should think of a more appropriate way.

I'd vote for doing this sooner rather than later. Simulation doesn't seem very clean because it would force us to expose the lie to the applications requesting resources (right?). Even if we keep it as an int and express it in tenths, I think that would be quite sufficient. Also, it's not entirely about over-subscription, small tasks which do a lot of sitting around waiting for things to happen, can't accurately reflect that.

Nathan Roberts
added a comment - 27/Sep/12 19:09 Num cores to be a float to allow a bit of over-subscription like it is possible today without any concept of cores. Granted this can be simulated by artificially increasing the number of cores in configuration, but we should think of a more appropriate way.
I'd vote for doing this sooner rather than later. Simulation doesn't seem very clean because it would force us to expose the lie to the applications requesting resources (right?). Even if we keep it as an int and express it in tenths, I think that would be quite sufficient. Also, it's not entirely about over-subscription, small tasks which do a lot of sitting around waiting for things to happen, can't accurately reflect that.

Alright, the patch needs some minor upmerge to the latest trunk. Other than that, a few comments: lots of text, but mostly minor issues:

api.records.Resource should perhaps be not comparable anymore after these changes?

ResourceComparator:

It seems unnatural for the comparator to store clusterResource, is it possible to just pass in clusterResource where-ever needed? I see compare() -> getResourceAsValue() and divide() -> getResourceAsValue() using this. For all the traces of divide(), I verified that clusterResource can be passed all the way through. For compare() method, see below.

Haven't read the paper yet, but in the current impl, we may end up comparing cpus to memory when comparing two resources. Is that intended? If not, even compare() won't need to read clusterResource.

ResourceComparator is doing much more than comparisons - may be call it something like ResourceSchedulingCalculator?

divide() seems to be completely unrelated to divideAndCeil() looking at the implementation. One of the names need to change, not sure which.

Not sure how to differentiate between ratio() and divide() either. The implementation does differ a lot, I couldn't figure out which old references of the division operator were replaced by divide() and which ones by ratio(). Names can be made more explicit, if not that, at the least the documentation.

Resources.java

multiplyAndNormalizeUp() and multiplyAndNormalizeDown(), roundUp(), roundDown(), ratio(), divide(), divideAndCeil() and equals() are unnecessary level of indirections to ResourceComparator and can be removed? lessThan(), lessThanOrEqual(), greaterThan(), greaterThanOrEqual(), min(), max() are useful though.

createResource(int memory) constructor is going to be problem when other resources/scheduling-algos come in. We should rename it, but okay doing it later.

Could we change the name of ResourceMemoryCpuComparator to something more like DefaultMultiResourceComparator? I think ResourceMemoryCpuNetworkBandwithDiskStorageGPUComparator is a bit long, but it is the direction we are headed in.

I too agree. Given this is also going to be public (albeit admin facing) configuration, I am +1 for something like MultiResourceComparator. Thoughts?

Problems worth considering in follow-up JIRAs:

Num cores to be a float to allow a bit of over-subscription like it is possible today without any concept of cores. Granted this can be simulated by artificially increasing the number of cores in configuration, but we should think of a more appropriate way.

Some calculations like in LeafQueue have become very hard to read, we can rewrite them to calculate one item per line

Vinod Kumar Vavilapalli
added a comment - 09/Sep/12 03:00 Alright, the patch needs some minor upmerge to the latest trunk. Other than that, a few comments: lots of text, but mostly minor issues:
api.records.Resource should perhaps be not comparable anymore after these changes?
ResourceComparator:
It seems unnatural for the comparator to store clusterResource , is it possible to just pass in clusterResource where-ever needed? I see compare() -> getResourceAsValue() and divide() -> getResourceAsValue() using this. For all the traces of divide(), I verified that clusterResource can be passed all the way through. For compare() method, see below.
Haven't read the paper yet, but in the current impl, we may end up comparing cpus to memory when comparing two resources. Is that intended? If not, even compare() won't need to read clusterResource .
ResourceComparator is doing much more than comparisons - may be call it something like ResourceSchedulingCalculator ?
divide() seems to be completely unrelated to divideAndCeil() looking at the implementation. One of the names need to change, not sure which.
Not sure how to differentiate between ratio() and divide() either. The implementation does differ a lot, I couldn't figure out which old references of the division operator were replaced by divide() and which ones by ratio() . Names can be made more explicit, if not that, at the least the documentation.
Resources.java
multiplyAndNormalizeUp() and multiplyAndNormalizeDown() , roundUp() , roundDown() , ratio(), divide(), divideAndCeil() and equals() are unnecessary level of indirections to ResourceComparator and can be removed? lessThan(), lessThanOrEqual(), greaterThan(), greaterThanOrEqual(), min(), max() are useful though.
createResource(int memory) constructor is going to be problem when other resources/scheduling-algos come in. We should rename it, but okay doing it later.
Could we change the name of ResourceMemoryCpuComparator to something more like DefaultMultiResourceComparator? I think ResourceMemoryCpuNetworkBandwithDiskStorageGPUComparator is a bit long, but it is the direction we are headed in.
I too agree. Given this is also going to be public (albeit admin facing) configuration, I am +1 for something like MultiResourceComparator. Thoughts?
Problems worth considering in follow-up JIRAs:
Num cores to be a float to allow a bit of over-subscription like it is possible today without any concept of cores. Granted this can be simulated by artificially increasing the number of cores in configuration, but we should think of a more appropriate way.
Some calculations like in LeafQueue have become very hard to read, we can rewrite them to calculate one item per line

Arun C Murthy
added a comment - 21/Aug/12 06:31 Thanks for the fix Junping, I've incorporated it.
The latest patch looks good to go, all unit tests work and I've run sample jobs on a real cluster.
Can anyone pls take a look? Tx.

For two @Ignore issues in TestFairScheduler, I found it reveals two tiny bugs:
testChoiceOfPreemptedContainers: There are something wrong with Resource.equal() that it is possible to compare two object: one is ResourcePBImpl (a normal resource object) and the other is Resource (Resource.NONE). We should check the compared obj is instance of Resource rather than forcing them the same class (that is the general way of implementing equal() but not suitable here).
testPreemptionDecision: That's something wrong with Resources.createResource(memory), if memory == 0, then we should put core = 0 there as the caller tries to create a none resource. Thoughts?
Base on two findings, I update the code a bit and attached as YARN-2-help.patch which pass local unit test (with applying YARN-2 patch). Arun, please feel free to merge into your patch if you think it is helpful.

Junping Du
added a comment - 14/Aug/12 06:51 For two @Ignore issues in TestFairScheduler, I found it reveals two tiny bugs:
testChoiceOfPreemptedContainers: There are something wrong with Resource.equal() that it is possible to compare two object: one is ResourcePBImpl (a normal resource object) and the other is Resource (Resource.NONE). We should check the compared obj is instance of Resource rather than forcing them the same class (that is the general way of implementing equal() but not suitable here).
testPreemptionDecision: That's something wrong with Resources.createResource(memory), if memory == 0, then we should put core = 0 there as the caller tries to create a none resource. Thoughts?
Base on two findings, I update the code a bit and attached as YARN-2 -help.patch which pass local unit test (with applying YARN-2 patch). Arun, please feel free to merge into your patch if you think it is helpful.

Arun C Murthy
added a comment - 02/Aug/12 19:14 Oops, my bad. I forgot to point that out.
I will need some help from someone on FairScheduler - I don't know enough about it and not sure why those tests failed due to my (almost) non-existent work there during this patch. Thanks.

Thanks Arun. The changes look good. I still want something that would allow a task that uses almost no CPU to indicate that. I don't think float is the correct solution, but we need something. I am fine if you want to punt on that, but I would like to ses us mark the API as unstable until we can come up with some sort of a solution.

I am also concerned about the addition of @Ignore to some of the tests.

Robert Joseph Evans
added a comment - 02/Aug/12 17:55 Thanks Arun. The changes look good. I still want something that would allow a task that uses almost no CPU to indicate that. I don't think float is the correct solution, but we need something. I am fine if you want to punt on that, but I would like to ses us mark the API as unstable until we can come up with some sort of a solution.
I am also concerned about the addition of @Ignore to some of the tests.

also, I'm happy to port pieces of my ginormous patch over to this if you'd like – while the majority of the patch I posted is test cases (which may or may not match the semantics of your DRF implementation due to decisions about edge cases), other pieces such as the FIFO support, the web GUI, and the metrics code might save you some time.

Andrew Ferguson
added a comment - 02/Aug/12 00:51 @acmurthy: you bet! I should have time this week to read this over.
also, I'm happy to port pieces of my ginormous patch over to this if you'd like – while the majority of the patch I posted is test cases (which may or may not match the semantics of your DRF implementation due to decisions about edge cases), other pieces such as the FIFO support, the web GUI, and the metrics code might save you some time.
cheers,
Andrew

Arun C Murthy
added a comment - 02/Aug/12 00:34 Thanks for reviews Bobby. I've incorporated all except the CPU one - not sure if fraction is the right one to go for right now...
Andrew - if you have time, could you pls take a look too? Thanks.

My comments are mostly the same ones that I had for the previous patches.

I don't really like having the resource comparator class configuration be specific to the scheduler. I would prefer to see it be available for both the fifo and the capacity scheduler. This is very minor, but it enforces consistency between the schedulers.

The more I think about it the more I want to see the ability to request only part of a core. I don't think we need to make it a true float. Perhaps we should round up to the closest quarter of a core, but requiring full core increments is too course of a measure. I think we are going to get bad cluster utilization unless we can do a more fined grained approach.

Also inside LeafQueue.java and ParentQueue.java there is some code that was refactored to use the new Resources Methods, but the original code is still there, just commented out. Please clean this up.

Robert Joseph Evans
added a comment - 25/Jul/12 19:27 I have looked through the patch somewhat quickly.
My comments are mostly the same ones that I had for the previous patches.
I don't really like having the resource comparator class configuration be specific to the scheduler. I would prefer to see it be available for both the fifo and the capacity scheduler. This is very minor, but it enforces consistency between the schedulers.
The more I think about it the more I want to see the ability to request only part of a core. I don't think we need to make it a true float. Perhaps we should round up to the closest quarter of a core, but requiring full core increments is too course of a measure. I think we are going to get bad cluster utilization unless we can do a more fined grained approach.
Also inside LeafQueue.java and ParentQueue.java there is some code that was refactored to use the new Resources Methods, but the original code is still there, just commented out. Please clean this up.

Arun C Murthy
added a comment - 24/Jul/12 20:54 Here is an updated patch which is complete.
It's already too large, and since it doesn't modify existing behaviour, I think it can go in to unblock other patches while I add more unit tests via an aux-jira.

awesome, thanks for the update Arun. I just finished reading through your commits. so far, your patch looks a lot like mine, which is great! hopefully that means our logic is correct. I like that you pulled more of the division and rounding code into the ResourceComparator, and out of CSQueueUtils to keep it modular; I didn't think to do that.

I have a few suggestions for you (all of which I learned after writing test cases):

1) In ResourceMemoryCpuComparator (renamed "DefaultMultiResourceComparator" in my patch), I found that a simple "if lhs.equals(rhs) return 0;" was needed at the start – after dividing by the cluster resources, two identical resource requests might appear to be different due to floating point issues.

2) In the same class, I found that I needed to normalize the resources (by the cluster's resources), and then sort them to compare two resources which consume the same amount of their most-dominant resource, but differing amounts of their 2nd-most-dominant resource. This is important when checking that you don't exceed a resource limit (eg, "greaterThan(comparator, consumed, limit)") – it may be that I'm within the limit for CPUs (which is the dominant resource), but exceeding the limit for memory (which is not my dominant resource).

3) In resourcemanager.resource.Resources, when multiplying CPUs by a float, because CPUs is an int, I needed two versions: one which rounded-up, and one which rounded-down. Calculating queueMaxCap was the only time I needed the round-down version. Technically, this is also needed for memory (since it is also an int), but as long as we only allocate memory in units of at least, say, 128 MB (as is current practice in the code), the extra bits in the int (0 bytes - 128 MB) are actually serving as a store for the fractional part! and thus, the existing roundUp() and roundDown() functions (from CSQueueUtils) suffice.

Andrew Ferguson
added a comment - 16/Jul/12 22:21 awesome, thanks for the update Arun. I just finished reading through your commits. so far, your patch looks a lot like mine, which is great! hopefully that means our logic is correct. I like that you pulled more of the division and rounding code into the ResourceComparator, and out of CSQueueUtils to keep it modular; I didn't think to do that.
I have a few suggestions for you (all of which I learned after writing test cases):
1) In ResourceMemoryCpuComparator (renamed "DefaultMultiResourceComparator" in my patch), I found that a simple "if lhs.equals(rhs) return 0;" was needed at the start – after dividing by the cluster resources, two identical resource requests might appear to be different due to floating point issues.
2) In the same class, I found that I needed to normalize the resources (by the cluster's resources), and then sort them to compare two resources which consume the same amount of their most-dominant resource, but differing amounts of their 2nd-most-dominant resource. This is important when checking that you don't exceed a resource limit (eg, "greaterThan(comparator, consumed, limit)") – it may be that I'm within the limit for CPUs (which is the dominant resource), but exceeding the limit for memory (which is not my dominant resource).
3) In resourcemanager.resource.Resources, when multiplying CPUs by a float, because CPUs is an int, I needed two versions: one which rounded-up, and one which rounded-down. Calculating queueMaxCap was the only time I needed the round-down version. Technically, this is also needed for memory (since it is also an int), but as long as we only allocate memory in units of at least, say, 128 MB (as is current practice in the code), the extra bits in the int (0 bytes - 128 MB) are actually serving as a store for the fractional part! and thus, the existing roundUp() and roundDown() functions (from CSQueueUtils) suffice.
cheers,
Andrew

Arun C Murthy
added a comment - 07/Jul/12 04:38 Andrew - I have an updated version of my CS patch which differs significantly. I'll post it over the weekend and you can review and provide f/b. Ok? Thanks.

I did update the Capacity Scheduler to schedule and account for CPU in the most recent patch. The key updates are in CSQueueUtils, LeafQueue, and ParentQueue, and they quite heavily tested by the fully updated Capacity Scheduler test suite.

To update the Capacity Scheduler, I followed the logic from DRF, taking your dominant resource's share as the capacity you are consuming.

best,
Andrew

ps – Do you mind if I re-set the Patch Available flag? While this patch passes the tests I ran locally (`mvn tes`t in hadoop-mapreduce-project/hadoop-yarn/), I am curious what the Apache buildbot thinks of it. thanks!

Andrew Ferguson
added a comment - 07/Jul/12 02:58 hi Arun,
a branch sounds like a great idea. thanks!
I did update the Capacity Scheduler to schedule and account for CPU in the most recent patch. The key updates are in CSQueueUtils, LeafQueue, and ParentQueue, and they quite heavily tested by the fully updated Capacity Scheduler test suite.
To update the Capacity Scheduler, I followed the logic from DRF, taking your dominant resource's share as the capacity you are consuming.
best,
Andrew
ps – Do you mind if I re-set the Patch Available flag? While this patch passes the tests I ran locally (`mvn tes`t in hadoop-mapreduce-project/hadoop-yarn/), I am curious what the Apache buildbot thinks of it. thanks!

Arun C Murthy
added a comment - 07/Jul/12 01:42 Andrew - I think we should break this down into multiple jiras and probably even work on a branch.
I'll open new jiras and assign some over while I finish up the CS, ok? Thanks.

This extended and updated version now includes tests, and support for CPU cores information throughout the resource manager.

It also incorporates the feedback from Robert above.

Although this page is very large, there are bulk of the code is either 1) new and updated tests, or 2) updates to the RM and NM webapps, queue metrics, etc. which all need to be updated to display CPU cores as well.

While obviously it would be easier to read this patch if it were split into pieces, the new tests for CPU as a scheduable resource require the updated queue metrics and accounting, creating an inter-dependency. I am certainly open to suggestions from anyone who sees how to split this patch into chunks!

I have tested this patch locally, and it appears to pass the YARN and MapReduce test suites.

Andrew Ferguson
added a comment - 06/Jul/12 23:44 This extended and updated version now includes tests, and support for CPU cores information throughout the resource manager.
It also incorporates the feedback from Robert above.
Although this page is very large, there are bulk of the code is either 1) new and updated tests, or 2) updates to the RM and NM webapps, queue metrics, etc. which all need to be updated to display CPU cores as well.
While obviously it would be easier to read this patch if it were split into pieces, the new tests for CPU as a scheduable resource require the updated queue metrics and accounting, creating an inter-dependency. I am certainly open to suggestions from anyone who sees how to split this patch into chunks!
I have tested this patch locally, and it appears to pass the YARN and MapReduce test suites.
your comments and patience appreciated.
thanks,
Andrew

Thanks for you feedback! since I posted the earlier update, I've been pushing it to completion: adding CPU core information to the queue metrics, resource manager web interface, etc. I've also been adding test cases and ensuring that the new patch passes existing test cases as well. currently, the patch is failing just a few unit tests, but I expect it will be done in a day or two.

as the patch has grown quite large (the diff is pushing 7000 lines..), it's clear we want to minimize the cost of adding a third resource. as it is, most of the diff is new testing. I will strive to keep function calls as general as possible (eg, "Resource r" instead of "int memory, float cores"), but there are quite a few places where we want to consider each resource separately since the math can be different, and it should be clear to anyone adding additional resources that they need to consider something in that function's logic.

Regarding applications which haven't been updated for CPU cores, and might submit a request with 0 or NULL, my current patch does round the request to the minimum resource request, so those applications will be fine. (not sure if the currently attached patch does this)

Regarding "spare capacity" – I think this is one of the differences between the capacity scheduler and the fair scheduler. should the capacity not in use (or leftover capacity from queues which can't fill it because of the new multi-dimensional nature of resources) be simply split over the queues based on their capacity percentages? or should that capacity be treated as a single pool, and allocations be made treating the capacity percentages as weights? (this is more of a Fair Sched approach). anyway, I agree,, that should probably be left as a separate JIRA, or perhaps simply left to the Fair Scheduler.

Andrew Ferguson
added a comment - 05/Jul/12 20:10 hi Robert,
Thanks for you feedback! since I posted the earlier update, I've been pushing it to completion: adding CPU core information to the queue metrics, resource manager web interface, etc. I've also been adding test cases and ensuring that the new patch passes existing test cases as well. currently, the patch is failing just a few unit tests, but I expect it will be done in a day or two.
as the patch has grown quite large (the diff is pushing 7000 lines..), it's clear we want to minimize the cost of adding a third resource. as it is, most of the diff is new testing. I will strive to keep function calls as general as possible (eg, "Resource r" instead of "int memory, float cores"), but there are quite a few places where we want to consider each resource separately since the math can be different, and it should be clear to anyone adding additional resources that they need to consider something in that function's logic.
Regarding applications which haven't been updated for CPU cores, and might submit a request with 0 or NULL, my current patch does round the request to the minimum resource request, so those applications will be fine. (not sure if the currently attached patch does this)
Regarding "spare capacity" – I think this is one of the differences between the capacity scheduler and the fair scheduler. should the capacity not in use (or leftover capacity from queues which can't fill it because of the new multi-dimensional nature of resources) be simply split over the queues based on their capacity percentages? or should that capacity be treated as a single pool, and allocations be made treating the capacity percentages as weights? (this is more of a Fair Sched approach). anyway, I agree,, that should probably be left as a separate JIRA, or perhaps simply left to the Fair Scheduler.
I'll incorporate your other points (eg, comparator name, ASF license) in my updated patch.
thanks!
Andrew

Andrew, Sorry this has taken me so long to get to this. Thanks to both you and Arun for taking this up. It is something that is going to be really great when it is done.

I have a few comments on the code.

ResourceComparator.java needs to have the Apache License Header in it.

I don't really like having the resource comparator class configuration be specific to the scheduler. I would prefer to see it be available for both the fifo and the capacity scheduler. This is very minor, but it enforces consistency between the schedulers.

In a few places like SchedulerNode, we still operate on each of the resources separately.

It would really be nice to be able to abstract some of that away, like with the comparisons, so that if we add in new resources in the future, we do not need to change the code again. (This is also very minor)

To answer your question about computeSlotMillis, I would say no, but I am open to others opinions on this. This counter is there to try an maintain backwards compatibility. It was intended to indicate how much total resources the job used, i.e. how many milliseconds this job held a slot that no-one else could use. Because there are no real slots any more, I would prefer to see this metric deprecated, and replaced with something that breaks it down by the resources involved. But that is probably for a separate JIRA, because it is a potentially complex question.

I would like better protections against someone passing in a 0 or null for the number of CPU cores in YARN. For MR I see a new default for the number of CORES being requested, but I don't see an equivalent in just YARN. This is mostly because CPU cores is being added in and I can see other applications, like the distributed shell, not being updated, which could result in all kinds of issues. It would be great that if no CPU request is given we default to 1.

Could we change the name of ResourceMemoryCpuComparator to something more like DefaultMultiResourceComparator? I think ResourceMemoryCpuNetworkBandwithDiskStorageGPUComparator is a bit long, but it is the direction we are headed in.

Do we want to be able to schedule only part of a core (make the resource a float not an int)? For a Map or Reduce task we typically are only going to want 1 CPU, but some things like the MR AM, unless it is a very big job, 0.5 cores is overkill for what it does.

This is just a cursory look but I like what I see.

To chime in on some of your questions

Are you planning to change the definition of a queue's capacity?

I think this could be something very useful, but should probably be done on a separate JIRA.

Do you plan to change how spare capacity is allocated?

What do you mean by spare capacity? Do you mean capacity that is not currently in use? If so I would love to see a patch that does this, so that I can run gridmix on it both ways and see what the results are.

Are you planning to support priorities or weights within the queues?

I would also like to see something like this happen, but from discussions I have had in the past, at least for the MRV1 case we would need something like preemption to be able to avoid some potential deadlocks. I could be wrong about that here, because the resource allocation now behaves differently in with respect to a priority, but either way I think that discussion is something that should be done on a separate JIRA to avoid blocking this coming in.

Robert Joseph Evans
added a comment - 05/Jul/12 18:24 Andrew, Sorry this has taken me so long to get to this. Thanks to both you and Arun for taking this up. It is something that is going to be really great when it is done.
I have a few comments on the code.
ResourceComparator.java needs to have the Apache License Header in it.
I don't really like having the resource comparator class configuration be specific to the scheduler. I would prefer to see it be available for both the fifo and the capacity scheduler. This is very minor, but it enforces consistency between the schedulers.
In a few places like SchedulerNode, we still operate on each of the resources separately.
this .availableResource.setMemory(node.getTotalCapability().getMemory());
this .availableResource.setCores(node.getTotalCapability().getCores());
It would really be nice to be able to abstract some of that away, like with the comparisons, so that if we add in new resources in the future, we do not need to change the code again. (This is also very minor)
To answer your question about computeSlotMillis, I would say no, but I am open to others opinions on this. This counter is there to try an maintain backwards compatibility. It was intended to indicate how much total resources the job used, i.e. how many milliseconds this job held a slot that no-one else could use. Because there are no real slots any more, I would prefer to see this metric deprecated, and replaced with something that breaks it down by the resources involved. But that is probably for a separate JIRA, because it is a potentially complex question.
I would like better protections against someone passing in a 0 or null for the number of CPU cores in YARN. For MR I see a new default for the number of CORES being requested, but I don't see an equivalent in just YARN. This is mostly because CPU cores is being added in and I can see other applications, like the distributed shell, not being updated, which could result in all kinds of issues. It would be great that if no CPU request is given we default to 1.
Could we change the name of ResourceMemoryCpuComparator to something more like DefaultMultiResourceComparator? I think ResourceMemoryCpuNetworkBandwithDiskStorageGPUComparator is a bit long, but it is the direction we are headed in.
Do we want to be able to schedule only part of a core (make the resource a float not an int)? For a Map or Reduce task we typically are only going to want 1 CPU, but some things like the MR AM, unless it is a very big job, 0.5 cores is overkill for what it does.
This is just a cursory look but I like what I see.
To chime in on some of your questions
Are you planning to change the definition of a queue's capacity?
I think this could be something very useful, but should probably be done on a separate JIRA.
Do you plan to change how spare capacity is allocated?
What do you mean by spare capacity? Do you mean capacity that is not currently in use? If so I would love to see a patch that does this, so that I can run gridmix on it both ways and see what the results are.
Are you planning to support priorities or weights within the queues?
I would also like to see something like this happen, but from discussions I have had in the past, at least for the MRV1 case we would need something like preemption to be able to avoid some potential deadlocks. I could be wrong about that here, because the resource allocation now behaves differently in with respect to a priority, but either way I think that discussion is something that should be done on a separate JIRA to avoid blocking this coming in.
Thanks again to both you and Arun for doing this.

Andrew Ferguson
added a comment - 23/Jun/12 02:40 This is an updated version of my previous patch. Of particular note, it fixes a typo in the original patch in Resources' subtractFrom() – the original patch double-subtracted memory.
I have successfully run MapReduce jobs using this patch, and they will request both memory and cpu cores. So far, I have only tested this with the FIFO scheduler.
When combined with MAPREDUCE-4351 and MAPREDUCE-4334 , the requested cpu share is also enforced.
One discussion point I want to raise is: currently the number of requested cores is an integer. Do we want to support fractional cores as well?

Andrew Ferguson
added a comment - 22/Jun/12 18:54 I've amended Arun's original patch to also pass the number of cores via the ContainerStartMonitoringEvent. With this version, the patch in MAPREDUCE-4334 can be used to enforce CPU weights.

I'm excited to see this started – I'm quite interested in the multi-resource scheduling problem. After reading through the patch, I have a few questions for you; hopefully this feedback will be helpful.

First off, I want to confirm my understanding is correct: this patch is designed to allocate resources to jobs within the same capacity queue based on the DRF-inspired ordering of their need for resources. It is not designed to do weighted DRF for the complete cluster. If I'm mistaken, perhaps some of my feedback my not apply.

1) Are you planning to change the definition of a queue's capacity? Currently, it is defined as a fractional percentage of the parent queue's total memory. Alternatively, queues could be specified with a fractional percentage of each resource. eg, I could have one queue with "75% CPU and 50% RAM" and a second with "25% CPU and 50% RAM".

2) Do you plan to change how spare capacity is allocated? My understanding is that it's currently shared proportionally, based on the queue capacities, an approach seems like it would be intuitive for cluster operators. With a multi-resource setup however, running DRF on the pool of spare resources would provide higher utilization. (I can provide an example of this if you'd like.)

3) Are you planning to support priorities or weights within the queues? IIRC, this was supported in the MR1 scheduler, and the DRF paper describes a weighted extension.

4) Lastly, with the increasing flexibility of the YARN scheduler, I think it makes sense to better support heterogenous clusters. Currently, yarn.nodemanager.resource.memory-mb is a constant across the cluster, but with a scheduler capable of packing differently shaped resource containers onto each node, heterogenous nodes would be a natural extension. (This is more of an observation than a question.

Andrew Ferguson
added a comment - 11/Jun/12 19:04 Hi Arun,
I'm excited to see this started – I'm quite interested in the multi-resource scheduling problem. After reading through the patch, I have a few questions for you; hopefully this feedback will be helpful.
First off, I want to confirm my understanding is correct: this patch is designed to allocate resources to jobs within the same capacity queue based on the DRF-inspired ordering of their need for resources. It is not designed to do weighted DRF for the complete cluster. If I'm mistaken, perhaps some of my feedback my not apply.
1) Are you planning to change the definition of a queue's capacity? Currently, it is defined as a fractional percentage of the parent queue's total memory. Alternatively, queues could be specified with a fractional percentage of each resource. eg, I could have one queue with "75% CPU and 50% RAM" and a second with "25% CPU and 50% RAM".
2) Do you plan to change how spare capacity is allocated? My understanding is that it's currently shared proportionally, based on the queue capacities, an approach seems like it would be intuitive for cluster operators. With a multi-resource setup however, running DRF on the pool of spare resources would provide higher utilization. (I can provide an example of this if you'd like.)
3) Are you planning to support priorities or weights within the queues? IIRC, this was supported in the MR1 scheduler, and the DRF paper describes a weighted extension.
4) Lastly, with the increasing flexibility of the YARN scheduler, I think it makes sense to better support heterogenous clusters. Currently, yarn.nodemanager.resource.memory-mb is a constant across the cluster, but with a scheduler capable of packing differently shaped resource containers onto each node, heterogenous nodes would be a natural extension. (This is more of an observation than a question.
Looking forward to further discussions.
cheers,
Andrew

Arun C Murthy
added a comment - 08/Jun/12 08:07 Initial (incomplete) sketch of DRF for YARN CS... hopefully this provides enough context for folks to get an good feel.
Essentially I've added the ability for applications to ask for cores alongwith memory and a configurable resource-comparator for the CS to implement DRF-like multi-resource scheduling.
Thoughts?