*Re: [PATCH v5 10/14] sched/cpufreq: Refactor the utilization aggregation method
2018-07-30 19:35 ` skannan@ 2018-07-31 7:59 ` Quentin Perret
2018-07-31 19:31 ` skannan
2018-08-02 12:33 ` Peter Zijlstra1 sibling, 1 reply; 72+ messages in thread
From: Quentin Perret @ 2018-07-31 7:59 UTC (permalink / raw)
To: skannan
Cc: peterz, rjw, linux-kernel, linux-pm, gregkh, mingo,
dietmar.eggemann, morten.rasmussen, chris.redpath,
patrick.bellasi, valentin.schneider, vincent.guittot,
thara.gopinath, viresh.kumar, tkjos, joel, smuckle, adharmap,
skannan, pkondeti, juri.lelli, edubezval, srinivas.pandruvada,
currojerez, javi.merino, linux-pm-owner
On Monday 30 Jul 2018 at 12:35:27 (-0700), skannan@codeaurora.org wrote:
[...]
> If it's going to be a different aggregation from what's done for frequency
> guidance, I don't see the point of having this inside schedutil. Why not
> keep it inside the scheduler files?
This code basically results from a discussion we had with Peter on v4.
Keeping everything centralized can make sense from a maintenance
perspective, I think. That makes it easy to see the impact of any change
to utilization signals for both EAS and schedutil.
> Also, it seems weird to use a governor's
> code when it might not actually be in use. What if someone is using
> ondemand, conservative, performance, etc?
Yeah I thought about that too ... I would say that even if you don't
use schedutil, it is probably a fair assumption from the scheduler's
standpoint to assume that somewhat OPPs follow utilization (in a very
loose way). So yes, if you use ondemand with EAS you won't have a
perfectly consistent match between the frequency requests and what EAS
predicts, and that might result in sub-optimal decisions in some cases,
but I'm not sure if we can do anything better at this stage.
Also, if you do use schedutil, EAS will accurately predict what will be
the frequency _request_, but that gives you no guarantee whatsoever that
you'll actually get it for real (because you're throttled, or because of
thermal capping, or simply because the HW refuses it for some reason ...).
There will be inconsistencies between EAS' predictions and the actual
frequencies, and we have to live with that. The best we can do is make
sure we're at least internally consistent from the scheduler's
standpoint, and that's why I think it can make sense to factorize as
many things as possible with schedutil where applicable.
> > + if (type == frequency_util) {
> > + /*
> > + * Bandwidth required by DEADLINE must always be granted
> > + * while, for FAIR and RT, we use blocked utilization of
> > + * IDLE CPUs as a mechanism to gracefully reduce the
> > + * frequency when no tasks show up for longer periods of
> > + * time.
> > + *
> > + * Ideally we would like to set bw_dl as min/guaranteed
> > + * freq and util + bw_dl as requested freq. However,
> > + * cpufreq is not yet ready for such an interface. So,
> > + * we only do the latter for now.
> > + */
> > + util += cpu_bw_dl(rq);
> > + }
>
> Instead of all this indentation, can't you just return early without doing
> the code inside the if?
But then I'll need to duplicate the 'min' below, so not sure if it's
worth it ?
> > +enum schedutil_type {
> > + frequency_util,
> > + energy_util,
> > +};
>
> Please don't use lower case for enums. It's extremely confusing.
Ok, I'll change that in v6.
Thanks !
Quentin
^permalinkrawreply [flat|nested] 72+ messages in thread

*Re: [PATCH v5 10/14] sched/cpufreq: Refactor the utilization aggregation method
2018-07-31 7:59 ` Quentin Perret@ 2018-07-31 19:31 ` skannan
2018-08-01 7:32 ` Rafael J. Wysocki0 siblings, 1 reply; 72+ messages in thread
From: skannan @ 2018-07-31 19:31 UTC (permalink / raw)
To: Quentin Perret
Cc: peterz, rjw, linux-kernel, linux-pm, gregkh, mingo,
dietmar.eggemann, morten.rasmussen, chris.redpath,
patrick.bellasi, valentin.schneider, vincent.guittot,
thara.gopinath, viresh.kumar, tkjos, joel, smuckle, adharmap,
skannan, pkondeti, juri.lelli, edubezval, srinivas.pandruvada,
currojerez, javi.merino, linux-pm-owner
On 2018-07-31 00:59, Quentin Perret wrote:
> On Monday 30 Jul 2018 at 12:35:27 (-0700), skannan@codeaurora.org
> wrote:
> [...]
>> If it's going to be a different aggregation from what's done for
>> frequency
>> guidance, I don't see the point of having this inside schedutil. Why
>> not
>> keep it inside the scheduler files?
>
> This code basically results from a discussion we had with Peter on v4.
> Keeping everything centralized can make sense from a maintenance
> perspective, I think. That makes it easy to see the impact of any
> change
> to utilization signals for both EAS and schedutil.
In that case, I'd argue it makes more sense to keep the code centralized
in the scheduler. The scheduler can let schedutil know about the
utilization after it aggregates them. There's no need for a cpufreq
governor to know that there are scheduling classes or how many there
are. And the scheduler can then choose to aggregate one way for task
packing and another way for frequency guidance.
It just seems so weird to have logic that's very essential for task
placement to be inside a cpufreq governor.
>> Also, it seems weird to use a governor's
>> code when it might not actually be in use. What if someone is using
>> ondemand, conservative, performance, etc?
>
> Yeah I thought about that too ... I would say that even if you don't
> use schedutil, it is probably a fair assumption from the scheduler's
> standpoint to assume that somewhat OPPs follow utilization (in a very
> loose way). So yes, if you use ondemand with EAS you won't have a
> perfectly consistent match between the frequency requests and what EAS
> predicts, and that might result in sub-optimal decisions in some cases,
> but I'm not sure if we can do anything better at this stage.
>
> Also, if you do use schedutil, EAS will accurately predict what will be
> the frequency _request_, but that gives you no guarantee whatsoever
> that
> you'll actually get it for real (because you're throttled, or because
> of
> thermal capping, or simply because the HW refuses it for some reason
> ...).
>
> There will be inconsistencies between EAS' predictions and the actual
> frequencies, and we have to live with that. The best we can do is make
> sure we're at least internally consistent from the scheduler's
> standpoint, and that's why I think it can make sense to factorize as
> many things as possible with schedutil where applicable.
>
>> > + if (type == frequency_util) {
>> > + /*
>> > + * Bandwidth required by DEADLINE must always be granted
>> > + * while, for FAIR and RT, we use blocked utilization of
>> > + * IDLE CPUs as a mechanism to gracefully reduce the
>> > + * frequency when no tasks show up for longer periods of
>> > + * time.
>> > + *
>> > + * Ideally we would like to set bw_dl as min/guaranteed
>> > + * freq and util + bw_dl as requested freq. However,
>> > + * cpufreq is not yet ready for such an interface. So,
>> > + * we only do the latter for now.
>> > + */
>> > + util += cpu_bw_dl(rq);
>> > + }
>>
>> Instead of all this indentation, can't you just return early without
>> doing
>> the code inside the if?
>
> But then I'll need to duplicate the 'min' below, so not sure if it's
> worth it ?
I feel like less indentation where reasonably possible leads to more
readability. But I don't have a strong opinion in this specific case.
>> > +enum schedutil_type {
>> > + frequency_util,
>> > + energy_util,
>> > +};
>>
>> Please don't use lower case for enums. It's extremely confusing.
>
> Ok, I'll change that in v6.
Thanks.
-Saravana
^permalinkrawreply [flat|nested] 72+ messages in thread

*Re: [PATCH v5 10/14] sched/cpufreq: Refactor the utilization aggregation method
2018-07-31 19:31 ` skannan@ 2018-08-01 7:32 ` Rafael J. Wysocki
2018-08-01 8:23 ` Quentin Perret0 siblings, 1 reply; 72+ messages in thread
From: Rafael J. Wysocki @ 2018-08-01 7:32 UTC (permalink / raw)
To: Saravana Kannan, Quentin Perret
Cc: Peter Zijlstra, Rafael J. Wysocki, Linux Kernel Mailing List,
Linux PM, Greg Kroah-Hartman, Ingo Molnar, Dietmar Eggemann,
Morten Rasmussen, Chris Redpath, Patrick Bellasi,
Valentin Schneider, Vincent Guittot, Thara Gopinath,
Viresh Kumar, Todd Kjos, Joel Fernandes, Steve Muckle, adharmap,
skannan, Pavan Kondeti, Juri Lelli, Eduardo Valentin,
Srinivas Pandruvada, currojerez, Javi Merino, linux-pm-owner
On Tue, Jul 31, 2018 at 9:31 PM, <skannan@codeaurora.org> wrote:
> On 2018-07-31 00:59, Quentin Perret wrote:
>>
>> On Monday 30 Jul 2018 at 12:35:27 (-0700), skannan@codeaurora.org wrote:
>> [...]
>>>
>>> If it's going to be a different aggregation from what's done for
>>> frequency
>>> guidance, I don't see the point of having this inside schedutil. Why not
>>> keep it inside the scheduler files?
>>
>>
>> This code basically results from a discussion we had with Peter on v4.
>> Keeping everything centralized can make sense from a maintenance
>> perspective, I think. That makes it easy to see the impact of any change
>> to utilization signals for both EAS and schedutil.
>
>
> In that case, I'd argue it makes more sense to keep the code centralized in
> the scheduler. The scheduler can let schedutil know about the utilization
> after it aggregates them. There's no need for a cpufreq governor to know
> that there are scheduling classes or how many there are. And the scheduler
> can then choose to aggregate one way for task packing and another way for
> frequency guidance.
Also the aggregate utilization may be used by cpuidle governors in
principle to decide how deep they can go with idle state selection.
^permalinkrawreply [flat|nested] 72+ messages in thread

*Re: [PATCH v5 10/14] sched/cpufreq: Refactor the utilization aggregation method
2018-08-01 8:35 ` Rafael J. Wysocki@ 2018-08-01 9:23 ` Quentin Perret
2018-08-01 9:40 ` Rafael J. Wysocki
2018-08-02 13:04 ` Peter Zijlstra0 siblings, 2 replies; 72+ messages in thread
From: Quentin Perret @ 2018-08-01 9:23 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: Saravana Kannan, Peter Zijlstra, Rafael J. Wysocki,
Linux Kernel Mailing List, Linux PM, Greg Kroah-Hartman,
Ingo Molnar, Dietmar Eggemann, Morten Rasmussen, Chris Redpath,
Patrick Bellasi, Valentin Schneider, Vincent Guittot,
Thara Gopinath, Viresh Kumar, Todd Kjos, Joel Fernandes,
Steve Muckle, adharmap, skannan, Pavan Kondeti, Juri Lelli,
Eduardo Valentin, Srinivas Pandruvada, currojerez, Javi Merino,
linux-pm-owner
On Wednesday 01 Aug 2018 at 10:35:32 (+0200), Rafael J. Wysocki wrote:
> On Wed, Aug 1, 2018 at 10:23 AM, Quentin Perret <quentin.perret@arm.com> wrote:
> > On Wednesday 01 Aug 2018 at 09:32:49 (+0200), Rafael J. Wysocki wrote:
> >> On Tue, Jul 31, 2018 at 9:31 PM, <skannan@codeaurora.org> wrote:
> >> >> On Monday 30 Jul 2018 at 12:35:27 (-0700), skannan@codeaurora.org wrote:
> >> >>> If it's going to be a different aggregation from what's done for
> >> >>> frequency
> >> >>> guidance, I don't see the point of having this inside schedutil. Why not
> >> >>> keep it inside the scheduler files?
> >> >>
> >> >> This code basically results from a discussion we had with Peter on v4.
> >> >> Keeping everything centralized can make sense from a maintenance
> >> >> perspective, I think. That makes it easy to see the impact of any change
> >> >> to utilization signals for both EAS and schedutil.
> >> >
> >> > In that case, I'd argue it makes more sense to keep the code centralized in
> >> > the scheduler. The scheduler can let schedutil know about the utilization
> >> > after it aggregates them. There's no need for a cpufreq governor to know
> >> > that there are scheduling classes or how many there are. And the scheduler
> >> > can then choose to aggregate one way for task packing and another way for
> >> > frequency guidance.
> >>
> >> Also the aggregate utilization may be used by cpuidle governors in
> >> principle to decide how deep they can go with idle state selection.
> >
> > The only issue I see with this right now is that some of the things done
> > in this function are policy decisions which really belong to the governor,
> > I think.
>
> Well, the scheduler makes policy decisions too, in quite a few places. :-)
That is true ... ;-) But not so much about frequency selection yet I guess
> The really important consideration here is whether or not there may be
> multiple governors making different policy decisions in that respect.
> If not, then where exactly the single policy decision is made doesn't
> particularly matter IMO.
I think some users of the aggregated utilization signal do want to make
slightly different decisions (I'm thinking about the RT-go-to-max thing
again which makes perfect sense in sugov, but could possibly hurt EAS).
So the "hard" part of this work is to figure out what really is a
governor-specific policy decision, and what is common between all users.
I put "hard" between quotes because I only see the case of RT as truly
sugov-specific for now.
If we also want a special case for DL, Peter's enum should work OK, and
enable to add more special cases for new users (cpuidle ?) if needed.
But maybe that is something for later ?
Thanks,
Quentin
^permalinkrawreply [flat|nested] 72+ messages in thread

*Re: [PATCH v5 10/14] sched/cpufreq: Refactor the utilization aggregation method
2018-08-01 9:23 ` Quentin Perret@ 2018-08-01 9:40 ` Rafael J. Wysocki
2018-08-02 13:04 ` Peter Zijlstra1 sibling, 0 replies; 72+ messages in thread
From: Rafael J. Wysocki @ 2018-08-01 9:40 UTC (permalink / raw)
To: Quentin Perret
Cc: Rafael J. Wysocki, Saravana Kannan, Peter Zijlstra,
Rafael J. Wysocki, Linux Kernel Mailing List, Linux PM,
Greg Kroah-Hartman, Ingo Molnar, Dietmar Eggemann,
Morten Rasmussen, Chris Redpath, Patrick Bellasi,
Valentin Schneider, Vincent Guittot, Thara Gopinath,
Viresh Kumar, Todd Kjos, Joel Fernandes, Steve Muckle, adharmap,
skannan, Pavan Kondeti, Juri Lelli, Eduardo Valentin,
Srinivas Pandruvada, currojerez, Javi Merino, linux-pm-owner
On Wed, Aug 1, 2018 at 11:23 AM, Quentin Perret <quentin.perret@arm.com> wrote:
> On Wednesday 01 Aug 2018 at 10:35:32 (+0200), Rafael J. Wysocki wrote:
>> On Wed, Aug 1, 2018 at 10:23 AM, Quentin Perret <quentin.perret@arm.com> wrote:
>> > On Wednesday 01 Aug 2018 at 09:32:49 (+0200), Rafael J. Wysocki wrote:
>> >> On Tue, Jul 31, 2018 at 9:31 PM, <skannan@codeaurora.org> wrote:
>> >> >> On Monday 30 Jul 2018 at 12:35:27 (-0700), skannan@codeaurora.org wrote:
>> >> >>> If it's going to be a different aggregation from what's done for
>> >> >>> frequency
>> >> >>> guidance, I don't see the point of having this inside schedutil. Why not
>> >> >>> keep it inside the scheduler files?
>> >> >>
>> >> >> This code basically results from a discussion we had with Peter on v4.
>> >> >> Keeping everything centralized can make sense from a maintenance
>> >> >> perspective, I think. That makes it easy to see the impact of any change
>> >> >> to utilization signals for both EAS and schedutil.
>> >> >
>> >> > In that case, I'd argue it makes more sense to keep the code centralized in
>> >> > the scheduler. The scheduler can let schedutil know about the utilization
>> >> > after it aggregates them. There's no need for a cpufreq governor to know
>> >> > that there are scheduling classes or how many there are. And the scheduler
>> >> > can then choose to aggregate one way for task packing and another way for
>> >> > frequency guidance.
>> >>
>> >> Also the aggregate utilization may be used by cpuidle governors in
>> >> principle to decide how deep they can go with idle state selection.
>> >
>> > The only issue I see with this right now is that some of the things done
>> > in this function are policy decisions which really belong to the governor,
>> > I think.
>>
>> Well, the scheduler makes policy decisions too, in quite a few places. :-)
>
> That is true ... ;-) But not so much about frequency selection yet I guess
>
>> The really important consideration here is whether or not there may be
>> multiple governors making different policy decisions in that respect.
>> If not, then where exactly the single policy decision is made doesn't
>> particularly matter IMO.
>
> I think some users of the aggregated utilization signal do want to make
> slightly different decisions (I'm thinking about the RT-go-to-max thing
> again which makes perfect sense in sugov, but could possibly hurt EAS).
Fair enough.
> So the "hard" part of this work is to figure out what really is a
> governor-specific policy decision, and what is common between all users.
> I put "hard" between quotes because I only see the case of RT as truly
> sugov-specific for now.
OK
> If we also want a special case for DL, Peter's enum should work OK, and
> enable to add more special cases for new users (cpuidle ?) if needed.
> But maybe that is something for later ?
I agree. That can be done later if need be.
^permalinkrawreply [flat|nested] 72+ messages in thread

*Re: [PATCH v5 10/14] sched/cpufreq: Refactor the utilization aggregation method
2018-08-01 9:23 ` Quentin Perret
2018-08-01 9:40 ` Rafael J. Wysocki@ 2018-08-02 13:04 ` Peter Zijlstra
2018-08-02 15:39 ` Quentin Perret1 sibling, 1 reply; 72+ messages in thread
From: Peter Zijlstra @ 2018-08-02 13:04 UTC (permalink / raw)
To: Quentin Perret
Cc: Rafael J. Wysocki, Saravana Kannan, Rafael J. Wysocki,
Linux Kernel Mailing List, Linux PM, Greg Kroah-Hartman,
Ingo Molnar, Dietmar Eggemann, Morten Rasmussen, Chris Redpath,
Patrick Bellasi, Valentin Schneider, Vincent Guittot,
Thara Gopinath, Viresh Kumar, Todd Kjos, Joel Fernandes,
Steve Muckle, adharmap, skannan, Pavan Kondeti, Juri Lelli,
Eduardo Valentin, Srinivas Pandruvada, currojerez, Javi Merino,
linux-pm-owner
On Wed, Aug 01, 2018 at 10:23:27AM +0100, Quentin Perret wrote:
> On Wednesday 01 Aug 2018 at 10:35:32 (+0200), Rafael J. Wysocki wrote:
> > On Wed, Aug 1, 2018 at 10:23 AM, Quentin Perret <quentin.perret@arm.com> wrote:
> > > On Wednesday 01 Aug 2018 at 09:32:49 (+0200), Rafael J. Wysocki wrote:
> > >> On Tue, Jul 31, 2018 at 9:31 PM, <skannan@codeaurora.org> wrote:
> > >> >> On Monday 30 Jul 2018 at 12:35:27 (-0700), skannan@codeaurora.org wrote:
> > >> >>> If it's going to be a different aggregation from what's done for
> > >> >>> frequency
> > >> >>> guidance, I don't see the point of having this inside schedutil. Why not
> > >> >>> keep it inside the scheduler files?
> > >> >>
> > >> >> This code basically results from a discussion we had with Peter on v4.
> > >> >> Keeping everything centralized can make sense from a maintenance
> > >> >> perspective, I think. That makes it easy to see the impact of any change
> > >> >> to utilization signals for both EAS and schedutil.
> > >> >
> > >> > In that case, I'd argue it makes more sense to keep the code centralized in
> > >> > the scheduler. The scheduler can let schedutil know about the utilization
> > >> > after it aggregates them. There's no need for a cpufreq governor to know
> > >> > that there are scheduling classes or how many there are. And the scheduler
> > >> > can then choose to aggregate one way for task packing and another way for
> > >> > frequency guidance.
> > >>
> > >> Also the aggregate utilization may be used by cpuidle governors in
> > >> principle to decide how deep they can go with idle state selection.
> > >
> > > The only issue I see with this right now is that some of the things done
> > > in this function are policy decisions which really belong to the governor,
> > > I think.
> >
> > Well, the scheduler makes policy decisions too, in quite a few places. :-)
>
> That is true ... ;-) But not so much about frequency selection yet I guess
Well, sugov is part of the scheduler :-) It being so allows for the
co-ordinated decision making required for EAS.
> > The really important consideration here is whether or not there may be
> > multiple governors making different policy decisions in that respect.
> > If not, then where exactly the single policy decision is made doesn't
> > particularly matter IMO.
>
> I think some users of the aggregated utilization signal do want to make
> slightly different decisions (I'm thinking about the RT-go-to-max thing
> again which makes perfect sense in sugov, but could possibly hurt EAS).
>
> So the "hard" part of this work is to figure out what really is a
> governor-specific policy decision, and what is common between all users.
> I put "hard" between quotes because I only see the case of RT as truly
> sugov-specific for now.
>
> If we also want a special case for DL, Peter's enum should work OK, and
> enable to add more special cases for new users (cpuidle ?) if needed.
> But maybe that is something for later ?
Right, I don't mind moving the function. What I do oppose is having two
very similar functions in different translation units -- because then
they _will_ diverge and result in 'funny' things.
^permalinkrawreply [flat|nested] 72+ messages in thread

*Re: [PATCH v5 10/14] sched/cpufreq: Refactor the utilization aggregation method
2018-08-02 12:45 ` Peter Zijlstra@ 2018-08-02 15:21 ` Quentin Perret
2018-08-02 17:36 ` Peter Zijlstra0 siblings, 1 reply; 72+ messages in thread
From: Quentin Perret @ 2018-08-02 15:21 UTC (permalink / raw)
To: Peter Zijlstra
Cc: skannan, rjw, linux-kernel, linux-pm, gregkh, mingo,
dietmar.eggemann, morten.rasmussen, chris.redpath,
patrick.bellasi, valentin.schneider, vincent.guittot,
thara.gopinath, viresh.kumar, tkjos, joel, smuckle, adharmap,
skannan, pkondeti, juri.lelli, edubezval, srinivas.pandruvada,
currojerez, javi.merino, linux-pm-owner
On Thursday 02 Aug 2018 at 14:45:11 (+0200), Peter Zijlstra wrote:
> On Thu, Aug 02, 2018 at 02:33:15PM +0200, Peter Zijlstra wrote:
> > On Mon, Jul 30, 2018 at 12:35:27PM -0700, skannan@codeaurora.org wrote:
> > > On 2018-07-24 05:25, Quentin Perret wrote:
> > > If it's going to be a different aggregation from what's done for frequency
> > > guidance, I don't see the point of having this inside schedutil. Why not
> > > keep it inside the scheduler files? Also, it seems weird to use a governor's
> > > code when it might not actually be in use. What if someone is using
> > > ondemand, conservative, performance, etc?
> >
> > EAS hard relies on schedutil -- I suppose we need a check for that
> > somewhere and maybe some infrastructure to pin the cpufreq governor.
>
> Either that or disable EAS when another governor is selected.
>
> > We're simply not going to support it for anything else.
>
> To clarify, it makes absolutely no sense what so ever to attempt EAS
> when the DVFS control is not coordinated.
I tend to agree with that, but at the same time even if we create a very
strong dependency on schedutil, we will have no guarantee that the actual
frequencies used on the platform are the ones we predicted in EAS.
There are a number of reasons why a frequency request might not be served
(throttling, thermal capping, something HW-related, ...), so it's hard
to enforce the EAS model in practice.
The way I see things, EAS needs to assume that OPPs follow utilization.
Sugov does something that looks like that too, and it's also in the
scheduler, so that makes sense to try and factorize things, especially
for maintenance purpose. But I feel like the correlation between the two
could stop here.
If you use some sort HW governor that tries to always have some idle time
on the CPUs, then the assumption that OPPs follow utilization isn't _totally_
wrong. There should be a (loose) relation between what EAS 'thinks'
and the reality. And if this isn't true, then you might make slightly
sub-optimal decisions, but I'm not sure if there is anything we can do
about it :/
The scheduler works with various models which, by definition, don't
always perfectly reflect the reality. But those models are useful
because they enable us to reason about things and make decisions. EAS uses
a model where OPPs follow utilization. I think it's just another model
to the list, and we can't really enforce it strictly in practice anyway,
so we will have to live with its inaccuracies I suppose ...
I hope that makes sense :-)
Thanks,
Quentin
^permalinkrawreply [flat|nested] 72+ messages in thread

*Re: [PATCH v5 09/14] sched: Add over-utilization/tipping point indicator
2018-08-02 15:14 ` Vincent Guittot@ 2018-08-02 15:30 ` Quentin Perret
2018-08-02 15:55 ` Vincent Guittot0 siblings, 1 reply; 72+ messages in thread
From: Quentin Perret @ 2018-08-02 15:30 UTC (permalink / raw)
To: Vincent Guittot
Cc: Peter Zijlstra, Rafael J. Wysocki, linux-kernel,
open list:THERMAL, gregkh, Ingo Molnar, Dietmar Eggemann,
Morten Rasmussen, Chris Redpath, Patrick Bellasi,
Valentin Schneider, Thara Gopinath, viresh kumar, Todd Kjos,
Joel Fernandes, Cc: Steve Muckle, adharmap, Kannan, Saravana,
pkondeti, Juri Lelli, Eduardo Valentin, Srinivas Pandruvada,
currojerez, Javi Merino
On Thursday 02 Aug 2018 at 17:14:15 (+0200), Vincent Guittot wrote:
> On Thu, 2 Aug 2018 at 16:14, Quentin Perret <quentin.perret@arm.com> wrote:
> > Good point, setting the util_avg to 0 for new tasks should help
> > filtering out those tiny tasks too. And that would match with the idea
> > of letting tasks build their history before looking at their util_avg ...
> >
> > But there is one difference w.r.t frequency selection. The current code
> > won't mark the system overutilized, but will let sugov raise the
> > frequency when a new task is enqueued. So in case of a fork bomb, we
>
> If the initial value of util_avg is 0, we should not have any impact
> on the util_avg of the cfs rq on which the task is attached, isn't it
> ? so this should not impact both the over utilization state and the
> frequency selected by sugov or I'm missing something ?
What I tried to say is that setting util_avg to 0 for new tasks will
prevent schedutil from raising the frequency in case of a fork bomb, and
I think that could be an issue. And I think this isn't an issue with the
patch as-is ...
Sorry if that wasn't clear
> Then, select_task_rq_fair is called for a new task but util_avg is
> still 0 at that time in the current code so you will have consistent
> util_avg of the new task before and after calling
> find_energy_efficient_cpu
New tasks don't go in find_energy_efficient_cpu(), because, as you said,
they have no consistent util_avg yet when select_task_rq_fair() is called
for the first time.
Thanks,
Quentin
^permalinkrawreply [flat|nested] 72+ messages in thread

*Re: [PATCH v5 10/14] sched/cpufreq: Refactor the utilization aggregation method
2018-08-02 13:04 ` Peter Zijlstra@ 2018-08-02 15:39 ` Quentin Perret
2018-08-03 13:04 ` Quentin Perret0 siblings, 1 reply; 72+ messages in thread
From: Quentin Perret @ 2018-08-02 15:39 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Rafael J. Wysocki, Saravana Kannan, Rafael J. Wysocki,
Linux Kernel Mailing List, Linux PM, Greg Kroah-Hartman,
Ingo Molnar, Dietmar Eggemann, Morten Rasmussen, Chris Redpath,
Patrick Bellasi, Valentin Schneider, Vincent Guittot,
Thara Gopinath, Viresh Kumar, Todd Kjos, Joel Fernandes,
Steve Muckle, adharmap, skannan, Pavan Kondeti, Juri Lelli,
Eduardo Valentin, Srinivas Pandruvada, currojerez, Javi Merino,
linux-pm-owner
On Thursday 02 Aug 2018 at 15:04:40 (+0200), Peter Zijlstra wrote:
> On Wed, Aug 01, 2018 at 10:23:27AM +0100, Quentin Perret wrote:
> > On Wednesday 01 Aug 2018 at 10:35:32 (+0200), Rafael J. Wysocki wrote:
> > > On Wed, Aug 1, 2018 at 10:23 AM, Quentin Perret <quentin.perret@arm.com> wrote:
> > > > On Wednesday 01 Aug 2018 at 09:32:49 (+0200), Rafael J. Wysocki wrote:
> > > >> On Tue, Jul 31, 2018 at 9:31 PM, <skannan@codeaurora.org> wrote:
> > > >> >> On Monday 30 Jul 2018 at 12:35:27 (-0700), skannan@codeaurora.org wrote:
> > > >> >>> If it's going to be a different aggregation from what's done for
> > > >> >>> frequency
> > > >> >>> guidance, I don't see the point of having this inside schedutil. Why not
> > > >> >>> keep it inside the scheduler files?
> > > >> >>
> > > >> >> This code basically results from a discussion we had with Peter on v4.
> > > >> >> Keeping everything centralized can make sense from a maintenance
> > > >> >> perspective, I think. That makes it easy to see the impact of any change
> > > >> >> to utilization signals for both EAS and schedutil.
> > > >> >
> > > >> > In that case, I'd argue it makes more sense to keep the code centralized in
> > > >> > the scheduler. The scheduler can let schedutil know about the utilization
> > > >> > after it aggregates them. There's no need for a cpufreq governor to know
> > > >> > that there are scheduling classes or how many there are. And the scheduler
> > > >> > can then choose to aggregate one way for task packing and another way for
> > > >> > frequency guidance.
> > > >>
> > > >> Also the aggregate utilization may be used by cpuidle governors in
> > > >> principle to decide how deep they can go with idle state selection.
> > > >
> > > > The only issue I see with this right now is that some of the things done
> > > > in this function are policy decisions which really belong to the governor,
> > > > I think.
> > >
> > > Well, the scheduler makes policy decisions too, in quite a few places. :-)
> >
> > That is true ... ;-) But not so much about frequency selection yet I guess
>
> Well, sugov is part of the scheduler :-) It being so allows for the
> co-ordinated decision making required for EAS.
>
> > > The really important consideration here is whether or not there may be
> > > multiple governors making different policy decisions in that respect.
> > > If not, then where exactly the single policy decision is made doesn't
> > > particularly matter IMO.
> >
> > I think some users of the aggregated utilization signal do want to make
> > slightly different decisions (I'm thinking about the RT-go-to-max thing
> > again which makes perfect sense in sugov, but could possibly hurt EAS).
> >
> > So the "hard" part of this work is to figure out what really is a
> > governor-specific policy decision, and what is common between all users.
> > I put "hard" between quotes because I only see the case of RT as truly
> > sugov-specific for now.
> >
> > If we also want a special case for DL, Peter's enum should work OK, and
> > enable to add more special cases for new users (cpuidle ?) if needed.
> > But maybe that is something for later ?
>
> Right, I don't mind moving the function. What I do oppose is having two
> very similar functions in different translation units -- because then
> they _will_ diverge and result in 'funny' things.
Sounds good :-) Would kernel/sched/pelt.c be the right place then ? It's
cross-class and kinda pelt-related I guess
Thanks,
Quentin
^permalinkrawreply [flat|nested] 72+ messages in thread

*Re: [PATCH v5 09/14] sched: Add over-utilization/tipping point indicator
2018-08-02 15:30 ` Quentin Perret@ 2018-08-02 15:55 ` Vincent Guittot
2018-08-02 16:00 ` Quentin Perret0 siblings, 1 reply; 72+ messages in thread
From: Vincent Guittot @ 2018-08-02 15:55 UTC (permalink / raw)
To: Quentin Perret
Cc: Peter Zijlstra, Rafael J. Wysocki, linux-kernel,
open list:THERMAL, gregkh, Ingo Molnar, Dietmar Eggemann,
Morten Rasmussen, Chris Redpath, Patrick Bellasi,
Valentin Schneider, Thara Gopinath, viresh kumar, Todd Kjos,
Joel Fernandes, Cc: Steve Muckle, adharmap, Kannan, Saravana,
pkondeti, Juri Lelli, Eduardo Valentin, Srinivas Pandruvada,
currojerez, Javi Merino
On Thu, 2 Aug 2018 at 17:30, Quentin Perret <quentin.perret@arm.com> wrote:
>
> On Thursday 02 Aug 2018 at 17:14:15 (+0200), Vincent Guittot wrote:
> > On Thu, 2 Aug 2018 at 16:14, Quentin Perret <quentin.perret@arm.com> wrote:
> > > Good point, setting the util_avg to 0 for new tasks should help
> > > filtering out those tiny tasks too. And that would match with the idea
> > > of letting tasks build their history before looking at their util_avg ...
> > >
> > > But there is one difference w.r.t frequency selection. The current code
> > > won't mark the system overutilized, but will let sugov raise the
> > > frequency when a new task is enqueued. So in case of a fork bomb, we
> >
> > If the initial value of util_avg is 0, we should not have any impact
> > on the util_avg of the cfs rq on which the task is attached, isn't it
> > ? so this should not impact both the over utilization state and the
> > frequency selected by sugov or I'm missing something ?
>
> What I tried to say is that setting util_avg to 0 for new tasks will
> prevent schedutil from raising the frequency in case of a fork bomb, and
> I think that could be an issue. And I think this isn't an issue with the
> patch as-is ...
ok. So you also want to deal with fork bomb
Not sure that you don't have some problem with current proposal too
because select_task_rq_fair will always return prev_cpu because
util_avg and util_est are 0 at that time
>
> Sorry if that wasn't clear
>
> > Then, select_task_rq_fair is called for a new task but util_avg is
> > still 0 at that time in the current code so you will have consistent
> > util_avg of the new task before and after calling
> > find_energy_efficient_cpu
>
> New tasks don't go in find_energy_efficient_cpu(), because, as you said,
> they have no consistent util_avg yet when select_task_rq_fair() is called
> for the first time.
>
> Thanks,
> Quentin
^permalinkrawreply [flat|nested] 72+ messages in thread

*Re: [PATCH v5 09/14] sched: Add over-utilization/tipping point indicator
2018-08-02 15:55 ` Vincent Guittot@ 2018-08-02 16:00 ` Quentin Perret
2018-08-02 16:07 ` Vincent Guittot0 siblings, 1 reply; 72+ messages in thread
From: Quentin Perret @ 2018-08-02 16:00 UTC (permalink / raw)
To: Vincent Guittot
Cc: Peter Zijlstra, Rafael J. Wysocki, linux-kernel,
open list:THERMAL, gregkh, Ingo Molnar, Dietmar Eggemann,
Morten Rasmussen, Chris Redpath, Patrick Bellasi,
Valentin Schneider, Thara Gopinath, viresh kumar, Todd Kjos,
Joel Fernandes, Cc: Steve Muckle, adharmap, Kannan, Saravana,
pkondeti, Juri Lelli, Eduardo Valentin, Srinivas Pandruvada,
currojerez, Javi Merino
On Thursday 02 Aug 2018 at 17:55:24 (+0200), Vincent Guittot wrote:
> On Thu, 2 Aug 2018 at 17:30, Quentin Perret <quentin.perret@arm.com> wrote:
> >
> > On Thursday 02 Aug 2018 at 17:14:15 (+0200), Vincent Guittot wrote:
> > > On Thu, 2 Aug 2018 at 16:14, Quentin Perret <quentin.perret@arm.com> wrote:
> > > > Good point, setting the util_avg to 0 for new tasks should help
> > > > filtering out those tiny tasks too. And that would match with the idea
> > > > of letting tasks build their history before looking at their util_avg ...
> > > >
> > > > But there is one difference w.r.t frequency selection. The current code
> > > > won't mark the system overutilized, but will let sugov raise the
> > > > frequency when a new task is enqueued. So in case of a fork bomb, we
> > >
> > > If the initial value of util_avg is 0, we should not have any impact
> > > on the util_avg of the cfs rq on which the task is attached, isn't it
> > > ? so this should not impact both the over utilization state and the
> > > frequency selected by sugov or I'm missing something ?
> >
> > What I tried to say is that setting util_avg to 0 for new tasks will
> > prevent schedutil from raising the frequency in case of a fork bomb, and
> > I think that could be an issue. And I think this isn't an issue with the
> > patch as-is ...
>
> ok. So you also want to deal with fork bomb
> Not sure that you don't have some problem with current proposal too
> because select_task_rq_fair will always return prev_cpu because
> util_avg and util_est are 0 at that time
But find_idlest_cpu() should select a CPU using load in case of a forkee
no ?
Thanks,
Quentin
^permalinkrawreply [flat|nested] 72+ messages in thread

*Re: [PATCH v5 09/14] sched: Add over-utilization/tipping point indicator
2018-08-02 16:00 ` Quentin Perret@ 2018-08-02 16:07 ` Vincent Guittot
2018-08-02 16:10 ` Quentin Perret0 siblings, 1 reply; 72+ messages in thread
From: Vincent Guittot @ 2018-08-02 16:07 UTC (permalink / raw)
To: Quentin Perret
Cc: Peter Zijlstra, Rafael J. Wysocki, linux-kernel,
open list:THERMAL, gregkh, Ingo Molnar, Dietmar Eggemann,
Morten Rasmussen, Chris Redpath, Patrick Bellasi,
Valentin Schneider, Thara Gopinath, viresh kumar, Todd Kjos,
Joel Fernandes, Cc: Steve Muckle, adharmap, Kannan, Saravana,
pkondeti, Juri Lelli, Eduardo Valentin, Srinivas Pandruvada,
currojerez, Javi Merino
On Thu, 2 Aug 2018 at 18:00, Quentin Perret <quentin.perret@arm.com> wrote:
>
> On Thursday 02 Aug 2018 at 17:55:24 (+0200), Vincent Guittot wrote:
> > On Thu, 2 Aug 2018 at 17:30, Quentin Perret <quentin.perret@arm.com> wrote:
> > >
> > > On Thursday 02 Aug 2018 at 17:14:15 (+0200), Vincent Guittot wrote:
> > > > On Thu, 2 Aug 2018 at 16:14, Quentin Perret <quentin.perret@arm.com> wrote:
> > > > > Good point, setting the util_avg to 0 for new tasks should help
> > > > > filtering out those tiny tasks too. And that would match with the idea
> > > > > of letting tasks build their history before looking at their util_avg ...
> > > > >
> > > > > But there is one difference w.r.t frequency selection. The current code
> > > > > won't mark the system overutilized, but will let sugov raise the
> > > > > frequency when a new task is enqueued. So in case of a fork bomb, we
> > > >
> > > > If the initial value of util_avg is 0, we should not have any impact
> > > > on the util_avg of the cfs rq on which the task is attached, isn't it
> > > > ? so this should not impact both the over utilization state and the
> > > > frequency selected by sugov or I'm missing something ?
> > >
> > > What I tried to say is that setting util_avg to 0 for new tasks will
> > > prevent schedutil from raising the frequency in case of a fork bomb, and
> > > I think that could be an issue. And I think this isn't an issue with the
> > > patch as-is ...
> >
> > ok. So you also want to deal with fork bomb
> > Not sure that you don't have some problem with current proposal too
> > because select_task_rq_fair will always return prev_cpu because
> > util_avg and util_est are 0 at that time
>
> But find_idlest_cpu() should select a CPU using load in case of a forkee
> no ?
So you have to wait for the next tick that will set the overutilized
and disable the want_energy. Until this point, all new tasks will be
put on the current cpu
>
> Thanks,
> Quentin
^permalinkrawreply [flat|nested] 72+ messages in thread

*Re: [PATCH v5 09/14] sched: Add over-utilization/tipping point indicator
2018-08-02 16:07 ` Vincent Guittot@ 2018-08-02 16:10 ` Quentin Perret
2018-08-02 16:38 ` Vincent Guittot0 siblings, 1 reply; 72+ messages in thread
From: Quentin Perret @ 2018-08-02 16:10 UTC (permalink / raw)
To: Vincent Guittot
Cc: Peter Zijlstra, Rafael J. Wysocki, linux-kernel,
open list:THERMAL, gregkh, Ingo Molnar, Dietmar Eggemann,
Morten Rasmussen, Chris Redpath, Patrick Bellasi,
Valentin Schneider, Thara Gopinath, viresh kumar, Todd Kjos,
Joel Fernandes, Cc: Steve Muckle, adharmap, Kannan, Saravana,
pkondeti, Juri Lelli, Eduardo Valentin, Srinivas Pandruvada,
currojerez, Javi Merino
On Thursday 02 Aug 2018 at 18:07:49 (+0200), Vincent Guittot wrote:
> On Thu, 2 Aug 2018 at 18:00, Quentin Perret <quentin.perret@arm.com> wrote:
> >
> > On Thursday 02 Aug 2018 at 17:55:24 (+0200), Vincent Guittot wrote:
> > > On Thu, 2 Aug 2018 at 17:30, Quentin Perret <quentin.perret@arm.com> wrote:
> > > >
> > > > On Thursday 02 Aug 2018 at 17:14:15 (+0200), Vincent Guittot wrote:
> > > > > On Thu, 2 Aug 2018 at 16:14, Quentin Perret <quentin.perret@arm.com> wrote:
> > > > > > Good point, setting the util_avg to 0 for new tasks should help
> > > > > > filtering out those tiny tasks too. And that would match with the idea
> > > > > > of letting tasks build their history before looking at their util_avg ...
> > > > > >
> > > > > > But there is one difference w.r.t frequency selection. The current code
> > > > > > won't mark the system overutilized, but will let sugov raise the
> > > > > > frequency when a new task is enqueued. So in case of a fork bomb, we
> > > > >
> > > > > If the initial value of util_avg is 0, we should not have any impact
> > > > > on the util_avg of the cfs rq on which the task is attached, isn't it
> > > > > ? so this should not impact both the over utilization state and the
> > > > > frequency selected by sugov or I'm missing something ?
> > > >
> > > > What I tried to say is that setting util_avg to 0 for new tasks will
> > > > prevent schedutil from raising the frequency in case of a fork bomb, and
> > > > I think that could be an issue. And I think this isn't an issue with the
> > > > patch as-is ...
> > >
> > > ok. So you also want to deal with fork bomb
> > > Not sure that you don't have some problem with current proposal too
> > > because select_task_rq_fair will always return prev_cpu because
> > > util_avg and util_est are 0 at that time
> >
> > But find_idlest_cpu() should select a CPU using load in case of a forkee
> > no ?
>
> So you have to wait for the next tick that will set the overutilized
> and disable the want_energy. Until this point, all new tasks will be
> put on the current cpu
want_energy should always be false for forkees, because we set it only
for SD_BALANCE_WAKE.
^permalinkrawreply [flat|nested] 72+ messages in thread

*Re: [PATCH v5 09/14] sched: Add over-utilization/tipping point indicator
2018-08-02 16:38 ` Vincent Guittot@ 2018-08-02 16:59 ` Quentin Perret
2018-08-03 7:48 ` Vincent Guittot0 siblings, 1 reply; 72+ messages in thread
From: Quentin Perret @ 2018-08-02 16:59 UTC (permalink / raw)
To: Vincent Guittot
Cc: Peter Zijlstra, Rafael J. Wysocki, linux-kernel,
open list:THERMAL, gregkh, Ingo Molnar, Dietmar Eggemann,
Morten Rasmussen, Chris Redpath, Patrick Bellasi,
Valentin Schneider, Thara Gopinath, viresh kumar, Todd Kjos,
Joel Fernandes, Cc: Steve Muckle, adharmap, Kannan, Saravana,
pkondeti, Juri Lelli, Eduardo Valentin, Srinivas Pandruvada,
currojerez, Javi Merino
On Thursday 02 Aug 2018 at 18:38:01 (+0200), Vincent Guittot wrote:
> On Thu, 2 Aug 2018 at 18:10, Quentin Perret <quentin.perret@arm.com> wrote:
> >
> > On Thursday 02 Aug 2018 at 18:07:49 (+0200), Vincent Guittot wrote:
> > > On Thu, 2 Aug 2018 at 18:00, Quentin Perret <quentin.perret@arm.com> wrote:
> > > >
> > > > On Thursday 02 Aug 2018 at 17:55:24 (+0200), Vincent Guittot wrote:
> > > > > On Thu, 2 Aug 2018 at 17:30, Quentin Perret <quentin.perret@arm.com> wrote:
> > > > > >
> > > > > > On Thursday 02 Aug 2018 at 17:14:15 (+0200), Vincent Guittot wrote:
> > > > > > > On Thu, 2 Aug 2018 at 16:14, Quentin Perret <quentin.perret@arm.com> wrote:
> > > > > > > > Good point, setting the util_avg to 0 for new tasks should help
> > > > > > > > filtering out those tiny tasks too. And that would match with the idea
> > > > > > > > of letting tasks build their history before looking at their util_avg ...
> > > > > > > >
> > > > > > > > But there is one difference w.r.t frequency selection. The current code
> > > > > > > > won't mark the system overutilized, but will let sugov raise the
> > > > > > > > frequency when a new task is enqueued. So in case of a fork bomb, we
> > > > > > >
> > > > > > > If the initial value of util_avg is 0, we should not have any impact
> > > > > > > on the util_avg of the cfs rq on which the task is attached, isn't it
> > > > > > > ? so this should not impact both the over utilization state and the
> > > > > > > frequency selected by sugov or I'm missing something ?
> > > > > >
> > > > > > What I tried to say is that setting util_avg to 0 for new tasks will
> > > > > > prevent schedutil from raising the frequency in case of a fork bomb, and
> > > > > > I think that could be an issue. And I think this isn't an issue with the
> > > > > > patch as-is ...
> > > > >
> > > > > ok. So you also want to deal with fork bomb
> > > > > Not sure that you don't have some problem with current proposal too
> > > > > because select_task_rq_fair will always return prev_cpu because
> > > > > util_avg and util_est are 0 at that time
> > > >
> > > > But find_idlest_cpu() should select a CPU using load in case of a forkee
> > > > no ?
> > >
> > > So you have to wait for the next tick that will set the overutilized
> > > and disable the want_energy. Until this point, all new tasks will be
> > > put on the current cpu
> >
> > want_energy should always be false for forkees, because we set it only
> > for SD_BALANCE_WAKE.
>
> Ah yes I forgot that point.
> But doesn't this break the EAS policy ? I mean each time a new task is
> created, we use the load to select the best CPU
If you really keep spawning new tasks all the time, yes EAS won't help
you, but there isn't a lot we can do :/. We need to have an idea of how
big a task is for EAS, and we obviously don't know that for new tasks, so
it's hard/dangerous to make assumptions.
So the proposal here is that if you only have forkees once in a while,
then those new tasks (and those new tasks only) will be placed using load
the first time, and then they'll fall under EAS control has soon as they
have at least a little bit of history. This _should_ happen without
re-enabling load balance spuriously too often, and that _should_ prevent
it from ruining the placement of existing tasks ...
As Peter already mentioned, a better way of solving this issue would be
to try to find the moment when the utilization signal has converged to
something stable (assuming that it converges), but that, I think, isn't
straightforward at all ...
Does that make any sense ?
Thanks,
Quentin
^permalinkrawreply [flat|nested] 72+ messages in thread

*Re: [PATCH v5 10/14] sched/cpufreq: Refactor the utilization aggregation method
2018-08-02 15:21 ` Quentin Perret@ 2018-08-02 17:36 ` Peter Zijlstra
2018-08-03 12:42 ` Quentin Perret0 siblings, 1 reply; 72+ messages in thread
From: Peter Zijlstra @ 2018-08-02 17:36 UTC (permalink / raw)
To: Quentin Perret
Cc: skannan, rjw, linux-kernel, linux-pm, gregkh, mingo,
dietmar.eggemann, morten.rasmussen, chris.redpath,
patrick.bellasi, valentin.schneider, vincent.guittot,
thara.gopinath, viresh.kumar, tkjos, joel, smuckle, adharmap,
skannan, pkondeti, juri.lelli, edubezval, srinivas.pandruvada,
currojerez, javi.merino, linux-pm-owner
On Thu, Aug 02, 2018 at 04:21:11PM +0100, Quentin Perret wrote:
> On Thursday 02 Aug 2018 at 14:45:11 (+0200), Peter Zijlstra wrote:
> > To clarify, it makes absolutely no sense what so ever to attempt EAS
> > when the DVFS control is not coordinated.
>
> I tend to agree with that, but at the same time even if we create a very
> strong dependency on schedutil, we will have no guarantee that the actual
> frequencies used on the platform are the ones we predicted in EAS.
Sure; on x86 for example our micro-code does whatever. But using
schedutil we at least 'guide' it in the general direction we'd expect
with the control that is available.
Using a !schedutil governor doesn't even get us that and we're basically
running on random input without any feedback to close the loop. Not
something I feel we should support or care for.
^permalinkrawreply [flat|nested] 72+ messages in thread

*Re: [PATCH v5 09/14] sched: Add over-utilization/tipping point indicator
2018-08-02 16:59 ` Quentin Perret@ 2018-08-03 7:48 ` Vincent Guittot
2018-08-03 8:18 ` Quentin Perret0 siblings, 1 reply; 72+ messages in thread
From: Vincent Guittot @ 2018-08-03 7:48 UTC (permalink / raw)
To: Quentin Perret
Cc: Peter Zijlstra, Rafael J. Wysocki, linux-kernel,
open list:THERMAL, gregkh, Ingo Molnar, Dietmar Eggemann,
Morten Rasmussen, Chris Redpath, Patrick Bellasi,
Valentin Schneider, Thara Gopinath, viresh kumar, Todd Kjos,
Joel Fernandes, Cc: Steve Muckle, adharmap, Kannan, Saravana,
pkondeti, Juri Lelli, Eduardo Valentin, Srinivas Pandruvada,
currojerez, Javi Merino
On Thu, 2 Aug 2018 at 18:59, Quentin Perret <quentin.perret@arm.com> wrote:
>
> On Thursday 02 Aug 2018 at 18:38:01 (+0200), Vincent Guittot wrote:
> > On Thu, 2 Aug 2018 at 18:10, Quentin Perret <quentin.perret@arm.com> wrote:
> > >
> > > On Thursday 02 Aug 2018 at 18:07:49 (+0200), Vincent Guittot wrote:
> > > > On Thu, 2 Aug 2018 at 18:00, Quentin Perret <quentin.perret@arm.com> wrote:
> > > > >
> > > > > On Thursday 02 Aug 2018 at 17:55:24 (+0200), Vincent Guittot wrote:
> > > > > > On Thu, 2 Aug 2018 at 17:30, Quentin Perret <quentin.perret@arm.com> wrote:
> > > > > > >
> > > > > > > On Thursday 02 Aug 2018 at 17:14:15 (+0200), Vincent Guittot wrote:
> > > > > > > > On Thu, 2 Aug 2018 at 16:14, Quentin Perret <quentin.perret@arm.com> wrote:
> > > > > > > > > Good point, setting the util_avg to 0 for new tasks should help
> > > > > > > > > filtering out those tiny tasks too. And that would match with the idea
> > > > > > > > > of letting tasks build their history before looking at their util_avg ...
> > > > > > > > >
> > > > > > > > > But there is one difference w.r.t frequency selection. The current code
> > > > > > > > > won't mark the system overutilized, but will let sugov raise the
> > > > > > > > > frequency when a new task is enqueued. So in case of a fork bomb, we
> > > > > > > >
> > > > > > > > If the initial value of util_avg is 0, we should not have any impact
> > > > > > > > on the util_avg of the cfs rq on which the task is attached, isn't it
> > > > > > > > ? so this should not impact both the over utilization state and the
> > > > > > > > frequency selected by sugov or I'm missing something ?
> > > > > > >
> > > > > > > What I tried to say is that setting util_avg to 0 for new tasks will
> > > > > > > prevent schedutil from raising the frequency in case of a fork bomb, and
> > > > > > > I think that could be an issue. And I think this isn't an issue with the
> > > > > > > patch as-is ...
> > > > > >
> > > > > > ok. So you also want to deal with fork bomb
> > > > > > Not sure that you don't have some problem with current proposal too
> > > > > > because select_task_rq_fair will always return prev_cpu because
> > > > > > util_avg and util_est are 0 at that time
> > > > >
> > > > > But find_idlest_cpu() should select a CPU using load in case of a forkee
> > > > > no ?
> > > >
> > > > So you have to wait for the next tick that will set the overutilized
> > > > and disable the want_energy. Until this point, all new tasks will be
> > > > put on the current cpu
> > >
> > > want_energy should always be false for forkees, because we set it only
> > > for SD_BALANCE_WAKE.
> >
> > Ah yes I forgot that point.
> > But doesn't this break the EAS policy ? I mean each time a new task is
> > created, we use the load to select the best CPU
>
> If you really keep spawning new tasks all the time, yes EAS won't help
> you, but there isn't a lot we can do :/. We need to have an idea of how
My point was more that it's also happen for every single new task and
not only with fork bomb
> big a task is for EAS, and we obviously don't know that for new tasks, so
> it's hard/dangerous to make assumptions.
But by not making any assumption, the new tasks are placed outside EAS
control and can easily break what EAS tries to achieve because it
looks for the idlest cpu which is unluckily most probably a CPU that
EAS doesn't want to use
>
> So the proposal here is that if you only have forkees once in a while,
> then those new tasks (and those new tasks only) will be placed using load
> the first time, and then they'll fall under EAS control has soon as they
> have at least a little bit of history. This _should_ happen without
> re-enabling load balance spuriously too often, and that _should_ prevent
I'm not really concerned about re-enabling load balance but more that
the effort of packing of tasks in few cpus/clusters that EAS tries to
do can be broken for every new task.
So I wonder what is better for EAS : Make sure to efficiently spread
newly created tasks in cas of fork bomb or try to not break EAS task
placement with every newly created tasks
Vincent
> it from ruining the placement of existing tasks ...
>
> As Peter already mentioned, a better way of solving this issue would be
> to try to find the moment when the utilization signal has converged to
> something stable (assuming that it converges), but that, I think, isn't
> straightforward at all ...
>
> Does that make any sense ?
>
> Thanks,
> Quentin
^permalinkrawreply [flat|nested] 72+ messages in thread

*Re: [PATCH v5 09/14] sched: Add over-utilization/tipping point indicator
2018-08-03 7:48 ` Vincent Guittot@ 2018-08-03 8:18 ` Quentin Perret
2018-08-03 13:49 ` Vincent Guittot0 siblings, 1 reply; 72+ messages in thread
From: Quentin Perret @ 2018-08-03 8:18 UTC (permalink / raw)
To: Vincent Guittot
Cc: Peter Zijlstra, Rafael J. Wysocki, linux-kernel,
open list:THERMAL, gregkh, Ingo Molnar, Dietmar Eggemann,
Morten Rasmussen, Chris Redpath, Patrick Bellasi,
Valentin Schneider, Thara Gopinath, viresh kumar, Todd Kjos,
Joel Fernandes, Cc: Steve Muckle, adharmap, Kannan, Saravana,
pkondeti, Juri Lelli, Eduardo Valentin, Srinivas Pandruvada,
currojerez, Javi Merino
On Friday 03 Aug 2018 at 09:48:47 (+0200), Vincent Guittot wrote:
> On Thu, 2 Aug 2018 at 18:59, Quentin Perret <quentin.perret@arm.com> wrote:
> I'm not really concerned about re-enabling load balance but more that
> the effort of packing of tasks in few cpus/clusters that EAS tries to
> do can be broken for every new task.
Well, re-enabling load balance immediately would break the nice placement
that EAS found, because it would shuffle all tasks around and break the
packing strategy. Letting that sole new task go in find_idlest_cpu()
shouldn't impact the placement of existing tasks. That might have an energy
cost for that one task, yes, but it's really hard to do anything smarter
with new tasks IMO ... EAS simply can't work without a utilization value.
> So I wonder what is better for EAS : Make sure to efficiently spread
> newly created tasks in cas of fork bomb or try to not break EAS task
> placement with every newly created tasks
That shouldn't break the placement per se, we're just making one
temporary exception for new tasks. What do you think 'the right thing'
to do is ? To just put new tasks on prev_cpu or something like that ?
That might help some use-cases I suppose, but will probably harm others ...
I'm just not too keen on making assumptions about the size of new tasks,
that's all. But I'm definitely open to ideas if there is something
smarter we can do.
Thanks,
Quentin
^permalinkrawreply [flat|nested] 72+ messages in thread

*Re: [PATCH v5 09/14] sched: Add over-utilization/tipping point indicator
2018-08-03 8:18 ` Quentin Perret@ 2018-08-03 13:49 ` Vincent Guittot
2018-08-03 14:21 ` Vincent Guittot
2018-08-03 15:55 ` Quentin Perret0 siblings, 2 replies; 72+ messages in thread
From: Vincent Guittot @ 2018-08-03 13:49 UTC (permalink / raw)
To: Quentin Perret
Cc: Peter Zijlstra, Rafael J. Wysocki, linux-kernel,
open list:THERMAL, gregkh, Ingo Molnar, Dietmar Eggemann,
Morten Rasmussen, Chris Redpath, Patrick Bellasi,
Valentin Schneider, Thara Gopinath, viresh kumar, Todd Kjos,
Joel Fernandes, Cc: Steve Muckle, adharmap, Kannan, Saravana,
pkondeti, Juri Lelli, Eduardo Valentin, Srinivas Pandruvada,
currojerez, Javi Merino
On Fri, 3 Aug 2018 at 10:18, Quentin Perret <quentin.perret@arm.com> wrote:
>
> On Friday 03 Aug 2018 at 09:48:47 (+0200), Vincent Guittot wrote:
> > On Thu, 2 Aug 2018 at 18:59, Quentin Perret <quentin.perret@arm.com> wrote:
> > I'm not really concerned about re-enabling load balance but more that
> > the effort of packing of tasks in few cpus/clusters that EAS tries to
> > do can be broken for every new task.
>
> Well, re-enabling load balance immediately would break the nice placement
> that EAS found, because it would shuffle all tasks around and break the
> packing strategy. Letting that sole new task go in find_idlest_cpu()
Sorry I was not clear in my explanation. Re enabling load balance
would be a problem of course. I wanted to say that there is few chance
that this will re-enable the load balance immediately and break EAS
and I'm not worried by this case. But i'm only concerned by the new
task being put outside EAS policy.
For example, if you run on hikey960 the simple script below, which
can't really be seen as a fork bomb IMHO, you will see threads
scheduled on big cores every 0.5 seconds whereas everything should be
packed on little core :
for i in {0..10}; do
echo "t"$i
sleep 0.5
done
> shouldn't impact the placement of existing tasks. That might have an energy
> cost for that one task, yes, but it's really hard to do anything smarter
> with new tasks IMO ... EAS simply can't work without a utilization value.
>
> > So I wonder what is better for EAS : Make sure to efficiently spread
> > newly created tasks in cas of fork bomb or try to not break EAS task
> > placement with every newly created tasks
>
> That shouldn't break the placement per se, we're just making one
> temporary exception for new tasks. What do you think 'the right thing'
> to do is ? To just put new tasks on prev_cpu or something like that ?
I think that EAS, which is about saving power, could be a bit power
friendly when it has to make some assumptions about new task.
>
> That might help some use-cases I suppose, but will probably harm others ...
> I'm just not too keen on making assumptions about the size of new tasks,
But you are already doing some assumptions by letting the default
mode, which use load_avg, selecting the task for you. The comment of
the init function of load_avg states:
void init_entity_runnable_average()
{
...
/*
* Tasks are intialized with full load to be seen as heavy tasks until
* they get a chance to stabilize to their real load level.
* Group entities are intialized with zero load to reflect the fact that
* nothing has been attached to the task group yet.
*/
So it means that EAS makes the assumption that new task are heavy
tasks until they get a chance to stabilize
Regards,
Vincent
> that's all. But I'm definitely open to ideas if there is something
> smarter we can do.
>
> Thanks,
> Quentin
^permalinkrawreply [flat|nested] 72+ messages in thread

*Re: [PATCH v5 09/14] sched: Add over-utilization/tipping point indicator
2018-08-03 13:49 ` Vincent Guittot@ 2018-08-03 14:21 ` Vincent Guittot
2018-08-03 15:55 ` Quentin Perret1 sibling, 0 replies; 72+ messages in thread
From: Vincent Guittot @ 2018-08-03 14:21 UTC (permalink / raw)
To: Quentin Perret
Cc: Peter Zijlstra, Rafael J. Wysocki, linux-kernel,
open list:THERMAL, gregkh, Ingo Molnar, Dietmar Eggemann,
Morten Rasmussen, Chris Redpath, Patrick Bellasi,
Valentin Schneider, Thara Gopinath, viresh kumar, Todd Kjos,
Joel Fernandes, Cc: Steve Muckle, adharmap, Kannan, Saravana,
pkondeti, Juri Lelli, Eduardo Valentin, Srinivas Pandruvada,
currojerez, Javi Merino
On Fri, 3 Aug 2018 at 15:49, Vincent Guittot <vincent.guittot@linaro.org> wrote:
>
> On Fri, 3 Aug 2018 at 10:18, Quentin Perret <quentin.perret@arm.com> wrote:
> >
> > On Friday 03 Aug 2018 at 09:48:47 (+0200), Vincent Guittot wrote:
> > > On Thu, 2 Aug 2018 at 18:59, Quentin Perret <quentin.perret@arm.com> wrote:
> > > I'm not really concerned about re-enabling load balance but more that
> > > the effort of packing of tasks in few cpus/clusters that EAS tries to
> > > do can be broken for every new task.
> >
> > Well, re-enabling load balance immediately would break the nice placement
> > that EAS found, because it would shuffle all tasks around and break the
> > packing strategy. Letting that sole new task go in find_idlest_cpu()
>
> Sorry I was not clear in my explanation. Re enabling load balance
> would be a problem of course. I wanted to say that there is few chance
> that this will re-enable the load balance immediately and break EAS
> and I'm not worried by this case. But i'm only concerned by the new
> task being put outside EAS policy.
>
> For example, if you run on hikey960 the simple script below, which
> can't really be seen as a fork bomb IMHO, you will see threads
> scheduled on big cores every 0.5 seconds whereas everything should be
> packed on little core :
> for i in {0..10}; do
> echo "t"$i
> sleep 0.5
> done
>
> > shouldn't impact the placement of existing tasks. That might have an energy
> > cost for that one task, yes, but it's really hard to do anything smarter
> > with new tasks IMO ... EAS simply can't work without a utilization value.
> >
> > > So I wonder what is better for EAS : Make sure to efficiently spread
> > > newly created tasks in cas of fork bomb or try to not break EAS task
> > > placement with every newly created tasks
> >
> > That shouldn't break the placement per se, we're just making one
> > temporary exception for new tasks. What do you think 'the right thing'
> > to do is ? To just put new tasks on prev_cpu or something like that ?
>
> I think that EAS, which is about saving power, could be a bit power
> friendly when it has to make some assumptions about new task.
>
> >
> > That might help some use-cases I suppose, but will probably harm others ...
> > I'm just not too keen on making assumptions about the size of new tasks,
>
> But you are already doing some assumptions by letting the default
> mode, which use load_avg, selecting the task for you. The comment of
s/selecting the task/selecting the cpu/
> the init function of load_avg states:
>
> void init_entity_runnable_average()
> {
> ...
> /*
> * Tasks are intialized with full load to be seen as heavy tasks until
> * they get a chance to stabilize to their real load level.
> * Group entities are intialized with zero load to reflect the fact that
> * nothing has been attached to the task group yet.
> */
>
> So it means that EAS makes the assumption that new task are heavy
> tasks until they get a chance to stabilize
>
> Regards,
> Vincent
>
> > that's all. But I'm definitely open to ideas if there is something
> > smarter we can do.
> >
> > Thanks,
> > Quentin
^permalinkrawreply [flat|nested] 72+ messages in thread

*Re: [PATCH v5 09/14] sched: Add over-utilization/tipping point indicator
2018-08-03 13:49 ` Vincent Guittot
2018-08-03 14:21 ` Vincent Guittot@ 2018-08-03 15:55 ` Quentin Perret
2018-08-06 8:40 ` Vincent Guittot1 sibling, 1 reply; 72+ messages in thread
From: Quentin Perret @ 2018-08-03 15:55 UTC (permalink / raw)
To: Vincent Guittot
Cc: Peter Zijlstra, Rafael J. Wysocki, linux-kernel,
open list:THERMAL, gregkh, Ingo Molnar, Dietmar Eggemann,
Morten Rasmussen, Chris Redpath, Patrick Bellasi,
Valentin Schneider, Thara Gopinath, viresh kumar, Todd Kjos,
Joel Fernandes, Cc: Steve Muckle, adharmap, Kannan, Saravana,
pkondeti, Juri Lelli, Eduardo Valentin, Srinivas Pandruvada,
currojerez, Javi Merino
On Friday 03 Aug 2018 at 15:49:24 (+0200), Vincent Guittot wrote:
> On Fri, 3 Aug 2018 at 10:18, Quentin Perret <quentin.perret@arm.com> wrote:
> >
> > On Friday 03 Aug 2018 at 09:48:47 (+0200), Vincent Guittot wrote:
> > > On Thu, 2 Aug 2018 at 18:59, Quentin Perret <quentin.perret@arm.com> wrote:
> > > I'm not really concerned about re-enabling load balance but more that
> > > the effort of packing of tasks in few cpus/clusters that EAS tries to
> > > do can be broken for every new task.
> >
> > Well, re-enabling load balance immediately would break the nice placement
> > that EAS found, because it would shuffle all tasks around and break the
> > packing strategy. Letting that sole new task go in find_idlest_cpu()
>
> Sorry I was not clear in my explanation. Re enabling load balance
> would be a problem of course. I wanted to say that there is few chance
> that this will re-enable the load balance immediately and break EAS
> and I'm not worried by this case. But i'm only concerned by the new
> task being put outside EAS policy.
>
> For example, if you run on hikey960 the simple script below, which
> can't really be seen as a fork bomb IMHO, you will see threads
> scheduled on big cores every 0.5 seconds whereas everything should be
> packed on little core
I guess that also depends on what's running on the little cores, but I
see your point.
I think we're discussing two different things right now:
1. Should forkees go in find_energy_efficient_cpu() ?
2. Should forkees have 0 of initial util_avg when EAS is enabled ?
For 1, that would mean all forkees go on prev_cpu. I can see how that
can be more energy-efficient in some use-cases (the one you described
for example), but that also has drawbacks. Placing the task on a big
CPU can have an energy cost, but that should also help the task build
it's utilization faster, which is what we want to make smart decisions
with EAS. Also, it isn't always true that going on the little CPUs is
more energy efficient, only the Energy Model can tell. There is just no
perfect solution, so I'm still not fully decided on that one ...
For 2, I'm a little bit more reluctant, because that has more
implications ... That could probably harm some fairly standard use
cases (an simple app-launch for example). Enqueueing something new on a
CPU would go unnoticed, which might be fine for a very small task, but
probably a major issue if the task is actually big. I'd be more
comfortable with 2 only if we also speed-up the PELT half-life TBH ...
Is there a 3 that I missed ?
Thanks,
Quentin
^permalinkrawreply [flat|nested] 72+ messages in thread

*Re: [PATCH v5 09/14] sched: Add over-utilization/tipping point indicator
2018-08-03 15:55 ` Quentin Perret@ 2018-08-06 8:40 ` Vincent Guittot
2018-08-06 9:43 ` Quentin Perret
2018-08-06 10:08 ` Dietmar Eggemann0 siblings, 2 replies; 72+ messages in thread
From: Vincent Guittot @ 2018-08-06 8:40 UTC (permalink / raw)
To: Quentin Perret
Cc: Peter Zijlstra, Rafael J. Wysocki, linux-kernel,
open list:THERMAL, gregkh, Ingo Molnar, Dietmar Eggemann,
Morten Rasmussen, Chris Redpath, Patrick Bellasi,
Valentin Schneider, Thara Gopinath, viresh kumar, Todd Kjos,
Joel Fernandes, Cc: Steve Muckle, adharmap, Kannan, Saravana,
pkondeti, Juri Lelli, Eduardo Valentin, Srinivas Pandruvada,
currojerez, Javi Merino
On Fri, 3 Aug 2018 at 17:55, Quentin Perret <quentin.perret@arm.com> wrote:
>
> On Friday 03 Aug 2018 at 15:49:24 (+0200), Vincent Guittot wrote:
> > On Fri, 3 Aug 2018 at 10:18, Quentin Perret <quentin.perret@arm.com> wrote:
> > >
> > > On Friday 03 Aug 2018 at 09:48:47 (+0200), Vincent Guittot wrote:
> > > > On Thu, 2 Aug 2018 at 18:59, Quentin Perret <quentin.perret@arm.com> wrote:
> > > > I'm not really concerned about re-enabling load balance but more that
> > > > the effort of packing of tasks in few cpus/clusters that EAS tries to
> > > > do can be broken for every new task.
> > >
> > > Well, re-enabling load balance immediately would break the nice placement
> > > that EAS found, because it would shuffle all tasks around and break the
> > > packing strategy. Letting that sole new task go in find_idlest_cpu()
> >
> > Sorry I was not clear in my explanation. Re enabling load balance
> > would be a problem of course. I wanted to say that there is few chance
> > that this will re-enable the load balance immediately and break EAS
> > and I'm not worried by this case. But i'm only concerned by the new
> > task being put outside EAS policy.
> >
> > For example, if you run on hikey960 the simple script below, which
> > can't really be seen as a fork bomb IMHO, you will see threads
> > scheduled on big cores every 0.5 seconds whereas everything should be
> > packed on little core
>
> I guess that also depends on what's running on the little cores, but I
> see your point.
In my case, the system was idle and nothing else than this script was running
>
> I think we're discussing two different things right now:
> 1. Should forkees go in find_energy_efficient_cpu() ?
> 2. Should forkees have 0 of initial util_avg when EAS is enabled ?
It's the same topic: How EAS should consider a newly created task ?
For now, we let the "performance" mode selects a CPU. This CPU will
most probably be worst CPU from a EAS pov because it's the idlest CPU
in the idlest group which is the opposite of what EAS tries to do
The current behavior is :
For every new task, the cpu selection is done assuming it's a heavy
task with the max possible load_avg, and it looks for the idlest cpu.
This means that if the system is lightly loaded, scheduler will select
most probably a idle big core.
The utilization of this new task is then set to half of the remaining
capacity of the selected CPU which means that the idlest you are, the
biggest the task will be initialized to. This can easily be half a big
core which can be bigger than the max capacity of a little like on
hikey960. Then, util_est will keep track of this value for a while
which will make this task like a big one.
>
> For 1, that would mean all forkees go on prev_cpu. I can see how that
> can be more energy-efficient in some use-cases (the one you described
> for example), but that also has drawbacks. Placing the task on a big
> CPU can have an energy cost, but that should also help the task build
> it's utilization faster, which is what we want to make smart decisions
With current behavior, little task are seen as big for a long time
which is not really help the task to build its utilization faster
IMHO.
> with EAS. Also, it isn't always true that going on the little CPUs is
> more energy efficient, only the Energy Model can tell. There is just no
selecting big or Little is not the problem here. The problem is that
we don't use Energy Model so we will most probably do the wrong
choice. Nevertheless, putting a task on big is clearly the wrong
choice in the case I mentioned above " shell script on hikey960".
> perfect solution, so I'm still not fully decided on that one ...
>
> For 2, I'm a little bit more reluctant, because that has more
> implications ... That could probably harm some fairly standard use
> cases (an simple app-launch for example). Enqueueing something new on a
> CPU would go unnoticed, which might be fine for a very small task, but
> probably a major issue if the task is actually big. I'd be more
> comfortable with 2 only if we also speed-up the PELT half-life TBH ...
>
> Is there a 3 that I missed ?
Having something in the middle like taking into account load and/org
utilization of the parent in order to mitigate big task starting with
small utilization and small task starting with big utilization.
It's probably not perfect because big tasks can create small ones and
the opposite but if there are already big tasks, assuming that the new
one is also a big one should have less power impact as we are already
consuming power for the current bigs. At the opposite, if little are
running, assuming that new task is little will not harm the power
consumption unnecessarily.
My main concern is that by making no choice, you clearly make the most
power consumption choice which is a bit awkward for a policy that
wants to minimize power consumption.
Regards,
Vincent
>
> Thanks,
> Quentin
^permalinkrawreply [flat|nested] 72+ messages in thread

*Re: [PATCH v5 09/14] sched: Add over-utilization/tipping point indicator
2018-08-06 8:40 ` Vincent Guittot@ 2018-08-06 9:43 ` Quentin Perret
2018-08-06 10:45 ` Vincent Guittot
2018-08-06 10:08 ` Dietmar Eggemann1 sibling, 1 reply; 72+ messages in thread
From: Quentin Perret @ 2018-08-06 9:43 UTC (permalink / raw)
To: Vincent Guittot
Cc: Peter Zijlstra, Rafael J. Wysocki, linux-kernel,
open list:THERMAL, gregkh, Ingo Molnar, Dietmar Eggemann,
Morten Rasmussen, Chris Redpath, Patrick Bellasi,
Valentin Schneider, Thara Gopinath, viresh kumar, Todd Kjos,
Joel Fernandes, Cc: Steve Muckle, adharmap, Kannan, Saravana,
pkondeti, Juri Lelli, Eduardo Valentin, Srinivas Pandruvada,
currojerez, Javi Merino
On Monday 06 Aug 2018 at 10:40:46 (+0200), Vincent Guittot wrote:
> On Fri, 3 Aug 2018 at 17:55, Quentin Perret <quentin.perret@arm.com> wrote:
> For every new task, the cpu selection is done assuming it's a heavy
> task with the max possible load_avg, and it looks for the idlest cpu.
> This means that if the system is lightly loaded, scheduler will select
> most probably a idle big core.
Agreed, that is what should happen if the system is lightly loaded.
However, I'm still not totally convinced this is wrong. It's
definitely not _always_ wrong, at least. Just like starting new tasks
on little CPUs isn't always wrong either.
> selecting big or Little is not the problem here. The problem is that
> we don't use Energy Model so we will most probably do the wrong
> choice. Nevertheless, putting a task on big is clearly the wrong
> choice in the case I mentioned above " shell script on hikey960".
_You_ can say it's wrong because _you_ know the task composition. The
scheduler has no way to tell. You could come up with a script that
spawns heavy tasks every once in a while, and in this case putting
those on big cores would be beneficial ...
> Having something in the middle like taking into account load and/org
> utilization of the parent in order to mitigate big task starting with
> small utilization and small task starting with big utilization.
> It's probably not perfect because big tasks can create small ones and
> the opposite but if there are already big tasks, assuming that the new
> one is also a big one should have less power impact as we are already
> consuming power for the current bigs. At the opposite, if little are
> running, assuming that new task is little will not harm the power
> consumption unnecessarily.
Right, we can definitely come up with something more conservative than
what I'm currently proposing. I had a quick chat with Morten about this
the other day and one suggestion he had was to pick the CPU with the max
spare cap in the frequency domain in which the parent task is running ...
In any case, I really feel like there isn't an obvious right decision
here, so I'd prefer to keep things simple for now. This patch-set is a
first step, and fine-grained tuning for new tasks is probably something
that can be done later, if need be. What do you think ?
Thanks,
Quentin
^permalinkrawreply [flat|nested] 72+ messages in thread

*Re: [PATCH v5 09/14] sched: Add over-utilization/tipping point indicator
2018-08-06 8:40 ` Vincent Guittot
2018-08-06 9:43 ` Quentin Perret@ 2018-08-06 10:08 ` Dietmar Eggemann
2018-08-06 10:33 ` Vincent Guittot1 sibling, 1 reply; 72+ messages in thread
From: Dietmar Eggemann @ 2018-08-06 10:08 UTC (permalink / raw)
To: Vincent Guittot, Quentin Perret
Cc: Peter Zijlstra, Rafael J. Wysocki, linux-kernel,
open list:THERMAL, gregkh, Ingo Molnar, Morten Rasmussen,
Chris Redpath, Patrick Bellasi, Valentin Schneider,
Thara Gopinath, viresh kumar, Todd Kjos, Joel Fernandes,
Cc: Steve Muckle, adharmap, Kannan, Saravana, pkondeti,
Juri Lelli, Eduardo Valentin, Srinivas Pandruvada, currojerez,
Javi Merino
On 08/06/2018 10:40 AM, Vincent Guittot wrote:
> On Fri, 3 Aug 2018 at 17:55, Quentin Perret <quentin.perret@arm.com> wrote:
>>
>> On Friday 03 Aug 2018 at 15:49:24 (+0200), Vincent Guittot wrote:
>>> On Fri, 3 Aug 2018 at 10:18, Quentin Perret <quentin.perret@arm.com> wrote:
>>>>
>>>> On Friday 03 Aug 2018 at 09:48:47 (+0200), Vincent Guittot wrote:
>>>>> On Thu, 2 Aug 2018 at 18:59, Quentin Perret <quentin.perret@arm.com> wrote:
[...]
>> I think we're discussing two different things right now:
>> 1. Should forkees go in find_energy_efficient_cpu() ?
>> 2. Should forkees have 0 of initial util_avg when EAS is enabled ?
>
> It's the same topic: How EAS should consider a newly created task ?
>
> For now, we let the "performance" mode selects a CPU. This CPU will
> most probably be worst CPU from a EAS pov because it's the idlest CPU
> in the idlest group which is the opposite of what EAS tries to do
>
> The current behavior is :
> For every new task, the cpu selection is done assuming it's a heavy
> task with the max possible load_avg, and it looks for the idlest cpu.
> This means that if the system is lightly loaded, scheduler will select
> most probably a idle big core.
AFAICS, task load doesn't seem to be used for find_idlest_cpu() (
find_idlest_group() and find_idlest_group_cpu()). So the forkee
(SD_BALANCE_FORK) is placed independently of his task load.
Task load (task_h_load(p)) is used in
wake_affine()->wake_affine_weight() but for this to be called it has to
be a wakeup (SD_BALANCE_WAKE).
> The utilization of this new task is then set to half of the remaining
> capacity of the selected CPU which means that the idlest you are, the
> biggest the task will be initialized to. This can easily be half a big
> core which can be bigger than the max capacity of a little like on
> hikey960. Then, util_est will keep track of this value for a while
> which will make this task like a big one.
[...]
^permalinkrawreply [flat|nested] 72+ messages in thread

*Re: [PATCH v5 09/14] sched: Add over-utilization/tipping point indicator
2018-08-06 10:08 ` Dietmar Eggemann@ 2018-08-06 10:33 ` Vincent Guittot
2018-08-06 12:29 ` Dietmar Eggemann0 siblings, 1 reply; 72+ messages in thread
From: Vincent Guittot @ 2018-08-06 10:33 UTC (permalink / raw)
To: Dietmar Eggemann
Cc: Quentin Perret, Peter Zijlstra, Rafael J. Wysocki, linux-kernel,
open list:THERMAL, gregkh, Ingo Molnar, Morten Rasmussen,
Chris Redpath, Patrick Bellasi, Valentin Schneider,
Thara Gopinath, viresh kumar, Todd Kjos, Joel Fernandes,
Cc: Steve Muckle, adharmap, Kannan, Saravana, pkondeti,
Juri Lelli, Eduardo Valentin, Srinivas Pandruvada, currojerez,
Javi Merino
On Mon, 6 Aug 2018 at 12:08, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>
> On 08/06/2018 10:40 AM, Vincent Guittot wrote:
> > On Fri, 3 Aug 2018 at 17:55, Quentin Perret <quentin.perret@arm.com> wrote:
> >>
> >> On Friday 03 Aug 2018 at 15:49:24 (+0200), Vincent Guittot wrote:
> >>> On Fri, 3 Aug 2018 at 10:18, Quentin Perret <quentin.perret@arm.com> wrote:
> >>>>
> >>>> On Friday 03 Aug 2018 at 09:48:47 (+0200), Vincent Guittot wrote:
> >>>>> On Thu, 2 Aug 2018 at 18:59, Quentin Perret <quentin.perret@arm.com> wrote:
>
> [...]
>
> >> I think we're discussing two different things right now:
> >> 1. Should forkees go in find_energy_efficient_cpu() ?
> >> 2. Should forkees have 0 of initial util_avg when EAS is enabled ?
> >
> > It's the same topic: How EAS should consider a newly created task ?
> >
> > For now, we let the "performance" mode selects a CPU. This CPU will
> > most probably be worst CPU from a EAS pov because it's the idlest CPU
> > in the idlest group which is the opposite of what EAS tries to do
> >
> > The current behavior is :
> > For every new task, the cpu selection is done assuming it's a heavy
> > task with the max possible load_avg, and it looks for the idlest cpu.
> > This means that if the system is lightly loaded, scheduler will select
> > most probably a idle big core.
>
> AFAICS, task load doesn't seem to be used for find_idlest_cpu() (
> find_idlest_group() and find_idlest_group_cpu()). So the forkee
> (SD_BALANCE_FORK) is placed independently of his task load.
hmm ... so what is used if load or runnable load are not used ?
find_idlest_group() uses load and runnable load but skip spare
capacity in case of fork
> Task load (task_h_load(p)) is used in
> wake_affine()->wake_affine_weight() but for this to be called it has to
> be a wakeup (SD_BALANCE_WAKE).
>
> > The utilization of this new task is then set to half of the remaining
> > capacity of the selected CPU which means that the idlest you are, the
> > biggest the task will be initialized to. This can easily be half a big
> > core which can be bigger than the max capacity of a little like on
> > hikey960. Then, util_est will keep track of this value for a while
> > which will make this task like a big one.
>
> [...]
>
^permalinkrawreply [flat|nested] 72+ messages in thread

*Re: [PATCH v5 09/14] sched: Add over-utilization/tipping point indicator
2018-08-06 9:43 ` Quentin Perret@ 2018-08-06 10:45 ` Vincent Guittot
2018-08-06 11:02 ` Quentin Perret0 siblings, 1 reply; 72+ messages in thread
From: Vincent Guittot @ 2018-08-06 10:45 UTC (permalink / raw)
To: Quentin Perret
Cc: Peter Zijlstra, Rafael J. Wysocki, linux-kernel,
open list:THERMAL, gregkh, Ingo Molnar, Dietmar Eggemann,
Morten Rasmussen, Chris Redpath, Patrick Bellasi,
Valentin Schneider, Thara Gopinath, viresh kumar, Todd Kjos,
Joel Fernandes, Cc: Steve Muckle, adharmap, Kannan, Saravana,
pkondeti, Juri Lelli, Eduardo Valentin, Srinivas Pandruvada,
currojerez, Javi Merino
On Mon, 6 Aug 2018 at 11:43, Quentin Perret <quentin.perret@arm.com> wrote:
>
> On Monday 06 Aug 2018 at 10:40:46 (+0200), Vincent Guittot wrote:
> > On Fri, 3 Aug 2018 at 17:55, Quentin Perret <quentin.perret@arm.com> wrote:
> > For every new task, the cpu selection is done assuming it's a heavy
> > task with the max possible load_avg, and it looks for the idlest cpu.
> > This means that if the system is lightly loaded, scheduler will select
> > most probably a idle big core.
>
> Agreed, that is what should happen if the system is lightly loaded.
> However, I'm still not totally convinced this is wrong. It's
> definitely not _always_ wrong, at least. Just like starting new tasks
> on little CPUs isn't always wrong either.
As explained before, IMHO, this is not wrong if you looks for
performance but this is wrong if you looks for power saving
>
> > selecting big or Little is not the problem here. The problem is that
> > we don't use Energy Model so we will most probably do the wrong
> > choice. Nevertheless, putting a task on big is clearly the wrong
> > choice in the case I mentioned above " shell script on hikey960".
>
> _You_ can say it's wrong because _you_ know the task composition. The
> scheduler has no way to tell. You could come up with a script that
> spawns heavy tasks every once in a while, and in this case putting
> those on big cores would be beneficial ...
>
> > Having something in the middle like taking into account load and/org
> > utilization of the parent in order to mitigate big task starting with
> > small utilization and small task starting with big utilization.
> > It's probably not perfect because big tasks can create small ones and
> > the opposite but if there are already big tasks, assuming that the new
> > one is also a big one should have less power impact as we are already
> > consuming power for the current bigs. At the opposite, if little are
> > running, assuming that new task is little will not harm the power
> > consumption unnecessarily.
>
> Right, we can definitely come up with something more conservative than
> what I'm currently proposing. I had a quick chat with Morten about this
> the other day and one suggestion he had was to pick the CPU with the max
> spare cap in the frequency domain in which the parent task is running ...
>
> In any case, I really feel like there isn't an obvious right decision
> here, so I'd prefer to keep things simple for now. This patch-set is a
> first step, and fine-grained tuning for new tasks is probably something
> that can be done later, if need be. What do you think ?
I would have preferred to have a full power policy for all task when
EAS is in used by default and then see if there is any performance
problem instead of letting some UC unclear but that's a personal
opinion.
so IMO, the minimum is to add a comment in the code that describes
this behavior for fork tasks so people will understand why EAS puts
newly created task on not "EAS friendly" cpus when they will look at
the code trying to understand the behavior
>
> Thanks,
> Quentin
^permalinkrawreply [flat|nested] 72+ messages in thread

*Re: [PATCH v5 03/14] PM: Introduce an Energy Model management framework
2018-07-24 12:25 ` [PATCH v5 03/14] PM: Introduce an Energy Model management framework Quentin Perret
@ 2018-08-09 21:52 ` Rafael J. Wysocki
2018-08-10 8:15 ` Quentin Perret0 siblings, 1 reply; 72+ messages in thread
From: Rafael J. Wysocki @ 2018-08-09 21:52 UTC (permalink / raw)
To: Quentin Perret
Cc: peterz, linux-kernel, linux-pm, gregkh, mingo, dietmar.eggemann,
morten.rasmussen, chris.redpath, patrick.bellasi,
valentin.schneider, vincent.guittot, thara.gopinath,
viresh.kumar, tkjos, joel, smuckle, adharmap, skannan, pkondeti,
juri.lelli, edubezval, srinivas.pandruvada, currojerez,
javi.merino
On Tuesday, July 24, 2018 2:25:10 PM CEST Quentin Perret wrote:
> Several subsystems in the kernel (task scheduler and/or thermal at the
> time of writing) can benefit from knowing about the energy consumed by
> CPUs. Yet, this information can come from different sources (DT or
> firmware for example), in different formats, hence making it hard to
> exploit without a standard API.
>
> As an attempt to address this, introduce a centralized Energy Model
> (EM) management framework which aggregates the power values provided
> by drivers into a table for each frequency domain in the system. The
> power cost tables are made available to interested clients (e.g. task
> scheduler or thermal) via platform-agnostic APIs. The overall design
> is represented by the diagram below (focused on Arm-related drivers as
> an example, but applicable to any architecture):
>
> +---------------+ +-----------------+ +-------------+
> | Thermal (IPA) | | Scheduler (EAS) | | Other |
> +---------------+ +-----------------+ +-------------+
> | | em_fd_energy() |
> | | em_cpu_get() |
> +-----------+ | +----------+
> | | |
> v v v
> +---------------------+
> | |
> | Energy Model |
> | |
> | Framework |
> | |
> +---------------------+
> ^ ^ ^
> | | | em_register_freq_domain()
> +----------+ | +---------+
> | | |
> +---------------+ +---------------+ +--------------+
> | cpufreq-dt | | arm_scmi | | Other |
> +---------------+ +---------------+ +--------------+
> ^ ^ ^
> | | |
> +--------------+ +---------------+ +--------------+
> | Device Tree | | Firmware | | ? |
> +--------------+ +---------------+ +--------------+
>
> Drivers (typically, but not limited to, CPUFreq drivers) can register
> data in the EM framework using the em_register_freq_domain() API. The
> calling driver must provide a callback function with a standardized
> signature that will be used by the EM framework to build the power
> cost tables of the frequency domain. This design should offer a lot of
> flexibility to calling drivers which are free of reading information
> from any location and to use any technique to compute power costs.
> Moreover, the capacity states registered by drivers in the EM framework
> are not required to match real performance states of the target. This
> is particularly important on targets where the performance states are
> not known by the OS.
>
> On the client side, the EM framework offers APIs to access the power
> cost tables of a CPU (em_cpu_get()), and to estimate the energy
> consumed by the CPUs of a frequency domain (em_fd_energy()). Clients
> such as the task scheduler can then use these APIs to access the shared
> data structures holding the Energy Model of CPUs.
I'm a bit concerned that the code here appears to be designed around the
frequency domains concept which seems to be a limitation and which probably
is related to the properties of the current generation of hardware.
Assumptions like that tend to get tangled into the code tightly over time
and they may be hard to untangle from it when new use cases arise later.
For example, there probably will be more firmware involvement in future
systems and the firmware may not be willing to expose "raw" frequency
domains to the OS. That already is the case with P-states on Intel HW and
with ACPI CPPC in general.
IMO, frequency domains in your current code could be replaced with something
more general, like "performance domains" providing the scheduler with the
(relative) cost of running a task on a busy (non-idle) CPU (and, analogously,
"idle domains" that would provide the scheduler with the - relative - cost
of waking up an idle CPU to run a task on it or, the other way around, the
possible relative gain from taking all tasks away from a CPU in order to make
it go idle).
Also bear in mind that the CPUs the scheduler deals with are logical ones,
so they may be like hardware threads within a single core, for example.
Thanks,
Rafael
^permalinkrawreply [flat|nested] 72+ messages in thread

*Re: [PATCH v5 03/14] PM: Introduce an Energy Model management framework
2018-08-09 21:52 ` Rafael J. Wysocki@ 2018-08-10 8:15 ` Quentin Perret
2018-08-10 8:41 ` Rafael J. Wysocki0 siblings, 1 reply; 72+ messages in thread
From: Quentin Perret @ 2018-08-10 8:15 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: peterz, linux-kernel, linux-pm, gregkh, mingo, dietmar.eggemann,
morten.rasmussen, chris.redpath, patrick.bellasi,
valentin.schneider, vincent.guittot, thara.gopinath,
viresh.kumar, tkjos, joel, smuckle, adharmap, skannan, pkondeti,
juri.lelli, edubezval, srinivas.pandruvada, currojerez,
javi.merino
Hi Rafael,
On Thursday 09 Aug 2018 at 23:52:29 (+0200), Rafael J. Wysocki wrote:
> I'm a bit concerned that the code here appears to be designed around the
> frequency domains concept which seems to be a limitation and which probably
> is related to the properties of the current generation of hardware.
That's correct. I went for 'frequency domains' only because this is what
EAS and IPA are interested in, as of today at least. And both of them
are somewhat dependent on CPU-Freq, which is called CPU-*Freq*, not
CPU-Perf after all :-)
> Assumptions like that tend to get tangled into the code tightly over time
> and they may be hard to untangle from it when new use cases arise later.
>
> For example, there probably will be more firmware involvement in future
> systems and the firmware may not be willing to expose "raw" frequency
> domains to the OS. That already is the case with P-states on Intel HW and
> with ACPI CPPC in general.
Agreed, and embedded/mobile systems are going in that direction too ...
> IMO, frequency domains in your current code could be replaced with something
> more general, like "performance domains"
I don't mind using a more abstract name as long as we keep the same
assumptions, and especially that all CPUs in a perf. domain *must* have
the same micro-architecture. From that assumption result several
properties that EAS (in its current) form needs. The first one is that
all CPUs of a performance domain have the same capacity at any possible
performance state. And the second is that they all consume the same
amount of (active) power.
I know it is theoretically possible to mix CPU types in the same perf
domain, but that becomes nightmare-ish to manage in EAS, and I don't
think there is a single platform like this on the market today. And I
hope nobody will build one. Peter wanted to mandate that too, I think.
> providing the scheduler with the (relative) cost of running a task
What do you mean by relative ? That we should normalize the power costs ?
Or just use an abstract scale, without specifying the unit ?
The main reason I'm a bit reluctant to do that just now is because IPA
needs to compare the power of CPUs with the power of other components
(GPUs, for example). And the power of those other components is specified
with a specific unit too. So, not imposing a comparable unit for the
power of CPUs will result in an unspecified behaviour in IPA, and that
will break things for sure. I would very much like to avoid that, of
course.
What I am currently proposing is to keep the unit (mW) in the EM
framework so that migrating IPA to using it can be done in a (relatively)
painless way. On a system where drivers don't know the exact wattage,
then they should just 'lie' to the EM framework, but it's their job to
lie coherently to all subsystems and keep things consistent, because all
subsystems have specified power in comparable units.
Another solution to solve this problem could be to extend the EM
framework introduced by this patch and make it manage the EM of any
device, not just CPUs. Then we could just specify that all power costs
must be in the same scale, regardless of the actual unit, and register
the EM of CPUs, GPUs, ...
However, I was hoping that this patch as-is was enough for a first step,
and that this extension of the framework could be done in a second step ?
Thoughts ?
In any case, if we decide to keep the mW unit for now, I should at least
explain clearly why in the commit message.
> on a busy (non-idle) CPU (and, analogously,
> "idle domains" that would provide the scheduler with the - relative - cost
> of waking up an idle CPU to run a task on it or, the other way around, the
> possible relative gain from taking all tasks away from a CPU in order to make
> it go idle).
+1 for idle costs as a second type of 'domains' which could be managed
by the EM framework, alongside the 'perf' domains. I don't think we have
users of that just now (or providers of idle costs ?) so maybe that is
for later too ?
What do you think ?
Thank you very much for the feedback,
Quentin
^permalinkrawreply [flat|nested] 72+ messages in thread

*Re: [PATCH v5 03/14] PM: Introduce an Energy Model management framework
2018-08-10 8:15 ` Quentin Perret@ 2018-08-10 8:41 ` Rafael J. Wysocki
2018-08-10 9:12 ` Quentin Perret0 siblings, 1 reply; 72+ messages in thread
From: Rafael J. Wysocki @ 2018-08-10 8:41 UTC (permalink / raw)
To: Quentin Perret
Cc: peterz, linux-kernel, linux-pm, gregkh, mingo, dietmar.eggemann,
morten.rasmussen, chris.redpath, patrick.bellasi,
valentin.schneider, vincent.guittot, thara.gopinath,
viresh.kumar, tkjos, joel, smuckle, adharmap, skannan, pkondeti,
juri.lelli, edubezval, srinivas.pandruvada, currojerez,
javi.merino
Hi Quentin,
On Friday, August 10, 2018 10:15:39 AM CEST Quentin Perret wrote:
> Hi Rafael,
>
> On Thursday 09 Aug 2018 at 23:52:29 (+0200), Rafael J. Wysocki wrote:
> > I'm a bit concerned that the code here appears to be designed around the
> > frequency domains concept which seems to be a limitation and which probably
> > is related to the properties of the current generation of hardware.
>
> That's correct. I went for 'frequency domains' only because this is what
> EAS and IPA are interested in, as of today at least. And both of them
> are somewhat dependent on CPU-Freq, which is called CPU-*Freq*, not
> CPU-Perf after all :-)
Still, cpufreq manages CPU performance scaling really.
A cpufreq policy may represent a frequency domain or generally a group of
related CPUs and what matters is that there is a coordination between them
and not how that coordination happens at the hardware/firmware level etc.
> > Assumptions like that tend to get tangled into the code tightly over time
> > and they may be hard to untangle from it when new use cases arise later.
> >
> > For example, there probably will be more firmware involvement in future
> > systems and the firmware may not be willing to expose "raw" frequency
> > domains to the OS. That already is the case with P-states on Intel HW and
> > with ACPI CPPC in general.
>
> Agreed, and embedded/mobile systems are going in that direction too ...
>
> > IMO, frequency domains in your current code could be replaced with something
> > more general, like "performance domains"
>
> I don't mind using a more abstract name as long as we keep the same
> assumptions, and especially that all CPUs in a perf. domain *must* have
> the same micro-architecture.
That's fair enough I think.
> From that assumption result several
> properties that EAS (in its current) form needs. The first one is that
> all CPUs of a performance domain have the same capacity at any possible
> performance state. And the second is that they all consume the same
> amount of (active) power.
>
> I know it is theoretically possible to mix CPU types in the same perf
> domain, but that becomes nightmare-ish to manage in EAS, and I don't
> think there is a single platform like this on the market today. And I
> hope nobody will build one. Peter wanted to mandate that too, I think.
There are departures, say, at least as far as the capacity is concerned.
The uarch is the same for all of them, but the max capacity may vary
between them.
> > providing the scheduler with the (relative) cost of running a task
>
> What do you mean by relative ? That we should normalize the power costs ?
> Or just use an abstract scale, without specifying the unit ?
I mean "relative with respect to the other choices"; not absolute.
> The main reason I'm a bit reluctant to do that just now is because IPA
> needs to compare the power of CPUs with the power of other components
> (GPUs, for example). And the power of those other components is specified
> with a specific unit too. So, not imposing a comparable unit for the
> power of CPUs will result in an unspecified behaviour in IPA, and that
> will break things for sure. I would very much like to avoid that, of
> course.
The absolute power numbers are generally hard to get though.
In the majority of cases you can figure out what the extra cost of X with
respect to (alternative) Y is (in certain units), but you may not be able
to say what X and Y are equal to in absolute terms (for example, there
may be an unknown component in both X and Y that you cannot measure, but
it may not be relevant for the outcome of the computation).
> What I am currently proposing is to keep the unit (mW) in the EM
> framework so that migrating IPA to using it can be done in a (relatively)
> painless way. On a system where drivers don't know the exact wattage,
> then they should just 'lie' to the EM framework, but it's their job to
> lie coherently to all subsystems and keep things consistent, because all
> subsystems have specified power in comparable units.
Alternatively, there could be a translation layer between EM and IPA.
From my experience, if you want people to come up with some numbers,
they will just choose them to game the system this way or another
unless those numbers can be measured directly or are clearly documented.
And if that happens and then you want to make any significant changes,
you'll need to deal with "regressions" occuring because someone chose
the numbers to make the system behave in a specific way and your changes
break that.
As a rule, I rather avoid requesting unknown numbers from people. :-)
> Another solution to solve this problem could be to extend the EM
> framework introduced by this patch and make it manage the EM of any
> device, not just CPUs. Then we could just specify that all power costs
> must be in the same scale, regardless of the actual unit, and register
> the EM of CPUs, GPUs, ...
> However, I was hoping that this patch as-is was enough for a first step,
> and that this extension of the framework could be done in a second step ?
> Thoughts ?
>
> In any case, if we decide to keep the mW unit for now, I should at least
> explain clearly why in the commit message.
Right.
Actually, the unit is as good as any other, but you need to bear in mind that
the numbers provided may not be realistic.
> > on a busy (non-idle) CPU (and, analogously,
> > "idle domains" that would provide the scheduler with the - relative - cost
> > of waking up an idle CPU to run a task on it or, the other way around, the
> > possible relative gain from taking all tasks away from a CPU in order to make
> > it go idle).
>
> +1 for idle costs as a second type of 'domains' which could be managed
> by the EM framework, alongside the 'perf' domains. I don't think we have
> users of that just now (or providers of idle costs ?) so maybe that is
> for later too ?
Yes, this is for later IMO.
Thanks,
Rafael
^permalinkrawreply [flat|nested] 72+ messages in thread

*Re: [PATCH v5 03/14] PM: Introduce an Energy Model management framework
2018-08-10 8:41 ` Rafael J. Wysocki@ 2018-08-10 9:12 ` Quentin Perret
2018-08-10 11:13 ` Rafael J. Wysocki0 siblings, 1 reply; 72+ messages in thread
From: Quentin Perret @ 2018-08-10 9:12 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: peterz, linux-kernel, linux-pm, gregkh, mingo, dietmar.eggemann,
morten.rasmussen, chris.redpath, patrick.bellasi,
valentin.schneider, vincent.guittot, thara.gopinath,
viresh.kumar, tkjos, joel, smuckle, adharmap, skannan, pkondeti,
juri.lelli, edubezval, srinivas.pandruvada, currojerez,
javi.merino
On Friday 10 Aug 2018 at 10:41:56 (+0200), Rafael J. Wysocki wrote:
> On Friday, August 10, 2018 10:15:39 AM CEST Quentin Perret wrote:
> > On Thursday 09 Aug 2018 at 23:52:29 (+0200), Rafael J. Wysocki wrote:
> > > I'm a bit concerned that the code here appears to be designed around the
> > > frequency domains concept which seems to be a limitation and which probably
> > > is related to the properties of the current generation of hardware.
> >
> > That's correct. I went for 'frequency domains' only because this is what
> > EAS and IPA are interested in, as of today at least. And both of them
> > are somewhat dependent on CPU-Freq, which is called CPU-*Freq*, not
> > CPU-Perf after all :-)
>
> Still, cpufreq manages CPU performance scaling really.
>
> A cpufreq policy may represent a frequency domain or generally a group of
> related CPUs and what matters is that there is a coordination between them
> and not how that coordination happens at the hardware/firmware level etc.
Fair enough :-)
>
> > > Assumptions like that tend to get tangled into the code tightly over time
> > > and they may be hard to untangle from it when new use cases arise later.
> > >
> > > For example, there probably will be more firmware involvement in future
> > > systems and the firmware may not be willing to expose "raw" frequency
> > > domains to the OS. That already is the case with P-states on Intel HW and
> > > with ACPI CPPC in general.
> >
> > Agreed, and embedded/mobile systems are going in that direction too ...
> >
> > > IMO, frequency domains in your current code could be replaced with something
> > > more general, like "performance domains"
> >
> > I don't mind using a more abstract name as long as we keep the same
> > assumptions, and especially that all CPUs in a perf. domain *must* have
> > the same micro-architecture.
>
> That's fair enough I think.
Good.
> > From that assumption result several
> > properties that EAS (in its current) form needs. The first one is that
> > all CPUs of a performance domain have the same capacity at any possible
> > performance state. And the second is that they all consume the same
> > amount of (active) power.
> >
> > I know it is theoretically possible to mix CPU types in the same perf
> > domain, but that becomes nightmare-ish to manage in EAS, and I don't
> > think there is a single platform like this on the market today. And I
> > hope nobody will build one. Peter wanted to mandate that too, I think.
>
> There are departures, say, at least as far as the capacity is concerned.
>
> The uarch is the same for all of them, but the max capacity may vary
> between them.
I assume you're thinking about ITMT and things like that for example ?
That's an interesting case indeed, but yes, being able to reach higher
freqs for single-threaded workloads shouldn't violate the assumption, I
think.
> > > providing the scheduler with the (relative) cost of running a task
> >
> > What do you mean by relative ? That we should normalize the power costs ?
> > Or just use an abstract scale, without specifying the unit ?
>
> I mean "relative with respect to the other choices"; not absolute.
>
> > The main reason I'm a bit reluctant to do that just now is because IPA
> > needs to compare the power of CPUs with the power of other components
> > (GPUs, for example). And the power of those other components is specified
> > with a specific unit too. So, not imposing a comparable unit for the
> > power of CPUs will result in an unspecified behaviour in IPA, and that
> > will break things for sure. I would very much like to avoid that, of
> > course.
>
> The absolute power numbers are generally hard to get though.
>
> In the majority of cases you can figure out what the extra cost of X with
> respect to (alternative) Y is (in certain units), but you may not be able
> to say what X and Y are equal to in absolute terms (for example, there
> may be an unknown component in both X and Y that you cannot measure, but
> it may not be relevant for the outcome of the computation).
Agreed. EAS and IPA don't care about the absolute real power values, all
they care about is relative correctness. But what I really want to avoid
is having IPA getting the power of the GPUs in mW, and the power of CPUs
in an abstract scale without unit. That _will_ create problems eventually
IMO, because the behaviour is undefined. Specifying a unit everywhere is
an easy way to enforce a consistent design across sub-systems, that's
all.
> > What I am currently proposing is to keep the unit (mW) in the EM
> > framework so that migrating IPA to using it can be done in a (relatively)
> > painless way. On a system where drivers don't know the exact wattage,
> > then they should just 'lie' to the EM framework, but it's their job to
> > lie coherently to all subsystems and keep things consistent, because all
> > subsystems have specified power in comparable units.
>
> Alternatively, there could be a translation layer between EM and IPA.
Hmm, interesting... What do you have in mind exactly ? What would you
put in that layer ?
> From my experience, if you want people to come up with some numbers,
> they will just choose them to game the system this way or another
> unless those numbers can be measured directly or are clearly documented.
>
> And if that happens and then you want to make any significant changes,
> you'll need to deal with "regressions" occuring because someone chose
> the numbers to make the system behave in a specific way and your changes
> break that.
>
> As a rule, I rather avoid requesting unknown numbers from people. :-)
>
> > Another solution to solve this problem could be to extend the EM
> > framework introduced by this patch and make it manage the EM of any
> > device, not just CPUs. Then we could just specify that all power costs
> > must be in the same scale, regardless of the actual unit, and register
> > the EM of CPUs, GPUs, ...
> > However, I was hoping that this patch as-is was enough for a first step,
> > and that this extension of the framework could be done in a second step ?
> > Thoughts ?
> >
> > In any case, if we decide to keep the mW unit for now, I should at least
> > explain clearly why in the commit message.
>
> Right.
>
> Actually, the unit is as good as any other, but you need to bear in mind that
> the numbers provided may not be realistic.
As long as they're all correct in a relative way, that's fine by me :-)
Thanks,
Quentin
^permalinkrawreply [flat|nested] 72+ messages in thread

*Re: [PATCH v5 03/14] PM: Introduce an Energy Model management framework
2018-08-10 9:12 ` Quentin Perret@ 2018-08-10 11:13 ` Rafael J. Wysocki
2018-08-10 12:30 ` Quentin Perret0 siblings, 1 reply; 72+ messages in thread
From: Rafael J. Wysocki @ 2018-08-10 11:13 UTC (permalink / raw)
To: Quentin Perret
Cc: peterz, linux-kernel, linux-pm, gregkh, mingo, dietmar.eggemann,
morten.rasmussen, chris.redpath, patrick.bellasi,
valentin.schneider, vincent.guittot, thara.gopinath,
viresh.kumar, tkjos, joel, smuckle, adharmap, skannan, pkondeti,
juri.lelli, edubezval, srinivas.pandruvada, currojerez,
javi.merino
On Friday, August 10, 2018 11:12:18 AM CEST Quentin Perret wrote:
> On Friday 10 Aug 2018 at 10:41:56 (+0200), Rafael J. Wysocki wrote:
> > On Friday, August 10, 2018 10:15:39 AM CEST Quentin Perret wrote:
[cut]
> Agreed. EAS and IPA don't care about the absolute real power values, all
> they care about is relative correctness. But what I really want to avoid
> is having IPA getting the power of the GPUs in mW, and the power of CPUs
> in an abstract scale without unit. That _will_ create problems eventually
> IMO, because the behaviour is undefined. Specifying a unit everywhere is
> an easy way to enforce a consistent design across sub-systems, that's
> all.
OK
> > > What I am currently proposing is to keep the unit (mW) in the EM
> > > framework so that migrating IPA to using it can be done in a (relatively)
> > > painless way. On a system where drivers don't know the exact wattage,
> > > then they should just 'lie' to the EM framework, but it's their job to
> > > lie coherently to all subsystems and keep things consistent, because all
> > > subsystems have specified power in comparable units.
> >
> > Alternatively, there could be a translation layer between EM and IPA.
>
> Hmm, interesting... What do you have in mind exactly ? What would you
> put in that layer ?
Something able to say how the numbers used by EM and IPA are related. :-)
Do you think that IPA and EM will always need to use the same set of data for
the CPU?
> > From my experience, if you want people to come up with some numbers,
> > they will just choose them to game the system this way or another
> > unless those numbers can be measured directly or are clearly documented.
> >
> > And if that happens and then you want to make any significant changes,
> > you'll need to deal with "regressions" occuring because someone chose
> > the numbers to make the system behave in a specific way and your changes
> > break that.
> >
> > As a rule, I rather avoid requesting unknown numbers from people. :-)
> >
> > > Another solution to solve this problem could be to extend the EM
> > > framework introduced by this patch and make it manage the EM of any
> > > device, not just CPUs. Then we could just specify that all power costs
> > > must be in the same scale, regardless of the actual unit, and register
> > > the EM of CPUs, GPUs, ...
> > > However, I was hoping that this patch as-is was enough for a first step,
> > > and that this extension of the framework could be done in a second step ?
> > > Thoughts ?
> > >
> > > In any case, if we decide to keep the mW unit for now, I should at least
> > > explain clearly why in the commit message.
> >
> > Right.
> >
> > Actually, the unit is as good as any other, but you need to bear in mind that
> > the numbers provided may not be realistic.
>
> As long as they're all correct in a relative way, that's fine by me :-)
OK
Thanks,
Rafael
^permalinkrawreply [flat|nested] 72+ messages in thread