> On 04.05.2012 [22:14:16 +0200], Peter Zijlstra wrote:>> On Sat, 2012-05-05 at 01:28 +0530, Srivatsa S. Bhat wrote:>>> On 05/05/2012 12:54 AM, Peter Zijlstra wrote:>>>>>>>>>>>> Documentation/cgroups/cpusets.txt | 43 +++-->>>>> include/linux/cpuset.h | 4 >>>>> kernel/cpuset.c | 317 ++++++++++++++++++++++++++++--------->>>>> kernel/sched/core.c | 4 >>>>> 4 files changed, 274 insertions(+), 94 deletions(-)>>>>>>>> Bah, I really hate this complexity you've created for a problem that>>>> really doesn't exist.>>>>>>>>>>>>> Doesn't exist? Well, I believe we do have a problem and a serious one>>> at that too!>>>> Still not convinced,..>>>>> The heart of the problem can be summarized in 2 sentences:>>>>>> o During a CPU hotplug, tasks can move between cpusets, and never>>> come back to their original cpuset.>>>> This is a feature! You cannot say a task is part of a cpuset and then>> run it elsewhere just because things don't work out.>>>> That's actively violating the meaning of cpusets.> > Tbh, I agree with you Peter, as I think that's how cpusets *should*> work.

I agree that that's how cpusets must and should work in usual scenarios.Otherwise, the whole concept of cpusets wouldn't make much sense.

However, in the face of hotplug, there are examples in the existing kernelitself, where that principle is 'violated' in the strictest sense.

sched_setaffinity():It calls cpuset_cpus_allowed() to find out what cpus are allowed for thattask (looking at the cpuset it is attached to), so that it can validate orreduce the newly requested mask keeping the allowed cpus in mind.

But how does cpuset_cpus_allowed() calculate the "allowed cpus" for this task?It calls guarantee_online_cpus(), which does exactly what I tried to do inthis patchset! That is, if the task's cpuset doesn't have any online cpus,it goes up the cpuset hierarchy, trying to find a parent cpuset that doeshave some online cpus and returns that mask! That too, without complaining!

So it looks like the kernel already has relaxations with respect to cpusetsor allowed cpus when it is faced with hotplug..

> But I'll also reference `man cpuset`:> > Not all allocations of system memory are constrained by cpusets,> for the following reasons.> > If hot-plug functionality is used to remove all the CPUs that> are currently assigned to a cpuset, then the kernel will> automatically update the cpus_allowed of all processes attached> to CPUs in that cpuset to allow all CPUs. When memory hot-plug> function- ality for removing memory nodes is available, a> similar exception is expected to apply there as well. In> general, the kernel prefers to violate cpuset placement, rather> than starving a process that has had all its allowed CPUs or> memory nodes taken off- line. User code should reconfigure> cpusets to only refer to online CPUs and memory nodes when using> hot-plug to add or remove such resources.> > So cpusets are, per their own documentation, not hard-limits in the face> of hotplug.>

Right. So it is up to us to strike a balance in whatever way we choose -o just kill those tasks and be done with ito or come up with nice variants (it is worth noting that the documentation is flexible in the sense that it doesn't imply any hard-and-fast rule as to how exactly we should implement the nice variants.)

> I, personally, think we should just kill of tasks in cpuset-constrained> environments that are nonsensical (no memory, no cpus, etc.).

Even I think just killing the tasks or maybe even preventing such destructivehotplug (last cpu in a cpuset going offline) would have been way moreeasier to handle and also logical.. and userspace would have been morecautious while dealing with cpusets, from the beginning....

> But, it> would seem we've already supported this (inherit the parent in the face> of hotplug) behavior in the past. Not sure we should break it ... at> least on the surface.>

Yes. Now that the kernel already sported a nice variant from a long time,it wouldn't be good to break that, IMHO. But the question is, is that particularnice variant/feature (move tasks to another cpuset, so that we strictly followthe cpuset concept no matter what) really that good of a compromise?IOW, if we came up with such a nice variant to be good/accommodating to users,have we really achieved that goal, or have we created more woes instead?

So, I think, considering the following 2 factors, namely:o cpusets are not hard-limits in the face of hotplug, as per their own documentation, and the doc itself promises flexible nice variants, whose implementation we are free to choose.o sched_setaffinity() already does what I intended to do for cpusets with this patchset.

Considering these, I think it is OK to revisit and rework how we deal withcpusets during hotplug...

Oh by the way, my patchset also exposes a new file that shows exactly whatcpus the tasks in a cpuset are allowed to run on - so that we are not doinganything sneaky under the hood, without the user's knowledge. So it is alsoeasy for userspace to check if things deviated from the original configuration,and establish a new configuration if needed, by writing to cpuset.cpus file,which the kernel will immediately honour.

>>> o Tasks might get pinned to lesser number of cpus, unreasonably.>>>> -ENOPARSE, are you trying to say that when the set contains 4 cpus and>> you unplug one its left with 3? Sounds like pretty damn obvious, that's>> what unplug does, it takes a cpu away.> > I think he's saying that it's pinned to 3 forever, even if that 4th CPU> is re-plugged.>

Yes, I meant that. Sorry for not being clear.

<snip>

> > So I can see several solutions:> > - Rework cpusets to not be so nice to the user and kill of tasks that> run in stupid cpusets. (to be written)> - Keep current behavior to be nice to the user, but make it much noisier> when the cpuset rules are being broken because they are stupid (do> nothing choice)> - Track/restore the user's setup when it's possible to do so. (this> patchset)> > I'm not sure any of these is "better" than the rest, but they probably> all have distinct merits.>

Yep, and it makes sense to choose the one which the kernel is willingto support, within its constraints. And if that happens to be a nice variantanyway, why not choose the one which actually helps...?