On Thu, 2007-10-25 at 17:28 -0700, Christoph Lameter wrote:> On Thu, 25 Oct 2007, David Rientjes wrote:> > > The problem occurs when you add cpusets into the mix and permit the > > allowed nodes to change without knowledge to the application. Right now, > > a simple remap is done so if the cardinality of the set of nodes > > decreases, you're interleaving over a smaller number of nodes. If the > > cardinality increases, your interleaved nodemask isn't expanded. That's > > the problem that we're facing. The remap itself is troublesome because it > > doesn't take into account the user's desire for a custom nodemask to be > > used anyway; it could remap an interleaved policy over several nodes that > > will already be contended with one another.> > Right. So I think we are fine if the application cannot setup boundaries > for interleave.> > > > Normally, MPOL_INTERLEAVE is used to reduce bus contention to improve the > > throughput of the application. If you remap the number of nodes to > > interleave over, which is currently how it's done when mems_allowed > > changes, you could actually be increasing latency because you're > > interleaving over the same bus.> > Well you may hit some nodes more than others so a slight performance > degradataion.> > > This isn't a memory policy problem because all it does is effect a > > specific policy over a set of nodes. With my change, cpusets are required > > to update the interleaved nodemask if the user specified that they desire > > the feature with interleave_over_allowed. Cpusets are, after all, the > > ones that changed the mems_allowed in the first place and invalidated our > > custom interleave policy. We simply can't make inferences about what we > > should do, so we allow the creator of the cpuset to specify it for us. So > > the proper place to modify an interleaved policy is in cpusets and not > > mempolicy itself.> > With that MPOL_INTERLEAVE would be context dependent and no longer > needs translation. Lee had similar ideas. Lee: Could we make > MPOL_INTERLEAVE generally cpuset context dependent?>

That's what my "cpuset-independent interleave" patch does. Daviddoesn't like the "null node mask" interface because it doesn't work withlibnuma. I plan to fix that, but I'm chasing other issues. I shouldget back to the mempol work after today.

What I like about the cpuset independent interleave is that the "policyremap" when cpusets are changed is a NO-OP--no need to change thepolicy. Just as "preferred local" policy chooses the node where theallocation occurs, my cpuset independent interleave patch interleavesacross the set of nodes available at the time of the allocation. Theapplication has to specifically ask for this behavior by the null/emptynodemask or the TBD libnuma API. IMO, this is the only reasonableinterleave policy for apps running in dynamic cpusets.

An aside: if David et al [at google] are using cpusets on fake numa forresource management [I don't know this is the case, but saw somediscussions way back that indicate it might be?], then maybe thisbecomes less of an issue when control groups [a.k.a. containers] andmemory resource controls come to fruition?