The following series is an RFC for some code I wrote in conjunction withsome rt/cfs load-balancing enhancements. The enhancements arent quiteready to see the light of day yet, but this particular fix is ready forcomment. It applies to sched-devel.

This series addresses a problem that I discovered while working on the rt/cfsload-balancer, but it appears it could affect upstream too (though its muchless likely to ever occur).

Patches 1&2 move the existing balancer data into a "sched_balancer" containercalled "group_balancer". Patch #3 then adds a new type of balancer called a"core balancer".

Here is the problem statement (also included in Documentation/scheduler):

Core Balancing ---------------------- The standard group_balancer manages SCHED_OTHER tasks based on a hierarchy of sched_domains and sched_groups as dictated by the physical cache/node topology of the hardware. Each group may contain one or more cores which have a specific relationship to other members of the group. Balancing is always performed on an inter-group basis.

For example, consider a quad-core, dual socket Intel Xeon system. It has a total of 8 cores across one logical NUMA node, with a cache shared between cores [0,2], [1,3], [4,6], [5,7]. From a sched_domain/group perspective on core 0, this looks like the following:

Recall that balancing is always inter-group, and will get more aggressive in the lower domains than the higher ones. The balancing logic will attempt to balance between [0],[2] first, [0,2], [1,3], [4,6], [5,7] second, and [0-7] last. Note that since domain-2 only consists of 1 group, it will never result in a balance decision since there must be at least two groups to consider.

This layout is quite logical. The idea is that [0], and [2] can balance between each other aggresively in a very efficient manner since they share a cache. Once the load is equalized between two cache-peers, domain-1 can spread the load out between the other peer-groups. This represents a pretty good way to structure the balancing operations.

However, there is one slight problem with the group_balancer: Since we always balance inter-group, intra-group imbalances may result in suboptimal behavior if we hit the condition where lower-level domains (domain-0 in this example) are ineffective. This condition can arise whenever a domain-level imbalance cannot be resolved such that the group has a high aggregate load rating, yet some cores are relatively idle.

For example, if a core has a large but affined load, or otherwise untouchable tasks (e.g. RT tasks), SCHED_OTHER will not be able to equalize the load. The net result is that one or more members of the group may remain relatively unloaded, while the load rating for the entire group is high. The higher layer domains will only consider the group as a whole, and the lower level domains are left powerless to equalize the vacuum.

To address this concern, core_balancer adds the concept of a new grouping of cores at each domain-level: a per-core grouping (each core in its own unique group). This "core_balancer" group is configured to run much less aggressively than its topologically relevant brother: "group_balancer". Core_balancer will sweep through the cores every so often, correcting intra-group vacuums left over from lower level domains. In most cases, the group_balancer should have already established equilibrium, therefore benefiting from the hardwares natural affinity hierarchy. In the cases where it cannot achieve equilibrium, the core_balancer tries to take it one step closer.

By default, group_balancer runs at sd->min_interval, whereas core_balancer starts at sd->max_interval (both of which will respond to dynamic programming). Both will employ a multiplicative backoff algorithm when faced with repeated migration failure.