Hi all,
With the fresh release of the 2.6.25 kernel, Linux cgroups (fka process
containers) are getting more attention. For those unfamiliar with the
concept, Control Groups are "a generic framework where
several 'resource controllers' can plug in and manage different
resources of the system such as process scheduling or memory
allocation. [They] also offer a unified user interface, based on a
virtual filesystem where administrators can assign arbitrary resource
constraints to a group of chosen tasks."
2.6.24 introduced a CPU bandwidth allocation controller, and today,
2.6.25 features a memory resource controller. Patches for network and
block I/O bandwidth control have also been submitted. So it looks to me
that everything is available to create real process containers,
susceptible to hold individual users' jobs and to keep them inside
defined limits. At kernel level.
One of the limitations with current resource schedulers is the CPU usage
limit enforcement on multi-core systems. On non-NUMA systems (hello
Intel! :)), there's no mechanism to prevent a user submitting a job
which asks for, say, one core on a 8-cores machine, to actually spawn 8
threads which will be spread over the 8 cores, and make exclusive use
of all the machine's CPU resources. This would impact performance of
other users' jobs in a sneaky way, and, as a rigid^Wrighteous sysadmin,
I can't tolerate this.
I was looking for the longest time for a way to "pin" a group of
processes to a specific *number* of cores, and not to a specific list
of cores (ie. I don't want to limit a process to run on cores 0 and 1,
but rather say that I'd like this process to use at most 2 cores on the
system). And it looks like cgroups would be a good candidate to achieve
this.
Another benefit would be the memory resources allocation. Our current
scheduler, and it's probably the case for the others as well, enforces
memory limitations by accounting for memory used by jobs every x
minutes. So if a job has peak memory bursts, it can easily get
unnoticed and continue to run, although it may already have either
triggered the OOM killer, or prevented another process' memory
allocation. If the enforcement is made at kernel level, I assume that
it will be in real-time, and that this kind of problem would be
avoided.
I yet have to try implementing cgroups and see if they could be used in
an HPC environment to enforce reliable resources allocation limits, but
I was wondering if anybody tried this already, especially the
integration with existing schedulers, or if anyone had ideas on the
subject.
Thanks,
--
Kilian