cgroups: so close and yet so far away from per-user fair scheduling

June 17, 2011

Suppose, not entirely hypothetically, that you have some shared,
multiuser compute servers. Further suppose that sometimes, person A is
running a single compute process while person B is running, oh, nine.
Since Linux divides CPU time among processes without caring who owns
them, this means that person A is getting 1/10th of the CPU while person
B is getting 9/10ths of it. This doesn't seem entirely fair; it would
be better if A and B split the CPU 50/50 regardless of how many compute
jobs each ran.

(Since these are all multi-CPU machines, the real examples are more
complicated. But yes, periodically there are more compute jobs than
there are cores.)

Modern Linux kernels come with support for cgroups, which is designed to enable this
sort of stuff. I'll cut to the chase: cgroups can at least in theory
do exactly the per-user fair scheduling that we want here. In practice
Linux is let down by the current state of the user tools, which lack the
features you need to make this feasible.

How to use cgroups to create per-user fair scheduling is pretty
straightforward; you just put each user into their own cgroup (or at
least each real user, you might want to do something different with
system daemon UIDs) and give each user cgroup the same cpu.shares
value. The system will then evenly divide the available CPU up between
all users with active processes. The obvious place to manage all of this
is in a PAM module, which can create the per-user cgroup on the fly the
first time it's necessary and so on.

The kernel support for all of this is there, as is most of the user
level tools you'd need (in the form of libcg and associated programs); there's
even a PAM module, which classifies users into cgroups based on a
configuration file or two. However, what the tools don't have is
any ability to have generic entries in the configuration files for
creating cgroups and assigning users to them. If you want to have
one cgroup per user, you get to write them out explicitly (and then
the tools will create them all ahead of time). Oh sure, you can generate
the config files with a script, but you also have to poke various
daemons every time you want your config file changes to take effect.
Things get annoying fast.

(I also wonder how happy the kernel will be to have a thousand or so
cgroups, almost all of which are unused at any given time (given that
only a handful of our users will log on to a compute server at once).)

PS: the tragic thing is that a hard-coded PAM module would be almost
trivial (and I've written PAM modules before). But that would mean
building and maintaining a custom PAM module, and this issue is not
quite important enough here to justify that.

(Like most sysadmins, we get a modest amount of hives at locally
developed software. The closer we can be to stock systems the happier we
are, because it means that someone else is maintaining the software.)

Sidebar: systemd, cgroups, and per-user fair scheduling

It appears that this entire issue will be rendered
moot for us if and when Ubuntu does an LTS release
that's based on systemd. Per this blog posting and more
documentation,
current versions of systemd can already put users into per-user cgroups
for you with the right options set on the systemd PAM module.

I admit that I'm kind of looking forward to this.

(Well, I'm not looking forward to yet another init replacement and
init system, but there doesn't seem to be anything I can do about
that. Maybe systemd will be the last one for a while.)