On Monday 08 October 2007, Olli-Pekka Lehto wrote:
> Hello,
>> I'm interested in hearing some best practice solutions in fine-grained
> management memory resources on clusters of SMPs. How do you enforce real
> memory usage inside (RHEL-based) cluster nodes running multiple serial
> jobs simultaneously? More specifically, how to do this efficiently when
> some of the jobs map copious amounts of virtual memory but have only a
> fraction of it resident at any given time?
>> As SMP systems keep getting constantly fatter (and the potential for
> users interfering with each others' jobs increasing) it would be great
> to have something like AIX's WLM (Workload Manager) on Linux to
> effectively manage intra-SMP resources.
>> Olli-Pekka
Ah, a family of issues near to my heart. :)
I'll ask a broader question: How do you enforce real memory usage in modern
Linux *at all*?
We were interested in this because we were having user jobs regularly cause
nodes to go into an Out Of Memory (OOM) state, triggering the kernel's
oom_killer. The oom_killer sometime would kill system processes, which
sometimes caused subsequent jobs to die. Even if subsequent jobs didn't
die, recovery required that we manually close the node, reboot it when
running jobs finished, then reopen it. This gets to be pretty dreary after
a while.
Our problem is somewhat different from your interests, but some of the same
issues come into play. See below for the partially satisfying solution
that we put in place for our OOM woes. First a review of the problem
landscape as I understand it.
You can try to enforce memory limits with a daemon, but you risk missing
important events, including a badly behaved process suddenly using a whole
lot of memory all at once. If that happens, your daemon is nearly useless
since swapping and/or oom_killer will be running, and not your daemon.
Your node may lock up for a while, which was what the daemon was supposed
to prevent.
I think you really want to do it in the kernel, so that badly behaved
requests for memory (allocation and/or writing) can be cut off before they
affect anyone else.
But the kernel doesn't really enforce anything useful. It doesn't enforce a
resident set size (RSS) limit, even though setrlimit() will let you request
such a limit. As I understand it, modern Linux doesn't even try to track
RSS, because semantics of RSS are unclear given modern memory management
methods.
RSS probably isn't even what you want -- you probably want to limit the
amount of physical memory used, keeping the sum of the limits around the
amount of total RAM, to avoid swapping. There is no way to communicate
this limit to the kernel; I suspect it doesn't even track it except
globally.
The kernel *is* able to enforce the amount of virtual memory allocated per
process (set with setrlimit()), but as you noted, that is of limited value
when different applications can have very different overcommit percentages
(virtual memory allocated beyond the amount actually used).
But take a step back from considering the limits you can place on a given
process. You probably want a policy that limits memory use at the job
level, not at the process level, regardless of whether you have one job or
multiple jobs running on a node. There is no kernel mechanism for that
either.
Seems your best bet might be to write a daemon, and hope that actual use
patterns don't cause swapping or OOM before the daemon can act.
To end our OOM problems, we took a different route. The job launch
mechanism (via LSF) sets the per-process virtual-memory-allocation limit on
each user job process. We can prevent OOM this way, unless a job both uses
non-standard job launch methods and has runaway memory use (which is rare
in our experience).
Other weaknesses of our method include:
* It does not prevent heavy swapping (which would be nice to have, but at
least the user suffers the consequences most).
* It can prevent a job from using all available RAM if the job has a larger
overcommit than our algorithm assumes.
* When the VM allocation limit is reached, the errors are often cryptic.
Nothing appears in syslog (unlike segfaults, which are logged at least on
x86_64) -- the kernel patch to enable logging seems likely pretty trivial,
but stock kernels don't do it. A malloc() will return ENOMEM, which many
programs and libraries don't handle properly (or indeed handle at all --
how many programmers omit checking the return value or errno?), so the user
doesn't get a useful error message. A failed stack expansion will cause a
segfault (as I recall), which is also cryptic to the user. At least
segfaults get logged...
I'd love to hear other approaches to this family of problems.
David