On Wednesday 10 October 2007 12:23:14 am Tim Cutts wrote:
> We then have a default memory limit on the queues which
> is really very low indeed (1.9 GB, typically, because we have 2 GB
> RAM per core on our nodes). If the user wants more memory, they have
> to set a new higher limit themselves.
I'm also relying on LSF's LSB_MEMLIMIT_ENFORCE option to take care of
memory-greedy jobs.
Before that, I tried to modify the default VM overcommit behavior on
individual nodes, playing with sys.vm.overcommit_memory and
sys.vm.overcommit_ratio values.
By setting overcommit_memory=2 and an appropriate overcommit_ratio, you
can basically prevent any swapping. The result is that processes'
malloc()s going beyond the limits are denied. This is cool from the
sysadmin standpoint, since the greedy applications are killed before
bringing the machine to its knees. But it may as well happen that an
application trying to use the last few available MBs gets killed, while
another one has already allocated several GBs, which is not especially
fair. And on top of that, most scientific applications are not very
careful about checking errors. So our users were beginning to complain
that their applications were crashing without any reason when they were
reaching the overcommit limits. Which made me realize that this
solution was probably not that optimal.
So LSF per-job memory limits enforcement did the trick for us: an esub
script to check that user can't request funny limits, and jobs using
more that requested get killed. That's good for serial jobs.
But parallel (read MPI) jobs are a different can of worms. Say you have
2 dual-cpu nodes, with 4GB each. A user can submit a job using 4 CPUs
and 6GB of memory without any problem as long as those 6GB are equally
balanced between the two nodes. But since LSF conception of the memory
limits is *per job*, it means that, for this specific job, we need to
set -M6000000 if we want it to run. And this limit won't prevent a
process from this job to use more than 4GB on the first node, making it
unusable...
So anyway, no solution is perfect. I guess that what the Linux kernel
really misses are memory quotas. Per user. Exactly like disk quotas.
That would be *really* neat and solve a whole range of problems.
> When they do that, we have
> supplied LSF with an esub script which then checks that the user has
> supplied both the new memory, and a suitable resource selection and
> reservation option. If they have not, the job is rejected. So for
> example, if the user asks for a 6 GB memory limit, the esub will
> check that they have requested a machine with at least 6GB of free
> memory, and then reserve that memory with the scheduler. For
> example:
>> -M6000000 -R"select[mem>6000] rusage[mem=6000]"
I'm not 100% certain here, but I would have assumed that it would be the
scheduler's job to select a host with enough ressources to run the job.
So from my understanding, specifying -R"rusage[mem=6000]" would be
sufficient to select a machine which 6GB available. But I may have
missed some LSF subtleties. :)
Cheers,
--
Kilian