Hello all:
I've got some problems with my users and ram usage. One in partular
is doing some algorithm design, and knows his code is buggy. He's
trying to debug it, but needs to do so on a cluster. As such, he's
trying to set memory limits to ensure he doesn't crash nodes (which
50% of his jobs end up crashing its first node, and consequently
leaving about half the cluster "locked" until I catch it and restart
the crashed node). However, we've had lots of problems getting memory
limits to work correctly.
What we need is: Specify that the first "node" / "process" be given
one limit, and the remainder of the nodes be given a different limit
(e.g, first process be given 2GB, remainder given 1GB). If we try and
just assign 2GB accross the cluster, then half the cluster would go
unused (as we have 8GB ram per node and 8 cores per node).
Also, we've had some problems correctly enforcing memory limits. We
want per-process limits, not job-total limits, and when we tried the
variouis mem= options, we got results that differed from what we
expected based on reading the approprate man page.
Any suggestions? This is currently the single largest problem facing
our ROCKS cluster, and has poised a signifiant reliability problem.
Thanks!
--Jim
Admin of "aeolus", 24 8-core, 8gb nodes.