Friday, 20 July 2012

Jobs with memory leaks containment

This week some sites suffered from extreme memory hungry jobs using
up to 16GB of memory and killing the nodes. These were most likely
due to memory leaks. The user cancelled all of them before he was
even contacted but not before he created some annoyance.

We have had some discussion about how to fix this and atlas so far
has asked not to limit on memory because their jobs use for brief
periods of time more than what is officially requested. And this is
true most of their jobs do this infact. According to the logs the
production jobs use up to ~3.5GB mem and slightly less than 5GB
vmem. See plot below for one random day (other days are similar).

To avoid killing everything but still putting a barrier against the
memory leaks what I'm going to do in Manchester is to limit for mem to 4GB and a limit for vmem to 5GB.

If you are worried about memory leaks you might want to go through a
similar check. If you are not monitoring your memory consumption on
a per job basis you can parse your logs. For PBS I used this command
to produce the plot above

numbers are already sorted in numerical order so the last one is the
highest (mem,vmem) a job has used that day. atlprd is the
atlas production group which you can replace with other groups.
Atlas users jobs have up to a point similar usage and then every day
you might find a handful crazy numbers like 85GB vmem and 40GB mem.
These are the jobs we aim at killing.

I thought the batch system was simplest way because it is only two commands in PBS but after lot of reading and a week of testing it is not possible to over allocate memory without affecting the scheduling and ending up with less jobs on the nodes. This is what I found out:

There are various memory parameters that can be set in PBS:

(p)vmem: virtual memory. PBS doesn't interpret vmem as the almost unlimited address space. If you set this value it will interpret it for scheduling purposes as memory+swap available. It might be different with later versions but that's what happens in torque 2.3.6.(p)mem: physical memory: that's you RAM.

when there is a p in front it means per process rather than per job

If you set them what happens is as follows:

ALL: if a job arrives without memory settings the batch system will assign these limits as allocated memory for the job not only as a limit the job doesn't have to exceed.ALL: if a job arrives with memory resources settings that exceed the limits it will be rejected.(p)vmem,pmem: if a job exceeds the settings at run time it will be killed as these parameters set limits at OS level.mem: if a job exceeds this limit at run time it will not get killed. This is due to a change in the libraries apparently.
To check how the different parameters affect the jobs you can submit directly to pbs this csh command and play with the parameters

qmgrqmgr: set queue long resources_max.vmem = 5gbqmgr: set queue long resources_max.mem = 4gbqmgr: set queue long resources_max.pmem = 4gb

These settings will affect the whole queue so if you are worried
about other VOs you might want to check what sort of memory usage
they have. Although I think only CMS might have a similar usage. I know for sure Lhcb uses less. And as said above this will affect the scheduling.

Update 02/08/2012

RAL and Nikhef use a maui parameter to correct the the over allocation problem

NODEMEMOVERCOMMITFACTOR 1.5

this will cause maui to allocate up to 1.5 times more memory than there is on the nodes. So if a machine has 2GB memory a 1.5 factor allows to allocate 3GB. Same with other memory parameters described above. The factor can of course be tailored to your site.

On the atlas side there is a memory parameter that can be set in panda. It sets
a ulimit on vmem on a per process basis in the panda wrapper. It didn't
seem to have an effect on the memory seen by the batch system but that
might be because forked processes are double counted by PBS which opens a whole different can of worms.

2 comments:

Hello friend! I am new to the internet world, last week my site experienced memory leak. Even though after many tries, i was unable to recover it. However, this post has some informative tricks which can be of great help. Let us see if i could end up on the winning side.