Hi there
I am a little confused about how Maui and Torque interpret vmem,
particularly in the case of multithreaded jobs.
I have a tiny cluster of virtual machines for experimenting with
scheduling policies. The cluster uses Rocks 5.3 with the Torque roll
that provides Torque 2.4.6 and Maui 3.2.6p21. The only twist is that I
rebuilt the Torque rpm so that I could add --enable-cpuset to the
configure script.
Each compute node has 2 cpus and 2gb memory, and I have been
experimenting with a multithreaded program (it uses Intel MKL or ATLAS)
that requires 1.5gb memory. I submit the job using nodes=1:ppn=2 and
vmem=1600mb. If I don't ask Maui to enforce any resource limits and
leave that to Torque and rlimit then everything is ok and the program
runs to completion. I then considered using Maui to enforce the
resource limits...
ENFORCERESOURCELIMITS ON
RESOURCELIMITPOLICY SWAP:ALWAYS:CANCEL
This was mainly because I was interested in providing some better
feedback to users when they exceed their memory requirements. So with
Maui in charge of vmem resource limits, and ignvmem=true for Torque, I
submitted my job again. However, this time the job was killed and the
following appeared in the Maui log...
job 56 exceeds requested swap limit (1548 > 800)
job '56' in state 'Running' has exceeded SWAP resource limit (1548 > 800)
(action CANCEL will be taken)
If I try with pvmem=1600mb then the job will never run because there is
no compute node with 2 x 1600mb of memory.
Interestingly, if I ask Maui to enforce limits on MEM rather than SWAP,
and I use mem instead of vmem, then everything appears to be ok.
However, I can see problems ahead if the jobs were constrained by mem
rather than vmem, so I don't particularly want to go in that direction.
Can anyone please identify my embarrassing mistake?
Many thanks
Martin
Torque config:
#
# Create queues and set their attributes.
#
#
# Create and define queue default
#
create queue default
set queue default queue_type = Execution
set queue default enabled = True
set queue default started = True
#
# Set server attributes.
#
set server scheduling = True
set server acl_host_enable = False
set server acl_hosts = XXX.XXX.XXX.XXX
set server managers = maui at XXX.XXX.XXX.XXX
set server managers += root at XXX.XXX.XXX.XXX
set server default_queue = default
set server log_events = 511
set server mail_from = adm
set server query_other_jobs = True
set server resources_default.walltime = 01:00:00
set server scheduler_iteration = 600
set server node_check_rate = 150
set server tcp_timeout = 6
set server next_job_number = 80
Maui config:
RMPOLLINTERVAL 00:00:15
SERVERHOST XXX.XXX.XXX.XXX
SERVERPORT 42559
SERVERMODE NORMAL
RMCFG[base] TYPE=PBS
ADMIN1 maui root
ADMIN3 ALL
LOGFILE maui.log
LOGFILEMAXSIZE 10000000
LOGLEVEL 3
QUEUETIMEWEIGHT 1
BACKFILLPOLICY FIRSTFIT
RESERVATIONPOLICY CURRENTHIGHEST
NODEALLOCATIONPOLICY MINRESOURCE
ENFORCERESOURCELIMITS ON
#RESOURCELIMITPOLICY MEM:ALWAYS:CANCEL
RESOURCELIMITPOLICY SWAP:ALWAYS:CANCEL