Thanks James,
Each node in our cluster has 12Gb memory and 8 processors. The resource the job tried to request was something like ' -l mem=12gb, nodes=1:ppn=4' (the java app uses 4 processors.)
My problems were
(1)Torque did not let me use -l nodes=1:ppn=4, mem=12gb, vmem=12gb, pmem=12gb at all, it reported 'getsize() failed for mem/pmem in mom_set_limits' and exited;
(2) If I did not set mem=12gb, i.e., used '-l nodes=1:ppn=4' only, the java app (requiring 12Gb explicitly) would run without any problem, but Torque did not prevent another new job (requiring 12Gb as well) from starting on that node, which caused application failed
I should (not tried though) be able to use '-l nodes=1:ppn=8' to stop new job from starting on that node because there is no cores are free until the current job terminates. This appears to me not an optimized solution as the Java app only needs 4 cores, and the rest of 4 cores + 3~4Gb of memory on that node can be used by other small apps running through Torque; also the performance profiling showed the application runs the fastest on 4 cores. Assigning 8 cores to it would not make the app run any faster.
Setting unlimited stack size also did not help solving my problem.
Any suggestions? Thank you again for the help! I am really getting frustrated :(
P.S.
________________________________
From: "Coyle, James J [ITACD]" <jjc at iastate.edu>
To: Fan Dong <fan.dong at ymail.com>; "torqueusers at supercluster.org" <torqueusers at supercluster.org>
Sent: Tue, April 13, 2010 11:29:53 AM
Subject: RE: [torqueusers] Torque memory allocation
Fan,
You probably are having problems with default settings for pmem
and vmem, which you are not setting.
The defaults are probably 4GB.
I’ll assume that you have nodes with 16 processors and
with 16Gb of memory, (1GB/processor on average)
and that the Java app is a single process, so you are only
reserving 1 processor with nodes=1:ppn=1
so that you reservation looks something like:
#PBS
-lmem=12Gb,nodes=1:ppn=1,walltime=1:00:00
If so, I’d suggest instead using pmem and vmem also, and reserve
Enough processors on that node so that that number of processors with
the average memory will satisfy your memory needs. In this case
12GB at 1GB per processor means reserve 12 processors.
#PBS -lvmem=12GB,pmem=12Gb,mem=12Gb,nodes=1:ppn=12,walltime=1:00:00
Then 12/16 ths of the memory is being used, so reserve 12/16 ths
of the cpus on that node.
So two of these jobs cannot fit onto one node, and if the
process us being killed for virtual memory (vmem)
or for process size (pmem )should take care of that.
Also if you are using only a single node and using tcsh or csh, I’d
place the command
unlimit stacksize
in the script before the memory intensive command (look uo the
equivalent command if you are in a Bourne shell like bash)
If you use multiple nodes , put this command in your ~/.cshrc
file.
James Coyle, PhD
High Performance Computing Group
115 Durham
Center
Iowa State
Univ.
Ames, Iowa
50011 web: http://www.public.iastate.edu/~jjc
From:torqueusers-bounces at supercluster.org
[mailto:torqueusers-bounces at supercluster.org] On Behalf Of Fan Dong
Sent: Monday, April 12, 2010 9:21 PM
To: torqueusers at supercluster.org
Subject: [torqueusers] Torque memory allocation
Hi there,
I am running into a problem described as the follows:
1) we have some memory intensive java jobs to run through
Torque, each of the jobs requires 12Gb of memory and each nodes in the cluster
has 16Gb of memory.
2) when a job is running on one of the node, Torque does not
prevent the new job (requiring 12Gb memory as well) from starting on the same
node, causing that new job fails because there is no enough memory.
(We already let Torque to scatter the jobs cross the nodes, but this will
happen when there are more jobs than nodes)
3) tried use -l mem=12gb, but did not work. Torque
seems to have a 4Gb limit for this setting.
I was wondering if there is any solution for that. We
are not using Moab or Maui.
Any input is highly appreciated.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20100413/206d7adf/attachment-0001.html