Job management commands

Computational jobs on Guillimin are run in batch mode (deferred execution, as opposed to interactive). Jobs to be executed are submitted to queues, which then run depending on a number of factors controlled by the scheduler. The scheduler (MOAB) insures that the cluster resources are used in the most effective way, and, at the same time, are shared between the users in a fair manner. You manage your jobs on the cluster (submit, check status, kill, etc.) ONLY with the help of a special set of scheduler commands.

We summarize the basic MOAB and Torque commands (further details can be found with man qsub, or man with any command given below).

qsub script_file - submits a new job, where script_file is a text file, containing the name of your executable program and execution options (number of cores requested, working directory, job name, etc.). The example of script_file is given below. Can be used with the following arguments:

qsub -q qname - submits your job to a particular queue qname

qsub -I (capital i) - submits an interactive job, you have to wait until you get a shell running on a worker node

qstat- shows a current list of submitted jobs.

showq- shows a current list of submitted jobs; note: it takes a few minutes before your job shows up in showq. Can be used with the following arguments:

showq -v - shows a full detailed list of submitted jobs

showq -u username - show a listing of submitted jobs for your username

showq -r - the same as the previous, but a list of assigned nodes is shown for each job

checkjob -v job_ID - shows why the job is waiting its execution

canceljob job_ID or qdel job_ID - kills the job, or removes it from the queue. The job_ID can be obtained by the qstat and showq commands.

showstart job_ID - shows when the scheduler estimates the job will start (not very reliable)

Job queues on Guillimin

Computing nodes of Guillimin cluster are grouped in 4 partitions. Note: the Westmere nodes have been decommissioned so ppn=12 no longer has a special meaning and should not be used anymore except in serial jobs with nodes=1.

Partition

Memory per core on Sandy Bridge nodes

(ppn=16)

Serial Workload (SW2) - for jobs, not requiring a large memory footprint

The queues on Guillimin are organized accordingly. Depending on the type and requirements of your code you submit your job to one of the following queues:

metaq (default)

debug (made of three nodes in the SW2 partition)

In general, the default queue (with no explicit queue name specified) is recommended to minimize waiting times in the queue. The scheduler will steer the job to a fitting node, depending on 3 parameters: the walltime, the number of processors per node specified (ppn), and the minimum amount of memory needed per core (pmem).

The last debug queue is a special one. It is specially created to allow you to test your code, before you submit it for a long run. The jobs, submitted to "debug" queue should normally start almost immediately, and you will be able to see if your program behaves as you expect. There are strict resource and time limitations for this queue though: The default running time is 30 min only and the maximum is 2 hours. If the parameters in your submission file exceed these limits, your job will be rejected!

IMPORTANT:

The computing nodes, assigned to 3 mentioned above partitions, have different amount of memory! The memory per CPU core for SW2 and LM2 partitions is about 3.8 GB and 7.8 GB respectively. Be aware that if your job exceeds these limits, it will be killed automatically. Also, if you decide to explicitly specify the memory requirements for the job in your submission file (normally not required), and these requirements appear to exceed the mentioned above limits, your job will be automatically blocked.

If you are using thread-type parallelization (like OpenMP), the default queue will usually steer your job to an SW node.

It is ALWAYS a good idea to submit a short test job to "debug" queue first, before a long-time run. In this way you will immediately know if your program works as expected or not. However, please do NOT use the "debug" queue for real production-type runs! Remember, that your job will be automatically killed after 30 min !

The DEFAULT walltime for each of the other queues is 3 hours, with a MAXIMUM allowed walltime of 14 days for contributors, except for XLM2 nodes for which it is 7 days. All other users are now limited to 12 hours or less per job.

For jobs using GPUs or Xeon Phis, the pmem parameter denotes memory per node, NOT per core.

Tables showing how queues map to nodes

For beginning users we advise you to read the sections below to get started, but for more advanced users it is useful to know which queues map to which compute nodes. The first table shows the mapping from the highest level. In general, jobs are categorized in "serial" jobs of less than 16 cores on a single node (nodes=1:ppn<16), parallel jobs with 16 cores per node (ppn=16) for Sandy Bridge nodes, and parallel jobs with an unspecified number of cores per node (procs=n). Note: the Westmere nodes have been decommissioned so ppn=12 no longer has a special meaning and should not be used anymore except in serial jobs with nodes=1.

Queue

nodes=1:ppn<16

ppn=16

procs=n, n≥16

default (metaq)

SW2, LM2, XLM2, AW

SW2, LM2, XLM2, AW

SW2, LM2, XLM2

The behaviour of the default queue firstly depends on whether the job is a "serial" (<16 cores) job or not. In that case it will most likely run on one of around 45 Sandy Bridge SW2 nodes, or for shorter duration jobs, one of around 80 Sandy Bridge AW nodes. Depending on the walltime and pmem value the job is either routed to the internal serial-short,sw-serial, or lm2-serial queue. Note: you should not submit directly to such internal queues but you see them appearing in checkjob and other commands.

PBS -l value where n<16

pmem value

walltime ≤ 36h (internal queue name)

walltime > 36h (internal queue name)

nodes=1:ppn=n

3700m

SW2, AW (serial-short)

SW2 (sw-serial)

nodes=1:ppn=n

7700m

LM2 (lm2-serial)

LM2 (lm2-serial)

For parallel jobs using 16 cores or more the behaviour depends on the pmem value (minimum number of required megabytes per core), the ppn value, and the walltime. Shorter jobs with low memory requirements can run almost anywhere on the cluster. Longer duration jobs will only run on the minimum node that fits the memory requirements given, so that they do not run on nodes that are over-specified. Note that jobs with high memory requirements (>=7700m) also always have a higher priority than jobs with lower memory requirements, so that even short-duration low-memory jobs will enter last on LM2 nodes.

pmem value (internal queue)

walltime <= 72h

walltime > 72h

ppn=16

procs

ppn=16

procs

3700m(‡) (sw2plus, sw2-parallel)

SW2, LM2

SW2,LM2

SW2

SW2

7700m (lm2)

LM2

LM2

LM2

LM2

>7800m (xlm2, see below)

XLM2

XLM2

XLM2

XLM2

(‡) default

Job Accounting

For any user with an NRAC/LRAC allocation a special account with the Resource Allocation Project identifier (RAPid) has been created within the Compute Canada Database. These special accounts have correspondingly been setup within the scheduler environment of Guillimin and must be specified as part of the job submission for users to access their NRAC/LRAC allocation. If no RAPid is specified the job will be rejected unless you are a member of only one RAPid. Specifying the RAPid for your project is important in order to have the job scheduled with the priority assigned to the project.

The RAPid can be specified as either part of the job submission script, or on the command line as an option to the qsub command. For example, if your RAPid is xyz-123-ab you could include the following line at the beginning of your job script:

#PBS -A xyz-123-ab

You could also specify the RAPid on the command line as an argument to the qsub command:

qsub -A xyz-123-ab script_file

Examples

Submitting a single-processor job

You would create a job submission file, e.g. script_file, with the following content:

#PBS -l nodes=1:ppn=1 - you are requesting one cpu core on one computational node

#PBS -l walltime=12:00:00 - sets the execution time limit for your job (12 hours in this case, only 3 hours will be set by system if this parameter is omitted)

#PBS -A xyz-123-ab - See the "Job Accounting" section below

#PBS -o outputfile - the stdout of your code will be redirected to outputfile file

#PBS -e errorfile - the stderr of your code will be redirected to errorfile file

#PBS -N jobname - your job will have name jobname

module load ... - loads any modules required by the job

cd $PBS_O_WORKDIR - changes directory to the same place you were in when you submitted your job

./your_app - starts execution of your_app, which is an executable file

You submit your job with the following command:

qsub script_file

Submitting an OpenMP job

The submission of programs, compiled with OpenMP options, is similar to the one of serial jobs. The difference is that now you reserve more than 1 core on the node. The following sample script submits OpenMP code to be executed on 4 cores.

The number of nodes for an OpenMP job is always 1 (you can not go beyond 1 node with OpenMP), and the number of cores (ppn) should never exceed the number of physical CPU cores of that node (16 in our case). So, the largest job you can run is "nodes=1:ppn=16".

The OMP_NUM_THREADS variable should correspond to the number of requested CPU cores. It should NEVER exceed the number of cores you reserved!

The OpenMP jobs should normally be submitted to the default queue. In special cases when your code needs a large memory footprint, you can submit OpenMP jobs with a higher pmem value, with e.g. "nodes=1:ppn=16,pmem=7700m" (the full 6 or 8 GB cannot be used because the Operating System takes space too).

For programs compiled with MPI support, it is also necessary to set the IPATH_NO_CPUAFFINITY variable, for instance using export IPATH_NO_CPUAFFINITY=1.

Submitting MPI jobs

Parallel programs, compiled with MPI libraries, are conceptually different from serial ones or OpenMP codes. They are normally run across multiple computing nodes with data and instruction exchange performed via the cluster network. Therefore more that 1 physical computing node is usually reserved for the job, and the executable itself is started with the help of a special launcher, which ensures the data exchange between the processes.

Here is the example of submission script for MPI job, which sends your program to be executed on 48 (3*16) CPU cores:

The line "#PBS -l procs=48" asks the scheduler to reserve 48 CPU cores for your job on 3 nodes in the cluster, depending on availability. If this number is not a multiple of 16 it will be rounded up.

If instead, you use "#PBS -l nodes=3:ppn=16" you ask the scheduler to reserve 3*16=48 CPU cores for your job on 3 Sandy Bridge nodes in the cluster.

The line "#PBS -l pmem=1700m" corresponds to the default memory reserved per core; if you need more memory this value should be higher. Recommended pmem values are 3700m (default), and 7700m (only run on LM2); see below for the extra large memory XLM2 nodes.

mpiexec -n 48 ./your_mpi_app - starts program your_mpi_app, compiled with MPI, in parallel on 48 cores. The program "mpiexec" is a mentioned above launcher which "organizes" all communications between the MPI processes. The parameter -n should NEVER be larger than the number of "nodes"*"ppn" or "procs" in the "#PBS -l " line.

We no longer recommend using "ppn" values less than 16 for multi-node jobs.

IMPORTANT:

Although many third-party user application packages often provide MPI sources as a part of the package, we strongly advise to build your application with our MPI packages, which are already installed on the system, and are accessible through "module" environment. These packages are specially built using InfiniBand libraries and ensure that MPI traffic of your parallel application will go through InfiniBand network.

Inside your MPI job please make sure that the software "modules", corresponding to your compiler and MPI package, are loaded. This way you will be sure that these modules will always be loaded on every computing node, where your job is sent.

We no longer recommend adding corresponding "module load ..." lines to your .bashrc file, or using "#PBS -V". A job is most reliably re-run if it is self-contained and has all information needed, instead of relying on the environment.

Submitting hybrid (MPI and OpenMP) jobs

For parallel programs compiled with MPI libraries, that use OpenMP for intra-node communications, the above example needs to be modified. For 48 CPU cores on 3 nodes with 16 cores each it reads as follows:

#PBS -j oe - merges output and error files in the output file. Since -o and -e are not given, the output file will be "jobname.oJOBID"

os.getenv('PBS_O_WORKDIR') - even in Bash, the environment variable $PBS_O_WORKDIR is the current directory from where you have submitted this job. The Python function getenv() lets you provide a default value in case the variable is not set.

os.environ['var_name'] = 'value' - to set an environment variable

subprocess.call("command arg1 arg2 ... > results.txt", shell=True) - to run a shell command. While the output of the Python script will be in jobname.oJOBID, the output of command will be in results.txt.

How to submit jobs from the worker nodes (jobs within jobs)

From within your job’s submission script, the job can spawn a child job using the following qsub command: