How to get started with distributed computing using qsub?

The FieldTrip qsub toolbox is a small stand-alone toolbox to facilitate distributed computing. The idea of the qsub toolbox is to provide you with an easy MATLAB interface to distribute your jobs and not have to go to the Linux command-line to use the qsub command from there. Besides the Torque cluster (which we have at the Donders in Nijmegen) it also supports Linux clusters with other PBS versions, Sun Grid Engine (SGE), Oracle Grid Engine, SLURM and LSF as batch queueing systems.

Submitting a single MATLAB job to the cluster

To submit a job to the cluster, you will use qsubfeval. It stands for “qsub – evaluation - function”. As an input you specify a name of your function, an argument, and time and memory requirements (see below).

Try the following:

qsubfeval('rand', 100, 'timreq', 60, 'memreq', 1024)

Besides the memory requirements for your computation, MATLAB also requires memory for itself. The qsubfeval and qsubcellfun functions have the option memoverhead for this, which is by default 1GB. The memreq option itself does not have a default value. The torque job is started with a memory reservation of memreq+memoverhead.

The qsubfeval command creates a bunch of temporally files in your working directory. STDIN.oXXX is the standard output, i.e. the output that MATLAB normally prints in the command window. STDIN.eXXX is an error message file. For the job to complete successfully, this file should be empty.

Submitting a batch of jobs

To execute few jobs in parallel as a batch you will use qsubcellfun. It is very similar to qsubfeval, but instead of one input argument, you specify a cell array of arguments. Qsubcellfun then evaluates your function with each element of the array. In fact it calls qsubfeval as many times as the number of elements in the array.

Qsubcellfun is similar to the standard Matlab Cellfun. Try the following:

Qsubcellfun works as a wrapper for qsubfeval. If you use qsubcellfun, all the temporally files created by qsubfeval are automatically deleted when the job is completed, or when it is terminated with Ctrl+C, or with an error.

Time and memory management

You will have noticed that you have to specify the time and memory requirements for the individual jobs using the 'timreq' and 'memreq' arguments to qsubcellfun. These time and memory requirements are passed to the batch queueing system, which uses them to find an appropriate execution host (i.e one that has enough free memory) and to monitor the usage.

Do not set the requirements too tight, because if the job exceeds the requested resources, it will be killed. However, if you grossly overestimate them, your jobs will be scheduled in a “slow” queue, where only a few jobs can run simultaneously. The queueing and throttling policies on the number and the size of the jobs is to prevent a few large jobs from a single user from blocking all computational resources of the cluster. So the most optimal approach to get your jobs executed is to try and estimate the memory and time requirements as good as you can.

The help of qsubcellfun lists some suggestions on how to estimate the time and memory.

Stacking of jobs

The execution of each job involves writing the input arguments to a file, submitting the job, to Torque, starting MATLAB, reading the file, evaluate the function, writing the output arguments to file and at the end collecting all output arguments of all jobs and rearranging them. Starting MATLAB for each job imposes quite some overhead on the jobs if they are small, that is why qsubcellfun implements “stacking” to combine multiple MATLAB jobs into one job for the Linux cluster. If the jobs that you pass to qsubcellfun are small (less than 180 seconds) they will be stacked automatically. You can control it in detail with the “stack” option in qsubcellfun. For example

Note that the stacking implementation is not yet ideal, since with the default option it distributed the 4 jobs into 3+1, whereas 2+2 would be better.

Submitting a batch and don't wait within MATLAB for them to return

If you run your interactive MATLAB session on a torque execution host with a limited walltime and want to submit a batch of jobs with qsubcellfun, you don't know when the batch of jobs will finish. Consequently, you cannot predict the walltime that your interactive session requires in order to see all jobs returning.

Rather than waiting for all jobs to return, you can submit the batch and close the interactive MATLAB session. The next day or week, when all batch jobs have finished (use “qstat” to check on them) you can start MATLAB again to collect the results.