Northwestern KB

Managing Jobs on Quest

Please note: Job resource requirements, such as the number of cores or node or the amount of memory, in the submission script are recorded by the scheduler. Any changes made to the job script (jobscript.sh in the example on Submitting a Job on Quest) after the job has been submitted with msub will have no effect on the job. After submitting a job, you can only Hold, Release, Kill, and Modify job parameters using the the Moab commands in the list below.

In most cases, the job ID will be only numbers. In some cases, the job ID may start with "Moab.", for example:

Moab.13870586

"Moab." is part of the job ID. There are two common cases when this might occur:

Jobs submitted as part of a job array will be assigned a job ID prefixed by "Moab." This is normal.

Jobs that are rejected by the Torque resource manager may be assigned a job ID prefixed by "Moab." This can happen, for example, if you request more cores per node than are available on Quest. If your job ID begins with "Moab." and you aren't using a job array, then you should use the checkjob command to investigate problems with your job. Also see Troubleshooting Jobs on Quest for additional details.

Job Status

After submitting a job, you can execute the
showq command or the checkjob command to check the status of your job. On Quest, submitted jobs are analyzed and queued by the scheduler. When a job is sent to the scheduler, it is first checked by a resource manager. The resource manager ensures that you have enough resources, particularly compute hours, on the system in order to run your job.

If enough resources exist in your allocation, the job is forwarded to the scheduler to be put in queue. It is important to note that if there is a typo in your job submission script, it may be flagged by the resource manager and you job will be rejected and placed on BatchHold.

When the scheduler receives a job, it will prioritize your job relative to other jobs currently in the queue. The accounting system assumes that your job will run with the amount of time and number of cores that you specified in your job submission script. If your job requires less time than you specified, the accounting system only charges you for the time used on the system.

If you lack enough compute hours left in your allocation to run your job, it will be placed on BatchHold or SystemHold. Jobs in a BatchHold or SystemHold state, will remain in this state until you cancel the job or a system administrator intervenes to either add enough compute hours for your job to run, or to redirect your job to another account for you to access so your job can run. If your job is under a BatchHold or SystemHold and you need assistance from a system administrator, please contact quest-help@northwestern.edu for help.

Generally, the more resources that a job requires, the longer a job may sit in the queue until the necessary resources become free and can be scheduled. Full access nodes are
dedicated resources thus the access criteria, queues, job duration and job size limits for these nodes are different. See Full Access Job Commands for specialized information.

Commonly Used Commands

﻿The showq Command

The
showq
command (without any options) displays the job queues for
all
users on Quest. To quickly access information about your specific job(s), there are options to filter the results (you can combine multiple options):

Command

Description

showq -u <netID>

Show only jobs belonging to user specified

showq -r

Show running jobs

showq -i

Show idle jobs

showq -b

Show blocked jobs

showq -w acct=<allocationID>

Show only jobs belonging to account specified

showq --help

See documentation and additional options

The output of the showq command groups jobs into three categories: active, eligible, and blocked. Active jobs are running. Eligible jobs are being considered by the scheduler when additional computing resources become available; they are currently idle. There is a limit of 30 idle jobs per user. If you have submitted more than 30 jobs, and they weren't scheduled immediately, then some of the jobs will appear in the Blocked list; these jobs will be moved to the Eligible list as space becomes available. Jobs may also appear on the Blocked list if they were submitted to an expired allocation, there are insufficent compute hours on an allocation to complete a job, or the job has other errors or resource limit issues. Use the checkjob command to get more information.

﻿The
checkjob Command

The checkjob command displays detailed information about a submitted job’s status and diagnostic information that can be useful for troubleshooting submission issues. It can also be used to obtain useful information about completed jobs such as the allocated nodes, resources used, and exit codes. The -v flag is useful for gathering additional diagnostic information about your job.

Example usage:

checkjob -v <jobID>

where you can get your
<jobID> using the showq commands above.

Example for a Successfully Running Job

Note in the output below that:

The State is listed as Running (State: Running)

The amount of walltime used is listed with the amount of walltime requested at the bottom (Reservation '19936802' (-00:00:25 -> 00:04:35 Duration: 00:05:00))

Example for a Job with an Error

This example is for a job that requested more cores per node than are available on Quest. See Quest Technical Specifications for details on the nodes. In the submission script, 30 cores per node were requested with the line:

#MSUB -l nodes=1:ppn=30

The first indication of a problem with the job was that the job ID begins with "Moab."

Note in the output below that:

The State is listed as Idle (State: Idle)

The NOTE: entries at the bottom tell you that the requested tasks/procs (cores) is greater than the maximum number of cores per node (PPN) for each of Quest's partitions.

This job will remain in the idle state indefinitely and will never run. You must cancel this job.

Cancelling jobs

You can cancel one or all of your jobs with mjobctl. Proceed with caution, as this cannot be undone, and you will not be prompted for confirmation after issuing the command.

Cancel a single job using the job number:

mjobctl -c <jobID>

Cancel all of your jobs:

mjobctl -c -w user=<your_netID>

Additional mjobctl Commands

The Moab job control command (mjobctl) is used for holding, releasing, and canceling jobs, or changing the parameters of a submitted job. You can place your job in a “user hold” state after the job has been submitted with

msub –h <jobID>

Jobs placed in a “user hold” state will appear in the output of
showq and checkjob commands. You can then release your job with

mjobctl -r <jobID>

Moab permits modification of some job parameters after job submission and before the job starts running. These parameters include: