Submitting Jobs using loadleveler on AIX Clusters

LoadLeveler on all LONI p5 machines employs a backfill scheduler. The objective of backfill scheduling is to maximize the use of resources to achieve the highest system efficiency, while preventing potentially excessive delays in starting jobs with large resource requirements.

The benefit of using backfill scheduling is two-fold: On one hand, large jobs can run because the backfill scheduler does not allow jobs with smaller resource requirements to continuously use up resource before the larger jobs can accumulate enough resource to run; On the other hand, smaller jobs can be backfiiled to run on the resource larger jobs are accumulating, if they can finish before the projected start time of larger jobs.

With that being said, it helps to shorten the wait time of your job to specify wall_clock_limit whenever possible, which is the maximum time your job will run. The shorter it is, the better chance that your jobs will be backfilled to run. Below is an example:

A job of another user's that requests 8 nodes is waiting in the queue and is projected to run in 10 hours (see Useful LoadLeveler Commands for information on how to check the estimated start time of a job). Now suppose that 5 out of these 8 nodes are idle when you submit your own job which will run 2 hours on 2 nodes. If you specify wall_clock_limit to a value less than 10 hours, LoadLeveler will see the chance of backfill and your job will run immediately. However, if you do not specify wall_clock_limit, the system will assume it to be the default value, which is currenly 5 days for most queues, and your job will have to wait in the queue.

LoadLeveler Command File Syntax

▶ Table of Contents

This document is meant to be a condensed yet complete introduction to LoadLeveler's Job Command files and is specific to the LONI P5 environment. If LoadLeveler keywords or options that you are aware of from other experiences with LoadLeveler can not be found on this page it is because those particular items are not currently applicable in LONI P5s. Full documentation for Job Command files can be found in the online book, Using and Administering LoadLeveler (in pdf format).

1. Introduction

A Job Command file is simply a text file that may contain the following different types of information:

LoadLeveler keyword statements

Shell command statements

LoadLeveler variables

Comment Statements

The following rules dictate the form of your Job Command file:

Keyword statements begin with # @. There can be any number of blanks between the # and the @.

Comments begin with a # just as they do for shell scripts.

Statement components are seperated by blanks.

The backslash \ is the line continuation character. The continued line must not begin with # @.

Keywords are case insensitive.

Your Job Command file may tell LoadLeveler what you want to run through the executable keyword or you may have your Job Command file serve as the executable by not specifying an executable or by explicitly setting the executable to be the Job Command file itself. Therefore, your Job Command file may be a shell script that drives the programs that you want to execute.

The Job Command file may also contain different job steps that together constitute your job. You may name each job step and you must have a queue keyword entry for each job step you want to run. Unless otherwise noted, the keywords you set for the first job step will be inherited by all subsequent job steps. By default, LoadLeveler will view each job step as an independent entity but, by using the dependency keyword, you can conditionally execute different programs depending on the return value of the previous job steps.

2. Job Command File Keywords

account_no: Allows you to specify an allocation to associate with a job. The user's default allocation is used if none is specified.

# @ account_no = string

arguments: Specify a list of arguments to pass to the program you list as the executable. Default is no arguments.

# @ arguments = arg1 arg2 ....

blocking: Blocking specifies that tasks be assigned to machines in multiples of a certain integer. This may be an integer value or unlimited. Unlimited blocking specifies that tasks be assigned to each machine until it runs out of initiators, at which time tasks will be assigned to the machine which is next in the order of priority. By default, nothing is set.

# @ blocking = unlimited

checkpoint: A value of interval, yes or no indicates if a job is able to be checkpointed. Checkpointing a job is a way of saving the state of the job so that if the job does not complete it can be restarted from the saved state rather than starting the job from the beginning. Default value is no.

# @ checkpoint = yes

class: Specify the job class (queue) you want to submit your job to. The command llclass will list the available classes on LONI P5s and show how busy each class is currently. The current choices are: checkpt, workq, preempt, single, or interactive. The default is checkpt.

# @ class = workq

comment: Associate a user defined text comment with the job.

# @ comment = this is my very first job

core_limit: Set the hard and/or soft limits for the size of the core file that may result from your job. The value is a comma separated pair of integer resource values. The units are described in Resources section below..

# @ core_limit = 100mb

cpu_limit: Set the hard and/or soft limit for the amount of cpu time a submitted job step can use. The value is a comma separated pair of times in the form hh:mm:ss.ff. Default is unlimited.

# @ cpu_limit = 01:20:00,01:00:00

data_limit: Set the hard and/or soft limit for the size of the data segment to be used by the job step. The values are common separated integers in units shown in Resource section below. The default is unlimited.

# @ data_limit = 16mw,12mw

dependency: Specifies the dependencies between job steps. A job step that depends on a previous job step may evaluate logical expressions based on the previous job step's return values.

# @dependency ...

environment: Specify your initial environment variables when your job starts. Separate environment specifications may be separated with semicolons. Special values are as follows:

COPY_ALL All environment variables from your shell are copied. This is the default.

$var The variable var is coped into your job's environment.

!var The variable var is NOT to be copied.

var=value The variable var is set to value and copied into your job's environment

# @ environment COPY_ALL; !var2; var1=up

error: Set the name of the file that will hold standard error from your job. If you do not specify a file, standard error goes to std.err.

# @ error = filename

executable: For serial jobs, the executable keyword gives the name of the program to run. For parallel jobs, the executable should be /usr/bin/poe or a shell script that itself invokes poe. If executable is not set, the executable is assumed to be the Job Command file itself. The executable keyword statement sets the $(base_executable) to the executable filename without the directory portion as a side effect.

# @ executable = a.out

file_limit: Specifies the hard and/or soft limit for the size of a file. The value is a pair of comma-separated resource values (see Resource below). Default is no set values.

# @ file_limit = 4gb

hold: Specifies whether you want to place a hold on your job when you submit it. For this purpose you should select a user hold as this is the type of hold that you have permission to remove. The hold will remain in place until you release it with the llhold -r command. Holds may be of type user, system or usersys (both a user and a system hold). Default is no hold setting.

# @ hold = usersys

image_size Tell LoadLeveler the maximum virtual image size, in kilobytes, that your program will occupy during execution. If you do not specify the image size, the size will be that of the executable. If you specify a size that is too small your job may be dispatched to a machine with insufficient resources and your job may crash. Conversely, overestimating the size may result in LoadLeveler having difficulty finding a machine with sufficient resources to run your job. Default is the size of the executable file.

# @ image_size = 200

initialdir: The pathname of the directory that will serve as the initial working directory for your job. If not specified, initialdir defaults to the current working directory at the time your job is submitted to LoadLeveler. File names in the Job Command file that are not absolute path names (do not begin with a /) are relative to the initialdir. Default is thesubmission current working directory.

# @ initialdir = pathname

input: Specify the name of the file to use as standard input when your job runs. If not specified, /dev/null is used.

# @ input = filename

job_cpu_limit: Set the hard and/or soft limit for the CPU time to be used by all processes in a job. If a job starts a process that itself forks other processes the sum total of CPU time consumed will be regulated by this limit. The value is a pair of comma-delimited time values. The valid units and special values for limits are described in this table. The default is unlimited.

# @ job_cpu_limit = 20:00,15:30

job_name: Set the name of the job. The job name may be set only once, subsequent instances are ignored. The job_name will only appear in the long reports from the llq, llstatus, and llsummary commands and in email notifications about the job. There is no default value.

# @ job_name = my_first_job

job_type: Tells LoadLeveler what type of job you want to run. Valid choices are serial and parallel. If you select serial, you may not specify any of the following keywords: node, tasks_per_node, total_tasks, network.LAPI, network.MPI, or other keyword related to parallel processing. The default value is serial.

# @ job_type = parallel

max_processors: Specify the maximum number of nodes for a parallel job, regardless of the number of processors in the node. You may not specify both max_processors and node and we encourage users to use the node keyword instead of max_processors. There is no default value.

# @ max_processors = 64

min_processors: Specify the minimum number of nodes for a parallel job, regardless of the number of processors in the node. You may not specify both min_processors and node and we encourage users to use the node keyword instead of min_processors. There is no default value.

# @ min_processors = 32

network: For parallel jobs only, tell LoadLeveler how you want your tasks to communicate with one another. The value specified is a comma-delimited triplet of network_type, usage, and mode. The full keyword name includes a protocol specifier:

network.MPI: Message Passing Interface

network.LAPI: Low-Level Application Programming Interface

Users who are not developing their own communications protocols will probably want to use the MPI protocol. The network_type may be ethernet, sn_single (sn_all). For most instances you will probably want to use the sn_single (sn_all) for communications as it offers the highest bandwidth and also no other network traffic will use the switch. The switch interface is reserved to be used exclusively for parallel applications. The peak bidirectional bandwidth for the three interfaces available on the Parallel Execution Nodes are:

Ethernet: 1,000

sn_single: 4,000

The usage qualifier describes whether the network adapter can be shared by other tasks. Possible values are shared and not_shared. The default is shared. The mode qualifier specifies the communication mode you wish to use. You may select either IP (the Internet Protocol) or US (for User Space). Of these the US choice represents a lower level communication protocol that should yield higher performance for your applications. The network keyword can not be set if you have already set an Adapter requirement or preference. Also, the value you set for the network keyword in a job step is not inherited by subsequent job steps within the same Job Command file. An example of the network keyword usage, for a job that I want to run on the Parallel Execution Nodes in LONI P5s that uses MPI and that should communicate over the Federation switch, sharing the adapter and using the User Space subsystem reads as follows (it is the default):

# @ network.MPI = sn_single,shared,US

node: Tells LoadLeveler the minimum and maximum number of nodes you require for a given job step. The value is not inherited by subsequent job steps. The default minima is 1 and the default maxima is the number of nodes that service the particular job class (queue). The syntax for the command is:

# @ node = 6,10 # node = [min][,max]

To use the node keyword in addition to the total_tasks keyword the min and max values must be equal or you must specify only one value. To specify that I want to run on at least 6 nodes and on up to 10 nodes, it they are available, my Job Command file would have a node keyword entry that looks like

node_usage: Specify whether this job step shares nodes with other job steps. Possible values are shared and not_shared. Shared means that the nodes can be shared with other tasks of other job steps. not_shared means that no other job steps are scheduled to run on the node. The default is shared. Only the Parallel Execution nodes may have more than one task executing on a node so this option only concerns jobs submitted to the P1 and PM job classes. The default value is shared.

# @ node_usage = not_shared

notification: Tells LoadLeveler when you want the user specified by the notify_user keyword to be sent email. The following options are supported:

always: Notify when the job begins, ends and if it has an error

error: Notify only if job fails

start: Notify only when the job begins

never: deafening silence from LoadLeveler

complete: Notify when the job ends, default value

# @ notification = always

notify_user: The email address that you wish LoadLeveler to send email notifications to. The default value for this variable is your pelican_login@ianx where ianx is LoadLeveler's name for the 2 interactive nodes (ian1 - ian2). That is to say your account on whichever interactive node you are submitting the job from. Please be advised that mail sent to your account at casper.lsu.edu will not be deliverable. Please use your PAWS email account for notifications or some other email account. Default is the user name on the submitting machine.

# @ notify_user = user@host

output: The file that will hold the standard output from your job. If no file is specified, std.out is used.

# @ output = filename

preferences: Specifies the characteristics that you prefer be available on the machine that executes the job steps. LoadLeveler attempts to run the job steps on machines that meet your preferences. If such a machine is not available LoadLeveler will then assign your job to machines which meet only your requirements. There are no default preferences. String values must be quoted. Examples:

# @ preferences = (Memory >= 64) && (Arch == "Power5")

or

# @ preferences = Machine == "l1f1n14"

queue: Instructs LoadLeveler to place one copy of the job step in the queue. This statement is required. It marks the end of a job step.

Disk: The amount of disk space, in kilobytes, you think are required to run a job.

Machine: The name of the machine you must run on.

Memory: The amount of physical memory, in Megabytes, that your job requires.

OpSys: Operating System. All LONI P5 nodes are kept at the same version of the AIX operating system to avoid conflicts between submitting and executing machines.

The following are examples of specifying requirements. You could use any of these statements with the preferences keyword instead. To specify that your serial job requires a Power5 processor with at least 64 Megabytes of memory and at least 512 kilobytes of disk space you would include the following in your Job Command file.

restart: Tells LoadLeveler if you want your job to be restarted if it can not be completed on the machine it was originally dispatched to. If you specify restart = no your job will be canceled if it can not complete. A checkpointed job is always considered to be restart-able.

# @ restart = yes | no #Default value: yes

rss_limit: Set the hard and/or soft limit for the resident set size. The syntax is rss_limit = hardlimit,softlimit.

# @ rss_limit = 120,100 #Default value: unlimited

shell: The name of the shell to use to run a job step. The default value is your login shell.

# @ shell = /bin/ksh #Default value: /bin/bash

stack_limit: Set the hard and/or soft limit for the size of the stack segment that your job may use. The syntax is stack_limit = hardlimit,softlimit.

# @ stack_limit = 512mb #Default value: unlimited

startdate: Tells LoadLeveler when you want your job to run. The default value is the current date and time. The syntax is:

# @ startdate = date time

where date is expressed as MM/DD/YY and time is expressed as HH:mm[:ss]If you specify a start date in the future, your job is kept in the deferred state until the start time.

# @ startdate = 04/05/06 22:00:00

step_name: Gives a name to a job step within your Job Command file. You may use any combination of alphanumeric characters and underscores '_' and periods '.' with the following exceptions. You may not name a job step T or F or use a number as the first character in the name. The default name for the first job step is the character '0' and subsequent job steps in the Job Command file are named sequentially.

# @ step_name = my_job.step

tasks_per_node: For a parallel job, the number of tasks you want to run on a node. This keyword is used in conjunction with the node keyword and its value is not inherited by subsequent job steps. not inherited by subsequent job steps. The default value is 1 task per node.

# @ tasks_per_node = number #Default value: The default is one task per node.

total_tasks: For a parallel job, specifies the total number of tasks you want to run on all available nodes. The total_tasks keyword is used in conjunction with the node keyword and the node keyword must specify a single value (minimum and maximum numbers must be identical). This keyword is not inherited by subsequent job steps. You may not specify both total_tasks and tasks_pre_node.

# @ total_tasks = number #Default value: No default is set.

user_priority: Set the priority of your job step relative to other job steps, in the same job class, owned by you. This keyword does not impact your jobs priority relative to jobs owned by other users. The user_priority may take on values from 0 to 100 with the higher value having the higher priority for being dispatched to run by LoadLeveler. The default value is 50.

# @ user_priority = number #Default value: The default priority is 50

wall_clock_limit: Set the hard and/or soft limit for the elapsed time for which a job can run. LoadLeveler reckons the wall clock time for a job from the time the job was dispatched to a machine to execute. The default value for the wall_clock_limit is 30 minutll_clock_limit = hardlimit,softlimit. The valid units and special values for limits are described in this table.

3. Job Command File Variables

LoadLeveler creates several variables that you may use within your Job Command File in, for example, the name of your output file. The variable names are case insensitive but you must reference them with the following syntax:

$(variable_name)

In addition some keyword statements set variables that you may reference. The following is a listing of the LoadLeveler variables.

$(host): The hostname of the machine from which you submitted the job. Equivalent to the $(hostname) variable.

$(domain): The domain of the host from which you submitted the job.

$(jobid): A sequential number assigned to the job by the submitting machine. Equivalent to the $(cluster) variable.

$(stepid): The sequential number assigned to a job step when more than one queue statement appears in the Job Command file. The $(stepid) and $(process) variables are equivalent.

$(executable): Contains the name of the executable if you set the executable keyword.

$(base_executable): Contains the name of the executable without the directory path if you set the executable keyword.

$(class): Contains the name of the job class that your job has been submitted to if you set the class keyword.

$(comment): Contains the comment text if you set the comment keyword.

$(job_name): Contains the job name text if you set the job_name keyword.

$(step_name): Contains the step name text if you set the step_name keyword.

4. Resource Limits

Resource limits, with the exception of the wall_clock_time limit, are unlimited for all users by default. The time limits for the different job classes are listed in the this table. If your job exceeds a soft limit that you have imposed on it a trap-able signal is sent to your processes. If your job exceeds a hard limit a nontrap-able signal is sent to terminate your processes. The units for space related limits may be any of the following, the default unit is bytes.

b: bytes

w: words (4 bytes)

kb: kilobytes (210 bytes)

kw: kilowords (210 words)

mb: megabytes (220 bytes)

mw: megawords (220 words)

gb: gigabytes (230 bytes)

gw: gigawords (230 words)

For the time related limits, cpu_limit, job_cpu_limit, and wall_clock_limit the hard limit and soft limit are expressed as:

[[hours:]minutes:]seconds[.fraction]

So that the default format is the number of seconds but you may specify time limits in hours minutes and seconds or minutes and seconds as well. The fractional number of seconds is rounded.

You may also specify any of the three following strings with all the limit keywords. rlim_infinity and unlimited both set the limit to be the largest representable positive integer. copy copies the resource limit in place for your user id (as reported by ulimit -a for Korn shell and limit for C shell) at the time the job is submitted.

Note: #@ executable and #@ arguments, if
included, are passed on to poe.

You should specify wall_clock_limit whenever possible, which
is the maximum estimated time you job will run, because it may shorten
the time that your job has to wait in the queue as well as improve the
utilization of the cluster.

In the example above, there should be nothing after #@
queue. However, if one would like to issue a series of shell
commands, a script such as the following can be used. The only
difference is that the #@ executable and #@ arguments
keywords are missing, so LoadLeveler looks after #@ queue for
the commands to run. Also, in this mode one must explicitly
invoke /usr/bin/poe in order to run a provide the environment
for a parallel job.

Note: On LONI machines the serial jobs are restricted to a specific node (usually node 14), which is also open for parallel jobs. Therefore, when parallel jobs are running on that node, all serial jobs will have to wait in the queue even if there are other nodes available.

4. Submission Script for Pandora

Pandora is the new IBM POWER7 cluster. Please note a few differences in the LoadLeveler submit script compared to the POWER5 clusters that we supported in the past:

Arch in the requirements directive is "POWER7" (all caps), not "Power5".

LoadLeveler now requires you to specify consumable resources via the resources directive. You must specify both how much memory each task uses (ConsumableMemory) and how many CPUs each task uses (ConsumableCpus). In general, you will want ConsumableCpus(1), instead increasing the number of tasks based on your code's scalability: 8 tasks for an 8-way job, 32 tasks for a 32-way job, and so on.

As an example, if you request 1 node with 32 tasks and 32 ConsumableCpus, then you are requesting 1024 total processors and 32 times the amount of RAM. Pandora will not be able to provide this.

The network directive can either be network.MPI_LAPI or network.MPI, except when you are running GAMESS. For GAMESS, it must be network.MPI_LAPI.

LoadLeveler Job Chains and Dependencies

▶ Table of Contents

LoadLeveler allows one to specify multiple, independent job steps per LL queue script. They are run concurrently as long as the resources are available. Likewise, LoadLeveler provides for the specification of dependencies among jobs steps such that job chains maybe set up depending on the return status of a previously run job.

1. Independent Jobs

Independent job steps are specified using the step_name directive. In this example, the environment directive applies to all stanzas.

Interactive Parallel Sessions

▶ Table of Contents

An interactive parallel session on one of our HPCs allows one to operate interactively (ala shell, etc) while taking advantage of multiple processors/nodes. This is useful for development, debugging, and testing. DO NOT RUN YOUR PRODUCTION CODE INTERACTIVELY because it will have negative impact on other users. The following is meant to be a quick reference on how to achieve such a session on various LSU/LONI resources:

In general, these instructions should work for any AIX box running LoadLeveler with interactive queues enabled.

1. Terminal Only

From the head node:

% poe ./a.out -rmpool 1 -nodes 1 -procs 8

The above command requires 8 processors on 1 node to run the a.out interactively, -rmpool 1 requires that interactive jobs to be sent to LoadLeveler pool 1.

Currently, Pelican only has one node for the interactive queue, hence, maximum node number is 1 and maximum processor number is 8.

2. Terminal with an X11 Session

The example above isn't really useful when one wants to have a commandline interface to the parallel environment. To do this, one must establish an X11 session inside of the parallel environment set up by LoadLeveler that is forwarded to the user on his local computer.

We are currently trying to figure out how to do this - it is required for interactive parallel debugging on the P5s.

3. View Job Status

3.1. All Jobs in the Queue

3.2. All of One's Own Jobs

# llq -u username

3.3. Details About Why A Job Has Not Yet Started

# llq -s job-id

The key information is located at the end of the output, and will look similar to the following:

==================== EVALUATIONS FOR JOB STEP l1f1n01.4604.0 ====================
The class of this job step is "workq".
Total number of available initiators of this class on all machines in the cluster: 0
Minimum number of initiators of this class required by job step: 4
The number of available initiators of this class is not sufficient for this job step.
Not enough resources to start now.
Not enough resources for this step as backfill.

Or it will tell you the estimated start time:

==================== EVALUATIONS FOR JOB STEP l1f1n01.8207.0 ====================
The class of this job step is "checkpt".
Total number of available initiators of this class on all machines in the cluster: 8
Minimum number of initiators of this class required by job step: 32
The number of available initiators of this class is not sufficient for this job step.
Not enough resources to start now.
This step is top-dog.
Considered at: Fri Jul 13 12:12:04 2007
Will start by: Tue Jul 17 18:10:32 2007

3.4. Generate a long listing rather than the standard one

# llq -l job-id

This command will give you detailed job information.

3.5. Job Status States

Canceled

CA

The job has been canceled as by the llcancel command.

Completed

C

The job has completed.

Complete Pending

CP

The job is completed. Some tasks are finished.

Deferred

D

The job will not be assigned until a specified date. The start date may have been specified by the user in the Job Command file or it may have been set by LoadLeveler because a parallel job could not obtain enough machines to run the job.

Idle

I

The job is being considered to run on a machine though no machine has been selected yet.

NotQueued

NQ

The job is not being considered to run. A job may enter this state due to an error in the command file or because LoadLeveler can not obtain information that it needs to act on the request.

Not Run

NR

The job will never run because a stated dependency in the Job Command file evaluated to be false.

Pending

P

The job is in the process of starting on one or more machines. The request to start the job has been sent but has not yet been acknowledged.

Rejected

X

The job did not start because there was a mismatch or requirements for your job and the resources on the target machine or because the user does not have a valid ID on the target machine.

Reject Pending

XP

The job is in the process of being rejected.

Removed

RM

The job was canceled by either LoadLeveler or the owner of the job.

Remove Pending

RP

The job is in the process of being removed.

Running

R

The job is running.

Starting

ST

The job is starting.

Submission Error

SX

The job can not start due to a submission error. Please notify the Bluedawg administration team if you encounter this error.

System Hold

S

The job has been put in hold by a system administrator.

System User Hold

HS

Both the user and a system administrator has put the job on hold.

Terminated

TX

The job was terminated, presumably by means beyond LoadLeveler's control. Please notify the Bluedawg administration team if you encounter this error.

User Hold

H

The job has been put on hold by the owner.

Vacated

V

The started job did not complete. The job will be scheduled again provided that the job may be reschellued.