All memory requests are '''per core'''. One of the more common scenarios is where somebody will need, say 20 cores and 400 GB of memory. So they will make a request like '<tt>-pe single 20, -l mem=400G</tt>' This will never run, because what you are really requesting is 20 cores and 8000GB of memory (20 * 400). Since we have no nodes with 8000 terabytes of memory, the job will never run. In this case, you will divide the 400GB total memory request by the number of cores (20), so the correct command would be '<tt>-pe single 20, -l mem=20G</tt>'.

All memory requests are '''per core'''. One of the more common scenarios is where somebody will need, say 20 cores and 400 GB of memory. So they will make a request like '<tt>-pe single 20, -l mem=400G</tt>' This will never run, because what you are really requesting is 20 cores and 8000GB of memory (20 * 400). Since we have no nodes with 8000 terabytes of memory, the job will never run. In this case, you will divide the 400GB total memory request by the number of cores (20), so the correct command would be '<tt>-pe single 20, -l mem=20G</tt>'.

+

== Other Handy SGE Features ==

+

=== Email status changes ===

+

One of the most commonly used options when submitting jobs not related to resource requests is to have have SGE email you when a job changes its status. This takes two directives to qsub: '<tt>-M ''someone@somewhere.com''</tt>' will give the email address to which to send status updates. '<tt>-m abe</tt>' is probably the most common directive given for ''when'' to send updates. This will send email messages when a job (a)borts, (b)egins, or (e)nds. Other possibilities are (s)uspended and (n)ever.

+

=== Job Naming ===

+

If you have several jobs in the queue, running the same script with different parameters, it's handy to have a different name for each job as it shows up in the queue. This is accomplished with the '<tt>-N ''JobName''</tt>' qsub directive.

+

=== Combining Output Streams ===

+

Normally, SGE will create two files for output. One will be .e''jobnumber'' and the other .o''jobnumber''. If you want both of these to be combined into a single file, you can use the qsub directive '<tt>-j y</tt>'.

+

=== Running from the Current Directory ===

+

By default, jobs run from your home directory. Many programs incorrectly assume that you are running the script from the current directory. You can use the '<tt>-cwd</tt>' directive to change to the "current working directory" you used when submitting the job.

+

== Running from a qsub Submit Script" ==

+

No doubt after you've run a few jobs you get tired of typing something like 'qsub -l mem=2G,h_rt=10:00 -pe single 8 -n MyJobTitle MyScript.sh'. How are you supposed to remember all of these every time? The answer is to create a 'submit script', which outlines all of these for you. Below is a sample submit script, which you can modify and use for your own purposes.

+

<syntaxhighlight lang="bash">

+

#!/bin/bash

+

+

## A Sample qsub script created by Kyle Hutson

+

##

+

## Note: Usually a '#" at the beginning of the line is ignored. However, in

+

## the case of qsub, lines beginning with #$ are commands for qsub itself, so

+

## I have taken the convention here of starting *every* line with a '#', just

+

## Delete the first one if you want to use that line, and then modify it to

+

## your own purposes. The only exception here is the first line, which *must*

+

## be #!/bin/bash (or another valid shell).

+

+

## Specify the amount of RAM needed _per_core_. Default is 1G

+

##$ -l mem=1G

+

+

## Specify the maximum runtime. Default is 1 hour (1:00:00)

+

##$ -l h_rt=1:00:00

+

+

## Require the use of infiniband. If you don't know what this is, you probably

+

## don't need it. Default is "FALSE"

+

##$ ib=TRUE

+

+

## CUDA directive. If You don't know what this is, you probably don't need it

## fairly quickly. If you need a job that requires more than that, you might

+

## benefit from emailing us at beocat@cis.ksu.edu to see how we can assist in

+

## getting your job scheduled in a reasonable amount of time. Default is

+

## "single 1"

+

##$ -pe single 12

+

##$ -pe mpi-1 2

+

##$ -pe mpi-fill 20

+

##$ -pe mpi-spread 16

+

+

## Checkpointing. Options are BLCR or dmtcp. Default is no checkpointing.

+

##$ -ckpt dmtcp

+

+

## Use the current working directory instead of your home directory

+

##$ -cwd

+

+

## Merge output and error text streams into a single stream

+

##$ -j y

+

+

## Name my job, to make it easier to find in the queue

+

##$ -N MyJobTitle

+

+

## And finally, we run the job we came here to do.

+

## $HOME/ProgramDir/ProgramName ProgramArguments

+

+

## OR, for the case of MPI-capable jobs

+

## mpirun $HOME/path/MpiJobName

+

+

## Send email when a job is aborted (a), begins (b), and/or ends (e)

+

##$ -m abe

+

+

## Email address to send the email to based on the above line.

+

##$ -M myemail@ksu.edu

+

</syntaxhighlight>

+

== Array Jobs ==

+

One of SGE's useful options is the ability to run "Array Jobs"

+

+

It can be used with the following option to qsub.

+

+

+

-t n[-m[:s]]

+

Submits a so called Array Job, i.e. an array of identical tasks being differentiated only by an index number and being treated by Grid

+

Engine almost like a series of jobs. The option argument to -t specifies the number of array job tasks and the index number which will be

+

associated with the tasks. The index numbers will be exported to the job tasks via the environment variable SGE_TASK_ID. The option arguments

+

n, m and s will be available through the environment variables SGE_TASK_FIRST, SGE_TASK_LAST and SGE_TASK_STEPSIZE.

+

+

Following restrictions apply to the values n and m:

+

+

1 <= n <= 1,000,000

+

1 <= m <= 1,000,000

+

n <= m

+

+

The task id range specified in the option argument may be a single number, a simple range of the form n-m or a range with a step size.

+

Hence, the task id range specified by 2-10:2 would result in the task id indexes 2, 4, 6, 8, and 10, for a total of 5 identical tasks, each

+

with the environment variable SGE_TASK_ID containing one of the 5 index numbers.

+

+

Array jobs are commonly used to execute the same type of operation on varying input data sets correlated with the task index number. The

+

number of tasks in a array job is unlimited.

+

+

STDOUT and STDERR of array job tasks will be written into different files with the default location

+

+

<jobname>.['e'|'o']<job_id>'.'<task_id>

+

+

=== Examples ===

+

==== Change the Size of the Run ====

+

Array Jobs have a variety of uses, one of the easiest to comprehend is the following:

+

+

I have an application, app1 I need to run the exact same way, on the same data set, with only the size of the run changing.

+

+

My original script looks like this:

+

+

<syntaxhighlight lang="bash">

+

#!/bin/bash

+

RUNSIZE=50

+

#RUNSIZE=100

+

#RUNSIZE=150

+

#RUNSIZE=200

+

app1 $RUNSIZE dataset.txt

+

</syntaxhighlight>

+

For every run of that job I have to change the RUNSIZE variable, and submit each script. This gets tedious.

+

+

With Array Jobs the script can be written like so:

+

+

<syntaxhighlight lang="bash">

+

#!/bin/bash

+

#$ -t 50:200:50

+

RUNSIZE=$SGE_TASK_ID

+

app1 $RUNSIZE dataset.txt

+

</syntaxhighlight>

+

I then submit that job, and SGE understands that it needs to run it 4 times, once for each task. It also knows that it can and should run these tasks in parallel.

+

+

==== Choosing a Dataset ====

+

A slightly more complex use of Array Jobs is the following:

+

+

I have an application, app2, that needs to be run against every line of my dataset. Every line changes how app2 runs slightly, but I need to compare the runs against each other.

+

+

Originally I had to take each line of my dataset and generate a new submit script and submit the job. This was done with yet another script:

+

+

<syntaxhighlight lang="bash">

+

#!/bin/bash

+

DATASET=dataset.txt

+

scriptnum=0

+

while read LINE

+

do

+

echo "app2 $LINE" > ${scriptnum}.sh

+

qsub ${scriptnum}.sh

+

scriptnum=$(( $scriptnum + 1 ))

+

done < $DATASET

+

</syntaxhighlight>

+

Not only is this needlessly complex, it is also slow, as qsub has to verify each job as it is submitted. This can be done easily with array jobs, as long as you know the number of lines in the dataset. This number can be obtained like so: wc -l dataset.txt in this case lets call it 5000.

+

+

<syntaxhighlight lang="bash">

+

#!/bin/bash

+

#$ -t 1:5000

+

app2 `sed -n "${SGE_TASK_ID}p" dataset.txt`

+

</syntaxhighlight>

+

This uses a subshell via `, and has the sed command print out only the line number $SGE_TASK_ID out of the file dataset.txt.

+

+

Not only is this a smaller script, it is also faster to submit because it is one job instead of 5000, so qsub doesn't have to verify as many.

+

+

To give you an idea about time saved: submitting 1 job takes 1-2 seconds. by extension if you are submitting 5000, that is 5,000-10,000 seconds, or 1.5-3 hours.

== Running jobs interactively ==

== Running jobs interactively ==

Some jobs just don't behave like we think they should, or need to be run with somebody sitting at the keyboard and typing in response to the output the computers are generating. Beocat has a facility for this, called 'qrsh'. qrsh uses the exact same command-line arguments as qsub. If no node is available with your resource requirements, qrsh will tell you

Some jobs just don't behave like we think they should, or need to be run with somebody sitting at the keyboard and typing in response to the output the computers are generating. Beocat has a facility for this, called 'qrsh'. qrsh uses the exact same command-line arguments as qsub. If no node is available with your resource requirements, qrsh will tell you

Line 494:

Line 655:

==== Job debugging ====

==== Job debugging ====

It is simplest if you know the job number of the job you are trying to get information on.

It is simplest if you know the job number of the job you are trying to get information on.

Resource Requests

Aside from the time, RAM, and CPU requirements listed on the SGEBasics page, we have several other requestable resources. Generally, if you don't know if you need a particular resource, you should use the default. These can be generated with the command

qconf -sc | awk '{ if ($5 != "NO") { print }}'

name

shortcut

type

relop

requestable

consumable

default

urgency

arch

a

RESTRING

==

YES

NO

NONE

0

avx

avx

BOOL

==

YES

NO

FALSE

0

calendar

c

RESTRING

==

YES

NO

NONE

0

cpu

cpu

DOUBLE

>=

YES

NO

0

0

cpu_flags

c_f

STRING

==

YES

NO

NONE

0

cuda

cuda

INT

<=

YES

JOB

0

0

display_win_gui

dwg

BOOL

==

YES

NO

0

0

exclusive

excl

BOOL

EXCL

YES

YES

0

1000

h_core

h_core

MEMORY

<=

YES

NO

0

0

h_cpu

h_cpu

TIME

<=

YES

NO

0:0:0

0

h_data

h_data

MEMORY

<=

YES

NO

0

0

h_fsize

h_fsize

MEMORY

<=

YES

NO

0

0

h_rss

h_rss

MEMORY

<=

YES

NO

0

0

h_rt

h_rt

TIME

<=

FORCED

NO

0:0:0

0

h_stack

h_stack

MEMORY

<=

YES

NO

0

0

h_vmem

h_vmem

MEMORY

<=

YES

NO

0

0

hostname

h

HOST

==

YES

NO

NONE

0

infiniband

ib

BOOL

==

YES

NO

FALSE

0

m_core

core

INT

<=

YES

NO

0

0

m_socket

socket

INT

<=

YES

NO

0

0

m_thread

thread

INT

<=

YES

NO

0

0

m_topology

topo

RESTRING

==

YES

NO

NONE

0

m_topology_inuse

utopo

RESTRING

==

YES

NO

NONE

0

mem_free

mf

MEMORY

<=

YES

NO

0

0

mem_total

mt

MEMORY

<=

YES

NO

0

0

mem_used

mu

MEMORY

>=

YES

NO

0

0

memory

mem

MEMORY

<=

FORCED

YES

0

0

num_proc

p

INT

==

YES

NO

0

0

qname

q

RESTRING

==

YES

NO

NONE

0

s_core

s_core

MEMORY

<=

YES

NO

0

0

s_cpu

s_cpu

TIME

<=

YES

NO

0:0:0

0

s_data

s_data

MEMORY

<=

YES

NO

0

0

s_fsize

s_fsize

MEMORY

<=

YES

NO

0

0

s_rss

s_rss

MEMORY

<=

YES

NO

0

0

s_rt

s_rt

TIME

<=

YES

NO

0:0:0

0

s_stack

s_stack

MEMORY

<=

YES

NO

0

0

s_vmem

s_vmem

MEMORY

<=

YES

NO

0

0

slots

s

INT

<=

YES

YES

1

1000

swap_free

sf

MEMORY

<=

YES

NO

0

0

swap_rate

sr

MEMORY

>=

YES

NO

0

0

swap_rsvd

srsv

MEMORY

>=

YES

NO

0

0

swap_total

st

MEMORY

<=

YES

NO

0

0

swap_used

su

MEMORY

>=

YES

NO

0

0

virtual_free

vf

MEMORY

<=

YES

NO

0

0

virtual_total

vt

MEMORY

<=

YES

NO

0

0

virtual_used

vu

MEMORY

>=

YES

NO

0

0

The good news is that most of these nobody ever uses. There are a couple of exceptions, though:

Infiniband

First of all, let me state that just because it sounds "cool" doesn't mean you need it or even want it. Infiniband does absolutely no good if running in a 'single' parallel environment. Infiniband is a high-speed host-to-host communication fabric. It is used in conjunction with MPI jobs (discussed below). Several times we have had jobs which could run just fine, except that the submitter requested Infiniband, and all the nodes with Infiniband were currently busy. In fact, some of our fastest nodes do not have Infiniband, so by requesting it when you don't need it, you are actually slowing down your job. To request Infiniband, add -l ib=true to your qsub command-line.

CUDA

CUDA is the resource required for GPU computing. We have a very small number of nodes which have GPUs installed. To request one of these nodes, add -l cuda=true to your qsub command-line.

Exclusive

Some programs just don't play nicely with others. They will attempt to use all available memory or will try to use all the cores it can use. The way to be a nice neighbor if your program has this problem is to request exclusive use of a node with -l excl=true. This can also be useful for benchmarking, where you can be sure that no other jobs are interfering with yours.

Parallel Jobs

There are two ways jobs can run in parallel, intranode and internode. Note: Beocat will not automatically make a job run in parallel. Have I said that enough? It's a common misperception.

Intranode jobs

Intranode jobs are easier to code and can take advantage of many common libraries, such as OpenMP, or Java's threads. Many times, your program will need to know how many cores you want it to use. Many will use all available cores if not told explicitly otherwise. This can be a problem when you are sharing resources, as Beocat does. To request multiple cores, use the qsub directive '-pe single n', where n is the number of cores you wish to use. If your command can take an environment variable, you can use $nslots to tell how many cores you've been allocated.

Internode (MPI) jobs

"Talking" between nodes is trickier that talking between cores on the same node. The specification for doing so is called "Message Passing Interface", or MPI. We have OpenMPI installed on Beocat for this purpose. Most programs written to take advantage of large multi-node systems will use MPI. You can tell if you have an MPI-enabled program because its directions will tell you to run 'mpirun program'. Requesting MPI resources is only mildly more difficult than requesting single-node jobs. Instead of using '-pe single n' for your qsub request, you will use one of the following:

Parallel Environment

Description

mpi-fill

This environment will use as many slots on each node as it can until it reaches the number of cores you have requested.

mpi-spread

This environment will spread itself out over as many nodes as possible until it reaches the number of cores you have requested.

mpi-1

This environment will allocate the slots you've requested 1 per node.

mpi-2

This environment will allocate the slots you've requested 2 per node. You must request cores as a multiple of 2

mpi-4

This environment will allocate the slots you've requested 4 per node. You must request cores as a multiple of 4

mpi-8

This environment will allocate the slots you've requested 8 per node. You must request cores as a multiple of 8

mpi-10

This environment will allocate the slots you've requested 10 per node. You must request cores as a multiple of 10

mpi-12

This environment will allocate the slots you've requested 12 per node. You must request cores as a multiple of 12

mpi-16

This environment will allocate the slots you've requested 16 per node. You must request cores as a multiple of 16

mpi-80

This environment will allocate the slots you've requested 80 per node. You must request cores as a multiple of 80

Some quick examples:

-pe mpi-4 16 will give you 4 chunks of 4 cores apiece. They might all happen to be allocated on the same node (16 cores), on 4 different nodes (4 cores each), on 3 nodes (8 cores on one and 4 cores on the other two), or on 2 nodes (8 cores each).

-pe mpi-fill 40 will give you 40 cores, but will attempt to get them all on the same node.

-pe mpi-fill 100 will give you 100 cores, and place them on as few nodes as possible. In this case it's likely you would get a full mage (80 cores) and either part of another mage (the remaining 20 cores) or one of the 20-core elves.

-pe mpi-spread 40 will give you 40 cores, and will attempt to place each on a separate node.

Requesting memory for multi-core jobs

All memory requests are per core. One of the more common scenarios is where somebody will need, say 20 cores and 400 GB of memory. So they will make a request like '-pe single 20, -l mem=400G' This will never run, because what you are really requesting is 20 cores and 8000GB of memory (20 * 400). Since we have no nodes with 8000 terabytes of memory, the job will never run. In this case, you will divide the 400GB total memory request by the number of cores (20), so the correct command would be '-pe single 20, -l mem=20G'.

Other Handy SGE Features

Email status changes

One of the most commonly used options when submitting jobs not related to resource requests is to have have SGE email you when a job changes its status. This takes two directives to qsub: '-M someone@somewhere.com' will give the email address to which to send status updates. '-m abe' is probably the most common directive given for when to send updates. This will send email messages when a job (a)borts, (b)egins, or (e)nds. Other possibilities are (s)uspended and (n)ever.

Job Naming

If you have several jobs in the queue, running the same script with different parameters, it's handy to have a different name for each job as it shows up in the queue. This is accomplished with the '-N JobName' qsub directive.

Combining Output Streams

Normally, SGE will create two files for output. One will be .ejobnumber and the other .ojobnumber. If you want both of these to be combined into a single file, you can use the qsub directive '-j y'.

Running from the Current Directory

By default, jobs run from your home directory. Many programs incorrectly assume that you are running the script from the current directory. You can use the '-cwd' directive to change to the "current working directory" you used when submitting the job.

Running from a qsub Submit Script"

No doubt after you've run a few jobs you get tired of typing something like 'qsub -l mem=2G,h_rt=10:00 -pe single 8 -n MyJobTitle MyScript.sh'. How are you supposed to remember all of these every time? The answer is to create a 'submit script', which outlines all of these for you. Below is a sample submit script, which you can modify and use for your own purposes.

#!/bin/bash## A Sample qsub script created by Kyle Hutson#### Note: Usually a '#" at the beginning of the line is ignored. However, in## the case of qsub, lines beginning with #$ are commands for qsub itself, so## I have taken the convention here of starting *every* line with a '#', just## Delete the first one if you want to use that line, and then modify it to## your own purposes. The only exception here is the first line, which *must*## be #!/bin/bash (or another valid shell).## Specify the amount of RAM needed _per_core_. Default is 1G##$ -l mem=1G## Specify the maximum runtime. Default is 1 hour (1:00:00)##$ -l h_rt=1:00:00## Require the use of infiniband. If you don't know what this is, you probably## don't need it. Default is "FALSE"##$ ib=TRUE## CUDA directive. If You don't know what this is, you probably don't need it## Default is "FALSE"##$ -l cuda=TRUE## Parallel environment. Syntax is '-pe Environment NumberOfCores' A list of## valid environments can be found at## http://support.cis.ksu.edu/BeocatDocs/SunGridEngine (section 3.2). One## quick note here. Jobs requesting 16 or fewer cores tend to get scheduled## fairly quickly. If you need a job that requires more than that, you might## benefit from emailing us at beocat@cis.ksu.edu to see how we can assist in## getting your job scheduled in a reasonable amount of time. Default is## "single 1"##$ -pe single 12##$ -pe mpi-1 2##$ -pe mpi-fill 20##$ -pe mpi-spread 16## Checkpointing. Options are BLCR or dmtcp. Default is no checkpointing.##$ -ckpt dmtcp## Use the current working directory instead of your home directory##$ -cwd## Merge output and error text streams into a single stream##$ -j y## Name my job, to make it easier to find in the queue##$ -N MyJobTitle## And finally, we run the job we came here to do.## $HOME/ProgramDir/ProgramName ProgramArguments## OR, for the case of MPI-capable jobs## mpirun $HOME/path/MpiJobName## Send email when a job is aborted (a), begins (b), and/or ends (e)##$ -m abe## Email address to send the email to based on the above line.##$ -M myemail@ksu.edu

Array Jobs

One of SGE's useful options is the ability to run "Array Jobs"

It can be used with the following option to qsub.

-t n[-m[:s]]
Submits a so called Array Job, i.e. an array of identical tasks being differentiated only by an index number and being treated by Grid
Engine almost like a series of jobs. The option argument to -t specifies the number of array job tasks and the index number which will be
associated with the tasks. The index numbers will be exported to the job tasks via the environment variable SGE_TASK_ID. The option arguments
n, m and s will be available through the environment variables SGE_TASK_FIRST, SGE_TASK_LAST and SGE_TASK_STEPSIZE.
Following restrictions apply to the values n and m:
1 <= n <= 1,000,000
1 <= m <= 1,000,000
n <= m
The task id range specified in the option argument may be a single number, a simple range of the form n-m or a range with a step size.
Hence, the task id range specified by 2-10:2 would result in the task id indexes 2, 4, 6, 8, and 10, for a total of 5 identical tasks, each
with the environment variable SGE_TASK_ID containing one of the 5 index numbers.
Array jobs are commonly used to execute the same type of operation on varying input data sets correlated with the task index number. The
number of tasks in a array job is unlimited.
STDOUT and STDERR of array job tasks will be written into different files with the default location
<jobname>.['e'|'o']<job_id>'.'<task_id>

Examples

Change the Size of the Run

Array Jobs have a variety of uses, one of the easiest to comprehend is the following:

I have an application, app1 I need to run the exact same way, on the same data set, with only the size of the run changing.

Not only is this needlessly complex, it is also slow, as qsub has to verify each job as it is submitted. This can be done easily with array jobs, as long as you know the number of lines in the dataset. This number can be obtained like so: wc -l dataset.txt in this case lets call it 5000.

#!/bin/bash#$ -t 1:5000
app2 `sed -n "${SGE_TASK_ID}p" dataset.txt`

This uses a subshell via `, and has the sed command print out only the line number $SGE_TASK_ID out of the file dataset.txt.

Not only is this a smaller script, it is also faster to submit because it is one job instead of 5000, so qsub doesn't have to verify as many.

To give you an idea about time saved: submitting 1 job takes 1-2 seconds. by extension if you are submitting 5000, that is 5,000-10,000 seconds, or 1.5-3 hours.

Running jobs interactively

Some jobs just don't behave like we think they should, or need to be run with somebody sitting at the keyboard and typing in response to the output the computers are generating. Beocat has a facility for this, called 'qrsh'. qrsh uses the exact same command-line arguments as qsub. If no node is available with your resource requirements, qrsh will tell you

Your "qrsh" request could not be scheduled, try again later.

Note that, like qsub, your interactive job will timeout after your allotted time has passed.

Job Accounting

Some people may find it useful to know what their job did during its run. The qacct tool will read SGE's accounting file and give you summarized or detailed views on jobs that have run within Beocat.

qacct

This data can usually be used to diagnose two very common job failures.

Job debugging

It is simplest if you know the job number of the job you are trying to get information on.

# if you know the jobid, put it here:
qacct -j 1122334455# if you don't know the job id, you can look at your jobs over some number of days in this case the past 14 days:
qacct -o $USER -d 14 -j

If you look at the line showing ru_wallclock. You can see that it shows 1s. This means that the job started and then promptly ended. This points to something being wrong with your submission script. Perhaps there is a typo somewhere in it.

If you look at the lines showing failed, ru_wallclock and category we can see some pointers to the issue.
It didn't finish because the scheduler (qmaster) enforced some limit. If you look at the category line, the only limit requested was h_rt. So it was a runtime (wallclock) limit.
Comparing ru_wallclock and the h_rt request, we can see that it ran until the h_rt time was hit, and then the scheduler enforce the limit and killed the job. You will need to resubmit the job and ask for more time next time.