We will be converting Beocat over from Gentoo Linux to CentOS Linux on December 26th. This means you will need to recompile any applications you have on your home directory. We are also converting over to use the Slurm scheduler instead of SGE. You will therefore also need to convert all your qsub scripts over to sbatch scripts. We have developed tools to make this process as easy as possible. We have test nodes available now so that you can compile and test your code before the transition. To access these, start by ssh'ing to the Eunomia head node from either Beocat head node.

+

+

clymene> ssh eunomia

+

+

Once on Eunomia, you can use the Slurm version of kstat (kstat.slurm for now) to see what jobs are running on the nodes. kstat.slurm will show all nodes for now even though only a few are accessible from the CentOS side (1-2 of each type of node right now).

+

+

eunomia> kstat.slurm --help

+

eunomia> kstat.slurm

+

+

If you already have a qsub script, I have created a new perl program called kstat.convert that will automatically convert your qsub script over to an sbatch script.

+

+

kstat.convert --sge qsub_script.sh --slurm slurm_script.sh

+

+

Below is an example of a simple qsub script and the resulting sbatch script.

+

+

+

+

== Submitting your first job ==

== Submitting your first job ==

To submit a job to run under Slurm, we use the <code>sbatch</code> command. sbatch (submit batch) takes the commands you give it, and runs it through the scheduler, which finds the optimum place for your job to run. With over 300 nodes and 7500 cores to schedule, as well as differing priorities, hardware, and individual resources, the scheduler's job is not trivial.

To submit a job to run under Slurm, we use the <code>sbatch</code> command. sbatch (submit batch) takes the commands you give it, and runs it through the scheduler, which finds the optimum place for your job to run. With over 300 nodes and 7500 cores to schedule, as well as differing priorities, hardware, and individual resources, the scheduler's job is not trivial.

Revision as of 17:06, 13 December 2017

Using the CentOS/Slurm test nodes

We will be converting Beocat over from Gentoo Linux to CentOS Linux on December 26th. This means you will need to recompile any applications you have on your home directory. We are also converting over to use the Slurm scheduler instead of SGE. You will therefore also need to convert all your qsub scripts over to sbatch scripts. We have developed tools to make this process as easy as possible. We have test nodes available now so that you can compile and test your code before the transition. To access these, start by ssh'ing to the Eunomia head node from either Beocat head node.

clymene> ssh eunomia

Once on Eunomia, you can use the Slurm version of kstat (kstat.slurm for now) to see what jobs are running on the nodes. kstat.slurm will show all nodes for now even though only a few are accessible from the CentOS side (1-2 of each type of node right now).

eunomia> kstat.slurm --help
eunomia> kstat.slurm

If you already have a qsub script, I have created a new perl program called kstat.convert that will automatically convert your qsub script over to an sbatch script.

kstat.convert --sge qsub_script.sh --slurm slurm_script.sh

Below is an example of a simple qsub script and the resulting sbatch script.

Submitting your first job

To submit a job to run under Slurm, we use the sbatch command. sbatch (submit batch) takes the commands you give it, and runs it through the scheduler, which finds the optimum place for your job to run. With over 300 nodes and 7500 cores to schedule, as well as differing priorities, hardware, and individual resources, the scheduler's job is not trivial.

There are a few things you'll need to know before running sbatch.

How many cores you need. Note that unless your program is created to use multiple cores (called "threading"), asking for more cores will not speed up your job. This is a common misperception. Beocat will not magically make your program use multiple cores! For this reason the default is 1 core.

How much time you need. Many users when beginning to use Beocat neglect to specify a time requirement. The default is one hour, and we get asked why their job died after one hour. We usually point them to the FAQ.

How much memory you need. The default is 1GB. If your job uses significantly more than you ask, your job will be killed off.

Any advanced options. See the AdvancedSlurm page for these requests. For our basic examples here, we will ignore these.

So let's now create a small script to test our ability to submit jobs. Create the following file (either by copying it to Beocat or by editing a text file and we'll name it myhost.sh. Both of these methods are documented on our LinuxBasics page.

1 #!/bin/sh2 srun hostname

Be sure to make it executable

chmod u+x myhost.sh

So, now lets submit it as a job and see what happens. Here I'm going to use five options

--mem-per-cpu= tells how much memory I need. In my example, I'm using our system minimum of 512 MB, which is more than enough. Note that your memory request is per core, which doesn't make much difference for this example, but will as you submit more complex jobs.

--time= tells how much runtime I need. This can be in the form of "minutes", "minutes:seconds", "hours:minutes:seconds", "days-hours", "days-hours:minutes" and "days-hours:minutes:seconds". This is a very short job, so 1 minute should be plenty. This can't be changed after the job is started please make sure you have requested a sufficient amount of time.

--cpus-per-task=1 tells Slurm that I need only a single core per task. The AdvancedSlurm page has much more on the "cpus-per-task" switch.

--ntasks=1 tells Slurm that I only need to run 1 task. The AdvancedSlurm page has much more on the "ntasks" switch.

--nodes=1 tells Slurm that this must be run on one machine. The AdvancedSlurm page has much more on the "nodes" switch.

Each job line will contain the username, program name, job ID number, number of cores, maximum memory used, whether the job is killable, and the
amount of time the job has run. If the job is still in the queue, it may contain information on the requested run time and memory per core and the time
shown is how long the job has been in the queue.

In this case, I have 2 jobs running on Hero43. unafold is using 1 core while octopus is using 16 cores. The most useful information here
is the memory being used in each case. While unafold is taking very little memory, octopus is using 125 GB and the red
font indicates that it is close to the amount requested. If the memory on a job is over the requested amount it will have a
red background and you should request more memory in future runs. If the memory is flashing with a red background, you are more
than 50% over your requested amount and your code will be forced to use disk swap which can slow it down enormously. You're usually
better off killing the job and restarting with an appropriate memory request.
If the code accesses large files, there may be an IO value reported. This number is not very accurate.

kstat -d 7
This will show you information about the jobs that have completed in the last 7 days.

kstat -c
This provides a global view of Beocat showing how many cores each person is using.

Generally speaking, those jobs that are higher on the list will start running before the ones lower on the list. This way you can see your relative position. Another useful tool is to see how busy Beocat is. http://ganglia.beocat.ksu.edu/ will give you those statistics. Depending on the resources you ask for, a job you submit may start immediately or may take up to several weeks, depending on the priority of your job, the resources available, and the requested resources of the jobs ahead of you in the queue.