SAIL Compute Cluster

Overview

The Stanford AI Lab cluster is a combination of 8 groups of machines, the Gorgon, Dag, Deep, Visionlab, NLP(Jude), Jag(Jagupard), John, Ladon, CVGL and Atlas. Altogether, these systems currently provide 3232 processor cores and 17.6TB of memory. The cluster is accessed through a batch queue system that coordinates all jobs running on the cluster. The nodes should not be accessed directly, as the scheduler will allocate CPUs exclusively to each job.

Once you have access to use the cluster, you can submit, monitor, and cancel jobs from the headnode, scail. This machine should not be used for any compute-intensive work, however you can get a shell on a compute node simply by starting an interactive job.

You can use the cluster by starting batch jobs or interactive jobs. Interactive jobs give you access to a shell on one of the nodes, from which you can execute commands by hand, whereas batch jobs run from a given shell script in the background and automatically terminate when finished. For more information, see the links at left.

Job Submissions

Use of the cluster is coordinated by a batch queue scheduler, which assigns compute nodes to jobs in an order that depends on the time submitted, the number of nodes requested, and the level of recent usage.

You can submit two kinds of jobs to the cluster- interactive and batch.

Interactive jobs give you access to a shell on one of the nodes, from which you can execute commands by hand, whereas batch jobs run from a given shell script in the background and automatically terminate when finished.

Generally speaking, interactive jobs are used for building and testing, while batch jobs are used thereafter.

Batch Jobs

Batch jobs are the most common way to interact with the cluster, and are useful when you do not need to interact with the shell to perform the desired task. Two clear advantages are that your job will be managed automatically after submission, and that placing your setup commands in a shell script lets you efficiently dispatch multiple similar jobs. To start a batch job on a queue (group you work with), ssh into scail and type:

qsub -q myqueue my_script.sh

The command will immediately return with a job ID, which you can use to monitor or delete the job. Your script will run in a shell with your AFS permissions, though the working directory may be different. You can obtain the directory in which you ran qsub from $PBS_O_WORKDIR. If you want to request that multiple nodes be allocated to your job, use the following syntax:

qsub -q myqueue -l nodes=4 my_script.sh

specifying the desired number of nodes. A file listing the nodes allocated to the job is located at the path $PBS_NODEFILE. If you request multiple nodes, your script will run on only one of them, and you can ssh into the other nodes to start processes on your other nodes. If you need to use multiple CPUs or cores on the same node, specify how may processors per node. For example, the following will allocate 10 cores on a single node:

qsub -q myqueue -l nodes=1:ppn=10 my_script.sh

If you are running the same script on multiple data units, a convenient way to queue a job for each data unit is to specify the path in an environment variable that is exported to your job. Your shell script can access the variable and act accordingly. To submit a job with environment variables, use the -v option with a comma-delimited list of variable names:

setenv DATA_PATH ~/data1
qsub -v DATA_PATH,OTHER_VAR my_script.sh

Alternatively, you can specify the value of a variable inline with the command using:

Interactive jobs are useful for compiling and testing code intended to run on the cluster, performing one-time tasks, and executing software that requires runtime feedback. To start an interactive job, ssh into scail and type:

qsub -I -q myqueue

The command will block until a node is available, at which point you will be dropped into a shell on a compute node. Your kerberos/AFS credentials will be transferred to the new session, so you can access files as usual for the maximum renewable life of your kerberos tickets (30 days). When you are finished with the session, type exit at the shell, or delete the job.

If you want to request that multiple nodes be allocated to you, use the following syntax:

qsub -I -q myqueue -l nodes=4

specifying the desired number of nodes. You will be dropped into a shell on one of these nodes, and the list of nodes allocated to your job can be viewed by typing:

cat $PBS_NODEFILE

Each line in this file corresponds to one CPU, so a host may appear twice if your job has been assigned to both cores of a given node.

If you're running multi-threaded programs that use multiple CPUs, including multi-threaded Matlab, you have to use the -l nodes=XX option to request multiple nodes. Only after checking $PBS_NODEFILE to see if you were indeed allocated the number of processors you want should you allow your program to use that many CPUs.

To request a particular machine, you can also specify

qsub -I -q gorgon -l host=gorgon5

Normally hosts are allocated in sequential order. A new host is allocated once all slots on the preceding host are filled. The host flag is particularly useful if you need a machine that is not currently running any jobs (say, for requesting 4 CPUs). You can check which machines are being used by running qstat -f | grep exec_host.

Managing Jobs

You can view a list of all jobs running on the cluster by typing:

qstat

Or, for a full-screen representation of the queue and entire compute cluster, you can run:

pbstop

You can view detailed information for a specific job by typing:

qstat -f job_identifier

where job_identifier is the name given to the job when you run qsub.

To cancel a job you started, type:

qdel job_identifier

If for some reason you want to restart a job that is already running, type:

qrerun job_identifier

Storage

There are several storage options for the scail cluster,

AFS

All scail cluster nodes mount CS AFS volumes such as your home directory and scratch volumes. This is a good option for jobs that are less I/O intensive and fast.

If your jobs need access to AFS for longer than 24 hours, you need to run 'reauth' (which should be in your PATH as /afs/cs/software/bin/reauth) before you do qsub. 'reauth' with renew your AFS tokens periodically for up to 7 days.

All NLP NFS filesystems, including /scr and /u/nlp, are automounted on scail and the NLP nodes.

/deep is NFS mounted to all gorgon and deep nodes, as well as scail.

/atlas is NFS mounted to all atlas nodes, as well as scail.

At the moment, /scail/scratch has 77TB usable storage and /deep has 72TB and both are NOT being backed up at all, please consider all data scratchable.

/scail/data has 27TB usable storage and gets a daily snapshot backup which is kept for a while in case file(s) get clobbered or deleted by mistake. To request a restore of such files from previous versions, fill out the form at cs.stanford.edu/restore.

/scail/data quotas are set up for each group (/scail/data/groups) and the user's CSID has to be in the right netgroup for access.

Queues

These are the queues currently enabled on scail, please only submit jobs to the queue in which your group owns. See Starting Jobssection for instructions.