Introduction

This manual provides a brief introduction on the usage of IIGB's Linux cluster, Biocluster.All servers and compute resources of the IIGB bioinformatics facility are available to
researchers from all departments and colleges at UC Riverside for a minimal
recharge fee (see rates).
To request an account, please contact Rakesh Kaundal (rkaundal@ucr.edu).The latest hardware/facility description for grant applications is available here: Facility Description [pdf].

Secondary function: submitting jobs to the queuing system (Torque/Maui)

Worker Nodes

High-Memory nodes

m01-m03: each 32-64 cores and 252-512 GB memory

Compute nodes

n01-n32: each 8 cores, each 16GB memory

n33, n34: each 48 cores, each 64GB memory

Current status of Biocluster nodes

Getting Started

The initial log-in, brings users into the Biocluster head node. From there, users can submit jobs via qsub to the compute nodes or log into owl to perform memory intensive tasks.Since all machines are mounting a centralized file system, users will always see the same home directory on all systems. Therefore, there is no need to copy files from one machine to another.

Unloading Software

Sometimes you want to no longer have a piece of software in path. To do
this you unload the module by running:

module unload <software name>

Additional Features

There are additional features and operations that can be done with the
module command. Please run the following to get more information:

module help

Quotas

CPU

Currently, the maximum number of CPU cores a user can use simultaneously on biocluster is 128 CPU cores when the load on the cluster is <30% and 64 CPU cores when the load is above 30%. If a user submits jobs for more than 128/64 CPU cores then the additional requests will be queued until resources within the user's CPU quota become available. Upon request a user's upper CPU quota can be extended temporarily, but only if sufficient CPU resources are available. To avoid monopolisation of the cluster by a small number of users, the high load CPU quota of 64 cores is dynamically readjusted by an algorithm that considers the number of CPU hours accumulated by each user over a period of 3 months along with the current overall CPU usage on the cluster. If the CPU hour average over the 3 month window exceeds an allowable amount then the default CPU quota will be reduced for such a heavy user to 32 CPU cores, and if it exceeds the allowable amount by two-fold it will be reduced to 16 CPU cores. Once the average usage of a heavy user drops again below those limits, the upper CPU limit will be raised accordingly. Note: when the overall CPU load on the cluster is below 70% then the dynamically readjusted CPU quotas are not applied. At those low load times every user has the same CPU quota: 128 CPU cores at <30% load and 64 CPU cores at 30-70% load.

Data Storage

A standard user account has a storage quota of 20GB. Much more storage space, in the range of many TBs, can be made available in a user account's bigdata directory. The amount of storage space available in bigdata depends on a user group's annual subscription. The pricing for extending the storage space in the bigdata directory is available here.

Memory

From the Biocluster head node users can submit jobs to the batch queue or the highmem queue. The nodes (n01-n34) associated with the batch queue are mainly for CPU intensive tasks, while the nodes (m01-m03) of the highmem queue are dedicated to memory intensive tasks. The batch nodes have 16-64GB RAM each and the highmem nodes have 256-512GB RAM.

Using a Script

When using the cluster it quickly becomes useful to be able to run multiple commands as part of a single job. To solve this we write scripts. In this case, the way it works is that we invoke the script as the last argument to qsub.

qsub <script name>

A script is just a set of commands that we want to make happen once the job runs. Below is an example script that does the same thing that we do with Exercise 5 in the Linux Basics Manual.

# Start preparing to do a blast run# Use awk to grab a number of proteins and then put them in a file.echo "Generating a set of IDs"awk --posix -v RS='>' '/W.H..(H){1,2}/ { print ">" $0;}' AE004437.faa | grep '^>' | awk --posix -v FS='|' '{print $4;}' > my_IDs

So if this script was called blast_AE004437.sh we could run the following to make all of those steps happen.

qsub blast_AE004437.sh

Tracking Jobs

Now that we have a job in the queue, how do I know if it is running? For that, there is a command called qstat. The command qstat will provide you with the current state of all the jobs running or queued to run on the cluster. The following is an example of that output:

The R in the S column means a job is running and a Q means that the job is queued waiting to run. Jobs get queued for a number of reasons, the most common are:

A job scheduling run has not yet completed. Scheduling runs take place approximately every 15 seconds.

The queue is at ~75% capacity and the job is requesting a significant amount of walltime.

The queue is at 100% capacity and the job has no place it can be started.

The job is requesting specific resources, such as 8 processors, and there is no place the system is able to fit it.

The user submitting the job has reached a resource maximum for that queue and cannot start any more jobs running until other jobs have finished.

There are additional flags that can be passed to qstat to get more information about the state of the cluster, including the -u flag that will only display the status of jobs for a particular user.
Once a job has finished, it will no longer show up in this listing.

Job Results

By default, results from the jobs come out two different ways.

The system sends STDOUT and STDERR to files called <job_name>.o<job_number> and <job_name>.e<job_number>.

Any output created by your script, like the blastp.out in the example above.

For example if you ran the example from above and got a job number of 679746, you would end up with a file called blast_AE004437.sh.o679746 and a file called blast_AE004437.sh.e679746 in the directory where you ran qsub. Additionally, because our script creates a directory using the PBS_JOBID variable, you would have a directory in your home directory called 679746.torque01.

Deleting Jobs

Sometimes you need to delete a job. You may need to do this if you accidentally submitted something that will run longer than you want or perhaps you accidentally submitted the wrong script. To do delete a job, you use the qdel command. If you wanted to delete job number 679440, you would run:

qdel 679440

Please be aware that you can only delete jobs that you own.

Delete all jobs of one user:

qselect -u $USER | xargs qdel

Delete all jobs running by one user:

qselect -u $USER -s R | xargs qdel

Delete all jobs queued jobs by one user:

qselect -u $USER -s Q | xargs qdel

Advanced Usage

There are number of additional things you can do with qsub that do a better job of taking advantage of the cluster.

To view qsub options please visit the online manual, or run the following:

man qstat

Requesting Additional Resources

Frequently, there is a need to use more than one processor or to specify some amount of memory. The qsub command has a -l flag that allows you to do just that.

Example: Requesting A Single Node with 8 Processors

Let's assume that the script we used above was multi-threaded and spins up 8 different processes to do work. If you wanted to ask for the processors required to do that, you would run the following:

qsub -l nodes=1:ppn=8 blast_AE004437.sh

This tells the system that your job needs 8 processors and it allocates them to you.

Example: Requesting 16GB of RAM for a Job

Using the same script as above, let's instead assume that this is just a monolitic process but we know that it will need about 16GB of RAM. Below is an example of how that is done:

qsub -l mem=16gb blast_AE004437.sh

Example: Requesting 2 Weeks of Walltime for a Job

Using the same script as above, let's instead assume that it is going to run for close to 2 weeks. We know there are 7 days in a week and 24 hours in a day, so 2 weeks in hours would be (2 * 7 *24) 336 hours. Below is an example of requesting that a job can run for 336 hours.

qsub -l walltime=336:00:00 blast_AE004437.sh

Example: Requesting Specific Node(s)

The following requests 8 CPU cores and 16GB of RAM on high memory node m01 for 220 hours:

Interactive Jobs

Sometimes, when testing, it is useful to run commands interactively instead of with a script. To do this you would run:

qsub -I

Just like scripts though, you may need additional resources. To solve this, specify resources, just like you would above:

qsub -l mem=16gb -I

Array Jobs

Many tasks in Bioinformatics need to be parallelized to be efficient. One of the ways we address this is using array jobs. An array job executes the same script a number of times depending on what arguments are passed. To specify that an array should be used, you use the -t flag. For example, if you wanted a ten element array, you would pass -t 1-10 to qsub. You can also specify arbitrary numbers in the array. Assume for a second that the 3 and 5-7 jobs failed for some unknown reason in your last run, you can specify -t 3,5-7 and run just those array elements.

Below is an example that does the same thing that the basic example from above, except that it spreads the workload out into seven different processes. This technique is particularly useful when dealing with much larger datasets.

Prepare Dataset

The script below creates a working directory and builds out the usable dataset. The job is passed to qsub with no arguments.

#!/bin/bash
# Create a directory for us to do work in.
# We are using a special variable that is set by the cluster when a job runs.
mkdir blast_AE004437
# Change to that new directory
cd blast_AE004437
# Copy the proteome of Halobacterium spec.
cp /srv/projects/db/ex/AE004437.faa .
# Do some basic analysis
# The echo command prints info to our output file
echo "How many predicted proteins are there?"
grep '^>' AE004437.faa --count
echo "How many proteins contain the pattern \"WxHxxH\" or \"WxHxxHH\"?"
egrep 'W.H..H{1,2}' AE004437.faa
# Start preparing to do a blast run
# Use awk to grab a number of proteins and then put them in a file.
echo "Generating a set of IDs"
awk --posix -v RS='>' '/W.H..(H){1,2}/ { print ">" $0;}' AE004437.faa | grep '^>' | awk --posix -v FS='|' '{print $4;}' > my_IDs
# Make the proteome blastable
echo "Making a blastable database"
formatdb -i AE004437.faa -p T -o

Analyze the Dataset

The script below will do the actual analysis. Assuming the name is blast_AE004437-multi.sh, the command to submit it would be qsub -t 1-7 blast_AE004437-multi.sh.

#!/bin/bash
# Specify the number of array runs. This means we are going to specify -t 1-7
# when calling qsub.
NUM=7
# Change to that new directory
cd blast_AE004437
# Do some math based on the number of runs we are going to do to figure out how
# many lines, and which lines should be in this run.
LINES=`cat my_IDs | wc -l`
MULTIPLIER=$(( $LINES / $NUM ))
SUB=$(( $MULTIPLIER - 1 ))
END=$(( $PBS_ARRAYID * $MULTIPLIER ))
START=$(( $END - $SUB ))
# Grab the IDs that are going to be part of each blast run
awk "NR==$START,NR==$END" my_IDs > $PBS_ARRAYID.IDs
# Make blastable IDs
echo "Making a set of blastable IDs"
fastacmd -d AE004437.faa -i $PBS_ARRAYID.IDs > $PBS_ARRAYID.fasta
# Run blast
echo "Running blast"
blastall -p blastp -i $PBS_ARRAYID.fasta -d AE004437.faa -o $PBS_ARRAYID.blastp.out -e 1e-6 -v 10 -b 10

Specifying Queues

Queues provide access to additional resources or allow use of resources in different ways. To take advantage of the queues, you will need to specify the -q option with the queue name on the command line.

For example, if you would like to run a job that consumes 16GB of memory, you should submit this job to the highmem queue:

qsub -q highmem myJob.sh

Troubleshooting

If a job has not started, or is in a queued state for a long period of time, users should try the following.
Check which nodes have available resources (CPU, memory, walltime, etc..):

showbf -S | less

Check how many processors are immediately available per walltime window on the batch queue:

showbf -f batch

Check earliest start and completion times (should not be infinity):

showstart JOBID

Check if a job is held:

showhold | grep JOBID

Check status of job and display reason for failure (if applicable):

checkjob JOBID

Data Storage

Biocluster users are able to check on their home and bigdata storage usage from the Biocluster Dashboard.

Storage Locations

Home Directories

Home directories are where you place the scripts and various things you are working on, on biocluster. This space is very limited. Please see the Quotas section above for the space that is allocated per user.

Path

/rhome/<username> (ex: /rhome/tgirke)

User Availability

All Users

Node Availability

All Nodes

Quota Responsibility

User

Big Data

Big data is an area where large amounts of storage can be made available to users. A lab purchases big data space seperately from access to the cluster. This space is then made available to the lab via a shared directory and individual directories for each user.

Lab Shared Space

This directory can be accessed by the lab as a whole.

Path

/shared/<labname> (ex: /shared/girkelab)

User Availability

Labs that have purchased space.

Node Availability

All Nodes

Quota Responsibility

Lab

Individual User Space

This directory can be accessed by specific lab members.

Path

/bigdata/<username> (ex: /bigdata/tgirke)

User Availability

Labs that have purchased space.

Node Availability

All Nodes

Quota Responsibility

Lab

Non-Persistent Space

Frequently, there is a need to do things like, output a signifigant amount of intermediate data durring a job, access a dataset from a faster medium than bigdata or the home directories or write out lock files. These types of things are well suited to the use of non-persistent spaces. Below are the filesystems available on biocluster.

Memory Backed Space

This type of space takes away from physical memory but allows extremely fast access to the files located on it. You will need to factor in the space you are using in RAM as well. For example, if you have a dataset that is 1G in size and use this space, it will take 1G of RAM.

Path

/dev/shm

User Availability

All Users

Node Availability

All Nodes

Quota Responsibility

N/A

Temporary Space

This is the standard space available on all Linux systems. Please be aware that it is limited to the amount of free disk space on the node you are running on.

Path

/tmp

User Availability

All Users

Node Availability

All Nodes

Quota Responsibility

N/A

SSD Backed Space

This space is must faster than the standard temporary space, but slower than being memory backed.

Path

/scratch

User Availability

All Users

Node Availability

High Mem Nodes

Quota Responsibility

N/A

Sharing data with other users

It is useful to share data and results with other users on the cluster, and we encourage collaboration The easiest way to share a file is to place it in a location that both users can access. Then the second user can simply copy it to a location of their choice. However, this requires that the file permissions permit the second user to read the file.Basic file permissions on Linux and other Unix like systems are composed of three groups: owner, group, and other. Each one of these represents the permissions for different groups of people: the user who owns the file, all the group members of the group owner, and everyone else, respectively Each group has 3 permissions: read, write, and execute, represented as r,w, and x. For example the following file is owned by the user 'bragr' (with read, write, and execute), owned by the group 'operations' (with read and execute), and everyone else cannot access it.

bragr@biocluster:~$ ls -l randomFileName

-rwxr-x--- 1 bragr operations 1.6K Nov 19 12:32 randomFileName

If you wanted to share this file with someone outside the 'operations' group, read permissions must be added to the file for 'other'.

Set Default Permissions

In Linux, it is possible to set the default file permission for new files. This is useful if you are collaborating on a project, or frequently share files and you do not want to be constantly adjusting permissions The command responsible for this is called 'umask'. You should first check what your default permissions currently are by running 'umask -S'.

bragr@biocluster:~$ umask -S

u=rwx,g=rx,o=rx

To set your default permissions, simply run umask with the correct options. Please note, that this does not change permissions on any existing files, only new files created after you update the default permissions. For instance, if you wanted to set your default permissions to you having full control, your group being able to read and execute your files, and no one else to have access, you would run:

bragr@biocluster:~$ umask u=rwx,g=rx,o=

It is also important to note that these settings only affect your current session. If you log out and log back in, these settings will be reset. To make your changes permanent you need to add them to your '.bashrc' file, which is a hidden file in your home directory (if you do not have a '.bashrc' file, you will need to create an empty file called '.bashrc' in your home directory). Adding umask to your .bashrc file is as simple as adding your umask command (such as 'umask u=rwx,g=rx,o=r') to the end of the file. Then simply log out and back in for the changes to take affect. You can double check that the settings have taken affect by running 'umask -S'.

Copying large folders to and from Biocluster

To perform over-the-network transfers, it is always recommended that you run the rsync command from your local machine (laptop or workstation).

On your computer open the Terminal and run:

rsync -ai FOLDER_A/ biocluster.ucr.edu:FOLDER_A/

Or:

rsync -ai biocluster.ucr.edu:FOLDER_B/ FOLDER_B/

Rsync will use SSH and will ask you for your biocluster password as SSH or SCP does.

If your connection broke, rsync can pick up when it left from - simply run the same command again.

Rsync does not exist on Windows. Only Mac and Linux support rsync natively.

Always put the / after both folder names, e.g: FOLDER_B/ Failing to do so will result in the nesting folders every time you try to resume. If you don't put / you will get a second folder_B inside folder_B FOLDER_B/FOLDER_B/

Rsync does not move but only copies.

man rsync

Copying large folders on Biocluster between Directories

Rsync does not move but only copies. You would need to delete once you confirm that everything has been transfered.

This is the rear case where you would run rsync on Biocluster and not on your computer (laptop or workstation). The format in this case is:

rsync -ai FOLDER_A/ X/FOLDER_A/

where X is a different folder (e.g. a Bigdata folder)

Once the rsync command is done, run it again. The second run will be short and it just a check. If there was no output, nothing changed, it is safe to delete the original location.
Specifically, running rsync the second time will ensure that everything has been transferred correctly. The -i (--itemize-changes) option asks rsync to report (output) all the changes that occure on to the filesystem during the sync. No output = No changes = The folder has been transfered safely.

All the bullets in the above section (Copying large folders to and from Biocluster) apply to this section

Copying large folders between Biocluster and other servers

This is a very rear case where you would run rsync on Biocluster and not on your computer (laptop or workstation). The format in this case is:

rsync -ai FOLDER_A/ sever2.xyz.edu:FOLDER_A/

where sever2.xyz.edu is a different server that accepts SSH connection.

All the bullets in the above sections (Copying large folders to and from Biocluster) apply to this section

Home Directories

Home directories are where you start each session on biocluster and where your jobs start when running on the cluster. They are automatically mounted when you log in and can be found at /rhome/<your username>.

Please remember: the default storage space quota per user account is 20 GB. A buffer of 10 GB is there to help with temporary overages but should not be used for permanent storage.

To get the current usage, run the following command in your home directory:

du -sh .

To calculate the sizes of each separate folder in your home, run:

du -sch ~/*

This will take some time to complete, please be patient.

For more information on your home directory, please see the Orientation section in the Linux Basics manual.

Backups

Biocluster has backups but you may want to periodically make copies of your critical data to your own storage device.
Please remember, Biocluster is a
production system for research computations with a very expensive
high-performance SAN storage infrastructure. It is not a data archiving
system.

Usually, we store the most recent release and 2-3 previous releases of each database. This way time consuming projects can use the same database version throughout their lifetime without always updating to the latest releases.
Requests for additional databases should be sent to support@biocluster.ucr.edu

Parallelization Software

Introduction

The low-latency interconnect provides speeds that average at 30 µs microseconds per message during high loads. This interconnect provides a break-through performance to computational jobs that can run
in parallel on multiple compute nodes but require frequent node-to-node
communication.

MVAPICh1Link:http://mvapich.cse.ohio-state.eduNote: MVAPICh1 runs over the InfiniBand RDMA hardware, so it is potentially faster then OpenMPI.
Include directroy: /opt/mvapich1-1.4.1/includeCompilers: /opt/mvapich1-1.4.1/bin/{mpicc,mpicxx,mpif77}
Launcher: /opt/mvapich1-1.4.1/bin/mpirun_rsh

Unavailable and untested examples: HMMER, ABySS Parallel Assembler

Monitoring the Load History

Several utilities are available for obtaining information about the history of the cluster load by users, labs and individual nodes.

The R Function plotCPUhours
When this function is executed from the R console on the biocluster head node, it will return an overview of the CPU hour history by users and PIs, as well as the average load of the entire cluster.