Project description

Cirrus-NGS

Introduction

Bionformatic analysis of large-scale next-generation sequencing (NGS) data requires significant compute resources. While cloud computing makes such processing power available on-demand, the administration of dynamic compute clusters is a daunting task for most working biologists. To address this pain point, the Center for Computational Biology and Bioinformatics at the University of California, San Diego, has developed Cirrus-NGS, a turn-key solution for common NGS analyses using Amazon Web Services (AWS).

Cirrus-NGS provides primary analysis pipelines for RNA-Seq, miRNA-Seq, ChIP-Seq, and whole-genome/whole-exome sequencing data. Cirrus users need not have any bioinformatics tools for these pipelines installed on their local machine, as all computation is performed using AWS clusters and all results are uploaded to AWS's S3 remote storage. The clusters are created dynamically through Cirrus-NGS, based on a custom machine image with all the necessary software pre-installed. Users manage clusters and run pipelines from within a web browser, using flexible but light-weight Jupyter notebooks.

Installing pip

Installing conda

While Cirrus-NGS itself is provied through github, its supporting libraries are best installed through conda, a cross-platform package manager. If you do not have conda installed already, download the appropriate Python-3.6-based installer from the links below:

Note the URL at which the server is running. It is usually http://localhost:8888, but if your system already has something running on port 8888, it may default so something else (e.g., in the example above, it is http://localhost:8889).

Do not shut down the Jupyter notebook server or close the terminal window in which it is running until you are done using Cirrus-NGS, as it must be running in order for the Cirrus-NGS notebooks to function. When you are done using Cirrus and ready to shut down the notebook server, simply type Control-C in the terminal window where it is running and follow the prompts.

Running Cirrus-NGS

General Overview

A user begins by defining a compute cluster. After this, sh/e decides upon the pipeline to run, which determines the exact inputs required, and then creates (in a spreadsheet program or text editor) a design file that specifies the details of the input sequencing data. S/he starts the Jupyter notebook for the chosen pipeline and inputs the path to the design file and other required inputs (such as AWS credentials, etc), then executes the notebook code to start the pipeline processing.

The notebook creates a yaml file summarizing all of the user input and transfers that file to the cluster, where it directs cluster-native code to sequentially execute the analysis steps specified by the user in a distributed fashion. Upon completion of each step, the output is uploaded to the user's specified S3 output bucket, from which it can be accessed at any point. During processing, the user can check the status of the pipeline run from within the notebook.

A Note About Terminology: Throughout, the term "pipeline" is used to describe functionality for analyzing a general type of sequencing (such as RNA-Seq or whole-genome sequencing). Each pipeline can have multiple, more granular "workflows" that support different approaches to processing the same type of sequencing data; for instance, the RNA-Seq pipeline contains four different workflows (star_gatk, star_htseq, star_rsem, and kallisto).

Gathering Required Inputs

The following pieces of information are required by all functionality in Cirrus-NGS:

If you encounter a Permission denied (publickey) error, note that the permissions on your key (.pem) file must be set so that it is not public, e.g. by running chmod 0400 <mypemfilename>.pem in the directory where the .pem file is located.

Additional inputs are required for specific actions, and are discussed in the relevant sections below.

Defining a Cluster

Run the following steps to define a compute cluster. You may then use that cluster for all Cirrus-NGS work going forward without having to rerun these steps (although you may also rerun them to create additional clusters, if desired).

Gather the required settings information; in addition to the valued described above in the Gathering Required Inputs section, you will need:

Generally speaking, the master node will need to run on an instance with at least 2 vCPUs and at least 8 GiB of memory (such as an m4.large).

The instance type to use for the compute nodes of your cluster

These nodes are dynamically added to and removed from the cluster to accommodate the amount of calculation the cluster is asked to do.

Generally speaking, the compute nodes will need to have CPU and memory capacity greater than or equal to that of the master node.

These nodes are run with AWS's discounted "spot" pricing (see below for details).

The spot price for the compute nodes of your cluster

If AWS has left-over capacity of a particular instance type after filling all on-demand requests for that type, it sells the remaining capacity at a dynamically-set lower price; the "spot price" you specify is the maximum hourly cost, in US dollars, that you are willing to pay for each such instance (see https://aws.amazon.com/ec2/faqs/ for details).

The going spot rate for a particular instance type in a particular region is constantly changing bsaed on demand. Visit https://aws.amazon.com/ec2/spot/pricing/ (remember to select the correct region for you, as described above!) to view the current spot prices for instances available in your region.

If the going spot rate for your chosen compute instance type goes above your specified spot price, you will not be charged more, but your compute nodes will shut down without warning as they are re-allocated to another user. Therefore, it is wise to choose a spot price that represents the maximum you are willing to pay, rather than under-bidding.

The number of compute nodes in your cluster when it is first started

If you do not have a good idea of how many you will need, it is fine to use zero by default.

The size in GiB of the virtual storage disk for the cluster

The VPC (Virtual Private Cloud) id for your AWS account

If you do not know the VPC id, you can find it by going part-way through the process of launching an AWS compute instance. Follow the instructions at https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/launching-instance.html up until step 6 (it does not matter what values you select during these steps, as you won't need to actually launch the instance). Step 6 brings you to the "Configure Instance Details" screen, on which you can find your VPC id, as well as your subnet ids (see next item):

The subnet id for your AWS account for the region in which you wish to create the cluster (see above for how to find this value)

In your browser, visit the address at which the Jupyter notebook server is running

Running a Pipeline

The design file is a tab-separated text file with either two or three columns (depending on the workflow chosen) that specifies the names of the sequence files to process and the necessary metadata describing them.

Building a Design File

The design file is the primary user input to Cirrus-NGS. It is a tab-separated text file that specifies the names of the sequence files to process and the necessary metadata describing them. For the RNA-Seq and miRNA-Seq pipelines, it contains two columns, while for the ChIP-Seq and whole-exome/whole-genome sequencing pipelines, it contains three columns (of which the first two are the same as in the two-column format). The design file has no header line.

Two-column format

In the two-column format, the first column is the filename of the sample (with extensions: e.g. fastq.gz), while the second column is the name of the group associated with that sample. (Group names are up to the user, but they are generally set to experimental conditions. For example, all control samples can be given a group named "control".)

For example, a two-column design file for two single-end-sequenced samples named mysample1 and mysample2 might look like:

mysample1.fastq.gz groupA
mysample2.fastq.gz groupB

If the sequencing data is paired-end, the first column includes the name of the forward sequencing file, followed by a comma, followed by the name of the reverse sequencing file (note that there must not be any spaces between these two file names--only a comma!) An example two-column design file for two paired-end-sequenced samples named mysample1 and mysample2 might look like:

Three-column format

The three-column format has the same first two columns as the two-column format, but adds a third column containing a pipeline-specific identifier that differentiates sample types for the ChIP-Seq and whole-genome/whole-exome sequencing pipelines.

For the ChIP-Seq pipeline, each sample is identified as either Chip or Input.

For the whole-genome/whole-exome sequencing pipeline, each sample is identified as either Tumor or Normal.

The following constraints apply to three-column design files:

The sample type identifiers are case-sensitive.

For example, a three-column design file for two paired-end-sequenced ChIP-Seq samples named mysample1 and mysample2 might look like