Identify and visualize expressed transcripts in RNA-seq data

Summary

Which genes are highly expressed, based on my RNA-seq data? Are any of the highly expressed genes also differentially expressed?

This recipe provides an outline of one method to identify and visualize genes and isoforms that are highly expressed in RNA-seq data. In particular, this recipe utilizes an analysis pipeline, allowing a user to chain together multiple analysis steps into one workflow that can be run in one step. An example use of this recipe is a case where an investigator wants to process several datasets in the same way, in which case the pipeline will allow the investigator to re-use the same modules and parameters, over and over again.

Given a set of raw RNA-seq reads, the goal is to align the reads to a reference genome, estimate expression abundance levels for reference genes and isoforms, filter out low-expressed genes and isoforms, and visualize the read alignments and their expression levels. In particular, this recipe uses the UCSC Table Browser to retrieve a reference genome to align RNA-seq reads against. We also uses several modules in GenePattern to align the reads against the reference genome, and to identify differentially expressed genes when comparing two conditions. Finally, we use IGV to visualize the differentially expressed genes.

Why differential expression analysis? We assume that most genes are not expressed all the time, but rather are expressed in specific tissues, stages of development, or under certain conditions. Genes which are expressed in one condition, such as cancerous tissue, are said to be differentially expressed when compared to normal conditions. To identify which genes change in response to specific conditions (e.g. cancer), we must filter or process the dataset to remove genes which are not informative.

Inputs

To complete this recipe, we will need RNA-seq reads, and reference gene annotations to align the reads against. In this example, we examine human RNA-seq data. The RNA-seq reads are in FASTQ/FASTA > format, and can be gzipped. We will need the following datasets, which can be downloaded from the GenomeSpace Public folder.

In this step, we use the UCSC Table Browser to retrieve reference gene annotations corresponding to the reference genome for our example data. If you are using your own data, you may already have a reference gene annotation file, or you may need to search for one matching your reference genome here.

Launch UCSC Table Browser from GenomeSpace by clicking on the icon.

Download the reference gene annotations from UCSC. For the example dataset, enter the following parameters:

clade: Mammal

genome: Human

assembly: Feb. 2009 (GRCh37/hg19)

group: Genes and Gene Predictions

track: UCSC Genes

region: genome

output format: GTF - gene transfer format

Send output to: Select the GenomeSpace checkbox.

output file: Give the output a name. For the example data, we name it UCSC_hg19.gtf.

Click get output to retrieve the file. This will take you to a new page which loads the file to GenomeSpace. The output file should appear in your GenomeSpace home directory.

In this recipe, we want to perform multiple analyses using GenePattern tools, specifically TopHat and Cufflinks. We will use TopHat in GenePattern to align the RNA-seq raw reads to a reference genome. Next, we will use Cufflinks to assemble transcripts and estimate the expression levels. We can combine these modules into a pipeline to perform the analysis in an efficient manner. Once saved, a pipeline may be reused or re-purposed for subsequent analyses.

NOTE: If you have not yet associated your GenomeSpace account with your GenePattern account, you will be asked to do so. If you do not yet have a GenePattern account, you can automatically generate a new account that will be associated with your GenomeSpace account.

Creating a new pipeline

Launch GenePattern from GenomeSpace by clicking on the icon.

Navigate to the Modules & Pipelines tab on the GenePattern toolbar. Do not click; instead, hover your cursor over the tab and click New Pipeline in the drop-down menu to load the Pipeline Designer interface.

OPTIONAL: In the right-hand information panel, change the Pipeline Name to, e.g., RNA_Exp. Other descriptive information (Description, Author, etc.) may be added by the user as desired.

Adding TopHat to the pipeline

Using either the Search Modules field or the Browse Modules button, select the TopHat module. A purple box titled TopHat will appear in the designer space while the right-hand information panel will change to display TopHat-specific parameters.

Adding Picard.SortSam to the pipeline

Using either the Search Modules field or the Browse Modules button, select the Picard.SortSam module. A purple box titled Picard.SortSam will appear in the designer space while the right-hand information panel will change to display Picard.SortSam-specific parameters.

In the output.format context menu, choose BAM.

Connect the TopHat and Picard.SortSam modules by clicking on the arrow button in the bottom-right corner of the TopHat box (output) and dragging to the arrow button at the top-left corner of the Picard.SortSam box next to input.file. A purple arrow should appear, connecting the two modules.

Click on the context menu at the bottom of the Picard.SortSam output box and choose bam.

Adding Cufflinks to the pipeline

Using either the Search Modules field or the Browse Modules button, select the Cufflinks module. A purple box titled Cufflinks will appear in the designer space while the right-hand information panel will change to display Cufflinks-specific parameters.

Select GTF.file to have the pipeline prompt the user to provide this file when run.

Connect the Picard.SortSam and Cufflinks modules by clicking on the arrow button in the bottom-right corner of the Picard.SortSam box (output) and dragging to the arrow button at the top-left corner of the Cufflinks box next to input.file. A purple arrow should appear, connecting the two modules.

Click on the context menu at the bottom of the Cufflinks output box and choose genes.fpkm_tracking.

Adding Fpkm_trackingToGct to the pipeline

Using either the Search Modules field or the Browse Modules button, select the Fpkm_trackingToGct module. A purple box titled Fpkm_trackingToGct will appear in the designer space while the right-hand information panel will change to display Fpkm_trackingToGct-specific parameters.

Connect the Cufflinks and Fpkm_trackingToGct modules by clicking on the arrow button in the bottom-right corner of the Cufflinks box (output) and dragging to the arrow button at the top-left corner of the Fpkm_trackingToGct box next to input.file. A purple arrow should appear, connecting the two buttons.

We will use IGV to visualize the differential expression and abundance levels in the genes.gct file, and check the quality of the mapped alignments from the BAM files.

Loading reference data into IGV.

Launch IGV from GenomeSpace by clicking on the context menu and choosing Launch, prompting the download of a .jnlp file. Double-click the file to launch IGV.

To load a pre-built reference dataset into IGV, navigate to File > Load from Server....

Select the 'UCSC Genes' reference dataset by expanding the following drop-down menus: Annotations > Genes. Choose the UCSC Genes checkbox.NOTE: Check only the 'UCSC Genes' box. Checking the 'Genes' or 'Annotations' checkboxes will check all boxes under that heading, i.e., it will load additional datasets that are not needed for this recipe

Click OK to load the dataset.

Loading RNA-seq data into IGV.

Load a reference genome into IGV by using the IGV genome selection drop-down menu. Make sure the reference genome matches the samples you are using, e.g., H. sapiens samples should use the Human hg19 genome, as in this example.

Use the GenomeSpace tab to import files by clicking on the Load File from GenomeSpace... option.

To load files, navigate to the directory in GenomeSpace which contains the file, then click the file to select it, and then click Open. Load the following files into IGV:NOTE: each file is loaded separately

To visualize the FPKM counts for specific genes, we use the genes.gct file and the UCSC gene annotation information.

From the GenomeSpace homepage, navigate to and right-click on the genes.gct file. Select Preview from the context menu.

In the preview window, copy a UCSC tracking id (under tracking_id). In this example, we use uc001aal.

Return to IGV and paste this tracking id into the search bar at the top of the window. Hit enter or click the Go button. A visual representation of the mapped reads will appear.

Display the FPKM values of the transcript by hovering the mouse over FPKM bar.

Results Interpretation

This is an example interpretation of the results from this recipe. First, we used GenePattern to create and run a pipeline which aligned raw reads from RNA-seq to a reference genome (pre-built in Bowtie) using TopHat, then identified differentially expressed genes using Cufflinks. We then used IGV to visualize the FPKM counts of the aligned reads for an example gene, uc001aal.

The wt_FPKM track is red in the region of the uc001aal gene, which indicates a high FPKM count for that gene. If we examine the *.accepted_hits.bam track in IGV, we can see that we have several reads aligning to this particular gene in the genome. However, the results in this example are not necessarily significant and are only a simple representation of possible results.