Find differentially expressed genes in RNA-seq data

Summary

This recipe provides one method to identify differentially expressed genes in RNA-seq read data. In particular, this recipe uses the UCSC Table Browser to retrieve a reference genome to align RNA-seq reads against. We also used several modules in GenePattern to align the reads against the reference genome, and to identify differentially expressed genes when comparing two conditions. Finally, we use IGV to visualize the differentially expressed genes.

Why differential expression analysis? We assume that most genes are not expressed all the time, but rather are expressed in specific tissues, stages of development, or under certain conditions. Genes which are expressed in one condition, such as cancerous tissue, are said to be differentially expressed when compared to normal conditions. To identify which genes change in response to specific conditions (e.g. cancer), we must filter or process the dataset to remove genes which are not informative.

Inputs

To complete this recipe, we will need RNA-seq reads, and reference gene annotations to align the reads against. In this example, we are validating a Saccharomyces cerevisiae knockout of the SFP1/YLR403W gene using paired 3’ Digital Gene Expression (DGE) RNA-seq reads. The RNA-seq reads are in FASTQ/FASTA format, and can be gzipped. We will need the following datasets, which can be downloaded from GenomeSpace's Public folder:

In this step, we use the UCSC Table Browser to retrieve reference gene annotations corresponding to the reference genome for our example data. If you are using your own data, you may already have a reference gene annotation file, or you may need to search for one matching your reference genome here.

Launch UCSC Table Browser from GenomeSpace by clicking on the icon.

Download the reference gene annotations from UCSC. For the example dataset, enter the following parameters:

clade: Other

genome: S. cerevisiae

assembly: Apr. 2011 (SacCer_Apr2011/sacCer3)

group: Genes and Gene Predictions

track: SGD Genes

table: sgdGene

region: genome

output format: GTF - gene transfer format

Send output to: Select the GenomeSpace checkbox.

output file: Give the output a name. For the example data, we name it sacCer3_SGD.gtf.

Click get output to retrieve the file. This will take you to a new page which loads the file to GenomeSpace. The output file should appear in your GenomeSpace home directory. You may click and drag the file to place it into your directory.

NOTE: If you have not yet associated your GenomeSpace account with your GenePattern account, you will be asked to do so. If you do not yet have a GenePattern account, you can automatically generate a new account that will be associated with your GenomeSpace account.

Click on the file (e.g., WT.fastq.gz) in GenomeSpace, then use the GenePattern context menu and click Launch on File

Click on the file (e.g., WT.fastq.gz) in GenomeSpace, then drag it to the GenePattern icon to launch

Open GenePattern from GenomeSpace, navigate to the GenomeSpace tab, then navigate to your personal directory.

We will use TopHat in GenePattern to align the RNA-seq raw reads to a reference genome. In this recipe, we specifically want to identify existing splice junctions, and not novel splice junctions. This module uses the FASTA/FASTQ files (e.g., WT.fastq.gz and SFP1KO.fastq.gz), the reference gene annotations (e.g., sacCer3.gtf) and a reference genome with an index (pre-built in GenePattern). If you are using your own data, you can load your own reference genome instead.

Change to the Modules tab, and search for "TopHat". Change the following parameters:

bowtie index: choose a genome from the drop-down menu. In this example we use Saccharomyces_cerevisiae_sacCer3_UCSC

reads pair 1: load the raw RNA-seq data, e.g., SFP1KO.fastq.gz. To do this, use the GenomeSpace tab to navigate to the directory containing your FASTQ file, then drag the file to the input box.

GTF file: load the GTF file. To do this, first click the Upload your own file button. This creates a new area to drag-and-drop files. Next, navigate to the GenomeSpace tab, and navigate to the folder containing the GTF file. Load the file into the GTF file parameter by clicking on the file and choosing Send to GTF, or by dragging the file to the GTF file input box.

find novel junctions: no

output prefix: choose a filename prefix, e.g., sfp1ko.reads

Click Run to run TopHat. This will generate several files; in this recipe, we will use the *.accepted_hits.bam fileNOTE: Each file of reads must be aligned separately; i.e., run TopHat on WT.fastq.gz first, then again on another dataset, e.g., SFP1KO.fastq.gz. It may take several hours to run TopHat, depending on the size of the files and whether you are using the GenePattern public server.

To run another FASTA/FASTQ file, use the following steps:

Click on the TopHat job number to bring up the context menu for jobs.

Click on Reload job, which will reload the job with the same parameters.

Remove the original FASTA/FASTQ reads file from the job, by clicking on the next to the file in the reads pair 1 parameter.

Using the GenomeSpace tab, navigate to the second FASTA/FASTQ file (e.g., WT.fastq.gz). Load it into the reads pair 1 parameter by dragging the file to the input box.

Remember to change the output prefix to a new filename, e.g., wt.reads

We will use the Cufflinks.cuffdiff module to test differential expression between the two samples (e.g., WT.fastq.gz and SFP1KO.fastq.gz). This module uses the aligned data output from TopHat, and the reference gene annotation GTF file.

Then, we convert the FPKM file to a GCT file using the Fpkm_trackingToGct module in GenePattern.

Change to the Modules tab, and search for "Cuffdiff". Once the module is loaded, change the following parameters:

aligned files: Create a new condition for the first FASTA/FASTQ file, e.g., wt. Load the appropriate file (e.g., wt.reads.accepted_hits.bam) into the aligned files parameter by clicking and dragging the file into the input box.

Click the Add Another Condition button.

aligned files: Create a second condition for the first FASTA/FASTQ file, e.g., sfp1ko. Load the appropriate file (e.g., sfp1ko.accepted_hits.bam) into the aligned files parameter by clicking and dragging the file into the input box.

GTF file: Click on the Upload your own file button, then upload GCT file into the Cufflinks.cuffdiff module using the GenomeSpace tab. Navigate to the directory containing the GCT file (e.g. sacCer3_SGD.gct). Load it into the Cufflinks.cuffdiff module by dragging the file to the GTF file parameter.

Click Run to submit your job. This will generate several files; in this recipe, we will use the genes.fpkm_tracking file.

Change to the Modules tab, and search for "Fpkm_trackingToGct".

Once the module is loaded, change the following parameters:

input file: load the genes.fpkm_tracking file. To do this, navigate to the GenomeSpace tab, and navigating to the folder containing the file. Load the file into the input file parameter by clicking on the file and choosing Send to Input, or by dragging the file to the input file input box.

We will use the Picard.SortSam module to prepare the aligned read data for viewing in IGV. In order to visualize BAM alignment files in IGV, we need a companion index file (BAI) which is located in the same directory. The Picard.SortSam module will sort the file and generate the index. This module uses the *.accepted_hits.bam files.

Change to the Modules tab, and search for "Picard.SortSam".

Once the module is loaded, change the following parameters:

input file: load one of the TopHat aligned reads outputs into the parameter, e.g. wt.reads.accepted_hits.bam. To do this, use the Jobs tab, navigate to one of the previous TopHat jobs, and load the file into the Picard.SortSam module by dragging the file into the input file parameter.

output format: BAM

Click Run to submit your job. This will generate a new sorted BAM file, as well as an index BAI file.

To run another file, use the following steps:

Click on the Picard.SortSam job number to bring up the context menu for jobs.

Click on Reload job, which will reload the job with the same parameters.

Remove the original file from the job, by clicking on the next to the file in the input file parameter.

Using the Jobs tab, navigate to other TopHat job, then load the other aligned reads file, e.g. sfp1ko.reads.accepted_hits.bam, into the input file parameter by dragging the file to the input box.

We will use IGV to visualize the differential expression and abundance levels in the genes.gct file, and check the quality of the mapped alignments from the BAM files. In this recipe, we will also validate that the gene of interest (SFP1) was knocked-out in the SFP1KO mutant strain of S. cerevisiae.

Launch IGV from GenomeSpace by clicking on the context menu and choosing Launch, prompting the download of a .jnlp file. Double-click the file to launch IGV.

Load a reference genome into IGV by using the IGV genome selection drop-down menu. Make sure the reference genome matches the samples you are using, e.g., S. cerevisiae samples should use the sacCer3 genome, as in this example.

Use the GenomeSpace tab to import files by clicking on the Load File from GenomeSpace... option.

To load files, navigate to the directory in GenomeSpace which contains the file, then click the file to select it, and then click Open. Load the following files into IGV:NOTE: each file is loaded separately

wt.reads.accepted_hits.sorted.bam

sfp1ko.reads.accepted_hits.sorted.bam

genes.gct

Now we can visualize the knock-out gene in our datasets. If you are using the example data, type "YLR403W" into the search box, and click Go or hit enter. Note that there are no reads aligned to the gene in the SFP1KO data and zero expression visible in the sfp1ko_FPKM track. There are reads aligned in the WT sample (although mostly at the 3' end, as expected with 3' DGE data) and a high expression level is visible in the wt_FPKM track.

Results Interpretation

This is an example interpretation of the results from this recipe. First, we used TopHat in GenePattern to align raw reads from RNA-seq to a reference genome (pre-built in Bowtie). After aligning both a wild-type and mutant strain of S. cerevisiae, we used Cufflinks to determine the genes which were differentially expressed when comparing the two phenotypes. We then use IGV to visualize the difference between these two datasets; in particular, we validate that the SFP1KO mutant strain of S. cerevisiae lacks the SFP1 gene.

The wt_FPKM track is red in the region of the SFP1 gene, whereas the sfp1ko_FPKM track is white. This is an indicator of gene expression, and shows that the SFP1 gene is not being transcribed in the SFP1KO mutant strain. We can see an additional indicator of this when we compare the wt.readpairs.accepted_hits.sorted.bam file to the sfp1ko.readpairs.accepted_hits.sorted.bam file. These are the tracks which have red and blue boxes to indicate where reads aligned against the reference genome. We can see there are several reads aligned in the WT strain to the SFP1 gene, but the area is blank in the SPF1KO mutant, showing that no reads from that gene aligned against the reference genome. These results confirm that the SFP1 gene was knocked-out in the SFP1KO mutant strain of S. cerevisiae. However, the results in this example are not necessarily significant and are only a simple representation of possible results.

Is GenePattern working ok??? I'm trying to use it but I keep getting an error message...
"GenePattern Server error preparing job 1314526 for execution.org.genepattern.server.executor.CommandExecutorException: Error adding job to queue, gpJobNo=1314526, DrmaaException=denied: parallel environment "smp" does not exist
job rejected: the requested parallel environment "smp" does not exist"
Can someone help me???

Posted by ted on June 30, 2016 16:44

I forwarded your comment to the GenePattern team at gp-help@broadinstitute.org but they could not find your job on the public server. Could you contact them directly to see if they can help sort this out?