Normalization and Differential Expression in RNA-Seq Data

Introduction

What is edgeR?

edgeR is a package written for [R] that performs differential tag/gene expression using count data under a negative binomial model. The methods used in edgeR do NOT support FPKM, RPKM or other types of data that are not counts.
You can obtain more information about edgeR from the bioconductor website. Here you’ll find a user’s guide/vignette with detailed examples, a reference manual for all edgeR functions, and the [R] code for installing the package. A useful discussion on a forum about edgeR can be found on a SeqAnswers Thread .

Installing edgeR

With an internet connection, edgeR can be easily installed by running the following commands in [R]

source("http://www.bioconductor.org/biocLite.R")
biocLite("edgeR")

Make sure you have the latest version of R (2.15.3), or some of the initial commands will not work.

The Data

Usually, a RNA-Seq data analysis starts with a set of FASTQ files which contain information on both the sequence and the quality of the short reads.

There are several tools to align the reads to the reference genome (e.g. Bowtie, TopHat, ...).
You will see in the next tutorial how to align the reads to a reference genome using Tophat. For now, assume we already performed this step and we produced a table of read counts for each gene.

We will consider an example based on the data analyzed in [1]. The Sherlock Lab in the Department of Genetics at Stanford sequenced 10 strains of Saccharomyces Cerevisiae grown in three media, namely YPD, Delft and Glycerol each with 3-4 biological replicates.
The samples were sequenced using Illumina’s Genome Analyzer II yielding 36 bp-long single-end reads. Reads were mapped to the reference genome (SGD release 64) using Bowtie, considering only unique mapping and allowing up to two mismatches.
The read count for a given gene is defined as the number of reads with 5’-end falling within the corresponding region. The gene-level counts for this example are provided in the yeastRNASeqRisso2011 R package.

In addition to the read counts, the R package contains some sample and gene information, such as GC-content and gene length.

To install the package you need to download the file from this link and in the directory where you downloaded the file type

Looking at the data

One of the main characteristics of RNA-Seq data is that each library will generally have a different sequencing depth.
We can visualize the total number of mapped reads to known genes with the barplot function. To check for systematic effects we can color-code the plot by different biological or technical variables

The boxplot function provides an easy way to visualize the difference in distribution between each experiment.

boxplot(log2(geneLevelCounts+1), las=2, col=colors[laneInfo[,4]])

A two-class comparison

As a first example, let's focus on the comparison between the three Delft and the three Glycerol yeast libraries.

Building the edgeR object

DGEList is the function that coverts the count matrix into an edgeR object.
In addition to the counts, we need to group the samples according to the variable of interest in our experiment.
We can then see the elements that the object contains by using the names function

These elements can be accessed using the $ symbol. We give some examples here of looking at what is saved in this object.

head(cds$counts) # original count matrix
cds$samples # contains a summary of your samples
sum(cds$all.zeros) # How many genes have 0 counts across all samples

Normalization

We can calculate the normalization factors which correct for the different sequencing depth of each library (or library size).

cds <- calcNormFactors(cds)
cds$samples

By default, the function calcNormFactors normalize the data using the "weighted trimmed mean of M-values" (TMM) method, proposed by [2]. Other options are RLE [3] and upper-quartile [4]. If we want to use the upper-quartile to normalize, we can add an extra argument to the function

cds <- calcNormFactors(cds, method="upperquartile")
cds$samples

If we want the normalized pseudo-counts, useful for instance for cluster analysis, we can get them with the following commands.

Estimating Dispersion

First, we need to estimate the common dispersion: this assumes that all the genes have the same dispersion.

cds <- estimateCommonDisp(cds, verbose=TRUE)

To understand what this value means, recall the parameterization for the variance of the negative binomial is ν(μ) = μ+μ2 ·φ. For poisson it’s ν(μ) = μ. The implied standard deviations are the square-roots of the variances. Now, suppose a gene had an average count of 200. Then the sd’s under the two models would be

In real experiments, the assumption of common dispersion is almost never met. Often, we observe a relation between mean counts and dispersion, i.e., the more expressed genes have less dispersion.

The way edgeR estimates a tagwise (i.e. gene-wise) dispersion parameter is by "shrinking" the gene-wise dispersions toward a common value (the common dispersion estimated in the previous step). Alternatively, one can shrink the gene-wise estimates to a common trend, by estimating a smooth function prior to the shrinkage (using the estimateTrendedDisp function). Here we keep things simple and shrink the estimates to the common value

cds <- estimateTagwiseDisp(cds)

We can plot the biological coefficient of variation (i.e. the square root of the dispersions) with the following

Testing for DE

The function exactTest performs pair-wise tests for differential expression between two groups. The important parameter is pair which indicates which two groups should be compared. The output of exactTest is a list of elements: we can get the table of the results with the topTags function.

Outputting the results

We can write the results to a file that we can later open in excel or in other programs

write.table(top, file="two-class-results.txt", sep='\t', quote=FALSE)

A three-class comparison

In some experiments, the design is more complicated than a two-class comparison. edgeR can deal with complicated designs via a Generalized Linear Model (GLM) approach.
As an example, we will consider the three-class comparison of YPD, Delft and Glycerol in the yeast data.

The main ingredient that we will need in the GLM approach is the design matrix. This matrix tells the model what are the important variables in the experiment. These can be either biological (e.g., growth condition, KO/WT, disease status, ...) or technical (e.g., library prep protocol, flow-cell, ...). To keep things simple, we will model only the growth condition.

Building the edgeR object

Again, we need to build an edgeR object (this time with all the samples)

Testing for DE

The function glmFit fits a GLM to the data and the function glmLRT performs a Likelihood Ratio Test (LRT) for the null hypothesis that the variable of interest does not influence the gene expression. The coef argument refers to the column numbers of the design matrix to be tested. Alternatively, one can use contrasts to test a specific linear combination of the variables (see the package vignette for details).

fit <- glmFit(cds, design)
lrt <- glmLRT(fit, coef=2:3)

The topTags function can still be used to extract the significant genes.

top <- topTags(lrt, n=nrow(cds$counts))$table
head(top)

We can store the ID of the DE genes and look at the distribution of the p-values.