Differential expression analysis for sequence count data.

Abstract

High-throughput sequencing assays such as RNA-Seq, ChIP-Seq or barcode counting provide quantitative readouts in the form of count data. To infer differential signal in such data correctly and with good statistical power, estimation of data variability throughout the dynamic range and a suitable error model are required. We propose a method based on the negative binomial distribution, with variance and mean linked by local regression and present an implementation, DESeq, as an R/Bioconductor package.

Dependence of the variance on the mean for condition A in the fly RNA-Seq data. (a) The scatter plot shows the common-scale sample variances (Equation (7)) plotted against the common-scale means (Equation (6)). The orange line is the fit w(q). The purple lines show the variance implied by the Poisson distribution for each of the two samples, that is, . The dashed orange line is the variance estimate used by edgeR. (b) Same data as in (a), with the y-axis rescaled to show the squared coefficient of variation (SCV), that is all quantities are divided by the square of the mean. In (b), the solid orange line incorporated the bias correction described in Supplementary Note C in Additional file . (The plot only shows SCV values in the range [0, 0.2]. For a zoom-out to the full range, see Supplementary Figure S9 in Additional file .)

Type-I error control. The panels show empirical cumulative distribution functions (ECDFs) for P values from a comparison of one replicate from condition A of the fly RNA-Seq data with the other one. No genes are truly differentially expressed, and the ECDF curves (blue) should remain below the diagonal (gray). Panel (a): top row corresponds to DESeq, middle row to edgeR and bottom row to a Poisson-based χ2 test. The right column shows the distributions for all genes, the left and middle columns show them separately for genes below and above a mean of 100. Panel (b) shows the same data, but zooms into the range of small P values. The plots indicate that edgeR and DESeq control type I error at (and in fact slightly below) the nominal rate, while the Poisson-based χ2 test fails to do so. edgeR has an excess of small P values for low counts: the blue line lies above the diagonal. This excess is, however, compensated by the method being more conservative for high counts. All methods show a point mass at p = 1, this is due to the discreteness of the data, whose effect is particularly evident at low counts.

Distribution of hits through the dynamic range. The density of common-scale mean values qi for all genes in the fly data (gray line, scaled down by a factor of seven), and for the hits reported by DESeq (red line) and by edgeR at a false discovery rate of 10% (dark blue line: with tag-wise dispersion estimation; light blue line: common dispersion mode).

Sample clustering for the neural cell data of Kasowski et al. []. A common variance function was estimated for all samples and used to apply a variance-stabilizing transformation. The heat map shows a false colour representation of the Euclidean distance matrix (from dark blue for zero distance to orange for large distance), and the dendrogram represents a hierarchical clustering. Two GNS samples were derived from the same patient (marked with '(*)') and show the highest degree of similarity. The two other GNS samples (including one with atypically large cells, marked '(L)') are as dissimilar from the former as the two NS samples.

Application to ChIP-Seq data. Shown are ECDF curves for P values resulting from comparisons of Pol-II ChIP-Seq data between replicates of the same individual (first and second column) and between two different individuals (third and forth column). The upper row corresponds to an analysis with DESeq ('D'), the lower row to one based on Poisson GLMs ('P'). If no true differential occupation exists (that is, when comparing replicates), the ECDF (blue) should stay below the diagonal (gray), which corresponds to uniform P values. In the first column, two replicates from HapMap individual GM12878 (A1) were compared against two further replicates from the same individual (A2). Similarly, in the second column, two replicates from individual GM12891 (B1) were compared against two further replicates from the same individual (B2). For DESeq, no excess of low P values was seen, as expected when comparing replicates. In contrast, the Poisson GLM analysis produced strong enrichments of small P values; this is a reflection of overdispersion in the data, that is, the variance in the data was larger than what the Poisson GLM assumes (see also Section Choice of distribution). The third column compares two replicates from individual GM12878 (A1) against two from the other individual (B1). True occupation differences are expected, and both methods result in enrichment of small P values. The forth column shows the comparison of four replicates of GM12878 (A1 combined with A2) against four replicates of GM12891 (B1, B2); increased sample size leads to higher detection power and hence smaller P values.

Noise estimates for the data of Nagalakshmi et al. []. The data allow assessment of technical variability (between library preparations from aliquots of the same yeast culture) and biological variability (between two independently grown cultures). The blue curves depict the squared coefficient of variation at the common scale, wρ(q)/q2 (see Equation (9)) for technical replicates, the red curves for biological replicates (solid lines, dT data set, dashed lines, RH data set). The data density is shown by the histogram in the top panel. The purple area marks the range of the shot noise for the range of size factors in the data set. One can see that the noise between technical replicates follows closely the shot noise limit, while the noise between biological replicates exceeds shot noise already for low count values.