Abstract

We propose a novel, efficient and intuitive approach of estimating mRNA abundances from the whole transcriptome shotgun sequencing (RNA-Seq) data. Our method, NEUMA (Normalization by Expected Uniquely Mappable Area), is based on effective length normalization using uniquely mappable areas of gene and mRNA isoform models. Using the known transcriptome sequence model such as RefSeq, NEUMA pre-computes the numbers of all possible gene-wise and isoform-wise informative reads: the former being sequences mapped to all mRNA isoforms of a single gene exclusively and the latter uniquely mapped to a single mRNA isoform. The results are used to estimate the effective length of genes and transcripts, taking experimental distributions of fragment size into consideration. Quantitative RT-PCR based on 27 randomly selected genes in two human cell lines and computer simulation experiments demonstrated superior accuracy of NEUMA over other recently developed methods. NEUMA covers a large proportion of genes and mRNA isoforms and offers a measure of consistency ('consistency coefficient') for each gene between an independently measured gene-wise level and the sum of the isoform levels. NEUMA is applicable to both paired-end and single-end RNA-Seq data. We propose that NEUMA could make a standard method in quantifying gene transcript levels from RNA-Seq data.

Algorithm overview, for paired-end RNA-Seq. (a) Calculation of gU and iU tables. First, all possible APEs are computationally made from the transcriptome sequence. The length d of an APE is fixed at each round. APEs are mapped back to the transcriptome sequence and classified into groups representing gene-wise (orange) and isoform-wise (green and violet) informative reads. APEs mapped on multiple genes (grey) are not used. For each mRNA isoform, APEs specific to the isoform are counted (iUd,i). For each gene, gene-wise informative APEs are counted (gUd,g). This procedure (from extraction of APEs to calculation of iUd,i and gUd,g) is repeated for every d, ranging from 37 to 250 bp in case of the 36-bp data. As a result, we obtain matrices gUd,g and iUd,i. (b) Calculation of EUMA and expression levels. Real RNA-Seq reads are mapped to the transcriptome sequence. For each gene gEUMA is computed by averaging gUd,i over all d, with weight P(d). iEUMA is computed likewise. P(d) is the probability distribution obtained from all mapped reads from the experiment. Then, for each gene, reads that are mapped to all of the gene's mRNA isoforms and not mapped to any other mRNA isoforms are counted (gNIR). Likewise, for each mRNA isoform, reads that are specifically mapped to the mRNA isoform are counted (iNIR). Finally, gNIR and iNIR are divided by gEUMA and iEUMA, to produce the mRNA abundance at the gene and isoform levels, respectively.

Comparison of four methods in prediction accuracy as a function of total number of reads. Prediction accuracy was defined as the Pearson correlation coefficient between true and estimated mRNA abundances. The x-axis denotes the total number of reads generated in each simulation for technical replicates. (a) Gene-level estimation for 50-bp paired-end RNA-Seq data. (b) Isoform-level estimation for 50-bp paired-end RNA-Seq data. (c and d) Gene and isoform-level estimation for 36-bp single-end RNA-Seq data. ERANGE does not report mRNA isoform abundances and was excluded from isoform analyses.

Plot of prediction accuracy versus consistency coefficient for technical replicates of four simulated 36- and 50-bp paired-end RNA-Seq samples. Each data point represents different number of sequence reads generated (labeled in million).