GTEx is throwing away 90% of their data

The Genotype-Tissue Expression (GTEx) project is an NIH initiative to catalog human tissue-specific expression patterns in order to better understand gene regulation (see initial press release). The project is an RNA-Seq tour-de-force: RNA extracted from multiple tissues from more than 900 individuals is been quantified with more than 1,800 RNA-Seq experiments. An initial paper describing the experiments was published in Nature Genetics earlier this year and the full dataset is currently being analyzed by a large consortium of scientists.

I have been thinking recently about how to analyze genotype-tissue expression data, and have been looking forward to testing some ideas. But I have not yet become involved directly with the data, and in fact have not even submitted a request to analyze it. Given the number of samples, I’d been hoping that some basic mapping/quantification had already been done so that I could build on the work of the consortium. But, alas, this past week I got some bad news.

The Montgomery et al. paper contains one figure providing the “FluxCapacitor outline”. It is completely useless in actually providing insight into what Flux Capacitor does:

Modification of the top half of Supplementary Figure 23 from Montgomery et al (2010) titled “Flux Capacitor Outline” (although it actually shows a splice graph if one corrects the errors as I have done in red).

The methods description in the Online Methods of Montgomery et al. can only be (politely) described as word salad. Consider for example the sentence:

In our approach we estimate the biases characteristic of each experiment by collecting read distribution profiles in non-overlapping transcripts, binned by several transcript lengths and expression levels. From these profiles, we estimate for each edge and transcript a flux correction factor that following the language of hydro-dynamic flow networks, we denote as the capacity of the edge, as the area under the transcript profile between the edge boundaries (Supplementary Fig. 23).

The indices i and j for are never defined, but more importantly its completely unclear what the the correction factor actually is, how it is estimated, and how it is used (this should be compared to the current sophistication of other methods). On the program website there is no coherent information either. Here is an example:

The resulting graph with edges labelled by the number of reads can be interpreted as a flow network where each transcript representing a transportation path from its start to its end and consequently each edge a possibly shared segment of transportation along which a certain number of reads per nucleotide — i.e., a flux — is observed.

I downloaded the code and it is undocumented- even to the extent that it is not clear what the input needs to be or what the output means. There is no example provided with the software to test the program.

I therefore became curious why GTEx chose Flux Capacitor instead of many other freely available tools for RNA-Seq (e.g. ALEXA-Seq, CLIIQ, Cufflinks, eXpress, iReckonIsoEM, IsoformEx, MISO, NEUMA, RSEM, rSEQ, rQuant, SLIDE, TIGAR, …). Although many of these programs are not suitable for production-scale analysis, Cufflinks and RSEM certainly are, and eXpress was specifically designed for efficient quantification (linear in the number of mapped reads and constant memory). I looked around and no benchmark of Flux Capacitor has ever been performed–there is literally not even a mention of it in any paper other than in manuscripts by Sammeth, Guigó or Dermitzakis. So I thought that after four years of repeated use of the program in high profile projects, I would take a look for myself:

After fumbling about with the barely usable Flux Capacitor software, I finally managed to run it on simulated data generated for my paper: Adam Roberts and Lior Pachter, Streaming fragment assignment for real time analysis of sequencing experiments, Nature Methods 10 (2013), 71–73. One example of the state of the software is the example page (the required sorted file is posted there but its download requires the realization that is is linked to from the non-obviously placed paperclip). Fortunately, I was using my own reads and the UCSC annotation. The Roberts-Pachter simulation is explained in the Online Methods of our paper (section “Simulation RNA-Seq study”). It consists of 75bp paired-end reads simulated according to parameters mimicking real data from an ENCODE embryonic stem cell line. I tested Flux Capacitor with both 10 million and 100 million simulated reads; the results are shown in the figure below:

Flux Capacitor accuracy on simulations with 10 million and 100 million reads. The top panels show scatterplots of estimated transcript abundance vs. true transcript abundance. The lower panels show the same data with both axes logged.

For comparison, the next figure shows the results of RSEM, Cufflinks and eXpress on a range of simulations (up to a billion reads) from the Roberts-Pachter paper (Figure 2a):

Flux Capacitor has very poor performance. With 100 million reads, its performance is equivalent to other software programs at 10 million reads, and similarly, with 10 million reads, it has the performance of other programs at 1 million reads. I think its fair to say that

Using Flux Capacitor is equivalent to throwing out 90% of the data!

The simulation is a best case scenario. It adheres to the standard model for RNA-Seq in which fragments are generated uniformly at random with lengths chosen from a distribution, and with errors. As explained above, all these parameters were set according to an actual ENCODE dataset, so that the difficulty of the problem corresponds to realistic RNA-Seq data. I can’t explain the poor performance of Flux Capacitor because I don’t understand the method. However my best guess is that it is somehow solving min-flow using linear programming along the lines of the properly fomulated ideas in E. Bernard, L. Jacob, J. Mairal and J.-P. Vert, Efficient RNA isoform identification and quantification from RNA-seq data with network flows, Technical Report HAL-00803134, March 2013. If this is the case, the poor performance might be a result of some difficulties resulting from the minimization of isoforms and reflected in the (incorrectly estimated) stripes on the left and bottom of the log-log plots. That is not to say the conclusions of the papers where Flux Capacitor is used are wrong. As can be seen from our benchmark, although performance is degraded with Flux Capacitor, the quantifications are not all wrong. For example, abundant transcripts are less likely to be affected by Flux Capacitor’s obviously poor quantification. Still, the use of Flux Capacitor greatly reduces resolution of low-expressed genes and, as mentioned previously, is effectively equivalent to throwing out 90% of the data.

As far as GTEx is concerned, I’ve been told that a significant amount of the analysis is based on raw counts obtained from reads uniquely mapping to the genome (this approach appears to have also been used in many of the other papers where Flux Capacitor was used). Adam Roberts and I examined the performance of raw counts in the eXpress paper (Figure S8, reproduced below):

Figure S8 from A. Roberts and L. Pachter, Nature Methods (2013) showing the limits of quantification when ignoring ambiguous reads. NEUMA (Normalization by Expected Uniquely Mappable Areas) calculates an effective length for each transcript in order to normalize counts based on uniquely mappable areas of transcripts. We modified NEUMA to allow for errors, thereby increasing the accuracy of the method considerably, but its accuracy remains inferior to eXpress, which does consider ambiguous reads. Furthermore, NEUMA is unable to produce abundance estimates for targets without sufficient amounts of unique sequence. The EM algorithm is superior because it can take advantage of different combinations of shared sequence among multiple targets to produce estimates. The accuracy was calculated using only the subset of transcripts (77% of total) that NEUMA quantifies.

Quantification with raw counts is even worse than Flux Capacitor. It is not even possible to quantify 23% of transcripts (due to insufficient uniquely mapping reads). This is why in the figure above the eXpress results are better than on the entire transcriptome (third figure of this post). The solid line shows that on the (raw count) quantifiable part of the transcriptome, quantification by raw counting is again equivalent to throwing out about 90% of the data. The dashed line is our own improvement of NEUMA (which required modifying the source code) to allow for errors in the reads. This leads to an improvement in performance, but results still don’t match eXpress (and RSEM and Cufflinks), and are worse than even Flux Capacitor if the unquantifiable transcripts are taken into account. In the recent Cufflinks 2 paper, we show that raw counts also cannot be used for differential analysis (as “wrong does not cancel out wrong”– see my previous post on this).

One criticism of my simulation study could be that I am not impartial. After all, Cufflinks and eXpress were developed in my group, and the primary developer of RSEM, Bo Li, is now my postdoc. I agree with this criticism! This study should have been undertaken a long time ago and subjected to peer review by the author(s?) of Flux Capacitor and not by me. The fact that I have had to do it is a failure on their part, not mine. Moreover, it is outrageous that multiple journals and consortia have published work based on a method that is essentially a black box. This degrades the quality of the science and undermines scientists who do work hard to diligently validate, benchmark and publish their methods. Open source (the Flux Capacitor source code is, in fact, available for download) is not open science. Methods matter.

Blog Stats

19 comments

We would like to thank Lior Pachter for giving us advance warning a few days ago of this post. This involves GTEx as well as other high profile projects.
Naturally, we stand by our methods and choices and are preparing a detailed response on the technical issues and assessment of the performance of the methods that led our groups to make this choice.

Given the sensitivity of the issue and the need to consult all our collaborators before releasing a response on their behalf and on behalf of the consortia as a whole, we anticipate this response in a week to 10 days from now.

Manolis Dermitzakis, University of Geneva, on behalf of GTEx and GEUVADIS consortia

1. Why use Spearman correlation as the important measure of agreement?

> When writing the eXpress paper Adam Roberts and I thought a lot about how to choose a measure of agreement. Of course a single number cannot capture the extent of differences between a “true” and estimated transcriptome any more than a single number can be used to summarize a distribution. One useful thing to look at is a stratification of performance by abundance level (we did this in a supplementary figure at the request of a helpful reviewer). However we also wanted to summarize performance with a single number in order to be able to display in a single plot accuracy as a function of number of reads. We initially considered Pearson correlation, however that is dominated by a few very highly expressed genes. Correlation of log-abundances is also problematic for the opposite reason: the measure is dominated by very low expressed genes which are unlikely to be of biological interest. We eventually settled on Spearman rank correlation as a robust measure that seems to capture what would be of typical biological interest (the ranking of the transcripts). We also experimented worth KL divergence and obtained very similar results; some of our figures use that.

For the Flux Capacitor benchmark I chose Spearman Correlation because that is what we was used in Figure 2 of the eXpress paper. I would encourage others to examine KL divergence and other measures. I don’t think its my job to do all this work on behalf of Flux Capacitor, and I’ve lost a lot of motivation to help them after a senior GTEx scientist accused me yesterday of behaving like a “14 year old bully”. Having said that, I don’t believe the performance relative to other programs will change much if one were to switch to something like KL divergence.

2. What is the correlation between replicates?

> In our simulation model reads are uniformly distributed along transcripts with error, but without sequence specific bias. The variance of counts per transcript is therefore approximately equal to the abundance x length x number of reads in the experiment.

3. What parameters did he use for the Flux calculation?

> The default parameters. The program was downloaded and run according to the instructions on the website.

4. Where is his code so we can see if there were any bugs (I’m sure he is willing to share, but I don’t see a link)?

> The simulation data used is available from the website http://bio.math.berkeley.edu/eXpress/simdata/ (this link is published in the Online Methods of the eXpress paper). The reads provided, and “true” abundances used to simulate the data are all that is needed to verify the plot I posted. The only other code is that of Flux Capacitor, eXpress, RSEM and Cufflinks all available from their respective websites. A full description of how the simulated data was generated is provided in the eXpress paper (it is also easy to verify that the data is consistent with the model purported to have generated it).

5. That 90% number seems very high, I wonder if varying the simulation approach/parameter settings/etc. would show it isn’t quite that bad

> I don’t think 90% is high. This is because the underlying method in Flux Capacitor (from what I can guess about it reading the now updated website) is not good. Specifically, the construction of the splice graph followed by linear programming to compute smoothed counts is not going to be as accurate as maximum likelihood. There are other issues: I don’t think Flux Capacitor is dealing with reads ambiguous to different genomic locations. And there is another issue which is that I don’t believe the program is even doing what the author intended it to. Software programs for RNA-Seq are fairly complex, and there are inevitably many bugs. In my projects (Cufflinks, eXpress) we benefited enormously from lots of feedback from the community (one can see this by looking at the software websites and the number of bugs that were fixed over time). I know this to also be true of other similar projects. This obviously hasn’t happened with Flux Capacitor, and therefore I’m skeptic that the code is reliable.

6. Throwing away 90% of you data might not matter if you get the right answer to the question you care about at the end. Can we evaluate something closer to what we care about? A list of DE genes transcripts, for example?

> If the question of interest required only 10% of the data that might be true. But then what was the point of sequencing the remaining 90%?

With respect to point 4, I think Jeff was asking for a script that details the exact commands that were used to run Flux Capacitor, parse its results and generate the plots shown in your post (i.e. a replicable pipeline). I know generating such a script may be more work than you are willing to put into this issue, but doing so can be very useful in silencing the “you didn’t run my code properly” criticism or for spotting any potential technical errors that might have arisen along the way.

The simulations were trained on experimental data in order to be as realistic as possible. They captured likely expression levels, read non-uniformity and non-normality of insert sizes. There were several simulation setups, some of them designed to favor Cufflinks & FluxCapacitor (methods allowing for expression to be exactly 0). In all setups FluxCapacitor was the worst performer (we were able to get it to run, and it runs quite fast, though admittedly documentation could be better).

A note regarding throwing away data. Most methods literally throw away data, as they only count pairwise exon connections. One really wants to consider the full exon path, i.e. the whole series of exons visited by each read pair (or sub-exons, when exons partially overlap). Clearly assigning (probabilistically) reads to isoforms can be done more precisely when we know all the exons visited by that read. In practice, we found that modelling full exon paths provides a significant boost in estimation precision (measured as Mean Absolute Error and Spearman Correlation).

An implication is that RNA-seq data should be summarized by counting exons paths, rather than counting pairwise exon connections (which I agree is indeed better than just getting exon counts). Else we’re throwing away valuable information contained in the data…

Thanks for sharing your preprint. At first glance it seems very well written and I look forward to reading it carefully. I completely agree with your comments regarding the counting of pairwise exon connections and throwing away data. That is why we transitioned to mapping reads directly to transcripts (rather than the genome) in eXpress, as was done in RSEM. I also noticed your shrinkage argument which seems interesting and a valuable idea for all methods developers to consider.

Thanks for the interesting comparison, I need to read it more carefully as well!

If your main criticism against BitSeq is speed and you are only interested in expression estimation (not differential expression analysis), you should definitely check out our new variational algorithm which gives comparable accuracy to original BitSeq in a fraction of the time (preprint at http://arxiv.org/abs/1308.5953 ).

Thanks for the update on the faster BitSeq version. I very much liked the fact that it models the whole data, so in principle there’s no loss of information and inference should be more precise than any method using data summaries. Of course, a practical argument is that data summaries can be very convenient, e.g. easier to post online than the full raw data.

Some more food for thought. I was also surprised at how useful a little bit of shrinkage can be for mixture models. Interestingly, BitSeq does include shrinkage via the prior. Actually for some reason in our simulations it seemed to over-shrink, which hurt the estimation precision (numerous estimates collapsed to the prior mode).

Mapping to transcriptome vs. genome makes for a very nice discussion. The former certainly seems more sensible. However I’m unsure of how many contaminating sequences (e.g. pre-mrna, introns) might end up mapped to the wrong transcript. These issues would be lessened when aligning to the genome. Also mapping to the known transcriptome might be a limitation for de novo discovery?

Regarding the transcriptomic mapping, it’s quite nice to include the pre-mRNA in the transcript set and also map to that, sweeping up some of those intronic reads. Antti and I have been doing that in some time-series work and it seems a nice feature of RNA-Seq that there is often a good pre-mRNA signal in the data (but I agree that mapping to the known transcriptome is a limitation where there are non-canonical splice variants).

On a different issue, an addition to my original comment to avoid being unfairly critical with FluxCapacitor. Roderic Guigo’s group made many contributions over the years on alternative splicing, regardless of the relative performance of any single method their work certainly made a difference and served as inspiration to many! (including me, certainly)

If I understand the method correctly, one of the main differences is not in the basic setup of FluxCapacitor but in its using a method of moments-type estimator (by solving a deterministic system of linear equations) as opposed to likelihood (or posterior) based inference. Probably not too hard to adjust or add as an option…

We were a bit surprised by the relatively poor performance of BitSeq in the Casper paper so we repeated the analysis of the same data and David kindly sent us all the code required. It turns out that there is a problem with Bowtie1 (at least using default parameters) leading to an incredibly low mapping rate (about 1-2%) on these datasets. This results in BitSeq having about 2% of the read data that Casper and Cufflinks are using in the comparisons. We therefore repeated the same benchmarking but using Bowtie2 which has similar mapping rates to Tophat2 used by the other two methods. We also compared to Sailfish, eXpress, RSEM and TIGAR2 and did some additional simulations studies to provide a more comprehensive benchmarking including also our faster VB version of BitSeq.

The paper is about to be submitted but we have put a preprint copy on the arxiv (http://arxiv.org/abs/1412.5995) and we will provide all the code (to include with the journal submission) in a week or so to reproduce all the plots. Future referees, please give us your feedback now!

Thanks for posting the link to your preprint here. I haven’t read the paper carefully but the simulation parameters:
μ_m ∼ U(10,210), m=1,…,M
RPK_{jm} ∼ NB(μ_m,20), m = 1,…,M,j = 1,2,3,
seem to me to be unrealistic. Specifically, the uniform distribution on transcript abundances is problematic. I guess it is interesting to see performance also in such a case, but “real” RNA-Seq looks very different. You’ll find almost all RNA-Seq datasets have a long (rare) tail. Typically we determine parameters approximately based on “real” data (and in turn simulate from those parameters).

Wow, now that is super-fast feedback (and isn’t is 2.14am in California?). This seems to me a very sensible suggestion, we can indeed use the real data to inform the simulations. Let us try that out to see if it makes a difference to the simulated data results.

Can we use the eff cnts from eXpress to calculate the relative isoform usage for each gene reliably?
I’m asking this since the confidence intervals on the isoform proportions (\psi in MISO) are not very small (and hence, it makes the conclusion drawn on the proportions not very reliable); on the other hand, it’s mentioned in your paper that the isoform count estimates are given within a %5.4 interval.

Out of curiosity; does the simulated RNA-seq data represent mRNA only (i.e., a polyA-tail selection type experiment) or total RNA. Do any key simulation parameters differ between these two types of experiments?