Engineering, Science, and Society

Bayesian model of differential gene expression

2016 November 5

by Daniel Lakeland

My wife has been working on a project where she's looking at 3 time-points during a healing process and doing RNA-seq. I'm analyzing the known secreted proteins for differential expression, using a fully Bayesian model in Stan:

Red genes have 90% posterior probability of up or down regulation relative to week 0 controls, at least according to the current overnight Stan run.

Fun, but non-trivial. There are around 6000 genes being considered, and there are 5 or 6 of these parameter vectors... 36000 parameters to be estimated from 60000 data points, and the parameters all have fairly long tails... cauchy priors and soforth. Then I need a couple hundred samples. so Stan is outputting something like 5-10 million numbers. Warmup has been a problem here, and the long tails have required re-parameterizations and tuning. Fun stuff.

This is something I am interested. Would you elaborate on the tools/pipeline/methodology you used to do this analysis please? I am a biologist and I am aware of the common tools sued in this analysis. I am not of the statistical variety so I have been wondering if anybody used Stan for the analysis and this where I found it!

This is the first real RNA-seq analysis I've done. My wife is the biologist. I'm not that familiar with the "standard" tools, but I have seen enough to know that they're usually frequentist and usually going along the wrong road in my opinion (For example, Receiver Operating Characteristic ROC is a statistic that is used to compare methods... and I think this is highly mistaken, RNA-seq is not a transmission line from the cell to your ear telling you "which are the true positives" which is the philosophy behind ROC). The goal in an RNA-seq analysis should be to measure the expression profile given that you have a noisy measurement process, and that most of the noise comes from biological replications, not measurement error in a single sample.

In any case, the theory I'm using here is something like this:

A single set of cells is sent through the RNA-seq machinery and a set of gene-counts comes back. If you divide all the counts by the total count, this is in essence a single vector-sample from a distribution over a categorical frequency distribution. A typical distribution over categorical frequency distributions is the Dirichlet distribution. So, our Bayesian goal in doing this repeatedly with different biological samples is to find a single vector parameter Ceff (for effective counts) so that our data vectors look like samples from a dirichlet_rng(Ceff).

Then, we can construct priors over the Ceff based on our biological model for what's going on (for example in my wife's case these are the same cell type at different time points in a healing process, so the expression profile should mostly be the same, but certain important genes could be up or down regulated).

I write a prior over log(Ceff) by specifying a prior over the baseline day-zero expression profile, and priors over "expression differences" for the different time points. The expression differences are cauchy distributed reflecting the fact that most genes are very similar between the time points, especially those which have low counts, and yet some small set of genes can be many factors bigger or smaller than the control.

Finally, I run the Stan code and get estimates of these differences in logarithms and these differences define whether and how much I'm estimating that a gene is up or down-regulated.

Runs take many hours, but I am confident at least that I understand the model, as opposed to some canned piece of software based on some basic theoretical ideas I have no confidence in.

Ok. I was aware that most of the tools are frequentist tools and hence my interest in your post. I am a biologist not well versed in statistics. Consequently the theory you described will take a while to sink in. In the mean time I have a couple more things to add.

There are multiple sources of error in RNA-Seq and the data is inherently noisy. You already know that. Methods development people have been fiercely arguing about the noise and its incorporation in the data analysis. Hence most of the methods/tools try to minimize the (biological and non-biological) errors as part of data cleaning/quality assessment.

Most of the tools do need lot of computational resource making it prohibitive for bench scientists to do or tryout. The reasoning goes like, 'if I have to go to cloud to do analysis then I might as well pass it onto statistician down the hall'. Unfortunately that is not good from an experimentalist's point of view: the one generating data will have insights on the relevant biology and should be doing the analysis. This is where methods like Kallisto/Sleuth and Salmon come in - they are reportedly fast and need resources available to all. They are supposed to run on laptops in reasonable time. I was wondering if you had a chance to use them.

I've long been an advocate of bench scientists and mathematical modelers working together actively. It's no good to pass off your data to some stats guy who will run some canned analysis, you need to describe what you did and then have them build you a custom mathematical model. Unfortunately Frequentist statistics sell themselves as canned all-in-one analyses usually.

The error and noise in RNA-seq is something that needs to be acknowledged. In a Frequentist analysis you "clean" the data to make it behave more like some default assumptions. In a Bayesian analysis you try to actually describe what the noise looks like and then infer what the observed values mean about the unknown things you want to estimate.

As such, I very simply take the total counts for each gene, add 1 so that none of them are exactly zero, and then divide by the total, there is no "cleaning" the data, there is only describing what noise I expect to have in the data. I think this is the only way to go. specifically, in my case, the dirichlet distribution describes the noise in the data, and I need to find out which parameters to put into the dirichlet to get output that "looks like" the actual data.

In R you can see how dirichlet distributions work as follows. Imagine you have only 4 genes of interest. after
library(gtools)

you can see some examples of frequency distributions (the rows) that occur when you randomly generate from a dirichlet distribution with parameter c(1,1,2,1) vs when you use parameter c(100,100,200,100)

rdirichlet(20,c(1,1,2,1))

vs

rdirichlet(20,c(1,1,2,1)*100)

In the first case, the output will cluster around vectors c(1/5,1/5,2/5,1/5) but they will vary from that base value a fair amount. Whereas in the second case they're all much closer to c(1/5,1/5,2/5,1/5)

the output of a dirichlet distribution is always a frequency distribution, that is, it's a vector of positive numbers that add to 1. the dirichlet distribution itself represents uncertainty about what the underlying frequencies will be.

I agree the error and noise in RNASeq need to be accounted for in the analysis. The commonly used frequentist approach in genomics arena cleans them off. However, some o f the noise comes from the fact that at the bench level the methods being used are inconsistent from experiment to experiment. The underlying molecular processes perform inconsistently and that fact gets listed somewhere obscure in the original and subsequent publications. The methods themselves are complex and despite being used extensively they are by no means consistent. These errors make it complicated to draw inferences. I feel Bayesian methods may be useful because, I believe, they do not discard data.

Are you planning on writing up some of those scripts here? Here is a motivation from the biology side: i had not heard "Dirichlet distribution" until you wrote that above!

I think you're on the right track in realizing that the noise comes in the repeated application of the same procedure to different biological samples with different people doing the extractions, different enzyme kits, different times of year and hence lab temperatures, different this and that.... my impression is that it's maybe less common these days for people to do just one biological replicate, which is good, but they're still tending to do a pretty small number, maybe 2!

I would like to write up a discussion on this and get it published, I'll have to work with my wife on how to get that done, as it's her data and it was collected in collaboration with another organization. So, it might be a little while before it becomes available.

A big problem is that biological researchers are still thinking along the lines of creating a "standardized pipeline for discovery" as if somehow you can discover things in an automated way without ever accounting for the specifics of what biological experiment you actually ran! My goal would be to turn that conversation back towards "here are tools for understanding data that should be put together in ways that help you make sense of the specific biological situation you are studying"

I'm interested in this as well. I've toyed around with this kind of analysis (bespoke hierarchical modeling with BUGS/JAGS/STAN), but never had time to flesh it out because:

1) figured fitting would take an inordinate amount of time w/ parameter estimates for 20-40K + transcripts + internal parameters.

2) didn't want to re-invent the mean-variance / overdispersion modeling methods covered by canned procedures in the literature. Granted, these usually aren't that complicated if to represent, but always wonder if it's worth it if #1 is limiting.

Maybe it's gotten better with the HMC or VB procedures now. Any chance you could share some of this on github?

It's absolutely the case that taking this to the full set of 20-40k genes would be computationally problematic. I am doing a model here for 6000 genes of interest plus "everything else" (that is, all the counts that classify as other genes not in the 6000 genes of interest).

If it was important enough, I think you could spin up a hefty preemptible google compute machine to do this with 200Gigs of RAM and 32 cores or whatnot and make it work for not much money (It'd cost $12 to do 30 hrs of computing). My model runs on 4 cores and a machine that has 64 Gigs of RAM in 4 to 6 hours to get 600 samples per chain (300 burn-in and 300 regular). If you don't save too many samples it would be tractable on todays machines to do the full 40k genes in the genome.

The advantage of bespoke analysis is putting in information about your specific situation. For example because these are both the same tissues at different timepoints, I expect much less difference in expression than if these were say two different cancers from two different tissues in two different patients.

sharing specifics of this will require getting buy in from other people, but I'm sharing the general idea here because I really do think it's something people should be doing, namely thinking about their data and building models. Looking for a canned utility to solve your problem is itself a problem in many cases.

I understand the predicament computing need will put a biologist in-I am one so I should know. Most of the time when RNASeq is done, the focus is on entire transcriptome, which is 20-22k for humans. When you factor in splicing for most of those genes, actual number of transcripts that need to be counted gets much higher. The tools in current usage such as HiSAT->StringTie->Ballgown or some other combination do the (1) assembling of the reads to reference, (2) assembling transcripts/splice-forms for each gene and (3) counting of each transcript. From there on statistical analysis to determine differential expression takes over. HiSAT-StringTie-Ballgown combination is supposed to be quite fast and computationally approachable for a wet lab (http://www.nature.com/nprot/journal/v11/n9/full/nprot.2016.095.html?cookies=accepted). On the other hand, psuedoaligners like Salmon or Kallisto make it possible on general laptops/desktops. How does Bayesian analysis compare to them when you have to check on >25k transcripts and many samples?

I like to be able to develop models of my data and would like to factor in the variation in the data rather than discarding reads. Current methods do not remove reads wholesale simply because they do not align; bulk of it is direct result of inherent limitations of the equipment/sequencers as well library construction methods. The field has come to this stage the hard way - committing the errors and figuring them out. However, the Bayesian methods have yet to impact us in a big way and that is disappointing. This would be greatly helpful when single-cell data are involved - there is no replication possible, each cell is different. That means inherent noise level is sky high. I will not be very happy to discard reads due to the noise if there is a way to factor them in.

I get the difficulty you have with putting the current project's analysis on github. I probably cant do that for my own data either - too many people involved and too many approval's needed. However, hopefully it should be possible to put a general description of the various steps.
Cheers

So, there's no principled reason for not using horseshoe+, it mainly comes down to that I have only vaguely heard of horseshoe+ and don't know much about its properties. Can you give me some background here? I take it it's main purpose is to deal with long vectors with sparse large components and lots of noisy near zero components?

I can't do better than to link you to the paper. But here, have the abstract with no clicks:

We propose a new prior for ultra-sparse signal detection that we term the "horseshoe+ prior." The horseshoe+ prior is a natural extension of the horseshoe prior that has achieved success in the estimation and detection of sparse signals and has been shown to possess a number of desirable theoretical properties while enjoying computational feasibility in high dimensions. The horseshoe+ prior builds upon these advantages. Our work proves that the horseshoe+ posterior concentrates at a rate faster than that of the horseshoe in the Kullback-Leibler (K-L) sense. We also establish theoretically that the proposed estimator has lower posterior mean squared error in estimating signals compared to the horseshoe and achieves the optimal Bayes risk in testing up to a constant. For global-local scale mixture priors, we develop a new technique for analyzing the marginal sparse prior densities using the class of Meijer-G functions. In simulations, the horseshoe+ estimator demonstrates superior performance in a standard design setting against competing methods, including the horseshoe and Dirichlet-Laplace estimators. We conclude with an illustration on a prostate cancer data set and by pointing out some directions for future research.

Cool, it seems easy enough to implement, I wonder how well it samples in Stan for my problem. I've had some divergent transition issues with the existing code. I'm going to have to adjust my model on the basis of some biology my wife told me about recently so I'll run with horseshoe+ too.