Thinking statistically about array results

A few days ago I posted about a problem we're having interpreting some microarray data. That problem could be simply resolved by finding the missing notebook, or activating the relevant memory synapses in the brain of the student who did the work. But there's a bigger and more intrinsic problem with this type of experiment, and that's deciding whether observed differences are due to antibiotic effects or to chance.

Consider data from a single array slide, comparing the amounts of mRNA from each of ~1700 genes from antibiotic-treated and untreated cells. If the antibiotic has no effect on a gene we expect the amounts from treated and untreated cells to be 'similar'. We don't expect them to be exactly the same (a ratio of exactly 1.00) because all sorts of factors generate random variation in signal intensity. So we need to spell out criteria that let us distinguish differences due to these chance factors from differences caused by effects of the antibiotic treatment. Here are some (better and worse) ways to do this:

1. The simplest way is to specify a threshold for significance, for example declaring that only ratios larger than 2.0 (or smaller than 0.5) will be considered as significant (i.e. not due to chance). But this isn't a good way to decide.

2. We could instead use the standard statistical cutoff, declaring that only the 5% highest and 5% lowest values would be considered significant. One problem here is that this criterion would tell us that some differences were significant (5% of 1700 x 2 =170) even if in fact the antibiotic treatment had absolutely no effect.

3. We could improve this by doing a control experiment to see how big we expect chance effects to be. One simple control is to use a microarray where both RNAs are from the same treatment. We can then use a scatter plot to compare the scores for each point. The diagonal line then represents the 1.00 ratio expected in the absence of random factors, and the degree of scatter of the ratios away from 1.00 tells us how big our chance effects typically are.

We could then decide to consider as significant only effects that are bigger than anything seen in the control. Sometimes this could be easy. For example, if expression of one gene was 20-fold higher after antibiotic treatment, whereas control effects never exceeded 5-fold, we'd be pretty sure the 20-fold increase was caused by the antibiotic.

But what if the control effects were all below 2-fold, and two of the genes in our treated RNA sample were up 2.5-fold? Does this mean the antibiotic caused this increase? How to decide?

We need to do enough replicates that the probability of any gene's expression being above our cutoff just by chance is very small. For example, we could require that a gene be expressed above the 2-fold cutoff in each of four independent replicates. Even if our arrays had a lot of random variation, comparing replicates can find the significant effects. Say that our controls tell us that 10% of the genes are likely to score above the 2-fold cutoff just by chance, and we do see about 170 scoring that high in our first real experimental array (comparing treated and non-treated cells). If we do another experimental array, we again expect 10% to score above the cutoff, but if only chance determines score, only about 17 of these are expected to have also scored this high in the first replicate. If we do a third replicate, we only expect 1 or 2 of the genes scoring above 2-fold to have scored this high in both of the first two replicates. You see the pattern. So, if our real four replicates include 6 genes that scored above 2 in all four, we can be quite confident that this is not due to chance. Instead the antibiotic treatment must have caused their expression to be increased.

The real analysis is more complicated, in part because different replicates have different amounts of error, and in part because we might want to take into account the actual expression levels rather than using the simplistic cutoff of 2-fold. Our preliminary look at the real replicate data last week suggested that a few genes were consistently (well, semi-consistently) increased. Tomorrow we're going to take another look at the real data and see if we think this is a result we have confidence in.

4 comments:

Hello Rosie, I recently started reading your blog after mndoci pointed it out. ..since I an old fashioned reductionist . I always wondered how many replicates go into these experiments. I cannot wait for your next post..and wonder if you have ever considered screencasting your analysis..it will be fun to watch and listen as well

I hate it when I see a paper that just picks two-fold as their arbitrary limit for all genes. I mean, different genes have different levels of regulation, right?

The only way to do it properly would be to do enough replicate arrays that you have a measure of the variation in the treated and untreated condition for each gene, and then do the proper stats on a gene by gene basis. Rather than talk about 2-fold up or down, talk about standard deviations and percentiles.

Of course, you still don't know the biological relevance of the increase or decrease, but at least you can then say with a defined level of certainty that the treatment caused the change in gene expression of the significantly regulated genes. Then you can start looking at the ones which went up or down the most, relative to their own intrinsic variability.

Then you can do validation by doing the analysis on the whole set of arrays, leaving out one. Then you look at how your list of most upregulated genes changes depending on which array you leave out of the analysis.

I'd argue that once you find a list of genes that remain significantly different in treated vs. control, and retain their significance even if you leave out some of the arrays(there are statistical treatments for determining the level of cross-validation), then you have the real list of genes that were affected by the treatment.

After doing all that number-crunching, It's tempting to try to set a cut-off, but you really have to consider the phenotypic effects of the gene expression. You can't assume that a change in gene expression, even if it's 10 stddevs from control, means anything. Nor is it correct to assume that the biggest changes, even relative to intrinsic variability, are the most relevant.

Now, if you have a hypothesis about what antibiotic treatment is going to do, and you see a real change in gene expression that was predicted by your hypothesis, then you're getting somewhere. You can then proceed to get fancy and develop a model where you can predict the antibiotic response based on the gene expression, but you have to have the phenotypes to correlate with in order to do this.

My apologies for the lecture, you probably understand all this far better than I. It's just kinda an issue with people in my field doing arrays for something like "stemness" and then trying to claim that because they found 350+ genes over their upregulation cut-off that it means something, when they really have no evidence that if they repeated the experiment that they would find the same list of 350 genes, in anything resembling the same order.

Mr. Gunn is quite right, and if we had unlimited resources that's what we'd do. The next post about this problem tries to explain what we're actually going to do, given our limited resources and the collaborative nature of this project.

You know, Rosie, what I was proposing isn't all that different from what you're talking about. I must have made it sound more complicated than it was. At this point in the open notebook experiment, one must take every precaution to make sure that their contribution is to signal and not noise, because the signal is low enough as it is, so I'll try to do better.

The overall point was that by using replicate RNAs from the same treatment to set your threshhold is, in fact, the way to do it.

The bit about cross-validation was simply to say that you know you have enough replicates when you can leave out a certain number of replicate arrays(in any combination) and not have your cutoff for a given gene change significantly.