教学方

Jeff Leek, PhD

脚本

As we've seen in many of the analyses we've talked about throughout this class, there are a large number of steps that are involved in doing a statistical genomics project from pre-processing and normalization, to statistical modeling, to post hoc analyses of the results that you get. So I wanted to talk a little bit about Researcher degrees of freedom. This is an idea that was originally proposed in psychology, and there was this paper that said, basically, undisclosed flexibility in data collection and analysis allows for presenting anything as statistically significant. And, so, what are they talking about here? They're talking about how there's a large number of steps in the sort of data analytic pipeline. They go from experimental design, all the way from the raw data to the summary statistics, and then finally there's a p-value at the end. Now usually when people are talking about statistical significance, they talk about p-values or multiple testing corrected p-values, and often a lot depends on that p-value being sort of small enough that a journal will publish the paper, or something like that. And so that dependence is going down a little bit over time, but originally there's been a lot of sort of focus on that. But there's been a lot of sort of steps underneath that process before you get to a p--value that could change what the p-value is. So, for example, if you throw out a particular outlier, or if you normalize the data a little bit differently, you might get different results. And so, there's lots of different ways you can analyze data. And the danger here is that, when they were talking about it in this paper, they were sort of talking about a nefarious case where you keep doing everything you can until you get a p-value that's significant, but you could imagine doing this just sort of by accident. You make a large number of choices when doing a genomic data analysis, and once you've made those choices, you get some result. And maybe you don't like that result so you redo the analysis. So one thing that you have to be very careful about when doing genomic analysis is redoing the analysis too many times. It makes sense when there's new updated software or there's sort of new biological or scientific knowledge that's been brought to bear to redo the analysis. But if you keep redoing it over and over again you sort of fall into this trip. And so, you can imagine how that would happen with different teams. So, this comes from sort of a recent analysis. This is an analysis in genomics, but it kind of illustrates the point that 29 different research teams were asked to see if referees were more likely to give red cards to dark-skinned players. And so each team analyzed the data a little bit differently. And here you can see the dots represent the different effect sizes that they estimated for the different studies, and so you can see that they're all different. And then the sort of confidence intervals, or the sort of confidence uncertainty intervals, for each of these different estimates are also different from each other. And so, while they're comfortingly sort of similar for many of the estimates here in the middle, you can get quite big variability just by changing the way that you analyze the data. And so, you have to be careful to make sure that you don't do this over and over and over again until you find just the one case where you get a large estimate of the effect, even if it's probably not necessarily due to anything other than the way that you analyze the data. And so, the difficult thing about thinking about that is if you do a different analysis, particularly if you adjust for different covariants, you actually are answering different questions. So the a question is going to be conditional on what your sort of model is. Ans so if you have whole bunch of extra covariants in the model, then you're asking, is there a difference in gene expression after I account for all of these other variables? That's a very different question than, is there just a gene expression difference overall, which might mean something totally different. And so you have to be a little bit careful about this idea researcher degrees of freedom as related to knowing what question it is that you're answering. And so this whole idea was sort of summarizing in this paper by Andrew Gelman and Eric Loken when they talk about The garden of forking paths. What they mean by that is basically that you start off doing an analysis where you just haven't seen the data, and maybe you have an analysis plan in mind. Then once you collect the data you realize, oh that there's a problem of a particular type. This happens all the time in genomic data. And then you start making decisions based on the data that you've observed, and once you start doing that you start playing into this researcher degrees of freedom idea. You're basically changing the way that you're analyzing the data based on the data, and you can end up with a little bit of trouble. So the key is to be thinking ahead right from the beginning, how am I going to analyze these data, what decisions am I going to make before looking at the data, so that you're not sort of driven by those, and sort of end up chasing a false positive. So the key take home message here, have a specific hypothesis that you're looking for. So with genomic data there's this sort of tendency to just sort of do discovery for the sake of doing discovery without a specific hypothesis. And that can often lead towards this sort of garden of forking paths or these researchers degrees of freedom. Another thing that you can do is pre-specify your analysis plan, that even if it's just internally to you, say like this is the way we're going to analyze the data and we're going to stick to it. And then even if you end up adapting it later, it's good to just analyze the data once exactly how you planned on analyzing it, even if it has problems, just so you know what would have happened, and see if there's big differences and why those differences might be. Another thing that you can do if you have enough data, although it's often not the case in genomics, is use training and test sets, so the idea that you can split your data up into a first analysis data set and then you can validate the results that you get in the remaining data. And then analyze your data once. So a very common temptation with genomics is to increasingly add complicated models until you find more and more things, and that often leads to false positives. The other thing that you could do is if you're going to do any analyses, if you report all of those analyses, it will give people the opportunity to sort of understand if maybe there's potential for data dredging or researcher degrees of freedom in your analysis. So this is sort of a cautionary note that genomic data is complicated, and if you add complicated analysis on top, you can often run into extra false positives.