教学方

Martin Lindquist, PhD, MSc

Professor, Biostatistics

Tor Wager

PhD

脚本

Hi, in this module we're going to be talking about the multiple comparison problem in FMRI. So, to recap what we talked about a few modules ago, when we want to fit the GLM in order to localize areas that are active in response to a task, we begin by constructing a model for each voxel of the brain. And this is typically done in the massive univariate approach, where every voxel has a separate model. And we usually use a GLM type approach. And here is just a kind of cartoon showing how we can create a Design Matrix for two different conditions, A and B. And then we put this into the GLM model as follows. Now, once we do this and we estimate the parameters of this model, we can perform a statistical test to determine whether or not there's task related activation present in the voxel. So typically we test some hypothesis, c transpose beta is equal to 0, so this is some linear combination of the beta parameters. So for example we might test condition a minus condition b is equal to 0. And in that case we want to check this versus the alternative that they're not equal to 0. And so we do this at every voxel of the brain and then we can summarize the results that say, the subsequent t-statistics that we obtain by performing this hypothesis test in a statistical image such as the one shown here. And so here each voxel now has a value corresponding to the t-statistic of the statistical test at that voxel. Now, the next stage is, that's a nice map and all, but we want to sort of determine which voxels are active or not. And so in that case, we need to find a way to threshold this t-map in order to find significant voxels and get a statistical parametric map, such as the one seen here. Here each significant voxel is color coded according to the size of its p-value. So the question here is, how do we determine this threshold? So, before we start talking about this and the multiple comparison problem that this entails, let's go over some basic nomenclature for hypothesis testing. So, the null hypothesis H nought is a statement of no effect. So, there's typically, we want to test the hypothesis that beta 1- beta 2 = 0. And then we try to see if we can reject this null hypothesis and say that well, they're indeed different from each other. The way we do this is through a test statistic T. And so, the test statistic measures the compatibility between the null hypothesis and the data. So the way we see whether or not they're compatible or not is we calculate something called the P-value. And the P-value's the probability that the test statistic would take a value as or more extreme than that actually observed if H nought is true. So, mathematically, we can write this as the probability that T is bigger than little T, the test statistic is bigger than little T given the null hypothesis. So basically this distribution is showing us what are feasible values that the test statistic can take, if the null hypothesis were indeed true? And if the p-value is small, that says that our test statistic is lying far out on the tails of plausible values. So the smaller the p-value, the less likely that we believe that it arose due to this, that the null hypothesis holds, and in that case we might choose to reject the null hypothesis. Typically, we decide a fixed threshold, which is called the significance level, so we choose a threshold u of alpha which controls the false positive rate at some level alpha. So we basically want to find some threshold u of alpha such that the probability that the test statistic lies above that value is equal to some value alpha, where say 0.05 is often used. So, we want to be able to, we want to control that the probability of making a false positive rate at say 5% in that case. So, whenever we're doing hypothesis testing, we're ultimately making a binary decision. Should we reject a null hypothesis, yes or no? So, when we're making decisions like this, there's two types of errors that we can make. One is called a Type I error. That happens if the null hypothesis is true, but we mistakenly reject it. This is also a called a false positive. So indeed, the null hypothesis is true, but we decide that we should reject the null hypothesis. And this we can control by the significance level alpha. So if we want to guard against the false positives, we can make the alpha level very, very small. That means we need a lot of evidence to reject the null hypothesis. The other thing is a Type II error, which is assumed that now that that null hypothesis is false, but we fail to reject it. This is a false negative. So in this case, we really should be rejecting the null hypothesis but we don't do that because we don't have enough evidence to do so. In that case we get a false negative. And what's most serious between a Type I and Type II error will depend on the situation. The probability that a hypothesis test correctly rejects a false null hypothesis, this is a good thing, this is called the power of the test. So we want a test that's very powerful because if the null hypothesis is false, we want to be able to reject it. So, these are sort of terms that are often used when talking about hypothesis testing and whatnot. So choosing an appropriate threshold is complicated in the situation that we're in in FMRI by the fact that we're dealing with a family of tests. So if more than one hypothesis test is performed at any given time the risk of making at least one Type I error is going to be inflated. It's going to be greater than the alpha level of a single test. So for example, if we control the Type I error rate, let's say 0.05, that's the rate for a single test. But if we perform hundreds of tests, there's a 5% likelihood of making a mistake on each of these tests, and eventually we're going to wind up making a mistake. So the more tests one performs, the greater the likelihood of getting at least one false positive. And so when we're actually performing, say, 100,000 tests, it's very likely that we'll make false positives if we don't make control for this appropriately. So again, which of these 100,000 voxels are significant in this statistical map? Well again, now we've performed 100,000 different hypothesis tests. And if we were to just assume that they were all independent and we could control at the 0.05 level, then we'd actually get 5000 false positive voxels, because once out of every 20 times we would make a mistake. So in this case, we would have 5,000 false positive voxels, and so this could be entire regions of the brain that are deemed active even though they shouldn't have been, and this can be a very serious problem. So choosing a threshold is ultimately a balance between sensitivity, which is the true positive rate, and specificity which is the true negative rate. So, again we looked at this little example in an earlier module, but I think it's worth looking at again. So, for example this statistical map, we could threshold at any given level here. So, here I show five examples with threshold at 1, 2, 3, 4, and 5. And so you see, if you choose a low threshold, then you get a lot of active voxels. So in this case, you're probably finding all the active voxels of the brain. However, you're probably getting a lot of things that shouldn't have been active and declaring them active, so that's no good. On the other hand, if you choose a very stringent threshold, say 5, in this case you're pretty sure that the regions that are active are truly active. But you can't shake the feeling that you've missed a couple of activations. So we have to find some middle ground and we want to do this in a principled way. So how do we choose the threshold to determine which voxels are active and not active in a principled way that we can sort of defend and believe in? So that's what the next couple of modules are about. And so there exists several different ways of quantifying the likelihood of obtaining false positives. One way is to control what's called family-wise error rate. The family-wise error rate is the probability of making any false positives. This provides a very strict control over multiple comparisons. So, we want to guard against making any false positives at all. A little bit more lenient approach which is becoming increasingly popular is what's called the False Discovery Rate or the FDR. And so, the False Discovery Rate controls the proportion of false positives among all rejected tests. And so in the coming modules we'll talk about the family-wise error rate and the false discovery rate in turn. So, that's the end of this module. This was just a brief introduction to the problem at hand, with multiple comparisons. In the next couple of modules, we'll go into detail and talk about methods for controlling the family-wise error rate and the false detection rate. See you then, bye.