Tuesday, August 18, 2015

demo: permutation tests for within-subjects cross-validation

This R code demonstrates how to carry out a permutation test at the group level, when using within-subjects cross-validation. I describe permutation schemes in a pair of PRNI conference papers; see DOI:10.1109/PRNI.2013.44 for an introduction and single subjects, and this one for the group level, including the fully balanced within-subjects approach used in the demo. A blog post from a few years ago also describes some of the issues, using examples structured quite similarly to this demo.

For this demo I used part of the dataset from doi:10.1093/cercor/bhu327, which can be downloaded from the OSF site. I did a lot of temporal compression with this dataset, which is useful for the demo, since only about 18 MB of files need to be downloaded.Unfortunately, none of the analyses we did for the paper are quite suitable to demonstrate simple permutation testing with within-subjects cross-validation, so this demo performs a new analysis. The demo analysis is valid, just not really sensible for the paper's hypotheses (so, don't be confused when you can't find it in the paper!).

The above figure is generated by the demo code, and shows the results of the test. The demo uses 18 subjects' data, and their null distributions are shown as blue histograms. The true-labeled accuracy for each person is plotted as a red line, and listed in the title, along with its p-value, calculated from the shown null distribution (the best-possible p-value, 1/2906, rounds to 0).

The dataset used for the demo has no missings: each of the people has six runs, with four examples (two of each class) in each run. Thus, I can use a single set of labels for the permutation test, carrying out the relabelings and classifications in each person individually (since it's within-subjects cross-validation), but with the null distribution for each person built from the same relabelings. Using the same relabelings in each person allows the group-level null distribution (green, in the image above) to be built from the across-subjects average accuracy for each relabeling. In a previous post I called this fully-balanced strategy "single corresponding stripes", illustrated with the image below; see that post (or the demo code) for more detail.

The histogram for the across-subjects means (green histogram; black column) is narrower than the individual subject's histograms. This is sensible: for any particular permutation relabeling, one person might have a high accuracy, and another person a low accuracy; averaging the values together gives a value closer to chance. Rephrased, each individual has at least one permutation with very low (0.2) accuracy (as can be seen in the blue histograms). But different labelings made that low accuracy in each person, so the lowest group-mean accuracy was 0.4.

The group mean of 0.69 was higher than all the permuted-label group means, giving a p-value of 0.0003 = 1/2906 (2906 permutation relabelings were run, all possible). The equivalent t-test is shown in the last panel, and also produces a very significant p-value.

Cross-validation is necessary with MVPA because otherwise performance can be really inflated. With linear SVMs (the most commonly-used algorithm), you need to fit the separating hyperplane from the training data. Since in MVPA-type datasets we generally have way more dimensions (voxels) than examples, it's very often possible to separate the training set perfectly. So we need a testing set (e.g., cross-validation) to get a proper estimate of classification performance, even before going to statistical testing.

For any permutation test, a critical step is to precisely define how you got to the statistic you want to test. For example, Is it from a single person, or averaged across people? From cross-validation, or not? Whatever you did on the dataset to get the statistic you're testing should also be done to the permuted-label datasets.

For how to permute the labels, look to the natural dependencies/structure in the dataset and decide on the "exchangeability blocks". This can be tricky in psychology-type datasets, because we have multiple layers of stratification (people within diagnoses, fMRI scanner runs within people, etc.). But once you figure out where it's valid to exchange labels, it's usually straightforward to actually do the test.

First, I just wanted to thank you for these posts - sorting things out online I often find them the clearest explanations of things.

Second, I wanted to ask - where have you found it easiest to implement these methods in searchlights? Especially for the 'single-corresponding' permutations? I am likely to build some PyMVPA code in the mean time but was curious.