First, as you'd guess from the title, the paper mostly describes a MATLAB package for performing RSA. I could easily download the package and start looking at the demos and documentation, but there is a lot in the package, and understanding what all it's capable of (and how exactly it's doing everything) is not a job for an hour or two. It certainly looks worth careful examination, though; I'm particularly interested in the statistical inference functions.

The part I mostly want to comment on is separate from the MATLAB package: the paper suggests using a linear discriminant analysis t-value as a distance (dissimilarity) metric measure of discriminability instead of Pearson correlation (1 - Pearson correlation was suggested in Kriegeskorte 2008). Here's how they describe the method (there's a bit more in the supplemental):

"We first divide the data into two independent sets. For each pair of stimuli, we then fit a Fisher linear discriminant to one set, project the other set onto that discriminant dimension, and compute the t value reflecting the discriminability between the two stimuli. We call this multivariate separation measure the linear-discriminant t (LD-t) value."

This is dense. To unpack it a bit, the idea is that you're using a statistic derived from a classification analysis for the distance metric. They suggest using Fisher linear discriminant analysis (LDA) for the classification algorithm, with two-fold cross-validation, averaging results across the folds. LDA strikes me as a reasonable suggestion, and I assume any sort of reasonable cross-validation scheme (e.g. leave-one-run-out) would be fine.

But, how to derive the a t-value from the cross-validated LDA? The paper's description wasn't detailed enough for me, so I poked around in the toolbox code, and found the fishAtestB_optShrinkageCov_C function in /Engines/fisherDiscrTRDM.m. It looks like they're fitting the discriminant to the training dataset, projecting the test dataset onto the discriminant, then doing a t-test on the "error variance" computing a t-value from the test data projected on the discriminant. The function code does everything with linear algebra; my MATLAB (and linear algebra) is too rusty for it to all be obvious (e.g. which step, if any, corresponds to the coefficients produced by the R lda command? Is it a two-sided t-test against zero?). Please comment or email if you can clarify and I'll update this post.See this new post.

Anyway, the idea of using a classification-derived distance metric for RSA is appealing, particularly to get a consistent and predictable zero when stimuli are truly unrelated (fMRI examples are often a bit correlated, making correlation-based RSA comparisons sometimes between "not that correlated" and "somewhat correlated", rather than the more interpretable "nothing" and "something").

Which brings me to what I realized I had wrong about RSA. To do cross-validation, you need multiple examples of the same stimulus, and at the end you have a single number (accuracy, LD-t, whatever). RSA is accordingly not done between examples (e.g. individual trials) but between stimulus types (classes with lots of examples; what we classify).

This RSA matrix (the official term is "RDM") is from a previous post, which I described as "an RSA matrix for a dataset with six examples in each of two classes (w and f)." While the matrix is sensible (w-f cells are oranger - less correlated - than w-w and f-f cells), the matrix should properly be a single value: the distance between w and f.

In other words, to make an RSA matrix (RDM) I needed at least three classes; not multiple examples of two classes. Say the new class is 'n'. Then, my RSA matrix would have w, f, and n along each axis, and we can ask questions like, "is w is more similar to f or n?". That RSA matrix would have just three numbers: the distances between w and f, w and n, and f and n. If using Pearson correlation, we'd calculate those three numbers by averaging (or some other sort of temporal compression, such as fitting a linear model) across the examples of each class (here, w1, w2, w3, w4, w5, w6) to get one example per class, then correlating these vectors (e.g. w with f). If using LDA, we'd (for example) use the first three w and f examples to train the classifier, then test on the last three of each (and the reverse), then calculate the LD-t. (To be clear, you can calculate LD-t with just two classes, it won't really look like an RDM since you just have one value (w-f).)

UPDATE 17 July 2014: Changed a bit of the text (strikeouts) in response to helpful comments from Hamed Nili. He also pointed out this page, which describes a few other aspects of the paper and toolbox.

UPDATE 21 August 2014: see this post for a detailed description of how to calculate the LD-t.

2 comments:

I have to say this is a very nice post about RSA and the toolbox. Thanks! Also thanks to the authors who put a lot of efforts in making it available. There is just a few aspects here which I think deserve a bit of elaboration. I have struggled a lot with some of the concepts that you describe, and I just want to add my own two cents. What you say about the classes 'w' and 'f' is true, however I think it would be good to agree on what is a ‘class’. If a ‘class’ is something like, say faces, then you can investigate this class with different exemplars of faces, with trial repetitions for each exemplar. In this example case of a single ‘class’ RSA investigation, the trial repetitions of each particular individual face will enable estimating the ld-t. If you had say 12 different faces to make this class, then you would end up with a 12 x 12 RDM, and the elements of the RDM would tell you what is the discriminability between each pair of individual faces in the region you are currently looking at. This would be an example of a single-‘class’ RSA study. This particular example with faces has been particularly challenging by the way, but that is a different story. One of the powerful aspects about RSA in general, is that it enables you to investigate ‘classes’ of stimuli, without assuming said ‘classes’ a priori. This is the case for a condition-rich design like the one described in Kriegeskorte et al. 2008. We could say that this study is composed of roughly speaking 6 ‘classes’ with stimuli drawn from human bodies, human faces, animal bodies, animal faces, natural inanimate objects and artificial inanimate objects. But what the report from Kriegeskorte et al. showed is that human inferior temporal cortex (and monkey IT!) emphasised particular class distinctions in the context of this condition rich design. Just by eye-balling the RDM, you can see that there is information in hIT about these ‘classes’, where exemplars within the ‘classes’ lead to more similar activity patterns, and exemplars that belong to different classes, like say animate and inanimate objects, lead to dissimilar activity patterns. This is different from many studies in which the experimenters assumed such ‘classes’, and averaged the activity patterns across different exemplars of the ‘classes’ before comparing them. This comment is not a rant per se about averaging, I would say that the two methods enable answering different questions. But in the case of RSA, if there is information related to a particular ‘class’, then the RDM should show it and, with the RSA-toolbox that you very well describe in your post, hypothesis about ‘classes’ can be tested with models and statistical inference.Thank you again for the wonderful blog post!Ian Charest

Thanks for the comment! I totally agree that terminology is part of the confusion, and that different methods (averaging, etc) are proper for addressing different questions.

I often think of "classes" as the "discrete units we want to make conclusions about". In classification terms, classes are usually the targets - what the algorithm is learning to distinguish ("w and f can be significantly classified in ROI A").

If classes are the "discrete units we want to make conclusions about", then I wouldn't call your example of a 12-faces RSA a single-class analysis, but rather 12 (the unique faces).

Obviously, the 12 faces all belong to a group - faces! But the12x12 RDM is intrinsically also about the individual stimuli (as you phrased it, "the discriminability between each pair of individual faces") - using 12 different faces would result in different RDMs. Hopefully, the relationship between the RDMs for stimulus super-classes would be the same after changing the stimulus images (eg the ROI has information about faces but not artificial objects).

This is a bit like the mass-univariate fMRI "first-level" and "second-level" analysis concept - some of the details that matter in the first level hopefully don't in the second. But we certainly need better terminology (and I'm not recommending "super-class"); something that makes the "level" of analysis clearer.