Monday, May 14, 2012

high dimension, low sample size

I'm accustomed to thinking of my images as points in hyperspace (number of dimensions == number of voxels). Currently I'm working a dataset that classifies well (two classes, linear svm, ROI-based). There is one training set and two testing sets, one of which is classified more accurately than the other. We want to describe the characteristics of the testing sets that underlie this difference.

A first idea was to compute the distance between the points in each of the two classes and datasets: are the points in the more-accurate dataset closer together than the less-accurate dataset? Or the centroids further apart? But calculating the distances hasn't been very helpful: both datasets seem extremely similar.

Perhaps I'm running into the phenomenon described by Hall et al. (2005) and Durrant & Kaban (2009): with many dimensions all pairwise distances between points can asymptotically approach the same value: all the points are equidistant. Durrant & Kaban also discuss "distance concentration": some datasets have all points quite evenly spaced while others (also HDLSS) do not, and the datasets with less concentration are easier to classify and characterize. Perhaps looking at the distance concentration of fMRI datasets can be useful for explaining why some datasets are extremely difficult to classify, or show very low classification accuracy ("anti-learning").

One other discussion that struck me in Hall, Marron, & Neeman was that of how many classifiers can perform similarly on HDLSS datasets. Many people use linear svms as a sort of default for fMRI MVPA, partly because comparisons don't usually show a large difference in performance with other classifiers. Perhaps these HDLSS phenomena are part of the reason why performance can be so similar.