Big Numbers

I mention this Discover article because it’s really an applied math story being popularized as a biology story. Noam Sobel and his team from Weitzmann made a database of thousands of smell-producing molecules, along with their properties (like size, or how tightly the molecules pack.) They then ran Principal Components Analysis on the data to find the critical properties, and observed that a single dimension or score encapsulated most of the variation — and even could predict which smells humans would find more and less appealing.

What I want to harp on is the use of PCA. To those of us who actually do math and science full time, it’s a pretty familiar, standard technique. It’s over a hundred years old. It’s very simple. It doesn’t even work in the difficult cases (a lot of modern random matrix theory is about identifying when PCA doesn’t work.) But I don’t think Principal Components Analysis (or multidimensional eigenvector analysis generally) gets the attention it deserves in the popular imagination. Here, given a glut of data, is a tool that can tell you what are the most useful ways to measure it. We’re used to science giving us measurements; but this is science telling us what measurements are important. Sobel’s research isn’t really a biological discovery at all; it’s a data analysis discovery.

And when you’re used to thinking in the language of dimensionality reduction, suddenly you see when it’s needed but missing. A gap, like “Oh, this really should be a dimensionality reduction problem.” For instance, I was reading some of Simon Baron-Cohen’s research about autism; for the unfamiliar, his theory is that people lie on a spectrum between “systematizing” and “empathizing” thoughts, with autistics on the “systematizing” extreme, much better at logical puzzles than interpersonal communication. I was reading his papers and I immediately thought “Something’s missing here.” If I wanted to know whether a single axis distinguished autistics from non-autistics, well, I’d look at all kinds of neurological and psychological properties, get a big cloud of data from autistics and non-autistics, and see if most of the variation was due to this empathizing/systematizing business.

Now, I’m not remotely a psychologist, but it looks like Baron-Cohen doesn’t do that. He gives out a survey, observes that autistics are more systematizing and less empathizing than non-autistics, and essentially says “Ta-Da!” And I go, “Huh?” How on earth does he know that this is the main difference between autistics and non-autistics? Where are all the correlations and variances? Now, maybe that’s standard practice in psychology. I’ve read a bit more in the social sciences, and it’s standard practice there. But to me, this kind of research is just crying out for a little data analysis. I think there are whole areas of research where people aren’t yet thinking in PCA. Maybe that should change.