Saturday, December 15, 2012

Another look at MNIST

I'm a bit obsessed with MNIST.
Mainly because I think it should not be used in any papers any more - it is weird for a lot of reasons.
When preparing the workshop we held yesterday I noticed one that I wasn't aware of yet: most of the 1-vs-1 subproblems, are really easy!

Basically all pairs of numbers can be separated perfectly using a linear classifier!
And even you you just do a PCA to two dimensions, they can pretty much still be linearly separated! It doesn't get much easier than that. This makes me even more sceptical about "feature learning" results on this dataset.

To illustrate my point, here are all pairwise PCA projections. The image is pretty huge. Otherwise you wouldn't be able to make out individual data points.
You can generate it using this very simple gist.

There are some classes that are not obviously separated: 3 vs 5, 4 vs 9, 5 vs 8 and 7 vs 9. But keep in mind, this is just a PCA to two dimensions. It doesn't mean that they couldn't be separated linarly in the original space.

Interestingly the "1"s are very easy to identify, even with seven and nine there is basically no way to confuse them. The ones have a somewhat peculiar shape, though. It would be fun to see what a tour along the "bow" (see img at [2, 2]) would look like.
Manifold-people should be delighted ;)

I think this plot emphasizes again: look at your data!
I hope you enjoyed this perspective.

What we see from here is that _most_ of the samples are easily separable, but not all of them. This is very common in practice that given some realistic dataset you can easily reach 90% recognition rate with the simplest algorithm you have, but every next percent is very difficult to achieve.

I think it's a valid point. MNIST is really not that challenging compared to a "generic" classification problem, because the classes form nice clusters. I am looking for a problem that had more convex regions, and that is probably what happens when you aggregate 0-4 and 5-9 as pointed out before.

One important thing to notice here is that you re-calculated the PCA for each pair here. So you found the one single optimal direction to classify between each pair. But what that means in the original domain is that you need all the 10*9 = 90 optimal features to make your classification. And indeed, 90 components seem to be the point you start to "explain" less than 90% of the variance. Having even more possible negative samples also probably makes the problem a little more challenging, e.g. mixing some letters or CIFAR images. It's more noise to filter out...