Pages

Tuesday, January 25, 2011

ALENEX: Experiments with Johnson-Lindenstrauss

I'm three days behind on my postings, so I have the luxury of looking back and attempting a larger perspective.

As Dana Randall put it at the business meeting, this is the Johnson-Lindenstrauss SODA. And it seems apropos to start with our paper at ALENEX. This was work done by my student Qiushi Wang, who's applying for grad schools (Admit him ! or email me for more info!)

The Johnson-Lindenstraus Lemma is one of the most powerful tools in the theory of metric embeddings and dimensionality reduction. Simply put, it says that given any set of $n$ points in a Euclidean space, there exists a linear mapping into roughly $O(log n)$ dimensional Euclidean space that preserves all distances approximately.

There's a long series of proofs of this lemma: all of them yield essentially the same bound on the number of dimensions and the same dependence on the error term, and so the main efforts have focused on improving the running time of the mapping itself. If we're mapping from d dimensions to k, a linear mapping can take time $O(kd)$ per point, and this is the time to beat.

There are two strands of research along these lines: the first family of methods tries to speed things up by sparsifying the projection matrix to speed up the transformation. You can make the matrix quite sparse this way, but there's a limit on what you can do, because if the input vector being projected is itself quite sparse, then the resulting vector has mostly zeros in it, destroying its norm (and any hope of preserving distances)

The trick, which leads to the second strand of research, is to "precondition" the input. The idea is quite elegant: if you apply what is essentially a random rotation to the vector, it becomes dense w.h.p, where density intuitively means that no one coordinate is very large (we assume unit norm vectors w.l.o.g). Once you do this, the resulting projection matrix can be made quite sparse.

There's a catch though: you're now using two matrices instead of one, so you end up spending d^2 time on the first part, which dominates the original $kd$ time. The second trick you need then is a special random rotation that can be applied very quickly. Essentially, you need the walsh-hadamard transform. This is the core idea behind the 2006 paper by Ailon and Chazelle, and there's been much work since on improving the bounds and the preconditioner construction. A third line of work combines the two strands is to sparsify (by subsampling) a special code matrix that has a "preconditioning" effect.

But in all of this, no one has really looked at the actual behavior of these algorithms in practice. There are a number of reasons to do this: first of all $O(\log n)$ dimensions isn't so hot if the constant is large. Secondly the algorithm is randomized, which tends to give practitioners the heebie-jeebies. And finally, the dizzying array of algorithm options available is just plain confusing.

The constant in the dimension of the embedding is small. It's essentially 1 * log P/epsilon^2, where P is the number of "norm probes" you require (P = n^2 for distances and n for norms). This is good, because it means that there are no hidden large constants.

The quality of all algorithms is basically the same, and is very consistent. In other words, the fact that JL is randomized (which often causes a lot of concern in practice), is not a problem for its use (unless you working in a distributed environment and need to share randomness - pointed out to me by TS Jayram).

The distortion error itself is very nicely concentrated (normally) around 1. Unless you have highly clustered data, in which case the distortion distribution looks like a superposition of shifted Gaussians, one for each cluster center.

Since all algorithms behave essentially the same on quality, speed is the main differentiator. Here, the 'best in class' depends heavily on what you know about the data. For dense data, you can be pretty sparse (as predicted by some of the papers) and the embedding is fast. For sparse data, it turns out that at least in MATLAB, and for small dimensions, the dense method work better (a little ironic considering that much of recent work was designed to deal with the sparse case). This is because of MATLAB's heavy optimization for dense matrix multiplication.

Of course, your dimensionality might be too high to store a dense matrix, or you might not even know what the data profile is like. In that case, preconditioning methods like the original Ailon/Chazelle method work great. and there are only small differences between the methods as d increases.

We're not even close to being done with our explorations: there are at least four or five new questions to explore based on feedback we got at SODA. But it's been an illuminating experience, and I've been heartened by all the interest the community has shown in this research, based on the feedback I got.