Last week’s applied math seminar was Arthur Szlam, talking about alternating methods for SVD. I’m putting up some notes here; several results are presented without proof.

Given an m x n matrix A, we want to find the best rank k approximation to A.
There is an explicit solution to this problem, known as the singular value decomposition. We write
where is diagonal, are orthogonal. The columns of are the eigenvectors of and the columns of are the eigenvectors of . This costs operations, and it’s a completely explicit solution. To get a rank-k approximation, we take
where the “hat” means taking the first k columns of each matrix.
This is the best rank-k approximation, both in the sense of Frobenius norm and operator norm.

However, if , time is wasteful. There are recent attempts to do this faster, in time.

We fix k, and a small oversampling parameter p, and let . Draw an random Gaussian matrix , and set .
Calculate the SVD, , and set to be the first k columns of , etc.

Why does this work?

Well, pretend A is really of rank 5, for instance. Hit A with a random matrix. It is unlikely that any vector will be in the kernel of A. Whatever's left is in the range. So we're very likely to have the full range of A; we probably preserve the SVD. However, in real life we don't have a known rank k matrix.

What we have is a guarantee of quality, in terms of the decay of the singular values:

Here the expected value is taken over the random Gaussian draw, the norm is the operator norm, and is the k + 1th singular value; if A is really rank k, this is zero, and the expected error is therefore zero. If is small, then the bound is good. We're in danger if m is large and is not super small.

This actually happens, unfortunately. For example, in term document matrices, where a collection of text documents form the rows, and words form the columns, with a word count in each matrix entry indicating how often a word was used in each document. (A little imagination should indicate the importance of these kinds of matrices for semantic analysis.) These matrices are sparse and large, but their singular values decay slowly. So what can we do here?

We hit by some more powers of A, to drive down the smaller eigenvalues. In practice, 10 or 15 times.
Now we have a new estimate, depending on the power q.

This works, but it can be memory intensive to save all the intermediate powers. This is where the "alternating" method comes in.

Here we look at a related problem: matrix completion. You're given a subset of the entries of a matrix A, and you know A is low rank. How do you fill in the rest of the entries? The simplest approach is alternating least squares.

Write where P is m-by-k and Q is n-by-k.
Alternate between updates of P and Q:
These are least-squares problems, and can be solved explicitly using

Coming back to our own problem, we can use the same formula to calculate .

This turns out to work well in practice; the example set was a database of labeled faces, which have the first 20 eigenvalues decaying fast, then the next decaying very slowly. This kind of alternating SVD works well for faces.

A very useful introduction to what’s going on recently in random matrix theory in Roman Vershynin’s slides here. What amazed me, seeing the big picture, was that so many of the fundamental results are quite recent. Universality of the Wigner semicircle law was only proven in 1973 and the circular law not till ’84. And everything I really want to know — what to do with sparse matrices, perturbed random matrices, Fourier matrices — is largely open.
More on what all this has to do with compressed sensing.

I just saw a talk by Raj Rao Nadakuditi about random matrix theory and the informational limit of eigen-analysis.

The issue is this. Lots of techniques for dealing with large data sets — PCA, SVD, signal detection — call for analyzing the eigenvectors and eigenvalues of a large matrix and reducing the dimensionality. But when do these friendly techniques fail us? It turns out that there is a phase transition, a signal-to-noise ratio below which we cannot distinguish the signal eigenvalues and eigenvectors from the noise. No matter how much data we collect, eigenvector-based dimensionality techniques will not distinguish signal from noise.

The simplest example Nadakuditi deals with is a Gaussian matrix perturbed by a rank-one matrix. is a Gaussian random matrix, is a normalized symmetric random matrix, and is the perturbed matrix, where is a rank-one perturbation.

What is the largest eigenvalue? What about the corresponding eigenvector? Well, the Wigner semicircle law tells us that the noise eigenvalues follow a semicircular distribution, and the signal eigenvalue falls outside the semicircle. If you look at the top eigenvalue as a function of , there’s a critical value of 1 where the top eigenvalue begins to escape the semicircle. Below that critical value, there’s no way to distinguish the signal eigenvalue.

Peche and Feral developed a closed-form formula for the top eigenvalue: if is greater than one, and otherwise.

But you can extend this to a broader range of situations. We can consider n x m matrices, and let
and
This models a sample-covariance matrix. Again we have a phase transition: if is greater than and

Nadakuti’s theorem is in a more general case. If we assume we have a distribution of noise eigenvalues that converges, as the size of the matrix becomes large, to some continuous distribution limited to the interval [a, b] then if is less than and b otherwise.

Here, refers to the Cauchy transform

The sketch of the proof is fairly simple. is an eigenvalue of iff 1 is an eigenvalue of

Equivalently,

Equivalently, the eigenvalues satisfy. Since the are assumed to be distributed uniformly on the sphere, this means that

So the moral of the story is that we can determine, rather specifically, at what point a set of data is too noisy for dimensionality reduction techniques to have any hope of accuracy. There are analogues to this in the work on sparse approximation of Donoho and Tanner, and in free probability theory.

Currently I’ve been interested in precisely the kinds of results mentioned in the article — I’m using them for my cryo-EM project. I don’t have much to add, since Gelman and Buchanan both do a beautiful job of exposition, except to say that this is the kind of writing about mathematics that I want to see more of. It’s written for a popular audience but doesn’t misrepresent the theory — moreover, it conveys the point that the math has implications for how we come to believe things. It’s not merely a toolbox, it’s a way of revising how we draw conclusions from data, which affects how we predict things like economic trends. And math also informs the other sciences about the limits of what they can infer.

Related is this recent paper, which deals with perturbations of random matrices — we can think of this as a signal matrix plus a noisy random matrix — and shows that if the signal is low rank and sufficiently strong compared to the noise, then the top eigenvalues do correspond to signal rather than noise. It’s a sort of optimistic counterpoint to the pessimistic results of Bouchaud and others mentioned in the article.