Big Numbers

Imagine the space of all policies, where one point in that space is the current status quo policy. To a first approximation, policy insight consists on learning which directions from that point are “up” as opposed to “down.” This space is huge – with thousands or millions of dimensions. And while some dimensions may be more important than others, because those changes are easier to implement or have a larger slope, there are a great many important dimensions.

In practice, however, most policy debate focuses on a few dimensions, such as the abortion rate, the overall tax rate, more versus less regulation, for or against more racial equality, or a pro versus anti US stance. In fact, political scientists Keith Poole and Howard Rosenthal are famous for showing that one can explain 85% of the variation in US Congressional votes by a single underlying dimension, where there are two separated clumps. Most of the remaining variation is explained by one more dimension. Similar results have since been found for many other nations and eras.

This sounds, to me, like the main insight of dimensionality reduction. How do you know you’ve picked a good basis? Is the set of coordinates you choose to measure actually the set of coordinates that most efficiently explain the data?

Maybe policy outcomes really are nearly clustered along one axis, and maybe that axis is the Democrat/Republican one. Possibly. But we’d have to check. That’s what rank estimation is for. (See Kritchman and Nadler, who do it with a matrix perturbation approach.)

I’d like to see that particular insight trickle into society more broadly. There are objective ways to compare the usefulness of coordinate systems. Put another way, if you want to play Twenty Questions with the universe, some questions are better than others. And there is always a possibility that the ones we’re using aren’t so good.

When this year is over — and I’m simultaneously thrilled and nostalgic about leaving college — I’m going to become a little math bookworm. I have all kinds of goodies freshly ordered from Amazon. Hopefully I can arrive at grad school a little less ignorant.

Via Kevin Kelly. The Grail lab at the University of Washington made a 3-d movie of the Colusseum from tourist photos on Flickr. They call it “building Rome in a day.”

In this project, we consider the problem of reconstructing entire cities from images harvested from the web. Our aim is to build a parallel distributed system that downloads all the images associated with a city, say Rome, from Flickr.com. After downloading, it matches these images to find common points and uses this information to compute the three dimensional structure of the city and the pose of the cameras that captured these images. All this to be done in a day.

So, how do they do it? The challenge in this process is to combine photographs taken at different angles and viewpoints, which on the surface look quite similar. The paper explains that the researchers treat each image as a “bag of words” — discrete visual features — and distances between images are found by taking inner products between the vectors that describe their features. They build a graph out of the images, with an edge connecting them if their features are close.

Reconstructing the actual 3d structure is done using techniques of Structure from Motion. A nice tutorial is here. Essentially, given two (or three) images we can compute the matrix (or tensor) that would transform one to another.

If all this sounds like a similar problem to cryo-EM, it’s because it is. We’re getting 3-d structure from 2-d images. The main difference is that, while cryo-EM was based around the group SO(3), the symmetries of the sphere, this Rome project is based on the Euclidean group, the rigid motions in space. The image on a visual camera depends on the distance from the object as well as the two spherical angles; the image of a protein from an electron microscope does not depend on distance, as it’s simply a line integral through the protein.

The computer science approach is fascinating, but the thing is, it treats building the graph and reconstructing the 3-d model as two separate problems. The features are simply “words” encoding no spatial information. There’s no exploitation of the relation between the underlying group and the graph of images. I don’t know if putting a little representation theory in this project would make it more effective practically — the model already looks pretty good — but it would be mathematically prettier. By that I mean, the researchers are taking features, which actually have physical significance, as objects in space, and throwing away all the physical information by regarding them as arbitrary words and building a graph that encodes no spatial information. The Structure from Motion part, as I understand, can only be handled by looking at two or three images at a time; so it seems we’ve thrown away a lot of the structure. And, to my eyes at least, it’s a more elegant approach to exploit the fact that the data is organized around a real physical structure. As always, though, take my musings with the grain of salt that I’m an ignorant beginner.

Alexander Bronstein et al have an interesting preprint about exploring the intrinsic symmetries of non-rigid shapes.

For non-rigid shapes, you can’t just define symmetry in terms of rigid motions; you need some way to identify intrinsic symmetries that remain constant up to distortions. This has sometimes been handled with geodesic distances, considering the eigenfunctions of the Laplace-Beltrami operator on the surface to transform intrinsic symmetries to (approximate) Euclidean symmetries in a feature space created by the eigenfunctions. The problem is that this method is not robust to topological noise or noise that changes the connectivity of the surface. (Such noise is common, i.e. in protein modeling — I’ve often seen biologists bemoan the fact that a peninsula often gets confused for an island and vice versa.)

The authors suggest as a solution the diffusion distance, recently applied to data analysis by Lafon and Coifman, which gives the distance between two points in the basis of the coefficients of the heat kernel. That is, we look at

where are the eigenvalues and the eigenfunctions of the Laplacian.

Then we compare distances by

The strength of the diffusion distance is that, while the geodesic distance only looks at the shortest path, the diffusion distance looks at all paths between two points, and concludes that they’re close together if there are many short paths between them. It can be thought of as the limit of diffusions on discrete graphs.

Using this framework we can talk about approximate symmetries. These are distortion functions on the surface that leave the surface nearly invariant — as measured by the diffusion metric. We can also define the maximum asymmetry at a point, given one of these approximate symmetries:

Local minima of the distortion function are good candidates for approximate symmetry.

The authors use this framework to compare different surfaces using the Gromov-Hausdorff metric. They do some experimental comparisons and find that the diffusion distance is much better at identifying approximate symmetries than competing metrics.

Without being an expert, I have the sense that much more can be done in this field to deal with topology. The paper doesn’t actually address topology itself. But the issue seems to be that in noisy images, pieces break off — peninsulas become islands and so on. I’ve seen a sort of hierarchical method of dealing with this: define the coarseness at which the homology changes, and make a branch there, and then check further (finer) degrees of coarseness, and make a tree that way. Then you have some metric for the similarity of trees that you can use to compare images.

The thing is, what these trees essentially tell you is something about the intrinsic geometry of a surface — after all, it’s the protrusions that are most likely to become islands at coarser levels of visualization. I’d guess there’s actually a theorem to be found (or already exists) relating these topological tree structures to geometric notions like geodesic distance or diffusion distance — something along the lines of “the closer the trees, the smaller the Gromov-Hausdorff distance between the surfaces.”

A very useful introduction to what’s going on recently in random matrix theory in Roman Vershynin’s slides here. What amazed me, seeing the big picture, was that so many of the fundamental results are quite recent. Universality of the Wigner semicircle law was only proven in 1973 and the circular law not till ’84. And everything I really want to know — what to do with sparse matrices, perturbed random matrices, Fourier matrices — is largely open.
More on what all this has to do with compressed sensing.

I just saw a talk by Raj Rao Nadakuditi about random matrix theory and the informational limit of eigen-analysis.

The issue is this. Lots of techniques for dealing with large data sets — PCA, SVD, signal detection — call for analyzing the eigenvectors and eigenvalues of a large matrix and reducing the dimensionality. But when do these friendly techniques fail us? It turns out that there is a phase transition, a signal-to-noise ratio below which we cannot distinguish the signal eigenvalues and eigenvectors from the noise. No matter how much data we collect, eigenvector-based dimensionality techniques will not distinguish signal from noise.

The simplest example Nadakuditi deals with is a Gaussian matrix perturbed by a rank-one matrix. is a Gaussian random matrix, is a normalized symmetric random matrix, and is the perturbed matrix, where is a rank-one perturbation.

What is the largest eigenvalue? What about the corresponding eigenvector? Well, the Wigner semicircle law tells us that the noise eigenvalues follow a semicircular distribution, and the signal eigenvalue falls outside the semicircle. If you look at the top eigenvalue as a function of , there’s a critical value of 1 where the top eigenvalue begins to escape the semicircle. Below that critical value, there’s no way to distinguish the signal eigenvalue.

Peche and Feral developed a closed-form formula for the top eigenvalue: if is greater than one, and otherwise.

But you can extend this to a broader range of situations. We can consider n x m matrices, and let
and
This models a sample-covariance matrix. Again we have a phase transition: if is greater than and

Nadakuti’s theorem is in a more general case. If we assume we have a distribution of noise eigenvalues that converges, as the size of the matrix becomes large, to some continuous distribution limited to the interval [a, b] then if is less than and b otherwise.

Here, refers to the Cauchy transform

The sketch of the proof is fairly simple. is an eigenvalue of iff 1 is an eigenvalue of

Equivalently,

Equivalently, the eigenvalues satisfy. Since the are assumed to be distributed uniformly on the sphere, this means that

So the moral of the story is that we can determine, rather specifically, at what point a set of data is too noisy for dimensionality reduction techniques to have any hope of accuracy. There are analogues to this in the work on sparse approximation of Donoho and Tanner, and in free probability theory.

Currently I’ve been interested in precisely the kinds of results mentioned in the article — I’m using them for my cryo-EM project. I don’t have much to add, since Gelman and Buchanan both do a beautiful job of exposition, except to say that this is the kind of writing about mathematics that I want to see more of. It’s written for a popular audience but doesn’t misrepresent the theory — moreover, it conveys the point that the math has implications for how we come to believe things. It’s not merely a toolbox, it’s a way of revising how we draw conclusions from data, which affects how we predict things like economic trends. And math also informs the other sciences about the limits of what they can infer.

Related is this recent paper, which deals with perturbations of random matrices — we can think of this as a signal matrix plus a noisy random matrix — and shows that if the signal is low rank and sufficiently strong compared to the noise, then the top eigenvalues do correspond to signal rather than noise. It’s a sort of optimistic counterpoint to the pessimistic results of Bouchaud and others mentioned in the article.

I’m going to a seminar on representation theory taught by Shamgar Gurevich, in particular as it applies to the cryo-electron microscopy problem.

Cryo-EM is actually what I’ve been spending this whole year on, for my senior thesis. The short story is this. One method of determining protein structure is to encase the proteins in a thin film of ice, stick them in an electron microscope, and take pictures. (Really they aren’t photographs but measurements of electrical potential; but this gives a picture of the density of a cross section of the molecule.) The trouble is, these are two-dimensional images, and we want a three-dimensional model.

It would be straightforward to do this if we knew the direction the images were taken at; we could reconstruct the 3-d structure using the Fourier Projection Slice Theorem, the same way tomography works. But we don’t know the angles. Another hurdle is that experimentally, the images are extremely noisy, with an SNR of about 1/60. So, we need mathematics to come to the rescue. It turns out to be very pretty math, involving graph theory, the eigenvalues of large sparse matrices, and some random matrix theory.
For more thorough explanations, check out some of my advisor’s recent papers.

Anyhow, all that is by way of introduction. Cryo-EM is also intimately related to representation theory, since it deals with the symmetries of groups. For example, if the molecule happens to have internal symmetries, and many such molecules do, then the simplest version of the reconstruction algorithm falls apart. We need representation theory to explain generalizations of the problem.

So I’m learning. At the moment we’re only working with finite groups; today we proved Schur’s Lemma, which gives an important relationship — a little like orthogonality, to my mind — between irreducible finite-dimensional representations of a group. If the two representations are not isomorphic then there is no linear map between them that commutes with the action of the group; if they are isomorphic, then the linear map is a scalar operator.

For us, the motivation is that this allows us to diagonalize matrices. If an operator T from V to V is diagonalizable, and V has a group representation, and T commutes with the representation, then T preserves all the irreducible subrepresentations. This, and Schur’s lemma, allows us to conclude that T restricted to a subrepresentation is some (complex) eigenvalue multiple of the identity on that subrepresentation.

We like diagonalizing big matrices for cryo-EM because a key method is to make enormous matrices based on the correlations between images, and determine the viewing directions of the images from the eigenvectors of those matrices. These matrices are sparse, which makes computation easier, but they have to be very large to achieve accuracy. So methods of diagonalization — and methods of identifying irreducible subrepresentations, a related problem — are invaluable.

I learned this in my probability class and thought it was pretty neat.

The Dirichlet problem is a standard problem in differential equations.
Given a region in the plane, its boundary , and a function on the boundary, we want to solve for such that is harmonic, and . (Harmonic just has the usual meaning that the Laplacian is zero: .)

The probabilistic approach is to take a Brownian motion starting at a point Let be the first moment in time when .Consider the function We claim that this is a solution to the Dirichlet problem.

To show this we need to determine
1. is harmonic.
2. Under some condition as

To prove 1, we use a mean value property

which just says that the integral around a circle is the value of the function at the center. This is a consequence of the strong Markov property for Brownian motion.
Expanding in a Taylor series gives us

Using the integral and noting that odd functions integrate to 0, we get

Which shows that is harmonic.

To prove 2, we use the fact that for a given point in the plane, the probability of any Brownian motion passing through that point is zero.

We claim that we have convergence under the following condition:
as if . Then we have

To do this, we use some computations with integrals and inequalities; I’ve been trying to put this up but WordPress and LaTeX hate me and are giving me all kinds of errors, so this part will sadly have to be sans proof.

I like this little thing because it illustrates the relationship between harmonic functions and diffusion.

(Note: this is my first time using LaTeX in WordPress! I’m so happy. I use Sitmo.

Lo and behold, the Shannon entropy of Pictish inscriptions turned out to be what one would expect from a written language, and not from other symbolic representations such as heraldry. “The paper shows that the Pictish symbols are characters of a lexicographic written language,” Lee says, “as opposed to the most general form of writing, which includes things like the [non-verbal] instructions on your Ikea flat packs.”