Via Kevin Kelly. The Grail lab at the University of Washington made a 3-d movie of the Colusseum from tourist photos on Flickr. They call it “building Rome in a day.”

In this project, we consider the problem of reconstructing entire cities from images harvested from the web. Our aim is to build a parallel distributed system that downloads all the images associated with a city, say Rome, from Flickr.com. After downloading, it matches these images to find common points and uses this information to compute the three dimensional structure of the city and the pose of the cameras that captured these images. All this to be done in a day.

So, how do they do it? The challenge in this process is to combine photographs taken at different angles and viewpoints, which on the surface look quite similar. The paper explains that the researchers treat each image as a “bag of words” — discrete visual features — and distances between images are found by taking inner products between the vectors that describe their features. They build a graph out of the images, with an edge connecting them if their features are close.

Reconstructing the actual 3d structure is done using techniques of Structure from Motion. A nice tutorial is here. Essentially, given two (or three) images we can compute the matrix (or tensor) that would transform one to another.

If all this sounds like a similar problem to cryo-EM, it’s because it is. We’re getting 3-d structure from 2-d images. The main difference is that, while cryo-EM was based around the group SO(3), the symmetries of the sphere, this Rome project is based on the Euclidean group, the rigid motions in space. The image on a visual camera depends on the distance from the object as well as the two spherical angles; the image of a protein from an electron microscope does not depend on distance, as it’s simply a line integral through the protein.

The computer science approach is fascinating, but the thing is, it treats building the graph and reconstructing the 3-d model as two separate problems. The features are simply “words” encoding no spatial information. There’s no exploitation of the relation between the underlying group and the graph of images. I don’t know if putting a little representation theory in this project would make it more effective practically — the model already looks pretty good — but it would be mathematically prettier. By that I mean, the researchers are taking features, which actually have physical significance, as objects in space, and throwing away all the physical information by regarding them as arbitrary words and building a graph that encodes no spatial information. The Structure from Motion part, as I understand, can only be handled by looking at two or three images at a time; so it seems we’ve thrown away a lot of the structure. And, to my eyes at least, it’s a more elegant approach to exploit the fact that the data is organized around a real physical structure. As always, though, take my musings with the grain of salt that I’m an ignorant beginner.