Post navigation

Earthmover Distance

Problem: Compute distance between points with uncertain locations (given by samples, or differing observations, or clusters).

For example, if I have the following three “points” in the plane, as indicated by their colors, which is closer, blue to green, or blue to red?

It’s not obvious, and there are multiple factors at work: the red points have fewer samples, but we can be more certain about the position; the blue points are less certain, but the closest non-blue point to a blue point is green; and the green points are equally plausibly “close to red” and “close to blue.” The centers of masses of the three sample sets are close to an equilateral triangle. In our example the “points” don’t overlap, but of course they could. And in particular, there should probably be a nonzero distance between two points whose sample sets have the same center of mass, as below. The distance quantifies the uncertainty.

All this is to say that it’s not obvious how to define a distance measure that is consistent with perceptual ideas of what geometry and distance should be.

Solution (Earthmoverdistance): Treat each sample set corresponding to a “point” as a discrete probability distribution, so that each sample has probability mass . The distance between and is the optional solution to the following linear program.

Each corresponds to a pile of dirt of height , and each corresponds to a hole of depth . The cost of moving a unit of dirt from to is the Euclidean distance between the points (or whatever hipster metric you want to use).

Let be a real variable corresponding to an amount of dirt to move from to , with cost . Then the constraints are:

Discussion: I’ve heard about this metric many times as a way to compare probability distributions. For example, it shows up in an influential paper about fairness in machine learning, and a few other CS theory papers related to distribution testing.

One might ask: why not use other measures of dissimilarity for probability distributions (Chi-squared statistic, Kullback-Leibler divergence, etc.)? One answer is that these other measures only give useful information for pairs of distributions with the same support. An example from a talk of Justin Solomon succinctly clarifies what Earthmover distance achieves

Also, why not just model the samples using, say, a normal distribution, and then compute the distance based on the parameters of the distributions? That is possible, and in fact makes for a potentially more efficient technique, but you lose some information by doing this. Ignoring that your data might not be approximately normal (it might have some curvature), with Earthmover distance, you get point-by-point details about how each data point affects the outcome.

This has the potential to be useful in redistricting because of the nature of the redistricting problem. As I wrote previously, discussions of redistricting are chock-full of geometry—or at least geometric-sounding language—and people are very concerned with the apparent “compactness” of a districting plan. But the underlying data used to perform redistricting isn’t very accurate. The people who build the maps don’t have precise data on voting habits, or even locations where people live. Census tracts might not be perfectly aligned, and data can just plain have errors and uncertainty in other respects. So the data that district-map-drawers care about is uncertain much like our point clouds. With a theory of geometry that accounts for uncertainty (and the Earthmover distance is the “distance” part of that), one can come up with more robust, better tools for redistricting.

Solomon’s website has a ton of resources about this, under the names of “optimal transport” and “Wasserstein metric,” and his work extends from computing distances to computing important geometric values like the barycenter, computational advantages like parallelism.

Others in the field have come up with transparency techniques to make it clearer how the Earthmover distance relates to the geometry of the underlying space. This one is particularly fun because the explanations result in a path traveled from the start to the finish, and by setting up the underlying metric in just such a way, you can watch the distribution navigate a maze to get to its target. I like to imagine tiny ants carrying all that dirt.

Finally, work of Shirdhonkar and Jacobs provide approximation algorithms that allow linear-time computation, instead of the worst-case-cubic runtime of a linear solver.

hi. non-math comment/question. the rendering of the math is pretty bad on my linux laptop (with both webkit-based and firefox browsers), but fine on my ipad. mathjax normally renders well on my laptop. i’m curious how you produce your html? no mathjax? cheers.

I would expect an optimal solution to this kind of problem to have some property about encoding of positions. If I apply a function f(x,y) to every point then how is the change in the distance? f is some perturbation, translation, rotation, change of scale, symmetry.

What are the properties of the earth-mover distance with respect to a family of perturbation of the data?

A good distance should be stable with respect to noise with small variance. If you have prior knoledge of the distribution of noise, then the optimal distance seems to have the property that the variance of the distance between the perturbed points is minimal, that is the variance of X = sum(d((xi+nxi,yi+nyi),(xj+nxj,yj+nyj)) is minimal where nxi = noise asociate with xi, for example nxi = random_normal(0,sigma) if the noise doesn’t depend of the coordinates of pointes.

I agree with your sentiment. For rigid transformations of the data it seems clear to me that the distance is unchanged. Scaling all the underlying point distances by the same constant should scale the distance by (with discrepancies depending on whether points overlap?)

I don’t any theorems about noise models off-hand, but I would be surprised if small noise caused a big change in the earthmover distance. I tried to look for a few theorems in the literature about this, but the problem is “stability” and “perturbation” result in wildly different measure-theoretic convergence theorems that I think are unrelated to this question, and probably mask more relevant work on this.

Noise related: fast and robust earth-movers distance, http://leibniz.cs.huji.ac.il/tr/1143.pdf, they use saturated distance which provide robust property against noise,
d_t (a,b) = min(d(a,b),t), they prove is a distance and they give a fast algorithm with applications.

Thank you for an interesting read. The EM distance was all the rage, about a year ago, in the context of training Generative Adverserial Networks – GANs (google for Wasserstein GAN or WGAN).

EM was used in training GANs for the same reason mentioned here, dealing with distributions that (might) have non-overlapping supports. Interestingly, EM was not calculated directly, but rather via it’s dual problem. I am not an expert on the mathematics, but those interested may find a through and well-written discussion here –https://vincentherrmann.github.io/blog/wasserstein/

Write code, not cover letters
Triplebyte's common application lets talented programmers skip resume and recruiter screens while applying to multiple top tech companies at once. Beat their online coding quiz to get started. People interested in math and physics tend to do well.