This is a short introduction to the notion of Wasserstein metric but presented in a way that explicitly highlights the earth mover intuition on the definition.

1. the cost of moving one particle

Throughout the post, we fix a metric space (X,d). Suppose a particle of mass is placed at position . The cost of moving this particle to a new position is defined to be . With this definition, you see:

The cost of moving a particle from place x to the same place x is zero.

The cost of moving a particle is proportional to its mass. (This makes sense: a particle of mass at position x is and should be cost-wise equivalent to two particles each of mass at the same position x.)

The cost of moving a particle of unit mass from x to y is .

2. The cost of moving a finite collection of particles of total mass 1.

Imagine a configuration of three particles of total mass 1 placed in X. For each i = 1,2,3, the particle i has mass and is placed at . Since the total mass is 1, we require . The mass distribution of this configuration may be written as where denotes the Dirac delta function centered at . (Alternatively, you could interprete the sum to mean a formal sum, or use Dirac measures instead.)

Now consider another configuration but this time of four particles of total mass 1, placed in the same space X. In this second configuration, for each , there is a particle of mass at position . The mass distribution of this configuration may be written as .

Given the first configuration (three particles), we can split some or all of the three particles into smaller particles of various smaller mass and move them in a way that results in the second configuration. There is more than one way to do it.

Start with the first configuration. Suppose, for each , we split the particle of mass at position into four smaller particles of mass . The total number of split particles is now and for each . Move the split particle of mass from the old position to the new position . The total mass of all particles at the new position is and we require this to be equal to , so that they merge to become a particle of mass at position .

Any matrix of nonnegative numbers such that for all i and for all j specifies a split-move-merge procedure that transforms the first configuration into the second configuration.

The cost of a split-move-merge procedure is defined to be its total cost of moving: moving the particle of mass from to costs and so the total cost is the sum .

We define the cost of transforming the mass distribution into the second to be the minimum possible cost for split-move-merge procedures from the first configuration to the second configuration.

It is an easy check that this definition of distance between and does not depend on how you write and . (In other words, whether you see a particle at position x of mass or see two particles at the same position x each of mass does not affect costs.)

3. coupling

Each split-move-merge procedure defines a distribution on . This distribution is a coupling of and in the sense that it projects to and . Indeed, if you project it to the first coordinate, you would get .

You could think of the couping as a distribution on the space of arrows, where you think of each to be an abstract arrow from x to y.

The integration of the function d(x,y) (whose domain is the space of arrows) w.r.t. the distribution works out to be the same as the cost .

Conversely, each coupling of and gives rise to a split-move-merge procedure that transforms into whose cost is equal to the integration of d(x,y) w.r.t. .

So the cost of transforming to is the same as the minimum possible expected value of the length of arrows (x,y) over all possible couplings of and .

4. composing

The distance between and is a metric. The triangle inequality on transfers to the triangle inequality for distance between because you can compose a transformation from to and another from to to get a third transformation from to whose cost is at most the sum of the costs of the two former transformations. This property is more easily seen if you think of as piles of dirt (of unit mass). In fact, suppose is three piles of dirt: pile i of mass at position . Suppose pile 1 is red dirt, pile 2 is yellow dirt, and pile 3 is green dirt. Move this three piles of dirt to form using some transformation . Each of the resulting piles may be a mixture of different colors. Now use some other transformation to move the piles again to form .The net effect is a transformation of into and the corresponding composition can be obtained by examining percentages of different colors of dirt in each of the final piles of dirt. (the output is actually not unique here, but there is an obvious canonical choice.)

5. definition of the Wasserstein distance

What we have defined is known as the 1st Wasserstein distance between and and is usually denoted . For general non-discrete probability measures and on , the definition is done using couplings of and but you will have to take inf instead of min, and the distance can be infinite.

6. Moving a collection of particles of equal mass

Suppose we are to move a collection of n particles whose each mass is . In the first configuration, the n particles are at the n places (some of which may be same places). In the second configuration, the new n places are . It is known that in this case there is always a split-move-merge procedure (between the two configurations) that minimizes the cost but also skips the split and the merge steps. The procedures without the split and merge steps are specified by functions from to itself. If is such a function, then this specifies a procedure in which you move the old particle of mass at the old position to the new position (at the cost of ) for .

So the cost of transforming to in this case is the same as the minimum of the average over all permutations on .

The fact that split and merge steps can be discarded is known as Birkhoff-von Neumann theorem, which actually states that any doubly stochastic matrix (i.e. a split-move-merge procedure in our case) can be written as a convex combination of permutations.

It is known that, in a metric space that is nice enough, any probability measure on it can be approximated by discrete distributions of the form in an appropriate sense.

7. Some properties

(some lower bounds) If two probability measures and are supported on subsets respectively and if , then the distance between and is also . If we relax to and , then we still have the approximate .

(upper bounds) If is the pushforward image of under some and if is such that for all x, then . If we relax to just for all (uniformly) in some big set and the diameter of X is O(1), then we still obtain an approximate conclusion .