Computing the Earth Mover's Distance under Transformations

The ideas and results contained in this document are part of my thesis,
which will be published as a Stanford computer science technical report
in June 1999. Similar ideas applied to the EMD under translation have
already been published in the technical report
[1].

The Earth Mover's Distance (EMD) is a distance measure between discrete,
finite distributions

x = { (x1,w1), (x2,w2),
..., (xm,wm) }

and

y = { (y1,u1), (y2,u2),
..., (yn,un) }.

The x distribution has an amount of mass or weight wi at
position xi in RK, i=1,...,m, while the y
distribution has weight uj at position yj, j=1,...,n.
An example pair of distributions in R2 is shown below.

Here x has m=2 masses and y has n=3 masses. The circle centers
are the points (mass locations) of the distributions. The area of a circle
is proportional to the weight at its center point. The total weight of
x is wS=sumi=1..m wi, and the total
weight of y is uS=sumj=1..n uj. In
the example above, the distributions have equal total weight
wS=uS=1.

Although the EMD does not require equal-weight distributions, let us begin
our discussion with this assumption. The EMD between two equal-weight
distributions is proportional to the amount of work needed to morph one
distribution into the other. We morph x into y by transporting
mass from the x mass locations to the y mass locations until
x has been rearranged to look exactly like y. An example
morph is shown below.

The amount of mass transported from xi to yj is
denoted fij, and is called a flow between xi
and yj. The work done to transport an amount of mass
fij from xi to yj is the product of the
fij and the distance dij=d(xi,yj)
between xi and yj.
The total amount of work to morph x into y by the flow
F=(fij) is the sum of the individual works:

WORK(F,x,y) = sumi=1..m, j=1..n
fij d(xi,yj).

In the above flow example, the total amount
of work done is 0.23*155.7 + 0.51*252.3 + 0.26*316.3 = 246.7.

The flow in the previous example is not an
optimal flow between the given distributions. A minimum work flow is shown
below.

The total amount of work done by this flow is
0.23*155.7 + 0.26*277.0 + 0.25*252.3 + 0.26*198.2 = 222.4.
The EMD between equal-weight distributions is the minimum work to morph
one into the other, divided by the total weight of the distributions. The
normalization by the total weight makes the EMD equal to the average
distance travelled by mass during an optimal (i.e. work minimizing) flow.
The EMD does not change if all the weights in both distributions are scaled
by the same factor. In the previous example, the total weight is 1, so the
EMD is equal to the minimum amount of work: EMD(x,y)=222.4.

When the distributions do not have equal total weights, it is not possible
to rearrange the mass in one so that it exactly matches the other. The EMD
between unequal-weight distributions is proportional to the minimum amount
of work needed to cover the mass in the lighter distribution by mass from
the heavier distribution. An example of a flow between unequal-weight
distributions is given below.

Here wS=1 and uS=0.74, so x is heavier than
y. The flow covers or matches all the mass in y with mass
from x. The total amount of work to cover y by this flow
is 0.51*252.3 + .23*292.9 = 196.0.
In this example, 0.23 of the mass at x1 and 0.03 of the mass at
x2 is not used in matching all the y mass.

In general, some of the mass (wS-uS if x is
heavier than y) in the heavier distribution is not needed to
match all the mass in the lighter distribution. For this reason, the case of
matching unequal-weight distributions is called the partial matching
case. The case when the distributions are equal-weight is called the
complete matching case because all the mass in one distribution is
matched to all the mass in the other distribution. Another way to visualize
matching distributions is to imagine piles of dirt and holes in the ground.
The dirt piles are located at the points in the heavier distribution, and the
the holes are located at the points of the lighter distribution. The volume
of a dirt pile or the volume of dirt missing from a hole is equal to the
weight of its point. Matching distributions means filling all the holes with
dirt. In the partial matching case, there will be some dirt leftover after
all the holes are filled.

The flow between the unequal-weight distributions in the
previous example
is not optimal. A minimum work flow is shown below.

The total amount of work for this flow to cover y is
0.23*155.7 + 0.25*252.3 + 0.26*198.2 = 150.4.
The EMD is the minimum amount of work to cover the mass in the lighter
distribution by mass from the heavier distribution, divided by the
weight of the lighter distribution (which is the total amount of mass
moved). As in the complete matching case,
the normalization of the minimum work makes the EMD equal to the average
distance mass travels during an optimal flow. Again, scaling the weights
in both distributions by a constant factor does not change the EMD. In
the above example, the weight of the lighter distribution is
uS=0.74, so EMD(x,y)= 150.4/0.74 = 203.3.

Formal Definition

Although we have described and informally defined the EMD for the two
separate cases of (i) equal-weight distributions, and (ii) unequal-weight
distributions, only one linear program is needed to formally define the
EMD. Recall the notion for the input distributions:

x = { (x1,w1), (x2,w2),
..., (xm,wm) }

and

y = { (y1,u1), (y2,u2),
..., (yn,un) },

wS is the total weight of x, and uS is the
total weight of y. The EMD is defined by the linear program

EMD(x,y)

=

minF in F(x,y)

WORK(F,x,y)

/

min(wS,uS)

=

minF=(fij) in F(x,y)

sumi=1..m, j=1..n fij
d(xi,yj)

/

min(wS,uS),

where the minimum is taken over the set of feasible flows
F(x,y) defined by

(1)

fij

>=

0

i=1..m, j=1..n

(2)

sumj=1..n fij

<=

wi

i=1..m

(3)

sumi=1..m fij

<=

uj

j=1..n

(4)

sumi=1..m, j=1..n fij

=

min(wS,uS).

Here the flow variable fij represents the amount of mass
matched between xi and yj. Constraint (1) forces
this amount to be nonnegative. Constraint (2) requires that the total
amount of y mass matched to x mass at xi does not
exceed the amount wi of x mass at this location. Similarly,
constraint (3) requires that the total amount of x mass matched to
y mass at yj does not exceed the amount uj of
y mass at this location. Finally, constraint (4) forces the total
amount of mass matched to be the amount of mass in the lighter distribution.
Without this constraint, the minimum would be achieved by setting
fij=0 for all i,j.

The distance function d between points is called the ground distance.
In contrast, the EMD is a distance between distributions which is built
on top of this point distance function. The total amount of mass moved
during any feasible flow (i.e. a flow satisfying (1), (2), (3), and (4))
is equal to the amount of mass min(wS,uS) in the lighter
distribution. The EMD is equal to the average distance travelled (in ground
distance units) by mass during an optimal flow. A simple units analysis shows
that EMD units are the same as ground distance units.

When x and y are equal weight distributions, then all the mass
in x must be matched to mass in y, and vice-versa. In this case,
the previously given definition for the set of feasible flows
F(x,y) can be rewritten as

(1)

fij

>=

0

&nbsp&nbsp i=1..m, j=1..n

(5)

sumj=1..n fij

=

wi

&nbsp&nbsp i=1..m

(6)

sumi=1..m fij

=

uj

&nbsp&nbsp j=1..n.

From (5) and (6), it follows that the total amount of mass matched in
the equal-weight case is sumi=1..m, j=1..n fij =
wS=uS(=min(wS,uS)).

The linear program (LP) which defines the EMD is a special type of LP
known as the transportation problem. The transportation
simplex algorithm is a specialization of the simplex algorithm for solving
LPs that takes advantage of the structure present in transportation problems
([3]).
Y. Rubner has
written a
C code
package to compute the EMD using the transportation simplex algorithm.

Some Notes

An excellent source for applications of the EMD in content-based image
retrieval is the work of
Y. Rubner. He has
successfully applied the EMD to measure global color similarity
between images
([5,6,7]) and texture similarity
([4,5,7]). In these
cases, the feature space in which the distribution points live and
the ground distance used to compare the points are different, but the same
EMD framework is applied.

The EMD is a metric between equal-weight distributions whenever the
underlying ground distance is a metric between points. The difficult part
in the proof of this statement is to show that the triangle inequality
holds. Assume that distributions have total weight one, so that the EMD
is exactly the minimum of amount of work to change one distribution into
another. One way to morph x into z is to use a minimum work
morph of x into y, and then use a minimum work morph of
y into z. The cost of the morph through y
is EMD(x,y)+EMD(y,z). Since
EMD(x,z) is the minimum amount of work required to change
x into z, it cannot be more than the work for the particular
morph through y. A formal proof of the metric property can be found
in [5].

Another interesting property of the EMD is that the ground distance between
the centroids of equal-weight distributions is a lower bound on their EMD
for any ground distance between points induced by a vector norm (e.g. the
Euclidean distance between points is the Euclidean norm of their
difference). There are also lower bounds for the
equal-weight case which use the EMD between projections of the distributions
onto lines through the origin. The EMD between equal-weight 1D distributions
can be computed much more efficiently than the EMD between more general
distributions. Some information on the EMD in 1D can be found in this
document, in the section on the EMD under
translation in 1D.
The technical report [1] covers the
previously mentioned lower bounds, as well as centroid-based and
projection-based lower bounds which can be applied in the partial matching
case.

The EMD is more general than the way I have described it thus far.
There is actually nothing in its formulation that assumes
the xi's and the yj's in distributions x
and y are points. The defining linear program uses only the distance
d(xi,yj), so xi and yj could
be any objects between which a distance can be computed. It is not
necessary that these objects have a point representation, or that the
ground distance is some standard point distance such as L2 or
L1. In general, the EMD is a distance measure between two sets
of weighted objects which is built upon a distance between individual
objects.

The Earth Mover's Distance Under Transformations

We begin with a statement of the problem and the basic motivation for
its study.

In words, we want to find the transformation of one distribution that
minimizes its EMD to another distribution.

The general motivation for allowing transformations is to have a distance
function which measures similarity modulo some factors. Consider the
example below in which the two distributions have equal total weight.

The EMD between the distributions on the left is large because mass in
the darker distributions must be transported large distances to change the
darker distribution into the lighter distribution (lighter in terms of
gray level, not total weight). The two distributions, however, appear
visually similar. If we allow the darker distribution to be translated
before comparing it to the lighter distribution, then the EMD is small.
This is clearly shown in the right half of the
above figure.
We shall give more motivation for allowing transformations in the next
section.

One class of distribution transformations contains transformations g that
change a distribution's weights, but leave its points fixed:

g(y) = g(Y,u) = (Y,g(u)).

Here y=(Y,u) splits the distribution into its points
contained in the matrix Y, and its weights contained in the vector u.
A useful transformation set of this class arises in the
scale estimation phase of the
SEDL image retrieval system. In this
setting, transformations are parametrized by a scale factor c, and

gc(y) = gc(Y,u) = (Y,cu).

Here, the distributions x and y are the color distributions
for a database image and a query pattern, respectively, and the goal is to
compute an estimate for the scale (size) at which the pattern may occur
within the database image. See the description of the
SEDL scale estimation phase for more details.

Another class of distribution transformations consists of transformations g
that change the points of a distribution, but leave its weights fixed:

g(y) = g(Y,u) = (g(Y),u).

For example, suppose G=E, the group of Euclidean
transformations. Distributions in E are defined by a rotation matrix
R and a translation vector t, and

gR,t(Y) = gR,t([y1 ... yn]) =
[Ry1+t ... Ryn+t].

In words, gR,t rotates each distribution point by R, and then
translates the result by t. For the case when transformations change only
distribution points, we shall give an iteration that is guaranteed to
converge, although it may converge to only a locally optimal transformation.
Before we do this,
however, let us give some more motivation for allowing transformations.

Motivation from Content-Based Image Retrieval

Suppose we are comparing scenes using the EMD between color distributions.
One problem is that the EMD can be large between
color distributions for two images of the same scene taken under different
illuminants, even if the camera location and orientation is fixed. This
is because the pixels colors in the images can be quite different, as
illustrated in the images below.

Under certain assumptions on the reflectance functions of scene objects,
a change in the spectral power distribution (SPD) from a(w) to b(w) causes a
linear transformation Aa,b of image pixel colors
([2]):

By allowing for a linear tranformation in the comparison between
color distributions, the EMD can show, for example, that the scene under
the white illuminant is similar to the scene under the red illuminant.

Texture comparison is another example in which allowing transformations is
useful. Excellent results using the EMD to compare textures were obtained
by Y. Rubner
([4]).
The main idea is to summarize
a texture by a distribution of energy in the spatial frequency plane.
A distribution point xi is a point in the spatial frequency plane,
and its weight wi is the fraction of the total energy at that
frequency. The textures shown below contain energy at only one spatial
frequency, but this will be enough to make our point clear.

Suppose we want the EMD to be small between the energy distributions for
the left and right textures because these differ only by a scaling and
rotation. As shown above, let q=(fx,fy) denote a
point in spatial frequency space. If we work in log-polar spatial frequency
space, recording

p=(log ||q||,angle(q)),

then scaling and rotating the texture results in a translation of the point p:

By allowing for a translation in log-polar spatial frequency space, the
EMD captures the similarity between textures that differ primarily in scale
and orientation.

The need for transformations might be more direct than in the previous
two applications, in the sense that distribution points may be points in
the image plane instead of points in a color space or a spatial frequency
space. Suppose for example, that we wish to match features in a stereo
pair of images as shown below.

The change in location of a feature point can be modelled (approximately)
by an affine transformation

pl = A pr + t

if the thickness of the object is small in comparison to its distance
from the camera center of projection.

A Convergent Iteration

In this section, we present an iteration to search for an optimal
transformation in the case that transformations change the points in a
distribution, but leave the weights fixed. Since it is the weights in the
distributions which determine the set of feasible flows,

F(x,g(y)) = F(x,y)
&nbsp&nbsp if g does not change the weights in x and y.

The idea is to find the best flow for a fixed transformation,
the best transformation for the previously computed flow, the best flow
for the previously computed transformation, and so on:

(1) &nbsp&nbsp

F(k)

=

arg

minF in F(x,y)

WORK(F,x,g(k)(y)), and

(2) &nbsp&nbsp

g(k+1)

=

arg

ming in G

WORK(F(k),x,g(y)).

As shown in (1), the kth flow iterate is defined as
a flow which minimizes the amount of work needed to match x and
g(k)(y), where g(k) is the
kth transformation iterate. As shown in (2), the
(k+1)st
transformation iterate is defined as a transformation g which minimizes
the amount of work done by F(k) to match x and g(y).
The iteration begins with an initial transformation g(0), and
the order of evaluation is

g(0) --> F(0) --> g(1) --> F(1)
--> ... .

An example will help clarify the iteration.

The left half of the figure below shows a dark and light distribution that
we will match under translation using the previously given iteration.

The initial translation in this example is g(0)=0, and we shall
translate the dark distribution. The best flow F(0) for
g(0) is shown by the labelled arcs connecting dark and light
masses in the left half of the diagram. This flow F(0) matches
half (.5) the mass over a relatively large distance. When we ask for the
best translation for this flow, we should expect that translation to move
the .7 weight dark mass closer to the .8 weight light mass in order to
decrease the total amount of work done by F(0). Indeed, the
optimal translation g(1) aligns these two masses as shown in the
right half of the figure. The best flow F(1) for this translation
matches all of the .7 weight dark mass to the
.8 weight light mass. Since these masses are on top of
each other, it costs zero work to match 70% of the distributions. No
further translation of the dark distribution improves the work --
g(2)=g(1) and the iteration converges.

The expression WORK(F,x,g(y)) defines a surface over the
space FxG of all possible (flow,transformation)=(F,g) pairs.
Our iteration provides a downhill path over this surface. The flow and
transformation iterates define WORK iterates

WORK(k) = WORK(F(k),x,g(k)(y)).

From (2), we know that

(3) &nbsp&nbsp WORK(F(k),x,g(k+1)(y)) <=
WORK(F(k),x,g(k)(y))=WORK(k)

because g(k+1) is an optimal transformation for F(k)
(and therefore
yields a smaller work value than g(k) for F(k)).
Similarly, (1) implies

since F(k+1) is an optimal flow for g(k+1). Combining
(3) and (4) shows

WORK(k+1) <= WORK(k).

Since the WORK sequence is decreasing and bounded below (by zero), it must
converge. The corresponding transformation sequence may, however, converge
to only a locally optimal transformation. Therefore, the iteration should be
run with different initial transformations in search of the global
minimum.

In just a couple of sections, we shall discuss
two specific cases in which
the EMD under transformation problem can be solved directly, without the
aid of our iteration:

equal-weight 1D distributions under translation with the absolute value
as the ground distance.

Although our iteration is not needed here, it is guaranteed to converge to
the global minimum of the WORK function in case (i).

Our iteration is very general, in that it can be applied for
a number of different (ground distance d, transformation set G) pairs.
The optimal flow step (1) is a transportation problem,
which can be solved with the transportation simplex algorithm
([3]). We call the
problem in step (2) the optimal transformation
problem. Our iteration can be applied
whenever the optimal transformation problem can be solved.

It is well known (and easily proven using standard calculus) that the
unique optimal translation in the least squares problem (7) is the
translation that lines up the centroids of the weighted point sets
{(c1,a1), ..., (cN,aN)} and
{(c1,b1), ..., (cN,bN)} :

t*

=

centroid(c,a) - centroid(c,b)

where

centroid(c,a)

=

sumr=1..N cr ar &nbsp&nbsp /
&nbsp&nbsp cS,

centroid(c,b)

=

sumr=1..N cr br &nbsp&nbsp /
&nbsp&nbsp cS, &nbsp&nbsp and

cS

=

sumr=1..N cr.

In terms of the original flow variables fij and distributions
x and y, this solution
becomes

where we have used the fact that sumi=1..m, j=1..n fij =
min(wS,uS) for any feasible flow F between x
and y.
In the next section, we shall discuss an interesting property of this
solution when
the distributions x and y have the same total weight.

Two Specific Cases

In this section, we examine two specific cases in which there is a
direct solution to the EMD under transformation problem.

for any feasible flow between x and y. This is because all
the mass in both distributions must be matched when the two distributions
have the same total weight. Substituting (9) into (8), we see that

t* = centroid(x) - centroid(y),

where

centroid(x)

=

sumi=1..m wi xi &nbsp&nbsp /
&nbsp&nbsp wS, &nbsp&nbsp and

centroid(y)

=

sumj=1..n uj yj &nbsp&nbsp /
&nbsp&nbsp uS.

Here we have also used the fact that x and y are equal-weight
distributions: wS=uS=min(wS,uS).

We have shown that the optimal translation in this case does not depend on
the feasible flow used. The translation t* which lines up the
centroids of x and y is the unique optimal translation for
every feasible flow. To compute the EMD under translation
between equal-weight distributions with d=(L2)2,
we simply compute
EMD(x,y+centroid(x)-centroid(y)). Although our
iteration is not needed here, it is guaranteed to converge to
the global minimum of the WORK function in this case. In fact, it will
converge in only a couple of steps since
g(2)=g(1)=t*.

In this section, we consider matching equal-weight distributions on
the real line under translation, with the absolute value (d=L1)
as the ground distance. Consider the distributions shown in the left
half of the figure below, where x is the dark distribution and
y is the light distribution.

&nbsp&nbsp

It turns out that there is a simple solution to computing the EMD
between equal weight distributions in 1D that involves the cumulative
distribution functions (CDFs) of the distributions. The CDF for x
starts at 0, increases an amount wi at each point xi,
and eventually becomes constant=wS at the largest point
xm. The CDF for x is the bold staircase graph, while
the CDF for y is the regular thickness staircase graph. Since
x and y are equal-weight distributions, the two CDFs become
constant at the same value wS=uS. The EMD between
x and y is equal to the area between the CDFs of x and
y divided by the total weight wS=uS. The area
between the CDFs is shaded, and the corresponding optimal flow is
indicated with arrows. This optimal flow is called the CDF flow, and
is denoted FCDF.

are the partial sums of the distribution weight vectors w and u,
W0 = U0 = 0, and |X| denotes the size of interval X.
The partial sums U0, U1, ..., Un are the
same for y and every translated version of y.
This means that the CDF flow is an optimal flow
between x and y+t for every translation t. An EMD
computation between x and a translation of y is shown in the
right half of the above figure.
To compute the EMD under translation
between equal-weight distributions in 1D with d=L1,
we simply solve the optimal translation problem for the fixed
flow F=FCDF. The optimal translation problem with d=L1
is covered in [1].