Title: Motivation
1Motivation
Histograms are everywhere in vision. object
recognition / classification appearance-based
tracking How do we compare two histograms pi,
qj? Information theoretic measures like
chi-square, Bhattacharyya coeff, KL-divergence,
are very prevalent. They are based on
bin-to-bin comparisons of mass. Example,
bhattacharyya coefficient
2Motivation
Problem the bin-to-bin comparison measures are
sensitive to the binning of the data, and also to
shifts of data acrossbins (say due to
intensity gain/offset). Example,
which of these is more similar to the black
circle?
0
255
10
0
255
10
0
255
10
intensity
intensity
intensity
3Earth Movers Distance
?
example borrowed from Efros_at_cmu
4Earth Movers Distance
?
5Earth Movers Distance
6The Difference?
(amount moved)
7The Difference?
(amount moved) (distance moved)
8Thought Experiment
- move the books on your bookshelf one space to
the right - you are lazy, so want to minimize sum of
distances moved
9Thought Experiment
More than one minimal solution. Not unique!
dist 4
dist 1 1 1 1 4
10Thought Experiment
now minimize sum of squared distances
dist 42 16
dist 12121212 4
dist 4
dist 1 1 1 1 4
strategy 1
strategy 2
11How Do We Know?
How do we know those are the minimal
solutions? Is that all of them? Lets go back to
abs distance new-old
Form a table of distances new-old
new
A
B
C
D
E
0
1
2
3
4
A
1
0
1
2
3
B
2
1
0
1
2
C
old
3
2
1
0
1
D
A
B
C
D
E
4
3
2
1
0
E
12How Do We Know?
How do we know those are the minimal
solutions? Is that all of them? Lets go back to
abs distance new-old
Form a table of distances new-old X off ones
that are not admissable
new
A
B
C
D
E
x
1
2
3
4
A
x
0
1
2
3
B
x
1
0
1
2
C
old
x
2
1
0
1
D
A
B
C
D
E
x
x
x
x
x
E
13How Do We Know?
How do we know those are the minimal
solutions? Is that all of them? Lets go back to
abs distance new-old
Consider all permutations where there is asingle
1 in each admissable row and column.
new
A
B
C
D
E
x
1
2
3
4
A
x
0
1
2
3
B
x
1
0
1
2
C
old
x
2
1
0
1
D
A
B
C
D
E
x
x
x
x
x
E
14How Do We Know?
How do we know those are the minimal
solutions? Is that all of them? Lets go back to
abs distance new-old
Consider all permutations where there is asingle
1 in each admissable row and column.
new
A
B
C
D
E
x
1
2
3
4
A
x
0
1
2
3
B
x
1
0
1
2
C
old
x
2
1
0
1
D
A
B
C
D
E
x
x
x
x
x
E
sum 1301 5
15How Do We Know?
How do we know those are the minimal
solutions? Is that all of them? Lets go back to
abs distance new-old
Consider all permutations where there is asingle
1 in each admissable row and column.
new
A
B
C
D
E
x
1
2
3
4
A
x
0
1
2
3
B
x
1
0
1
2
C
old
x
2
1
0
1
D
A
B
C
D
E
x
x
x
x
x
E
sum 2222 8
16How Do We Know?
How do we know those are the minimal
solutions? Lets go back to using absolute
distance new-old
Consider all permutations where there is asingle
1 in each admissable row and column. Try to find
the minimum one!
new
A
B
C
D
E
x
1
2
3
4
A
x
0
1
2
3
B
x
1
0
1
2
C
old
x
2
1
0
1
D
A
B
C
D
E
x
x
x
x
x
E
There are 432124 permutations in this
example. We can try them all.
sum 4202 8
178 min solutions!
It turns out that lots of solutions are minimal,
when we use absolute distance.
18How Do We Know?
The two we had before are there. But there are
others!!
4 1 2 3
1 2 3 4
3 1 2 4
19Recall Thought Experiment
now minimize sum of squared distances
dist 42 16
dist 12121212 4
dist 4
dist 1 1 1 1 4
strategy 1
strategy 2
20new
A
B
C
D
E
x
1
4
9
16
A
x
0
1
4
9
B
x
1
0
1
4
C
old
x
4
1
0
1
D
x
x
x
x
x
E
Only one unique min solution when we use
new-old2 This turns out to be the case for
new-oldp for any p gt 1 because then the cost
function is strictly convex.
21Other Ways to Look at It
The way weve set it up so far, this problem is
equivalent tothe linear assignment problem. We
can therefore solve itusing the Hungarian
algorithm.
22Other Ways to Look at It
We can also look at is as a min-costflow problem
on a bipartite graph.
(sources)
(sinks)
old position
new position
Instead of books, we canthink of these nodes
asfactories and consumers, or whatever. Why?
We can then think about relaxing the problemto
consider fractional assignments between oldand
new positions. (e.g. half of A goes to B, and
the other half goes to C.
cost(A,B)
A B C D E
A B C D E
1
-1
1
-1
1
-1
1
-1
cost(D,E)
more about this in a moment
cost(old,new)
new-oldp
23Monge-Kantorovich Transportation Problem
24Mallows (Wasserstein) Distance
X and Y be d-dimensional random variables. Prob
distribution of X is P, and distribution of Y is
Q. Also, consider some unknown distribution F
overthe two of them taken jointly (X,Y) dxd
dimensional Mallows distance
In words Trying to find a minimum expected
value of the distance between X and Y Expected
value is taken over some unknown joint
distribution F! F is constrained such that
marginal wrt X is P, and marginal wrt Y is Q
25Understanding Mallows Distance
for discrete variables
costs dij
26Mallows Versus EMD
EMD
Mallows
For distributions they are the same. Also same
when total masses are same
27Mallows vs EMD
main difference EMD allows partial matches in
the case of unequal masses.
EMD 0
Mallows 1/2
note using L1 norm
As the paper points out, you have to be careful
when allowing partial matches to make sure what
you are doing is sensible.
28Linear Programming
Mallows/EMD for general d-dimensional data is
solved vialinear programming, for example by the
simplex algorithm. This makes it OK for low
values of d (up to dozens), but makes it
unsuitable for very large d. As a result, EMD is
typically applied after clustering the data (say
using k-means) into a smaller set of clusters.
The coarse descriptors based on clusters are
often called signatures.
29Transportation Problem
Mallows is a special case of linear programming
transportation problem
formulated as a min-flow problem in a graph
p1
-q1
p2
-q2
pm
-qn
30Assignment Problem
some discrete cases (like our book example)
simplify further assignment problem
formulated as a min-flow problem in a graph
1
-1
p1
-q1
1
-1
p2
-q2
1
-1
all x_ij are 0 or 1, and only one 1 in each row
or column
pm
-qn
31Linear Programming
Mallows/EMD for general d-dimensional data is
solved vialinear programming, for example by the
simplex algorithm. This makes it OK for low
values of d (up to dozens), but makes it
unsuitable for very large d. As a result, EMD is
typically applied after clustering the data (say
using k-means) into a smaller set of clusters.
The coarse descriptors based on clusters are
often called signatures.
However, If we use marginal distributions, so
that we have 1D histograms,something wonderful
happens!!!
32One-Dimensional Data
one dimensional data (like weve been using for
illustration duringthis whole talk) is an
important special case. Mallows/EMD distance
computation greatly simplifies! First of all,
for 1D, we can represent densities by their
cumulativedistribution functions
33One-Dimensional Data
one dimensional data (like weve been using for
illustration duringthis whole talk) is an
important special case. Mallows/EMD distance
computation greatly simplifies! First of all,
for 1D, we can represent densities by their
cumulativedistribution functions and the min
distance can be computed as
(x)
x
F(x) G(x) dx
34One-Dimensional Data
G(x)
G-1(t)
1
1
F(x)
F-1(t)
t
t
0
0
x
x
0
255
0
255
intensity (for example)
intensity (for example)
just area between the two cumulative distribution
function curves
35Proof?
It is easy to find papers that state the previous
1D simplified solution, but quite hard to find
one with a proof! One is
but you still have to work at it. I did, one
week, and here is what I came up with
First, recall the quantile transform given a cdf
F(x), we can generate samples from it by
uniformly sampling t U(0,1) and then outputting
F-1(t)
F(x)
1
t0
ti U(0,1) gt xi F
t
0
x0
255
0
intensity (for example)
36Proof?
This allows us to understand that
37But so what? Why does this minimize the Mallows
distance?
38Expected cost is sum ofthe 4x4 array of
products. To compute Mallows distance, we want
to choose pij to minimize this expected cost
39P.Major says at minimum solution, for any pab
and pcd on opposite sides ofthe diagonal, one or
both of them should be zero. If not, we can
construct alower cost solution. Example
our new cost differs from old on by a(94)
a(01) -12aso is a lower cost solution.
40Connection (and a missing piece of the proof in
P.Majors paper) The above procedure serves to
concentrate all the mass of the joint
distribution along the diagonal, and apparently
also yields the mincost solution.. However,
concentration of mass along the diagonal is also
a property of joint distributions of correlated
random variables. Therefore... generating
maximally correlated random variables via the
quantile transformation should serve to generate
a joint distribution clustered as tightly as
possible around the diagonal of the cost matrix,
and therefore, should yield the minimum expected
cost. QED!!!!
41Example CDF Distance
qj
.25
.25
.25
.25
.25
.25
.25
.25
0
0
pi
1
2
3
4
5
1
2
3
4
5
black Pi cdf of p white Qi cdf of q
sum(Pi-Qi) .25 .25 .25 .25 0
1
Note we get 1 instead of 4, the number we got
earlier for the books,because we didnt divide
by total mass (4) earlier.
42Example Application
convert 3D color data into three 1D marginals
compute CDF of marginal color data in a circular
region compute CDF of marginal color data in a
ring around that circle compare two CDFs using
Mallows distance select peaks in the distance
function as interest regions repeat, at
a range of scales...