Title: Satyaki Mahalanabis
1Density estimation in linear time (approximating
L1-distances)
Satyaki Mahalanabis Daniel Štefankovic
University of Rochester
2Density estimation
f6
f1
f2
DATA
f4
f3
f5
F a family of densities
density
3Density estimation - example
0.418974, 0.848565, 1.73705, 1.59579,
-1.18767, -1.05573, -1.36625
N(?,1)
F a family of normal densities with ?1
?
4Measure of quality
gTRUTH fOUTPUT
L1 distance from the truth
f-g1 ? f(x)-g(x) dx
Why L1?
1) small L1 ? all events estimated with
small additive error 2) scale
invariant
5Obstacles to quality
DATA
F
bad data
?
?
weak class of densities
dist1(g,F)
6What is bad data ?
h-g 1
g TRUTH h DATA (empirical density)
? 2max h(A)-g(A)
A?Y(F)
Y(F) Yatracos class of F Aij
x fi(x)gtfj(x)
f2
f3
f1
A12
A13
A23
7Density estimation
F
f
with small g-f1
DATA (h)
assuming these are small
dist1(g,F)
? 2max h(A)-g(A)
A?Y(F)
8Why would these be small ???
dist1(h,F)
? 2max h(A)-g(A)
A?Y(F)
They will be if
1) pick a large enough F 2) pick a
small enough F so that VC-dimension of Y(F) is
small 3) data are iid from h
Emaxh(A)-g(A)?
Theorem (Haussler,Dudley, Vapnik, Chervonenkis)
VC(Y)
samples
A?Y
9How to choose from 2 densities?
f1
f2
10How to choose from 2 densities?
f1
f2
1
1
1
-1
11How to choose from 2 densities?
T? f1 T? f2
T?h
?
f1
f2
1
1
1
-1
T
12How to choose from 2 densities?
T? f1 T? f2
T?h
?
f1
f2
Scheffé if T? h gt T? (f1f2)/2 ?
f1 else ?
f2
Theorem (see DL01) f-g1 ?
3dist1(g,F) 2?
1
1
1
-1
T
13Density estimation
F
f
with small g-f1
DATA (h)
assuming these are small
dist1(g,F)
? 2max h(A)-g(A)
A?Y(F)
14Test functions
Ff1,f2,...,fN
Tij (x) sgn(fi(x) fj(x))
Tij?(fi fj) ? (fi-fj)sgn(fi-fj) fi fj1
Tij?h
fi wins
fj wins
Tij?fi
Tij?fj
15Density estimation algorithms
Scheffé tournament Pick the density with
the most wins.
Theorem (DL01) f-g1? 9dist1(g,F)8?
n2
Minimum distance estimate (Y85) Output
fk? F that minimizes max (fk-h)?
Tij
n3
ij
Theorem (DL01) f-g1? 3dist1(g,F)2?
16Density estimation algorithms
Scheffé tournament Pick the density with
the most wins.
Theorem (DL01) f-g1? 9dist1(g,F)8?
n2
Minimum distance estimate (Y85) Output
fk? F that minimizes max (fk-h)?
Tij
n3
ij
Theorem (DL01) f-g1? 3dist1(g,F)2?
17Our algorithm Efficient minimum
loss-weight
repeat until one distribution left 1) pick
the pair of distributions in F that are
furthest apart (in L1) 2) eliminate the loser
Theorem MS08 f-g1? 3dist1(g,F)2?
n
Take the most discriminative action.
after preprocessing F
18Tournament revelation problem
INPUT a weighed undirected graph G
(wlog all edge-weights distinct)
OUTPUT REPORT heaviest edge u1,v1 in
G ADVERSARY eliminates u1 or v1 ? G1
REPORT heaviest edge u2,v2 in G1
ADVERSARY eliminates u2 or v2 ? G2 .....
OBJECTIVE minimize total time spent
generating reports
19Tournament revelation problem
A
report the heaviest edge
4
2
3
B
5
6
D
C
1
20Tournament revelation problem
A
report the heaviest edge
BC
4
2
3
B
5
6
D
C
1
21Tournament revelation problem
A
report the heaviest edge
BC
2
3
eliminate B
report the heaviest edge
D
C
1
22Tournament revelation problem
A
report the heaviest edge
BC
2
3
eliminate B
report the heaviest edge
D
C
1
AD
23Tournament revelation problem
report the heaviest edge
BC
eliminate B
report the heaviest edge
D
C
1
AD
eliminate A
report the heaviest edge
CD
24Tournament revelation problem
A
BC
B
C
4
2
3
AD
BD
B
A
D
B
D
5
6
D
DC
AC
AB
C
AD
1
2O(F) preprocessing ? O(F) run-time O(F2 log F)
preprocessing ? O(F2) run-time
WE DO NOT KNOW Can get O(F) run-time
with polynomial preprocessing ???
25Efficient minimum loss-weight
repeat until one distribution left 1) pick
the pair of distributions that are
furthest apart (in L1) 2) eliminate the loser
(in practice 2) is more costly)
2O(F) preprocessing ? O(F) run-time O(F2 log F)
preprocessing ? O(F2) run-time
WE DO NOT KNOW Can get O(F) run-time
with polynomial preprocessing ???
26Efficient minimum loss-weight
repeat until one distribution left 1) pick
the pair of distributions that are
furthest apart (in L1) 2) eliminate the loser
Theorem f-g1? 3dist1(g,F)2?
n
Proof
that guy lost even more badly!
For every f to which f loses
f-f1 ? max f-f1
f loses to f
27Proof
that guy lost even more badly!
For every f to which f loses
f-f1 ? max f-f1
f loses to f
2h?T23 ? f2?T23 f3?T23
f1
(f1-f2)?T12 ? (f2-f3)? T23
(f4-h)?T23 ? ?
(fi-fj)?(Tij-Tkl)? 0
bad loss
f3
f1-g1 ? 3f2-g12?
BESTf2
28Application kernel density
estimates
(Akaike54,Parzen62,Rosenblatt56)
K kernel
h density
kernel used to smooth empirical g (x1,x2,...,xn
i.i.d. samples from h)
n
1
?
K(y-xi)
h K
n
as n??
i1
g K
29What K should we choose?
g K
n
1
?
K(y-xi)
h K
n
as n??
i1
Dirac ? would be good
Dirac ? is not good
Something in-between bandwidth selection
for kernel
density estimates
K(x/s)
as s? 0 Ks(x)? Dirac ?
Ks(x)
s
Theorem (see DL01) as s? 0 with sn??
gK
h1 ? 0
30Data splitting methods for kernel
density estimates
How to pick the smoothing factor ?
n
( )
?
1
y-xi
K
ns
s
i1
n-m
( )
?
y-xi
1
K
x1,...,xn-m
fs
s
(n-m)s
i1
x1,x2,...,xn
choose s using density estimation
xn-m1,...,xn
31Kernels we will use
( )
?
1
y-xi
K
ns
s
piecewise uniform
piecewise linear
32Bandwidth selection for uniform kernels
E.g. N ? n1/2 m ? n5/4
N distributions each is piecewise uniform with
n pieces m datapoints
Goal run the density estimation algorithm
efficiently
EMLW
MD
TIME
(fifj)?Tij
g?Tij ?
N
nm log n
2
(fk-h)? Tkj
N2
nm log n
fi-fj1
n
N2
33Bandwidth selection for uniform kernels
E.g. N ? n1/2 m ? n5/4
Can speed this up?
N distributions each is piecewise uniform with
n pieces m datapoints
Goal run the density estimation algorithm
efficiently
EMLW
MD
TIME
(fifj)?Tij
g?Tij ?
N
nm log n
2
(fk-h)? Tkj
N2
nm log n
fi-fj1
n
N2
34Bandwidth selection for uniform kernels
E.g. N ? n1/2 m ? n5/4
Can speed this up?
N distributions each is piecewise uniform with
n pieces m datapoints
absolute error bad relative error good
Goal run the density estimation algorithm
efficiently
EMLW
MD
TIME
(fifj)?Tij
g?Tij ?
N
nm log n
2
(fk-h)? Tkj
N2
nm log n
fi-fj1
n
N2
35Approximating L1-distances between distributions
N piecewise uniform densities (each n pieces)
(N2Nn) (log N)
WE WILL DO
?2
TRIVIAL (exact) N2n
36Dimension reduction for L2
Sn
Johnson-Lindenstrauss Lemma (82) ? L2 ?
Lt2 t O(?-2 ln n) (?
x,y ? S) d(x,y) ? d(?(x),?(y)) ? (1?)d(x,y)
N(0,t-1/2)
37Dimension reduction for L1
Sn
Cauchy Random Projection (Indyk00) ? L1
? Lt1 t O(?-2 ln n) (?
x,y ? S) d(x,y) ? est(?(x),?(y)) ? (1?)d(x,y)
N(0,t-1/2)
C(0,1/t)
(Charikar, Brinkman03 cannot replace est by d)
38Cauchy distribution C(0,1)
density function
1
? (1x2)
FACTS
X?C(0,1) ? aX?C(0,a) X?C(0,a), Y?C(0,b)
? XY?C(0,ab)
39Cauchy random projection for L1
(Indyk00)
A
B
D
X1
X2
X3
X4
X5
X6
X7
X8
X9
X1?C(0,z)
A(X2X3) B(X5X6X7X8)
z
40Cauchy random projection for L1
(Indyk00)
A
B
D
X1
X2
X3
X4
X5
X6
X7
X8
X9
X1?C(0,z)
A(X2X3) B(X5X6X7X8)
D(X1X2...X8X9)
z
? Cauchy(0,?-?1)
41All pairs L1-distances
piece-wise linear densities
42All pairs L1-distances
piece-wise linear densities
R(3/4)X1 (1/4)X2
B(3/4)X2 (1/4)X1
R-B?C(0,1/2)
X1 X2 ? C(0,1/2)
43All pairs L1-distances
piece-wise linear densities
Problem too many intersections!
Solution cut into even smaller pieces!
Stochastic measures are useful.
44Brownian motion
1
exp(-x2/2)
(2?)1/2
Cauchy motion
1
? (1x)2
45Brownian motion
1
exp(-x2/2)
(2?)1/2
computing integrals is easy
fR?Rd
? f dL Y ? N(0,S)
46Cauchy motion
1
? (1x)2
computing integrals is easy
fR?Rd
? f dL Y ? C(0,s) for d1
computing integrals is hard dgt1
obtaining explicit expression for the density
47X1
X2
X3
X4
X5
X6
X7
X8
X9
What were we doing? ? (f1,f2,f3) dL
(w1)1,(w2)1,(w3)1
48X1
X2
X3
X4
X5
X6
X7
X8
X9
What were we doing? ? (f1,f2,f3) dL
(w1)1,(w2)1,(w3)1
Can we efficiently compute integrals dL for
piecewise linear?
49Can we efficiently compute integrals dL for
piecewise linear?
? R? R2 ?(z)(1,z) (X,Y)? ? dL
50? R? R2 ?(z)(1,z) (X,Y)? ? dL
uv,u-v
(2(X-Y),2Y) has density at
2
51All pairs L1-distances for mixtures of uniform
densities in time
O(
)
(N2Nn) (log N)
?2
All pairs L1-distances for piecewise linear
densities in time
O(
)
(N2Nn) (log N)
?2
52QUESTIONS
? R? R3 ?(z)(1,z,z2) (X,Y,Z)? ? dL
?
1)
2) higher dimensions ?