How slow is the kmeans method - PowerPoint PPT Presentation

1 / 31
About This Presentation
Title:

How slow is the kmeans method

Description:

Points pi Absorbed by Ci when M completes ... Helps H absorb pi and qi right after Ci does, completing the reset. Everything else: ... – PowerPoint PPT presentation

Number of Views:62
Avg rating:3.0/5.0
Slides: 32
Provided by: Davi826
Category:

less

Transcript and Presenter's Notes

Title: How slow is the kmeans method


1
How slow is the k-means method?
  • David Arthur Sergei Vassilvitskii
  • Stanford University

2
The k-means Problem
  • Given an integer k and n data points in Rd
  • Partition points into k clusters
  • Choose k centers and partition points according
    to closest center
  • Try to minimize
  • f ? x c(x)2

3
Lloyds Algorithm (1982)
  • Simply called the k-means method
  • Choose k starting centers
  • Uniformly at random usually
  • Repeat until stable
  • Assign each point to the closest center
  • Set each center to be center of mass of points
    assigned to it

4
Example
Cluster boundary
Data Point
Cluster center
First choose k arbitrary centers
Assign points to closest centers
Recompute centers
k-means has now stabilized
5
About k-means
  • It always terminates
  • Each step decreases f
  • At most kn configurations
  • It can stop with arbitrarily bad clusterings

6
About k-means
  • Widely used because it is fast
  • Usually far fewer than n iterations
  • How do you formalize this?
  • Just look at worst-case performance?

7
k-means (Worst case iterations)
  • Counting number of configurations
  • Already showed O(kn)
  • Inaba et al. (SOCG 94) O(nkd)
  • One dimension
  • Dasgupta (COLT 03) O(n)
  • Har-Peled, Sadri (SODA 05) O(n?2)
  • ? ratio of largest distance to smallest

8
Our Main Result
  • Worst case 2O(vn)
  • k-means is superpolynomial!

9
Proof High Level
  • Start with configuration M with n points, which
    requires T iterations
  • Add O(1) clusters, O(k) points
  • These reset initial configuration M
  • M stabilizes to M
  • New clusters, points reset M to M
  • M now has to stabilize to M again
  • Now requires at least 2T iterations

10
Proof High Level
  • Repeat reset construction m times
  • O(m2) points
  • O(m) clusters
  • 2m iterations

11
Main Construction (Overview)
Ci
The original means configuration, M
12
Main Construction (Overview)
Ci
G
G
H
H
H
H
Note horizontal symmetry O(1) new clusters, O(k)
new points
13
Main Construction (Overview)
Ci
G
G
H
H
H
H
Points pi Absorbed by Ci when M completes
Points qi Absorbed by Ci after pi, which resets
the center of Ci
Everything else Balances the important points
Helps H absorb pi and qi right after Ci does,
completing the reset
14
Main Construction (Zoomed in)
Ci
0.989d
0.989d
G
de
e
0.2d
H
H
Zoomed in and more to scale
Some distances shown (d gtgt e)
15
Main Construction (t0)
Ci
G
H
H
We trace k-means from this initial configuration
16
Main Construction (t0T)
Ci
G
H
H
Push new points far enough away New clusters are
stable while M executes
17
Main Construction (tT1)Reassigning points to
clusters
Ci
G
pi
H
H
Take pi to be direct lift of final center of
Ci At time T1 Ci closer to taking pi than
ever Can position G so pi absorbed by Ci at time
T1
18
Main Construction (tT1)Reassigning points to
clusters
Ci
G
H
H
Nasty detail Have to position G to work for each
i simultaneously
19
Main Construction (tT1)Reassigning points to
clusters
Ci
G
H
H
Basic idea Perturb final Ci centers onto a
hypersphere, and align G with center
20
Main Construction (tT1)Recomputing centers
Ci
G
H
H
Center of G moves further away Centers of Ci
stable by symmetry
21
Main Construction (tT2)Reassigning points to
clusters
Ci
G
qi
H
H
Gs center far away it loses points qi switches
to some Cj we want it to be Ci regardless of
qis position in base space
22
Main Construction (tT2)Reassigning points to
clusters
Ci
G
H
H
Basic idea Translate each (pi, qi) along new
dimensionqi now closer to Ci than any other Cj
23
Main Construction (tT2)Recomputing centers
Ci
G
Centers reset to t0!
H
H
Symmetry Ci centers not lifted towards G Can now
choose qis coordinate in base space to reset Ci
24
Main Construction (tT3)Reassigning points to
clusters
G
H
Ci
Same clusters as t1
H
Ci centers have not moved closer to pi, qi But H
has
25
Main Construction (tT3)Recomputing centers
G
H
Ci
Same centers as t1
H
26
Main Construction (tT4)Reassigning points to
clusters
G
H
Ci
H
Same clusters as t2
27
Main Construction (tT4)Recomputing centers
G
H
Ci
H
Same centers as t2
Success!
New clusters now totally stable Ci free to
proceed another T-2 steps
28
Summary
  • Some configurations take 2O(vn) iterations
  • (Yes, we have actually implemented this!)
  • What now?
  • Lower bound is too precise to arise in practice
  • How do you formalize that?

29
The Big Question
  • How to guarantee good speed?
  • Choose initial centers randomly?
  • Nope Can force starting configuration w.h.p.
  • Har-Peled, Sadri SODA 05 Poly spread?
  • Nope Can make spreadn by adding 1 dim.
  • Low dimension?
  • Open We conjecture poly only if d1
  • But k-means fast in practice even in high dim

30
The Big Question
  • How to guarantee good speed?
  • We suggest smoothed analysis of Spielman and Teng
  • Perturb each data point using normal distribution
  • We recently showed O(nk) and O(2n/d)
  • Recall worst case bound O(nkd)
  • Still open!

31
Thanks for listening!
Write a Comment
User Comments (0)
About PowerShow.com