Title: Grid-based Coresets for Clustering Problems
1Grid-based Coresets for Clustering Problems
- Christian Sohler
- Universität Paderborn
- (joint work with Gereon Frahling)
2IntroductionClustering
- Clustering
- Partition input in sets (cluster), such that-
Objects in same cluster are similar - Objects
in different clusters are dissimilar - Goal
- Simplification
- Discovery of patterns
- Procedure
- Map objects to Euclidean space gt point set P
- Points in same cluster are close
- Points in different clusters are far away from
eachother
3Introductionk-means clustering
- Clustering with Prototypes
- One prototyp (center) for each cluster
- k-Median Clustering
- k clusters C ,,C
- One center c for each cluster C
- Minimize S S d(p,c )
1
k
i
i
i
p?C
i
i
4Introductionk-means clustering
- Clustering with Prototypes
- One prototyp (center) for each cluster
- k-Median Clustering
- k clusters C ,,C
- One center c for each cluster C
- Minimize S S d(p,c )
1
k
i
i
i
p?C
i
i
5Introductionk-means clustering
- Clustering with Prototypes
- One prototyp (center) for each cluster
- k-Median Clustering
- k clusters C ,,C
- One center c for each cluster C
- Minimize S S d(p,c )
1
k
i
i
i
p?C
i
i
6IntroductionSimplification / Lossy Compression
7IntroductionSimplification / Lossy Compression
8IntroductionSimplification / Lossy Compression
9IntroductionProperties of k-means
- Simple property of k-median
- Point set P
- Set of centers C
- Best clustering Assign each point to nearest
center
10IntroductionProperties of k-means
- Simple property of k-median
- Point set P
- Set of centers C
- Best clustering Assign each point to nearest
center
11IntroductionProperties of k-means
- Simple property of k-median
- Point set P
- Set of centers C
- Best clustering Assign each point to nearest
center
12IntroductionProperties of k-means
- Simple property of k-median
- Point set P
- Set of centers C
- Best clustering Assign each point to nearest
center
13IntroductionProperties of k-means
- Simple property of k-median
- Point set P
- Set of centers C
- Best clustering Assign each point to nearest
center
Notation cost(P,C) denotes the cost of the
solution defined this way
14IntroductionCoresets
- Definition (Coreset for k-median) HM04
- A weighted point set S is called e-coreset for P,
if for every set C of k centers we have - (1-e) cost(P,C) ? cost(S,C) ? (1e)
cost(P,C)
15IntroductionCoresets
- Definition (Coreset for k-median) HM04
- A weighted point set S is called e-coreset for P,
if for every set C of k centers we have - (1-e) cost(P,C) ? cost(S,C) ? (1e)
cost(P,C) - Replace point set by few weighted points(red)
3
4
5
5
4
16IntroductionCoresets
- Definition (Coreset for k-median) HM04
- A weighted point set S is called e-coreset for P,
if for every set C of k centers we have - (1-e) cost(P,C) ? cost(S,C) ? (1e)
cost(P,C) -
3
4
5
5
4
17IntroductionCoresets
- Definition (Coreset for k-median) HM04
- A weighted point set S is called e-coreset for P,
if for every set C of k centers we have - (1-e) cost(P,C) ? cost(S,C) ? (1e)
cost(P,C) -
3
4
5
5
4
18IntroductionRelated work
- Coresets for Clustering Problems
- k-center, k-median Badoiu, Indyk, Har-Peled,
2002existence of coresets, size independent of
dimension - Projective clustering Har-Peled, Varadarajan,
2002 - existence of coresets for projective
clustering, faster algorithms - k-median, k-means Har-Peled, Mazumdar,
2004faster algorithms, data streaming,
different definition of coresets - k-median, k-means Har-Peled, Kushal,
2004coresets of constant size - k-median Chen, 2005coreset with size
polynomial in dimension - K-median, k-means, MaxCut Frahling, S.,
2005oblivious coreset construction, dynamic
data streams
19- Coresets for clustering problems
- k-means Frahling, Sohler, 2006efficient
implementation - k-line median Fiat, Feldman, Sharir,
2006coresets for low dimensions - k-median, k-means Feldman, Momemizadeh, Sohler,
2006weak coresets size independent of n and d
20Coreset constructionFirst try
- Our Approach
- Partition the input space into regions
- For each region R
- Count number w(R) of points in R
- choose one representative point p from R
- Assign weight w(R) to p
- Remove all other points from R
- Analysis
- Moving a point by distance d changes cost(P,C) by
at most d - Sum up movement for all regions
- Show Overall movement is at most e?cost(P,C)
21Coreset constructionFirst try
- Our Approach
- Partition the input space into regions
- For each region R
- Count number w(R) of points in R
- choose one representative point p from R
- Assign weight w(R) to p
- Remove all other points from R
Only question How to find regions?
22Coreset constructionFirst try
- Our Approach
- Partition the input space into regions
- For each region R
- Count number w(R) of points in R
- choose one representative point p from R
- Assign weight w(R) to p
- Remove all other points from R
First try Regular grid with width W
23Coreset constructionFirst try
- Our Approach
- Partition the input space into regions
- For each region R
- Count number w(R) of points in R
- choose one representative point p from R
- Assign weight w(R) to p
- Remove all other points from R
First try Regular grid with width W
24Coreset constructionFirst try
- Our Approach
- Partition the input space into regions
- For each region R
- Count number w(R) of points in R
- choose one representative point p from R
- Assign weight w(R) to p
- Remove all other points from R
First try Regular grid with width W
- Error per cell
- O(W ? points in cell)
- W ? e ? cost(P,C)/n
- Too many cells
25Coreset constructionFirst try
- Our Approach
- Partition the input space into regions
- For each region R
- Count number w(R) of points in R
- choose one representative point p from R
- Assign weight w(R) to p
- Remove all other points from R
Second try Refine grid till cells have at most
R points
26Coreset constructionFirst try
- Our Approach
- Partition the input space into regions
- For each region R
- Count number w(R) of points in R
- choose one representative point p from R
- Assign weight w(R) to p
- Remove all other points from R
Second try Refine grid till cells have at most
R points per cell
27Coreset constructionFirst try
- Our Approach
- Partition the input space into regions
- For each region R
- Count number w(R) of points in R
- choose one representative point p from R
- Assign weight w(R) to p
- Remove all other points from R
Second try Refine grid till cells have at most
R points per cell
- Error per cell
- O(Cell width ?R)
- There can be point at distance Opt
- R?e
- Too many cells
28Coreset constructionSome definitions
- Assumptions
- Cost Opt of optimal k-median solution is known
- Grid i has cell width Opt / 2
- O(log n) levels
- Definition
- Cell in grid i is called heavy, if it contains
more than d?2 points. - A cell that is not heavy is light.
- Observation
- Movement cost for light cells is O(d?Opt)
- Construction
- Put coreset point in every light cell whose
parent cell is heavy
i
i
29Coreset constructionThe algorithm
Computation of coreset points
Opt
30Coreset constructionThe algorithm
Computation of coreset points
31Coreset constructionThe algorithm
Computation of coreset points
32Coreset constructionThe algorithm
Computation of coreset points
1
33Coreset constructionThe algorithm
Computation of coreset points
1
34Coreset constructionThe algorithm
Computation of coreset points
1
1
1
1
3
1
3
1
1
1
1
35Coreset constructionThe algorithm
Computation of coreset points
1
1
1
1
3
1
3
1
1
1
1
36Coreset constructionThe algorithm
Computation of coreset points
1
1
1
5
5
1
5
2
3
1
3
1
1
1
1
37Coreset constructionAnalysis
d
- Coreset size ? 2 ? heavy cells
1/e ?cell width
38Coreset constructionAnalysis
d
- Coreset size ? 2 ? heavy cells
1/e ?cell width
- Number of inner heavy cells per grid
- k/e (volume argument)
d
d
39Coreset constructionAnalysis
d
- Coreset size ? 2 ? heavy cells
Contribution of outer heavy cell d/e ?
cost(P,C) Number of outer heavy cells per
grid ?e/d
1/e ?cell width
- Number of inner heavy cells per grid
- k/e (volume argument)
d
40Coreset constructionAnalysis
- Coreset size ? O(log n ?(e/d k/e ))
d
Contribution of outer heavy cell d/e ?
cost(P,C) Number of outer heavy cells per
grid ?e/d
1/e ?cell width
- Number of inner heavy cells
- k/e (volume argument)
d
41Coreset constructionAnalysis
- Coreset size ? O(log n ?(e/d k/e ))
d
1/e ?cell width
42Coreset constructionAnalysis
- Coreset size ? O(log n ?(e/d k/e ))
d
1/e ?cell width
Outer cells Movement can be charged to
contribution ? Overall cost e ? cost(P,C)
43Coreset constructionAnalysis
- Coreset size ? O(log n ?(e/d k/e ))
d
Inner cells Cost per cell d ? Opt
1/e ?cell width
Outer cells Movement can be charged to
contribution ? Overall cost e ? cost(P,C)
44Coreset constructionAnalysis
- Coreset size ? O(log n ?(e/d k/e ))
d
Inner cells Cost per cell d ? Opt
1/e ?cell width
inner cells ? k/e ? de / log n
Outer cells Movement can be charged to
contribution ? Overall cost e ? cost(P,C)
d1
45Coreset Summary
- Theorem
- Our construction gives a coreset of size O(k log
n / e ) - Dynamic geometric data streams
- Stream of Insert(p)/Delete(p) operations p ?
1,,D - Stream consistent no Delete(p), if p is not in
current set - Algorithm
- Output Set of k centers
- Maintains Coreset
- Compute centers from coreset using (1e)-approx.
algorithm
d
d
46StreamingCoreset maintenance
- How to maintain coreset
- (1e)-approx. of number of points in heavy cells
sufficient - For grids with cell width Opt/2 we need
approximation for all cells with more than d 2
points - Solution
- Uniform random sampling will do
- Reason
- Size of grid cell imposes restriction on
distribution - Sample hits only few cells
- So, small space suffices
i
i
47Conclusions
- Summary
- Streaming algorithm for insertions and deletions
- Maintains coreset
- Computes (1e)-approximation from coreset
- Some more progress on
- High dimensional dynamic data streams
- Sliding window model (low dimensional)
48Thank you!
Christian Sohler Heinz Nixdorf Institut
Institut für Informatik Universität
Paderborn Fürstenallee 11 33102 Paderborn,
Germany Tel. 49 (0) 52 51/60 64 27 Fax
49 (0) 52 51/62 64 82 E-Mail csohler_at_upb.de http
//www.upb.de/cs/ag-madh