Grid-based Coresets for Clustering Problems - PowerPoint PPT Presentation

About This Presentation
Title:

Grid-based Coresets for Clustering Problems

Description:

Grid-based Coresets for Clustering Problems. Christian Sohler. Universit t Paderborn ... Cell in grid i is called heavy, if it contains more than d 2 points. ... – PowerPoint PPT presentation

Number of Views:41
Avg rating:3.0/5.0
Slides: 49
Provided by: cseIi
Category:

less

Transcript and Presenter's Notes

Title: Grid-based Coresets for Clustering Problems


1
Grid-based Coresets for Clustering Problems
  • Christian Sohler
  • Universität Paderborn
  • (joint work with Gereon Frahling)

2
IntroductionClustering
  • Clustering
  • Partition input in sets (cluster), such that-
    Objects in same cluster are similar - Objects
    in different clusters are dissimilar
  • Goal
  • Simplification
  • Discovery of patterns
  • Procedure
  • Map objects to Euclidean space gt point set P
  • Points in same cluster are close
  • Points in different clusters are far away from
    eachother

3
Introductionk-means clustering
  • Clustering with Prototypes
  • One prototyp (center) for each cluster
  • k-Median Clustering
  • k clusters C ,,C
  • One center c for each cluster C
  • Minimize S S d(p,c )

1
k
i
i
i
p?C
i
i
4
Introductionk-means clustering
  • Clustering with Prototypes
  • One prototyp (center) for each cluster
  • k-Median Clustering
  • k clusters C ,,C
  • One center c for each cluster C
  • Minimize S S d(p,c )

1
k
i
i
i
p?C
i
i
5
Introductionk-means clustering
  • Clustering with Prototypes
  • One prototyp (center) for each cluster
  • k-Median Clustering
  • k clusters C ,,C
  • One center c for each cluster C
  • Minimize S S d(p,c )

1
k
i
i
i
p?C
i
i
6
IntroductionSimplification / Lossy Compression
7
IntroductionSimplification / Lossy Compression
8
IntroductionSimplification / Lossy Compression
9
IntroductionProperties of k-means
  • Simple property of k-median
  • Point set P
  • Set of centers C
  • Best clustering Assign each point to nearest
    center

10
IntroductionProperties of k-means
  • Simple property of k-median
  • Point set P
  • Set of centers C
  • Best clustering Assign each point to nearest
    center

11
IntroductionProperties of k-means
  • Simple property of k-median
  • Point set P
  • Set of centers C
  • Best clustering Assign each point to nearest
    center

12
IntroductionProperties of k-means
  • Simple property of k-median
  • Point set P
  • Set of centers C
  • Best clustering Assign each point to nearest
    center

13
IntroductionProperties of k-means
  • Simple property of k-median
  • Point set P
  • Set of centers C
  • Best clustering Assign each point to nearest
    center

Notation cost(P,C) denotes the cost of the
solution defined this way
14
IntroductionCoresets
  • Definition (Coreset for k-median) HM04
  • A weighted point set S is called e-coreset for P,
    if for every set C of k centers we have
  • (1-e) cost(P,C) ? cost(S,C) ? (1e)
    cost(P,C)

15
IntroductionCoresets
  • Definition (Coreset for k-median) HM04
  • A weighted point set S is called e-coreset for P,
    if for every set C of k centers we have
  • (1-e) cost(P,C) ? cost(S,C) ? (1e)
    cost(P,C)
  • Replace point set by few weighted points(red)

3
4
5
5
4
16
IntroductionCoresets
  • Definition (Coreset for k-median) HM04
  • A weighted point set S is called e-coreset for P,
    if for every set C of k centers we have
  • (1-e) cost(P,C) ? cost(S,C) ? (1e)
    cost(P,C)

3
4
5
5
4
17
IntroductionCoresets
  • Definition (Coreset for k-median) HM04
  • A weighted point set S is called e-coreset for P,
    if for every set C of k centers we have
  • (1-e) cost(P,C) ? cost(S,C) ? (1e)
    cost(P,C)

3
4
5
5
4
18
IntroductionRelated work
  • Coresets for Clustering Problems
  • k-center, k-median Badoiu, Indyk, Har-Peled,
    2002existence of coresets, size independent of
    dimension
  • Projective clustering Har-Peled, Varadarajan,
    2002
  • existence of coresets for projective
    clustering, faster algorithms
  • k-median, k-means Har-Peled, Mazumdar,
    2004faster algorithms, data streaming,
    different definition of coresets
  • k-median, k-means Har-Peled, Kushal,
    2004coresets of constant size
  • k-median Chen, 2005coreset with size
    polynomial in dimension
  • K-median, k-means, MaxCut Frahling, S.,
    2005oblivious coreset construction, dynamic
    data streams

19
  • Coresets for clustering problems
  • k-means Frahling, Sohler, 2006efficient
    implementation
  • k-line median Fiat, Feldman, Sharir,
    2006coresets for low dimensions
  • k-median, k-means Feldman, Momemizadeh, Sohler,
    2006weak coresets size independent of n and d

20
Coreset constructionFirst try
  • Our Approach
  • Partition the input space into regions
  • For each region R
  • Count number w(R) of points in R
  • choose one representative point p from R
  • Assign weight w(R) to p
  • Remove all other points from R
  • Analysis
  • Moving a point by distance d changes cost(P,C) by
    at most d
  • Sum up movement for all regions
  • Show Overall movement is at most e?cost(P,C)

21
Coreset constructionFirst try
  • Our Approach
  • Partition the input space into regions
  • For each region R
  • Count number w(R) of points in R
  • choose one representative point p from R
  • Assign weight w(R) to p
  • Remove all other points from R

Only question How to find regions?
22
Coreset constructionFirst try
  • Our Approach
  • Partition the input space into regions
  • For each region R
  • Count number w(R) of points in R
  • choose one representative point p from R
  • Assign weight w(R) to p
  • Remove all other points from R

First try Regular grid with width W



23
Coreset constructionFirst try
  • Our Approach
  • Partition the input space into regions
  • For each region R
  • Count number w(R) of points in R
  • choose one representative point p from R
  • Assign weight w(R) to p
  • Remove all other points from R

First try Regular grid with width W



24
Coreset constructionFirst try
  • Our Approach
  • Partition the input space into regions
  • For each region R
  • Count number w(R) of points in R
  • choose one representative point p from R
  • Assign weight w(R) to p
  • Remove all other points from R

First try Regular grid with width W


  • Error per cell
  • O(W ? points in cell)
  • W ? e ? cost(P,C)/n
  • Too many cells

25
Coreset constructionFirst try
  • Our Approach
  • Partition the input space into regions
  • For each region R
  • Count number w(R) of points in R
  • choose one representative point p from R
  • Assign weight w(R) to p
  • Remove all other points from R

Second try Refine grid till cells have at most
R points



26
Coreset constructionFirst try
  • Our Approach
  • Partition the input space into regions
  • For each region R
  • Count number w(R) of points in R
  • choose one representative point p from R
  • Assign weight w(R) to p
  • Remove all other points from R

Second try Refine grid till cells have at most
R points per cell



27
Coreset constructionFirst try
  • Our Approach
  • Partition the input space into regions
  • For each region R
  • Count number w(R) of points in R
  • choose one representative point p from R
  • Assign weight w(R) to p
  • Remove all other points from R

Second try Refine grid till cells have at most
R points per cell


  • Error per cell
  • O(Cell width ?R)
  • There can be point at distance Opt
  • R?e
  • Too many cells

28
Coreset constructionSome definitions
  • Assumptions
  • Cost Opt of optimal k-median solution is known
  • Grid i has cell width Opt / 2
  • O(log n) levels
  • Definition
  • Cell in grid i is called heavy, if it contains
    more than d?2 points.
  • A cell that is not heavy is light.
  • Observation
  • Movement cost for light cells is O(d?Opt)
  • Construction
  • Put coreset point in every light cell whose
    parent cell is heavy

i
i
29
Coreset constructionThe algorithm
Computation of coreset points
Opt
30
Coreset constructionThe algorithm
Computation of coreset points
31
Coreset constructionThe algorithm
Computation of coreset points
32
Coreset constructionThe algorithm
Computation of coreset points
1
33
Coreset constructionThe algorithm
Computation of coreset points
1
34
Coreset constructionThe algorithm
Computation of coreset points
1
1
1
1
3
1
3
1
1
1
1
35
Coreset constructionThe algorithm
Computation of coreset points
1
1
1
1
3
1
3
1
1
1
1
36
Coreset constructionThe algorithm
Computation of coreset points
1
1
1
5
5
1
5
2
3
1
3
1
1
1
1
37
Coreset constructionAnalysis
d
  • Coreset size ? 2 ? heavy cells

1/e ?cell width
38
Coreset constructionAnalysis
d
  • Coreset size ? 2 ? heavy cells

1/e ?cell width
  • Number of inner heavy cells per grid
  • k/e (volume argument)

d
d
39
Coreset constructionAnalysis
d
  • Coreset size ? 2 ? heavy cells

Contribution of outer heavy cell d/e ?
cost(P,C) Number of outer heavy cells per
grid ?e/d
1/e ?cell width
  • Number of inner heavy cells per grid
  • k/e (volume argument)

d
40
Coreset constructionAnalysis
  • Coreset size ? O(log n ?(e/d k/e ))

d
Contribution of outer heavy cell d/e ?
cost(P,C) Number of outer heavy cells per
grid ?e/d
1/e ?cell width
  • Number of inner heavy cells
  • k/e (volume argument)

d
41
Coreset constructionAnalysis
  • Coreset size ? O(log n ?(e/d k/e ))

d
1/e ?cell width
42
Coreset constructionAnalysis
  • Coreset size ? O(log n ?(e/d k/e ))

d
1/e ?cell width
Outer cells Movement can be charged to
contribution ? Overall cost e ? cost(P,C)
43
Coreset constructionAnalysis
  • Coreset size ? O(log n ?(e/d k/e ))

d
Inner cells Cost per cell d ? Opt
1/e ?cell width
Outer cells Movement can be charged to
contribution ? Overall cost e ? cost(P,C)
44
Coreset constructionAnalysis
  • Coreset size ? O(log n ?(e/d k/e ))

d
Inner cells Cost per cell d ? Opt
1/e ?cell width
inner cells ? k/e ? de / log n
Outer cells Movement can be charged to
contribution ? Overall cost e ? cost(P,C)
d1
45
Coreset Summary
  • Theorem
  • Our construction gives a coreset of size O(k log
    n / e )
  • Dynamic geometric data streams
  • Stream of Insert(p)/Delete(p) operations p ?
    1,,D
  • Stream consistent no Delete(p), if p is not in
    current set
  • Algorithm
  • Output Set of k centers
  • Maintains Coreset
  • Compute centers from coreset using (1e)-approx.
    algorithm

d
d
46
StreamingCoreset maintenance
  • How to maintain coreset
  • (1e)-approx. of number of points in heavy cells
    sufficient
  • For grids with cell width Opt/2 we need
    approximation for all cells with more than d 2
    points
  • Solution
  • Uniform random sampling will do
  • Reason
  • Size of grid cell imposes restriction on
    distribution
  • Sample hits only few cells
  • So, small space suffices

i
i
47
Conclusions
  • Summary
  • Streaming algorithm for insertions and deletions
  • Maintains coreset
  • Computes (1e)-approximation from coreset
  • Some more progress on
  • High dimensional dynamic data streams
  • Sliding window model (low dimensional)

48
Thank you!
Christian Sohler Heinz Nixdorf Institut
Institut für Informatik Universität
Paderborn Fürstenallee 11 33102 Paderborn,
Germany Tel. 49 (0) 52 51/60 64 27 Fax
49 (0) 52 51/62 64 82 E-Mail csohler_at_upb.de http
//www.upb.de/cs/ag-madh
Write a Comment
User Comments (0)
About PowerShow.com