CHAN Siu Lung, Daniel - PowerPoint PPT Presentation

About This Presentation
Title:

CHAN Siu Lung, Daniel

Description:

CHAN Siu Lung, Daniel. CHAN Wai Kin, Ken. CHOW Chin Hung, Victor. KOON Ping Yin, Bob ... This category of clustering method try to reduce the data set into k clusters ... – PowerPoint PPT presentation

Number of Views:104
Avg rating:3.0/5.0
Slides: 28
Provided by: SLC2
Category:
Tags: chan | chan | daniel | lung | siu

less

Transcript and Presenter's Notes

Title: CHAN Siu Lung, Daniel


1
CURE Efficient Clustering Algorithm for Large
Databases
CHAN Siu Lung, Daniel CHAN Wai Kin, Ken CHOW Chin
Hung, Victor KOON Ping Yin, Bob
2
Content
  1. Different problem in traditional clustering
    method
  2. Basic idea of CURE clustering
  3. Improved CURE
  4. Summary
  5. References

3
Different problem in traditional clustering
method
4
Partitional Clustering
Different problem in traditional clustering method
  • Partitional Clustering
  • This category of clustering method try to reduce
    the data set into k clusters based on some
    criterion functions.
  • The most common criterion is square-error
    criterion.
  • This method favor to clusters with data points as
    compact and separated as possible

5
Partitional Clustering
Different problem in traditional clustering method
  • You may find error in case the square-error is
    reduced by splitting some large cluster to favor
    some other group.

Figure Splitting occur in large cluster by
partitional method
6
Hierarchical Clustering
Different problem in traditional clustering method
  • Hierarchical Clustering
  • This category of clustering method try to merge
    sequences of disjoint clusters into the target k
    clusters base on the minimum distance between two
    clusters.
  • The distance between clusters can be measured as
  • Distance between mean
  • Distance between average point
  • Distance between two nearest point within cluster

7
Hierarchical Clustering
Different problem in traditional clustering method
  • This method favor hyper-spherical shape and
    uniform data.
  • Lets take some prolonged data as example
  • Result of dmean

8
Hierarchical Clustering
Different problem in traditional clustering method
  • Result of dmin

9
Problems summary
Different problem in traditional clustering method
  1. Traditional clustering mainly favors spherical
    shape.
  2. Data in the cluster must be compact together.
  3. Each cluster must separate far away enough.
  4. Cluster size must be uniform.
  5. Outliner will greatly disturb the cluster result.

10
Basic idea of CURE clustering
11
General CURE clustering procedure.
Basic idea of CURE clustering
  1. It is similar to hierarchical clustering
    approach. But it use sample point variant as the
    cluster representative rather than every point in
    the cluster.
  2. First set a target sample number c . Than we try
    to select c well scattered sample points from the
    cluster.
  3. The chosen scattered points are shrunk toward the
    centroid in a fraction of ? where 0 lt?lt1

12
General CURE clustering procedure.
Basic idea of CURE clustering
  1. These points are used as representative of
    clusters and will be used as the point in dmin
    cluster merging approach.
  2. After each merging, c sample points will be
    selected from original representative of previous
    clusters to represent new cluster.
  3. Cluster merging will be stopped until target k
    cluster is found

13
Pseudo function of CURE
Basic idea of CURE clustering
14
CURE efficient
Basic idea of CURE clustering
  • The worst-case time complexity is O(n2logn)
  • The space complexity is O(n) due to the use of
    k-d treee and heap.

15
Improved CURE
16
Random Sampling
Improved CURE
  • In case of dealing with large database, we cant
    store every data point to the memory.
  • Handle of data merge in large database require
    very long time.
  • We use random sampling to both reduce the time
    complexity and memory usage.
  • Assume if we need to detect a cluster u present,
    we need to at least capture f fraction of data
    from this cluster fu
  • The the required sampling data s to capture can
    be present as follow
  • You can refer to proof from the reference (i).
    Here we just want to show that we can determine a
    sample size smin such that the probability of get
    enough sample from every cluster u is 1 - ?

17
Partitioning and two pass clustering
Improved CURE
  • In addition, we use two-pass approach to reduce
    the computation time.
  • First, we divide the n data point into p
    partition and each contain n/p data point.
  • We than pre-cluster each partition until the
    number of cluster n/pq reached in each partition
    for some q gt 1
  • Then each cluster in the first pass result will
    be used as the second pass clustering input to
    form the final cluster.
  • Each one partitions time complexity is
  • Therefore, the first pass complexity will be
  • And the second pass complexity is
  • Overall, the time complexity will become

18
Partitioning and two pass clustering
Improved CUR
  • The overall improvement will be
  • Also, to maintain the quality of clustering, we
    must make sure n/pq must be 2 to 3 times of k.

19
Outlier elimination
Improved CURE
  • We can introduce outliners elimination by two
    method.
  • Random sampling With random sampling, most of
    outlier points are filtered out.
  • Outlier elimination As outliner is not a compact
    group, it will grow in size very slowly during
    the cluster merge stage. We will then kick in the
    elimination procedure during the merging stage
    such that those cluster with 1 2 data points
    are removed from the cluster list.
  • In order to prevent these outliners from merging
    into proper cluster, we must trigger the
    procedure in proper stage such that we can
    properly remove the outliners. In general, we
    will trigger this procedure when cluster sets
    reduce to 1/3 of total data sets.

20
Data labeling
Improved CURE
  • Due to the use of random sample. We need to label
    back every remaining data points to the proper
    cluster group.
  • Each data point is assigned to the cluster group
    with a representative point nearest to the data
    point.

21
Final overview of CURE flow
Improved CURE
Data
22
Sample result with different parameter
Improved CURE
Different shrinking factor ?
23
Sample result with different parameter
Improved CURE
Different number of representatives c
24
Sample result with different parameter
Improved CURE
Relation of execution time, different partition
number p, and different sample points s
25
Summary
  • CURE can effectively detect proper shape of the
    cluster with the help of scattered representative
    point and centroid shrinking.
  • CURE can reduce computation time and memory
    loading with random sampling and 2 pass
    clustering
  • CURE can effectively remove outlier.
  • The quality and effectiveness of CURE can be
    tuned be varying different s,p,c,? to adapt
    different input data set.

26
References
  1. GRS97   Sudipto Guha, R. Rastogi, and K. Shim.
    CURE A clustering algorithm for large databases.
    Technical report, Bell Laboratories, Murray Hill,
    1997.
  2. ZRL96 Tian Zhang , Raghu Ramakrishnan , Miron
    Livny, BIRCH an efficient data clustering method
    for very large databases, ACM SIGMOD Record, v.25
    n.2, p.103-114, June 1996
  3. Sudipto Guha, Rajeev Rastogi, Kyuseok Shim CURE
    An Efficient Clustering Algorithm for Large
    Databases, ACM SIGMOD, 1998.

27
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com