Birch: Balanced Iterative Reducing and Clustering using Hierarchies - PowerPoint PPT Presentation

About This Presentation
Title:

Birch: Balanced Iterative Reducing and Clustering using Hierarchies

Description:

Birch: Balanced Iterative Reducing and Clustering using Hierarchies By Tian Zhang, Raghu Ramakrishnan Presented by Vladimir Jeli 3218/10 e-mail: jelicvladimir5_at_ ... – PowerPoint PPT presentation

Number of Views:451
Avg rating:3.0/5.0
Slides: 34
Provided by: Hung83
Category:

less

Transcript and Presenter's Notes

Title: Birch: Balanced Iterative Reducing and Clustering using Hierarchies


1
Birch Balanced Iterative Reducing and
Clustering using Hierarchies
  • By Tian Zhang, Raghu Ramakrishnan

Presented by Vladimir Jelic 3218/10 e-mail
jelicvladimir5_at_gmail.com
2
What is Data Clustering?
  • A cluster is a closely-packed group.
  • A collection of data objects that are similar to
    one another and treated collectively as a group.
  • Data Clustering is the partitioning of a dataset
    into clusters

3
Data Clustering
  • Helps understand the natural grouping or
    structure in a dataset
  • Provided a large set of multidimensional data
  • Data space is usually not uniformly occupied
  • Identify the sparse and crowded places
  • Helps visualization

4
Some Clustering Applications
  • Biology building groups of genes with related
    patterns
  • Marketing partition the population of consumers
    to market segments
  • Division of WWW pages into genres.
  • Image segmentations for object recognition
  • Land use Identification of areas of similar
    land use from satellite images

5
Clustering Problems
  • Today many datasets are too large to fit into
    main memory
  • The dominating cost of any clustering algorithm
    is I/O, because seek times on disk are orders of
    a magnitude higher than RAM access times

6
Previous Work
  • Two classes of clustering algorithms
  • Probability-Based
  • Examples COBWEB and CLASSIT
  • Distance-Based
  • Examples KMEANS, KMEDOIDS, and CLARANS

7
Previous Work COBWEB
  • Probabilistic approach to make decisions
  • Clusters are represented with probabilistic
    description
  • Probability representations of clusters is
    expensive
  • Every instance (data point) translates into a
    terminal node in the hierarchy, so large
    hierarchies tend to over fit data

8
Previous Work KMeans
  • Distance based approach, so there must be
    distance measurement between any two instances
  • Sensitive to instance order
  • Instances must be stored in memory
  • All instances must be initially available
  • May have exponential run time

9
Previous Work CLARANS
  • Also distance based approach, so there must be
    distance measurement between any two instances
  • computational complexity of CLARANS is about
    O(n2)
  • Sensitive to instance order
  • Ignore the fact that not all data points in the
    dataset are equally important

10
Contributions of BIRCH
  • Each clustering decision is made without scanning
    all data points
  • BIRCH exploits the observation that the data
    space is usually not uniformly occupied, and
    hence not every data point is equally important
    for clustering purposes
  • BIRCH makes full use of available memory to
    derive the finest possible subclusters ( to
    ensure accuracy) while minimizing I/O costs ( to
    ensure efficiency)

11
Background Knowledge (1)
  • Given a cluster of instances , we define

Centroid
Radius
Diameter
12
Background Knowledge (2)
centroid Euclidian distance
centroid Manhattan distance
average inter-cluster
average intra-cluster
variance increase
13
Clustering Features (CF)
  • The Birch algorithm builds a dendrogram called
    clustering feature tree (CF tree) while scanning
    the data set.
  • Each entry in the CF tree represents a cluster of
    objects and is characterized by a 3-tuple (N,
    LS, SS), where N is the number of objects in the
    cluster and LS, SS are defined in the following.

14
Clustering Feature (CF)
  • Given N d-dimensional data points in a cluster
    Xi where i 1, 2, , N,
  • CF (N, LS, SS)
  • N is the number of data points in the cluster,
  • LS is the linear sum of the N data points,
  • SS is the square sum of the N data points.

15
CF Additivity Theorem (1)
  • If CF1 (N1, LS1, SS1), and
  • CF2 (N2 ,LS2, SS2) are the CF entries of two
    disjoint sub-clusters.
  • The CF entry of the sub-cluster formed by merging
    the two disjoin sub-clusters is
  • CF1 CF2 (N1 N2 , LS1 LS2, SS1 SS2)

16
CF Additivity Theorem (2)

Example
17
Properties of CF-Tree
  • Each non-leaf node has at most B entries
  • Each leaf node has at most L CF entries which
    each satisfy threshold T
  • Node size is determined by dimensionality of data
    space and input parameter P (page size)

18
CF Tree Insertion
  • Identifying the appropriate leaf recursively
    descending the CF tree and choosing the closest
    child node according to a chosen distance metric
  • Modifying the leaf test whether the leaf can
    absorb the node without violating the threshold.
    If there is no room, split the node
  • Modifying the path update CF information up the
    path.

19
Example of the BIRCH Algorithm
New subcluster
sc4
sc5
sc8
sc6
sc7
LN3
sc3
LN2
sc1
sc2
Root
LN1
LN2
LN3
LN1
sc5
sc8
sc3
sc6
sc7
sc1
sc4
sc2
20
Merge Operation in BIRCH
If the branching factor of a leaf node can not
exceed 3, then LN1 is split
sc4
sc1
sc5
sc3
sc6
sc2
sc7
sc8
LN2
LN1
LN3
Root
LN1
LN2
LN3
LN1
LN1
sc5
sc8
sc3
sc6
sc7
sc1
sc4
sc2
21
Merge Operation in BIRCH
If the branching factor of a non-leaf node can
not exceed 3, then the root is split and the
height of the CF Tree increases by one
sc3
sc6
sc1
sc4
Root
sc2
sc7
sc5
NLN1
sc8
LN3
LN2
NLN2
LN1
LN1
LN1
LN1
LN2
LN3
sc8
sc1
sc4
sc7
sc3
sc6
sc2
sc5
22
Merge Operation in BIRCH
Assume that the subclusters are numbered
according to the order of formation
sc5
sc6
sc3
sc2
root
sc1
LN1
sc4
LN2
LN2
LN1
sc6
sc1
sc5
sc2
sc3
sc4
23
Merge Operation in BIRCH
If the branching factor of a leaf node can not
exceed 3, then LN2 is split
sc5
sc6
sc2
sc1
sc3
sc4
root
LN1
LN2
LN2
LN2
LN2
LN1
sc6
sc3
sc5
sc1
sc4
sc2
24
Merge Operation in BIRCH
LN2 and LN1 will be merged, and the newly formed
node wil be split immediately
sc2
sc5
sc6
sc3
sc4
sc1
LN3
root
LN3
LN2
LN2
LN3
LN3
sc6
sc2
sc3
sc1
sc4
sc5
25
Birch Clustering Algorithm (1)
  • Phase 1 Scan all data and build an initial
    in-memory CF tree.
  • Phase 2 condense into desirable length by
    building a smaller CF tree.
  • Phase 3 Global clustering
  • Phase 4 Cluster refining this is optional, and
    requires more passes over the data to refine the
    results

26
Birch Clustering Algorithm (2)
27
Birch Phase 1
  • Start with initial threshold and insert points
    into the tree
  • If run out of memory, increase thresholdvalue,
    and rebuild a smaller tree by reinserting values
    from older tree and then other values
  • Good initial threshold is important but hard to
    figure out
  • Outlier removal when rebuilding tree remove
    outliers

28
Birch - Phase 2
  • Optional
  • Phase 3 sometime have minimum size which performs
    well, so phase 2 prepares the tree for phase 3.
  • BIRCH applies a (selected) clustering algorithm
    to cluster the leaf nodes of the CF tree, which
    removes sparse clusters as outliers and groups
    dense clusters into larger ones.

29
Birch Phase 3
  • Problems after phase 1
  • Input order affects results
  • Splitting triggered by node size
  • Phase 3
  • cluster all leaf nodes on the CF values according
    to an existing algorithm
  • Algorithm used here agglomerative hierarchical
    clustering

30
Birch Phase 4
  • Optional
  • Do additional passes over the dataset reassign
    data points to the closest centroid from phase 3
  • Recalculating the centroids and redistributing
    the items.
  • Always converges (no matter how many time phase 4
    is repeated)

31
Conclusions (1)
  • Birch performs faster than existing algorithms
    (CLARANS and KMEANS) on large datasets
  • Scans whole data only once
  • Handles outliers better
  • Superior to other algorithms in stability and
    scalability

32
Conclusions (2)
  • Since each node in a CF tree can hold only a
    limited number of entries due to the size, a CF
    tree node doesnt always correspond to what a
    user may consider a nature cluster. Moreover, if
    the clusters are not spherical in shape, it
    doesnt perform well because it uses the notion
    of radius or diameter to control the boundary of
    a cluster

33
References
  • T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH
    An efficient data clustering method for very
    large databases. SIGMOD'96
  • Jan Oberst Efficient Data Clustering and How to
    Groom Fast-Growing Trees
  • Tan, Steinbach, Kumar Introduction to Data
    Mining
Write a Comment
User Comments (0)
About PowerShow.com