Clustering Large Datasets in Arbitrary Metric Space - PowerPoint PPT Presentation

About This Presentation
Title:

Clustering Large Datasets in Arbitrary Metric Space

Description:

Distance between object is used as a common metric to assess similarity ... WWW: document classification; clustering weblog data to discover groups of ... – PowerPoint PPT presentation

Number of Views:185
Avg rating:3.0/5.0
Slides: 37
Provided by: mur698
Learn more at: https://www2.cs.uh.edu
Category:

less

Transcript and Presenter's Notes

Title: Clustering Large Datasets in Arbitrary Metric Space


1
Clustering Large Datasets in Arbitrary Metric
Space
by Muralikrishna Achari
2
Contents
  • Introduction to Clustering
  • Problems in Traditional Clustering
  • Clustering Large Datasets
  • BIRCH
  • BUBBLE
  • BUBBLE-FM
  • Scalability
  • Conclusion

3
Traditional Clustering
  • Unsupervised Learning.
  • A process of grouping similar object into groups.
  • Distance between object is used as a common
    metric to assess similarity

4
Types of Clustering Algorithms
  • Hierarchical clustering
  • Minimal Spanning Tree Method, BIRCH, BUBBLE
  • Partition based clustering
  • K-means, CLARANS

5
Hierarchical Clustering
  • A crude division of instances into groups at the
    top level, and each of these groups is refined
    further perhaps all the way down to the
    individual instances.

6
Partition based clustering
  • A desired number of clusters are assumed at the
    start and instances are allocated among clusters
    so that a particular clustering criterion is
    optimized (e.g. minimization of the variability
    within clusters).

7
Applications
  • Marketing finding groups of customers with
    similar behavior.
  • Landscapes Characterizing different regions.
  • Biology classification of plants and animals
    given their features.
  • Earthquake studies clustering observed
    earthquake epicenters to identify dangerous
    zones.
  • WWW document classification clustering weblog
    data to discover groups of similar access
    patterns.

8
Problem with Traditional Clustering
  • Dealing with large number of dimensions and large
    number of data items can be problematic because
    of time complexity.

9
Requirements for a Good Clustering Algorithm
  • Scalability.
  • Dealing with different types of attributes.
  • Discovering clusters with arbitrary shape.
  • Minimal requirements for domain knowledge to
    determine input parameters.
  • Ability to deal with noise and outliers.

10
Clustering Large Datasets
11
Clustering Large Datasets
  • CLARANS
  • Assumes all object fit in main memory, ?
    sensitive to input order.
  • Uses R to improve efficiency.
  • BIRCH
  • Minimizes memory usage and scans data only once
    from disk.
  • Uses cluster representatives instead of actual
    data points.
  • 1st algorithm proposed in the database area that
    addresses outliers.
  • DBSCAN
  • Uses distance based notion to clusters to
    discover clusters of arbitrary shapes.
  • Sensitive to the input parameters and incurs
    substantial I/O cost.

12
Drawbacks
  • Both BIRCH and CLARANS works well for clusters
    with Spherical or Convex sphape and uniform size
    and are unsuitable when clusters have different
    sizes and are non-spherical.
  • All the three algorithms relies on vector
    operations which are only defined in coordinate
    space and are unsuitable to datasets in distance
    space.

BRICH
13
Proposed Approach
  • 2 algorithms for clustering large datasets based
    on BIRCH framework.
  • BUBBLE
  • BUBBLE-FM

14
BIRCH
  • Balanced Iterative Reducing and Clustering using
    Hierarchies
  • BIRCH is generalized framework for incremental
    clustering algorithms.
  • BIRCH components can be instantiated to generate
    concrete clustering algorithms.

15
BIRCH Components
  • Cluster Feature (CF)
  • A Summarized representation of the cluster
  • Cluster Tree (CF-tree)
  • A Height balanced tree for CFs

16
Clustering Feature
  • CFs are summarized representations of clusters.
  • Requirements-
  • Incrementally maintainable when a new object is
    inserted.
  • Contain Sufficient information to compute
    distance between clusters and objects.

17
CF-Tree
  • A height-balanced tree.
  • Two parameters
  • 1. Branching Factor, B
  • 2. Threshold, T
  • Non-leaf node has B entries ( CFi, childi, i
    1..B)
  • Where CFi is the CF of the sub clusters
    represented by this child
  • Childi is a pointer to its ith child.

18
CF-tree
  • Leaf node
  • Satisfies threshold T, which controls its
    tightness and quality.
  • Diameter or radius lt T
  • Tree size is a function of T
  • T increase tree size decreases.

19
CF Tree
20
Functionality of CF-tree
  • Direct a new object, O, to the cluster closest to
    it.
  • Non-leaf node exits to guide new objects to
    appropriate leaf clusters.
  • Leaf node absorbs the new object.

21
BIRCH Mechanism
  • Starts with an initial T.
  • Scans the data and inserts the objects into the
    tree.
  • During scan, existing clusters are updated and
    new clusters are formed.
  • If runs out of memory, M, increases T and builds
    a smaller CF-tree.
  • After inserting the old leaf entries, resumes
    from the point at which it was interrupted.

22
CF-tree insertion
  • CF tree insertion mechanism is same as that of
    B trees.
  • Each new object, O
  • Reaches the leaf node, L.
  • Inserted into a closest clusters C, if threshold,
    T, is not violated else forms a new Cluster
  • If there is not enough space in L, then split
    into two leaf nodes and distribute entries
    between the two nodes.
  • Like B tree, node splits might propagate till
    the root.
  • The path from the root to the leaf is updated to
    reflect the insertion.

23
BRICH Instantiation Summary
  • Cluster features at leaf and non-leaf levels.
  • Incremental maintenance of Cluster Features at
    leaf and non leaf nodes
  • Distance measure between CF and an object , and
    between CFs.
  • Threshold requirement.

24
BUBBLE
  • BIRCH instantiated in distance space.
  • No concept of Centroids.
  • For a given set of objects O O1On
  • Defines-
  • Rowsum (O)
  • Clustroid (O ) is and object O with least
    Rowsum value.
  • Radius, r (O )
  • Clustroid distance, D0 (O1 , O 2) d(O1,O2)

25
BUBBLE CF at leaf nodes
  • For a set of objects O O1On and cluster C .
    CF is a five tupple defined as (n, O, R, RS, r)
  • n Number of objects in C .
  • O Clustroid of C .
  • R representatives of the Cluster C , (R? C ).
  • RS The Rowsum of all the representatives.
  • r Radius of the Clusters C .

26
BUBBLE CF at non-leaf node
  • A set of sample objects, S (NLi), randomly
    collected from the subtree NLi form its CFi .
  • ?CF at NL ? S (NLi)
  • Each child node will have at least one
    representative in S(NL).
  • If CFi is leaf node then S (NLi) are randomly
    picked from the clustroids of CFi.

27
BUBBLE Incremental Maintenance of CF at leaf
  • Types of Insertions
  • Type I Insertion of a single object
  • Type II Insertion of a cluster of objects.

28
Type I Insertion
  • Inserting an object into the leaf
  • If C is small, maintain all the cluster objects
    and calculate the new clustroid.
  • If C is large, maintain a subset of C of size
    R that are close to the clustroid.

29
Type II Insertion
  • Inserting a cluster of objects-
  • C1 and C2 must be non-overlapping but close
    clusters.
  • The location of the new clustroid is between the
    two old clustroids.
  • By maintaining few objects far away from the
    clustriods of C1 and C2 the new clustroid can be
    calculated.

30
Incremental Maintenance of CF at non leaf
  • The sample objects at a non-leaf entry are
    updated whenever its child node splits.
  • The distribution of clusters changes
    significantly whenever a node splits.
  • To reflect changes in the distribution at all
    children nodes, update the sample objects at all
    entries of NL.

31
Drawbacks of BUBBLE
  • BUBBLE computes distance between sample objects
    which could be expensive.
  • E.g. edit distance on string

32
BUBBLE-FM
  • Transforms the distance space into an approximate
    vector space.
  • Maintains all the sample objects at each non leaf
    node in vector space.
  • For a new object O, transforms O to Vector space
    and uses Euclidean distance metric.
  • Doesnt use transformation at leaf node.

33
Scalability
34
Scalability
35
Conclusion
  • Presented the BIRCH framework for scalable
    incremental pre clustering algorithms.
  • BUBBLE for datasets in arbitrary metric space
  • Fast map to reduce the number of calls to an
    expensive distance function.

36
References
  • Primary Source
  • Clustering Large Datasets in Arbitrary Metric
    Spaces (1999)  Venkatesh Ganti Raghu Ramakrishnan
    Johannes Gehrke Computer Sciences.
  • Secondary Sources
  • BIRCH An Efficient Data Clustering Method for
    Very Large Databases (1996)  Tian Zhang, Raghu
    Ramakrishnan, Miron Livny
  • CURE An Efficient Clustering Algorithm for Large
    Databases (1998) Sudipto Guha, Rajeev Rastogi,
    Kyuscok Shim.
Write a Comment
User Comments (0)
About PowerShow.com