Clustering Large Datasets in Arbitrary Metric Space - PowerPoint PPT Presentation

About This Presentation

Title:

Clustering Large Datasets in Arbitrary Metric Space

Description:

Distance between object is used as a common metric to assess similarity ... WWW: document classification; clustering weblog data to discover groups of ... – PowerPoint PPT presentation

Number of Views:185

Avg rating:3.0/5.0

Slides: 37

Provided by: mur698

Learn more at: https://www2.cs.uh.edu

Category:

more less

Transcript and Presenter's Notes

Title: Clustering Large Datasets in Arbitrary Metric Space

1
Clustering Large Datasets in Arbitrary Metric
Space
by Muralikrishna Achari
2
Contents

Introduction to Clustering
Problems in Traditional Clustering
Clustering Large Datasets
BIRCH
BUBBLE
BUBBLE-FM
Scalability
Conclusion

3
Traditional Clustering

Unsupervised Learning.
A process of grouping similar object into groups.
Distance between object is used as a common
metric to assess similarity

4
Types of Clustering Algorithms

Hierarchical clustering
Minimal Spanning Tree Method, BIRCH, BUBBLE
Partition based clustering
K-means, CLARANS

5
Hierarchical Clustering

A crude division of instances into groups at the
top level, and each of these groups is refined
further perhaps all the way down to the
individual instances.

6
Partition based clustering

A desired number of clusters are assumed at the
start and instances are allocated among clusters
so that a particular clustering criterion is
optimized (e.g. minimization of the variability
within clusters).

7
Applications

Marketing finding groups of customers with
similar behavior.
Landscapes Characterizing different regions.
Biology classification of plants and animals
given their features.
Earthquake studies clustering observed
earthquake epicenters to identify dangerous
zones.
WWW document classification clustering weblog
data to discover groups of similar access
patterns.

8
Problem with Traditional Clustering

Dealing with large number of dimensions and large
number of data items can be problematic because
of time complexity.

9
Requirements for a Good Clustering Algorithm

Scalability.
Dealing with different types of attributes.
Discovering clusters with arbitrary shape.
Minimal requirements for domain knowledge to
determine input parameters.
Ability to deal with noise and outliers.

10
Clustering Large Datasets
11
Clustering Large Datasets

CLARANS
Assumes all object fit in main memory, ?
sensitive to input order.
Uses R to improve efficiency.
BIRCH
Minimizes memory usage and scans data only once
from disk.
Uses cluster representatives instead of actual
data points.
1st algorithm proposed in the database area that
addresses outliers.
DBSCAN
Uses distance based notion to clusters to
discover clusters of arbitrary shapes.
Sensitive to the input parameters and incurs
substantial I/O cost.

12
Drawbacks

Both BIRCH and CLARANS works well for clusters
with Spherical or Convex sphape and uniform size
and are unsuitable when clusters have different
sizes and are non-spherical.
All the three algorithms relies on vector
operations which are only defined in coordinate
space and are unsuitable to datasets in distance
space.

BRICH
13
Proposed Approach

2 algorithms for clustering large datasets based
on BIRCH framework.
BUBBLE
BUBBLE-FM

14
BIRCH

Balanced Iterative Reducing and Clustering using
Hierarchies
BIRCH is generalized framework for incremental
clustering algorithms.
BIRCH components can be instantiated to generate
concrete clustering algorithms.

15
BIRCH Components

Cluster Feature (CF)
A Summarized representation of the cluster
Cluster Tree (CF-tree)
A Height balanced tree for CFs

16
Clustering Feature

CFs are summarized representations of clusters.
Requirements-
Incrementally maintainable when a new object is
inserted.
Contain Sufficient information to compute
distance between clusters and objects.

17
CF-Tree

A height-balanced tree.
Two parameters
1. Branching Factor, B
2. Threshold, T
Non-leaf node has B entries ( CFi, childi, i
1..B)
Where CFi is the CF of the sub clusters
represented by this child
Childi is a pointer to its ith child.

18
CF-tree

Leaf node
Satisfies threshold T, which controls its
tightness and quality.
Diameter or radius lt T
Tree size is a function of T
T increase tree size decreases.

19
CF Tree
20
Functionality of CF-tree

Direct a new object, O, to the cluster closest to
it.
Non-leaf node exits to guide new objects to
appropriate leaf clusters.
Leaf node absorbs the new object.

21
BIRCH Mechanism

Starts with an initial T.
Scans the data and inserts the objects into the
tree.
During scan, existing clusters are updated and
new clusters are formed.
If runs out of memory, M, increases T and builds
a smaller CF-tree.
After inserting the old leaf entries, resumes
from the point at which it was interrupted.

22
CF-tree insertion

CF tree insertion mechanism is same as that of
B trees.
Each new object, O
Reaches the leaf node, L.
Inserted into a closest clusters C, if threshold,
T, is not violated else forms a new Cluster
If there is not enough space in L, then split
into two leaf nodes and distribute entries
between the two nodes.
Like B tree, node splits might propagate till
the root.
The path from the root to the leaf is updated to
reflect the insertion.

23
BRICH Instantiation Summary

Cluster features at leaf and non-leaf levels.
Incremental maintenance of Cluster Features at
leaf and non leaf nodes
Distance measure between CF and an object , and
between CFs.
Threshold requirement.

24
BUBBLE

BIRCH instantiated in distance space.
No concept of Centroids.
For a given set of objects O O1On
Defines-
Rowsum (O)
Clustroid (O ) is and object O with least
Rowsum value.
Radius, r (O )
Clustroid distance, D0 (O1 , O 2) d(O1,O2)

25
BUBBLE CF at leaf nodes

For a set of objects O O1On and cluster C .
CF is a five tupple defined as (n, O, R, RS, r)
n Number of objects in C .
O Clustroid of C .
R representatives of the Cluster C , (R? C ).
RS The Rowsum of all the representatives.
r Radius of the Clusters C .

26
BUBBLE CF at non-leaf node

A set of sample objects, S (NLi), randomly
collected from the subtree NLi form its CFi .
?CF at NL ? S (NLi)
Each child node will have at least one
representative in S(NL).
If CFi is leaf node then S (NLi) are randomly
picked from the clustroids of CFi.

27
BUBBLE Incremental Maintenance of CF at leaf

Types of Insertions
Type I Insertion of a single object
Type II Insertion of a cluster of objects.

28
Type I Insertion

Inserting an object into the leaf
If C is small, maintain all the cluster objects
and calculate the new clustroid.
If C is large, maintain a subset of C of size
R that are close to the clustroid.

29
Type II Insertion

Inserting a cluster of objects-
C1 and C2 must be non-overlapping but close
clusters.
The location of the new clustroid is between the
two old clustroids.
By maintaining few objects far away from the
clustriods of C1 and C2 the new clustroid can be
calculated.

30
Incremental Maintenance of CF at non leaf

The sample objects at a non-leaf entry are
updated whenever its child node splits.
The distribution of clusters changes
significantly whenever a node splits.
To reflect changes in the distribution at all
children nodes, update the sample objects at all
entries of NL.

31
Drawbacks of BUBBLE

BUBBLE computes distance between sample objects
which could be expensive.
E.g. edit distance on string

32
BUBBLE-FM

Transforms the distance space into an approximate
vector space.
Maintains all the sample objects at each non leaf
node in vector space.
For a new object O, transforms O to Vector space
and uses Euclidean distance metric.
Doesnt use transformation at leaf node.

33
Scalability
34
Scalability
35
Conclusion

Presented the BIRCH framework for scalable
incremental pre clustering algorithms.
BUBBLE for datasets in arbitrary metric space
Fast map to reduce the number of calls to an
expensive distance function.

36
References

Primary Source
Clustering Large Datasets in Arbitrary Metric
Spaces (1999) Venkatesh Ganti Raghu Ramakrishnan
Johannes Gehrke Computer Sciences.
Secondary Sources
BIRCH An Efficient Data Clustering Method for
Very Large Databases (1996) Tian Zhang, Raghu
Ramakrishnan, Miron Livny
CURE An Efficient Clustering Algorithm for Large
Databases (1998) Sudipto Guha, Rajeev Rastogi,
Kyuscok Shim.