Title: CSE 634 Data Mining Techniques
1CSE 634 Data Mining Techniques
- CLUSTERING
- Part 2( Group no 1 )
- By Anushree Shibani Shivaprakash Fatima
Zarinni - Spring 2006
- Professor Anita WasilewskaSUNY Stony Brook
2References
- Jiawei Han and Michelle Kamber. Data Mining
Concept and Techniques (Chapter8). Morgan
Kaufman, 2002. - M. Ester, H.P. Kriegel, J. Sander, and X. Xu. A
density-based algorithm for discovering clusters
in large spatial databases. KDD'96.
http//ifsc.ualr.edu/xwxu/publications/kdd-96.pdf - How to explain hierarchical clustering.
http//www.analytictech.com/networks/hiclus.htm - Tian Zhang, Raghu Ramakrishnan, Miron Livny.
Birch An efficient data clustering method for
very large databases - Data mining- Margaret H. Dunham
- http//cs.sunysb.edu/cse634/ Presentation 9
Cluster Analysis
3Introduction
- Major clustering methods
- Partitioning methods
- Hierarchical methods
- Density-based methods
- Grid-based methods
4Hierarchical methods
- Here we group data objects into a tree of
clusters. - There are two types of hierarchical clustering
- Agglomerative hierarchical
- clustering.
- Divisive hierarchical clustering
5Agglomerative hierarchical clustering
- Group data objects in a bottom-up fashion.
- Initially each data object is in its own cluster.
- Then we merge these atomic clusters into larger
and larger clusters, until all of the objects are
in a single cluster or until certain termination
conditions are satisfied. - A user can specify the desired number of clusters
as a termination condition.
6Divisive hierarchical clustering
- Groups data objects in a top-down fashion.
- Initially all data objects are in one cluster.
- We then subdivide the cluster into smaller and
smaller clusters, until each object forms cluster
on its own or satisfies certain termination
conditions, such as a desired number of clusters
is obtained.
7AGNES DIANA
- Application of AGNES( AGglomerative NESting) and
DIANA( Divisive ANAlysis) to a data set of five
objects, a, b, c, d, e.
8AGNES-Explored
- Given a set of N items to be clustered, and an
NxN distance (or similarity) matrix, the basic
process of Johnson's (1967) hierarchical
clustering is this - Start by assigning each item to its own cluster,
so that if you have N items, you now have N
clusters, each containing just one item. Let the
distances (similarities) between the clusters
equal the distances (similarities) between the
items they contain. - Find the closest (most similar) pair of clusters
and merge them into a single cluster, so that now
you have one less cluster.
9AGNES
- Compute distances (similarities) between the new
cluster and each of the old clusters. - Repeat steps 2 and 3 until all items are
clustered into a single cluster of size N. - Step 3 can be done in different ways, which is
what distinguishes single-link from complete-link
and average-link clustering
10 Similarity/Distance metrics
- single-link clustering, distance
- shortest distance
- complete-link clustering, distance
- longest distance
- average-link clustering, distance
- average distance
- from any member of one cluster to any member of
the other cluster.
11Single Linkage Hierarchical Clustering
- Say Every point is its own cluster
12Single Linkage Hierarchical Clustering
- Say Every point is its own cluster
- Find most similar pair of clusters
13Single Linkage Hierarchical Clustering
- Say Every point is its own cluster
- Find most similar pair of clusters
- Merge it into a parent cluster
14Single Linkage Hierarchical Clustering
- Say Every point is its own cluster
- Find most similar pair of clusters
- Merge it into a parent cluster
- Repeat
15Single Linkage Hierarchical Clustering
- Say Every point is its own cluster
- Find most similar pair of clusters
- Merge it into a parent cluster
- Repeat
16DIANA (Divisive Analysis)
- Introduced in Kaufmann and Rousseeuw (1990)
- Inverse order of AGNES
- Eventually each node forms a cluster on its own
17 Overview
- Divisive Clustering starts by placing all objects
into a single group. Before we start the
procedure, we need to decide on a threshold
distance. - The procedure is as follows
- The distance between all pairs of objects within
the same group is determined and the pair with
the largest distance is selected.
18 Overview-contd
- This maximum distance is compared to the
threshold distance. - If it is larger than the threshold, this group is
divided in two. This is done by placing the
selected pair into different groups and using
them as seed points. All other objects in this
group are examined, and are placed into the new
group with the closest seed point. The procedure
then returns to Step 1. - If the distance between the selected objects is
less than the threshold, the divisive clustering
stops. - To run a divisive clustering, you simply need to
decide upon a method of measuring the distance
between two objects.
19DIANA- Explored
- In DIANA, a divisive hierarchical clustering
method, all of the objects form one cluster. - The cluster is split according to some principle,
such as the minimum Euclidean distance between
the closest neighboring objects in the cluster. - The cluster splitting process repeats until,
eventually, each new cluster contains a single
object or a termination condition is met.
20Difficulties with Hierarchical clustering
- It encounters difficulties regarding the
selection of merge and split points. - Such a decision is critical because once a group
of objects is merged or split, the process at the
next step will operate on the newly generated
clusters. - It will not undo what was done previously.
- Thus, split or merge decisions, if not well
chosen at some step, may lead to low-quality
clusters.
21Solution to improve Hierarchical clustering
- One promising direction for improving the
clustering quality of hierarchical methods is to
integrate hierarchical clustering with other
clustering techniques. A few such methods are - Birch
- Cure
- Chameleon
22BIRCH An Efficient Data Clustering Method for
Very Large Databases
- Miron Livny
- Computer Sciences Dept.
- University of Wisconsin- Madison
- miron_at_cs.wisc.edu
- Raghu Ramakrishnan
- Computer Sciences Dept.
- University of Wisconsin- Madison
- raghu_at_cs.wisc.edu
Tian Zhang Computer Sciences Dept. University of
Wisconsin- Madison zhang_at_cs.wisc.edu
In Proceedings of the International Conference
Management of Data (ACM-SIGMOD), pages 103-114,
Montreal, Canada, June, 1996.
23Reference For Paper
- www2.informatik.huberlin.de/wm/mldm2004/zhang96bir
ch.pdf
24Birch (Balanced Iterative Reducing and Clustering
Using Hierarchies)
- A hierarchical clustering method.
- It introduces two concepts
- Clustering feature
- Clustering feature tree (CF tree)
- These structures help the clustering method
achieve good speed and scalability in large
databases.
25Clustering Feature Definition
- Given N d-dimensional data points in a cluster
Xi where i 1, 2, , N, - CF (N, LS, SS)
- N is the number of data points in the cluster,
- LS is the linear sum of the N data points,
- SS is the square sum of the N data points.
26 Clustering feature concepts
- Each record (data object) is a tuple of values of
attributes and here is called a vector. - Here is a database.
- We define
- (Vi1, Vid) Oi
- N N N N
- LS ? Oi (?Vi1, ? Vi2, ?Vid)
- i1 i1 i1 i 1
Linear Sum Definition
Definition
Name
27Square sum
- N N N
N - SS ? Oi2 ( ?Vi12, ?Vi22 ?Vid2)
- i 1 i1 i1
i1
Definition
Name
28Example of a case
- Assume N 5 and d 2
- Linear Sum
- 5 5 5
- LS ? Oi (?Vi1, ? Vi2)
- i1 i1 i1
- Square Sum
- 5 5
- SS ( ?Vi12), ?Vi22)
- i1 i1
29Example 2
Clustering feature CF( N, LS, SS) N 5 LS
(16, 30) SS ( 54, 190)
CF (5, (16,30),(54,190))
30CF-Tree
- A CF-tree is a height-balanced tree with two
parameters branching factor (B for nonleaf node
and L for leaf node) and threshold T. - The entry in each nonleaf node has the form CFi,
childi - The entry in each leaf node is a CF each leaf
node has two pointers prev' andnext'. - The CF tree is basically a tree used to store all
the clustering features.
31CF Tree
Root
Non-leaf node
CF1
CF3
CF2
CF5
child1
child3
child2
child5
Leaf node
Leaf node
CF1
CF2
CF6
prev
next
CF1
CF2
CF4
prev
next
32BIRCH Clustering
- Phase 1 scan DB to build an initial in-memory CF
tree (a multi-level compression of the data that
tries to preserve the inherent clustering
structure of the data) - Phase 2 use an arbitrary clustering algorithm to
cluster the leaf nodes of the CF-tree
33BIRCH Algorithm Overview
34Summary of Birch
- Scales linearly- with a single scan you get good
clustering and the quality of clustering improves
with a few additional scans. - It handles noise (data points that are not part
of the underlying pattern) effectively.
35Density-Based Clustering Methods
- Clustering based on density, such as
density-connected points instead of distance
metric. - Cluster set of density connected points.
- Major features
- Discover clusters of arbitrary shape
- Handle noise
- Need density parameters as termination
condition- - (when no new objects can be added to the
cluster.) - Example
- DBSCAN (Ester, et al. 1996)
- OPTICS (Ankerst, et al 1999)
- DENCLUE (Hinneburg D. Keim 1998)
36Density-Based Clustering Background
- Eps neighborhood The neighborhood within a
radius Eps of a given object - MinPts Minimum number of points in an
Eps-neighborhood of that object. - Core object If the Eps neighborhood contains at
least a minimum number of points Minpts, then the
object is a core object - Directly density-reachable A point p is directly
density-reachable from a point q wrt. Eps, MinPts
if - 1) p is within the Eps neighborhood of q
- 2) q is a core object
37Figure showing the density reachability and
density connectivity in density based clustering
- M, P, O, R and S are core objects since each is
in an Eps neighborhood containing at least 3
points
Minpts 3 Epsradius of the circles
38Directly density reachable
- Q is directly density reachable from M. M is
directly density reachable from P and vice versa.
39Indirectly density reachable
- Q is indirectly density reachable from P since Q
is directly density reachable from M and M is
directly density reachable from P. But, P is not
density reachable from Q since Q is not a core
object.
40Core, border, and noise points
- DBSCAN is a density-based algorithm.
- Density number of points within a specified
radius (Eps) - A point is a core point if it has more than a
specified number of points (MinPts) within Eps - These are points that are at the interior of a
cluster. - A border point has fewer than MinPts within Eps,
but is in the neighborhood of a core point. - A noise point is any point that is not a core
point nor a border point.
41DBSCAN (Density based Spatial clustering of
Application with noise) The Algorithm
- Arbitrary select a point p
- Retrieve all points density-reachable from p wrt
Eps and MinPts. - If p is a core point, a cluster is formed.
- If p is a border point, no points are
density-reachable from p and DBSCAN visits the
next point of the database. - Continue the process until all of the points have
been processed.
42Conclusions
- We discussed two hierarchical clustering methods
Agglomerative and Divisive. - We also discussed Birch- a hierarchical
clustering which produces good clustering over a
single scan and with a few additional scans you
get better clustering. - DBSCAN is a density based clustering algorithm
and through this algorithm we discover clusters
of arbitrary shapes. Distance is not the metric
unlike the case of hierarchical methods.
43GRID-BASED CLUSTERING METHODS
- This is the approach in which we quantize
space into a finite number of cells that form a
grid structure on which all of the operations for
clustering is performed. - So, for example assume that we have a set of
records and we want to cluster with respect to
two attributes, then, we divide the related space
(plane), into a grid structure and then we find
the clusters.
44Salary (10,000)
Our space is this plane
8
7
6
5
4
3
2
1
0
20 30 40
50 60
Age
45Techniques for Grid-Based Clustering
- The following are some techniques that are used
to perform Grid-Based Clustering - CLIQUE (CLustering In QUest.)
- STING (STatistical Information Grid.)
- WaveCluster
46Looking at CLIQUE as an Example
- CLIQUE is used for the clustering of
high-dimensional data present in large tables.
By high-dimensional data we mean records that
have many attributes. - CLIQUE identifies the dense units in the
subspaces of high dimensional data space, and
uses these subspaces to provide more efficient
clustering.
47Definitions That Need to Be Known
- Unit After forming a grid structure on the
space, each rectangular cell is
called a Unit. - Dense A unit is dense, if the fraction of
total data points contained in the
unit exceeds the input model
parameter. - Cluster A cluster is defined as a maximal set of
connected dense units. -
48How Does CLIQUE Work?
- Let us say that we have a set of records that we
would like to cluster in terms of n-attributes. - So, we are dealing with an n-dimensional space.
- MAJOR STEPS
- CLIQUE partitions each subspace that has
dimension 1 into the same number of equal length
intervals. - Using this as basis, it partitions the
n-dimensional data space into non-overlapping
rectangular units.
49CLIQUE Major Steps (Cont.)
- Now CLIQUES goal is to identify the dense
n-dimensional units. - It does this in the following way
- CLIQUE finds dense units of higher dimensionality
by finding the dense units in the subspaces. - So, for example if we are dealing with a
3-dimensional space, CLIQUE finds the dense units
in the 3 related PLANES (2-dimensional
subspaces.) - It then intersects the extension of the subspaces
representing the dense units to form a candidate
search space in which dense units of higher
dimensionality would exist.
50CLIQUE Major Steps. (Cont.)
- Each maximal set of connected dense units is
considered a cluster. - Using this definition, the dense units in the
subspaces are examined in order to find clusters
in the subspaces. - The information of the subspaces is then used to
find clusters in the n-dimensional space. - It must be noted that all cluster boundaries are
either horizontal or vertical. This is due to the
nature of the rectangular grid cells. -
51Example for CLIQUE
- Let us say that we want to cluster a set of
records that have three attributes, namely,
salary, vacation and age. - The data space for the this data would be
3-dimensional.
vacation
age
salary
52Example (Cont.)
- After plotting the data objects, each dimension,
(i.e., salary, vacation and age) is split into
intervals of equal length. - Then we form a 3-dimensional grid on the space,
each unit of which would be a 3-D rectangle. - Now, our goal is to find the dense 3-D
rectangular units.
53Example (Cont.)
- To do this, we find the dense units of the
subspaces of this 3-d space. - So, we find the dense units with respect to age
for salary. This means that we look at the
salary-age plane and find all the 2-D rectangular
units that are dense. - We also find the dense 2-D rectangular units for
the vacation-age plane.
54Example 1
55Example (Cont.)
- Now let us try to visualize the dense units of
the two planes on the following 3-d figure
56Example (Cont.)
- We can extend the dense areas in the vacation-age
plane inwards. - We can extend the dense areas in the salary-age
plane upwards. - The intersection of these two spaces would give
us a candidate search space in which
3-dimensional dense units exist. - We then find the dense units in the
salary-vacation plane and we form an extension of
the subspace that represents these dense units.
57Example (Cont.)
- Now, we perform an intersection of the candidate
search space with the extension of the dense
units of the salary-vacation plane, in order to
get all the 3-d dense units. - So, What was the main idea?
- We used the dense units in subspaces in order to
find the dense units in the 3-dimensional space. - After finding the dense units, it is very easy to
find clusters.
58Reflecting upon CLIQUE
- Why does CLIQUE confine its search for dense
units in high dimensions to the intersection of
dense units in subspaces? - Because the Apriori property employs prior
knowledge of the items in the search space so
that portions of the space can be pruned. - The property for CLIQUE says that if a
k-dimensional unit is dense then so are its
projections in the (k-1) dimensional space.
59Strength and Weakness of CLIQUE
- Strength
- It automatically finds subspaces of the highest
dimensionality such that high density clusters
exist in those subspaces. - It is quite efficient.
- It is insensitive to the order of records in
input and does not presume some canonical data
distribution. - It scales linearly with the size of input and has
good scalability as the number of dimensions in
the data increases. - Weakness
- The accuracy of the clustering result may be
degraded at the expense of simplicity of the
simplicity of this method.
60STING A Statistical Information Grid Approach to
Spatial Data Mining
Jiong Yang Department of Computer
Science University of California, Los Angeles CA
90095, U.S.A. jyang_at_cs.ucla.edu
Richard Muntz Department of Computer
Science University of California, Los Angeles CA
90095, U.S.A. muntz_at_cs.ucla.edu
Wei Wang Department of Computer
Science University of California, Los Angeles CA
90095, U.S.A. weiwang_at_cs.ucla.edu
VLDB Conference Athens, Greece, 1997
61Reference For Paper
- http//georges.gardarin.free.fr/Cours_XMLDM_Maste
r2/Sting.PDF
62Definitions That Need to Be Known
- Spatial Data
- Data that have a spatial or location component.
- These are objects that themselves are located in
physical space. - Examples My house, lake Geneva, New York City,
etc. - Spatial Area
- The area that encompasses the locations of all
the spatial data is called spatial area.
63STING (Introduction)
- STING is used for performing clustering on
spatial data. - STING uses a hierarchical multi resolution grid
data structure to partition the spatial area. - STINGS big benefit is that it processes many
common region oriented queries on a set of
points, efficiently. - We want to cluster the records that are in a
spatial table in terms of location. - Placement of a record in a grid cell is
completely determined by its physical location.
64Hierarchical Structure of Each Grid Cell
- The spatial area is divided into rectangular
cells. (Using latitude and longitude.) - Each cell forms a hierarchical structure.
- This means that each cell at a higher level is
further partitioned into 4 smaller cells in the
lower level. - In other words each cell at the ith level (except
the leaves) has 4 children in the i1 level. - The union of the 4 children cells would give back
the parent cell in the level above them.
65Hierarchical Structure of Cells (Cont.)
- The size of the leaf level cells and the number
of layers depends upon how much granularity the
user wants. - So, Why do we have a hierarchical structure for
cells? - We have them in order to provide a better
granularity, or higher resolution.
66A Hierarchical Structure for Sting Clustering
67Statistical Parameters Stored in each Cell
-
- For each cell in each layer we have attribute
dependent and attribute independent parameters. - Attribute Independent Parameter
- Count number of records in this cell.
- Attribute Dependent Parameter
- (We are assuming that our attribute values are
real numbers.)
68Statistical Parameters (Cont.)
- For each attribute of each cell we store the
following parameters - M ? mean of all values of each attribute in this
cell. - S ? Standard Deviation of all values of each
attribute in this cell. - Min ? The minimum value for each attribute in
this cell. - Max ? The maximum value for each attribute in
this cell. - Distribution ? The type of distribution that the
attribute value in this cell follows. (e.g.
normal, exponential, etc.) None is assigned to
Distribution if the distribution is unknown.
69Storing of Statistical Parameters
- Statistical information regarding the attributes
in each grid cell, for each layer are
pre-computed and stored before hand. - The statistical parameters for the cells in the
lowest layer is computed directly from the values
that are present in the table. - The Statistical parameters for the cells in all
the other levels are computed from their
respective children cells that are in the lower
level.
70How are Queries Processed ?
- STING can answer many queries, (especially region
queries) efficiently, because we dont have to
access full database. - How are spatial data queries processed?
- We use a top-down approach to answer spatial
data queries. - Start from a pre-selected layer-typically with a
small number of cells. - The pre-selected layer does not have to be the
top most layer. - For each cell in the current layer compute the
confidence interval (or estimated range of
probability) reflecting the cells relevance to
the given query.
71Query Processing (Cont.)
- The confidence interval is calculated by using
the statistical parameters of each cell. - Remove irrelevant cells from further
consideration. - When finished with the current layer, proceed to
the next lower level. - Processing of the next lower level examines only
the remaining relevant cells. - Repeat this process until the bottom layer is
reached.
72Different Grid Levels during Query Processing.
73Sample Query Examples
- Assume that the spatial area is the map of the
regions of Long Island, Brooklyn and Queens. - Our records represent apartments that are
present throughout the above region. - Query Find all the apartments that are for
rent near Stony Brook University that have a rent
range of 800 to 1000 - The above query depend upon the parameter near.
For our example near means within 15 miles of
Stony Brook University.
74Advantages and Disadvantages of STING
- ADVANTAGES
- Very efficient.
- The computational complexity is O(k) where k is
the number of grid cells at the lowest level.
Usually - k ltlt N, where N is the number of records.
- STING is a query independent approach, since
statistical information exists independently of
queries. - Incremental update.
- DISADVANTAGES
- All Cluster boundaries are either horizontal or
vertical, and no diagonal boundary is selected.
75