CSE 634 Data Mining Techniques

About This Presentation

Title:

CSE 634 Data Mining Techniques

Description:

... the closest (most similar) pair of clusters and merge them into ... radius of the circles. Directly density ... can extend the dense areas in the vacation-age ... – PowerPoint PPT presentation

Number of Views:50

Avg rating:3.0/5.0

Slides: 76

Provided by: anush1

Learn more at: https://www3.cs.stonybrook.edu

Category:

more less

Transcript and Presenter's Notes

Title: CSE 634 Data Mining Techniques

1
CSE 634 Data Mining Techniques

CLUSTERING
Part 2( Group no 1 )
By Anushree Shibani Shivaprakash Fatima
Zarinni
Spring 2006
Professor Anita WasilewskaSUNY Stony Brook

2
References

Jiawei Han and Michelle Kamber. Data Mining
Concept and Techniques (Chapter8). Morgan
Kaufman, 2002.
M. Ester, H.P. Kriegel, J. Sander, and X. Xu. A
density-based algorithm for discovering clusters
in large spatial databases. KDD'96.
http//ifsc.ualr.edu/xwxu/publications/kdd-96.pdf
How to explain hierarchical clustering.
http//www.analytictech.com/networks/hiclus.htm
Tian Zhang, Raghu Ramakrishnan, Miron Livny.
Birch An efficient data clustering method for
very large databases
Data mining- Margaret H. Dunham
http//cs.sunysb.edu/cse634/ Presentation 9
Cluster Analysis

3
Introduction

Major clustering methods
Partitioning methods
Hierarchical methods
Density-based methods
Grid-based methods

4
Hierarchical methods

Here we group data objects into a tree of
clusters.
There are two types of hierarchical clustering
Agglomerative hierarchical
clustering.
Divisive hierarchical clustering

5
Agglomerative hierarchical clustering

Group data objects in a bottom-up fashion.
Initially each data object is in its own cluster.
Then we merge these atomic clusters into larger
and larger clusters, until all of the objects are
in a single cluster or until certain termination
conditions are satisfied.
A user can specify the desired number of clusters
as a termination condition.

6
Divisive hierarchical clustering

Groups data objects in a top-down fashion.
Initially all data objects are in one cluster.
We then subdivide the cluster into smaller and
smaller clusters, until each object forms cluster
on its own or satisfies certain termination
conditions, such as a desired number of clusters
is obtained.

7
AGNES DIANA

Application of AGNES( AGglomerative NESting) and
DIANA( Divisive ANAlysis) to a data set of five
objects, a, b, c, d, e.

8
AGNES-Explored

Given a set of N items to be clustered, and an
NxN distance (or similarity) matrix, the basic
process of Johnson's (1967) hierarchical
clustering is this
Start by assigning each item to its own cluster,
so that if you have N items, you now have N
clusters, each containing just one item. Let the
distances (similarities) between the clusters
equal the distances (similarities) between the
items they contain.
Find the closest (most similar) pair of clusters
and merge them into a single cluster, so that now
you have one less cluster.

9
AGNES

Compute distances (similarities) between the new
cluster and each of the old clusters.
Repeat steps 2 and 3 until all items are
clustered into a single cluster of size N.
Step 3 can be done in different ways, which is
what distinguishes single-link from complete-link
and average-link clustering

10
Similarity/Distance metrics

single-link clustering, distance
shortest distance
complete-link clustering, distance
longest distance
average-link clustering, distance
average distance
from any member of one cluster to any member of
the other cluster.

11
Single Linkage Hierarchical Clustering

Say Every point is its own cluster

12
Single Linkage Hierarchical Clustering

Say Every point is its own cluster
Find most similar pair of clusters

13
Single Linkage Hierarchical Clustering

Say Every point is its own cluster
Find most similar pair of clusters
Merge it into a parent cluster

14
Single Linkage Hierarchical Clustering

Say Every point is its own cluster
Find most similar pair of clusters
Merge it into a parent cluster
Repeat

15
Single Linkage Hierarchical Clustering

Say Every point is its own cluster
Find most similar pair of clusters
Merge it into a parent cluster
Repeat

16
DIANA (Divisive Analysis)

Introduced in Kaufmann and Rousseeuw (1990)
Inverse order of AGNES
Eventually each node forms a cluster on its own

17
Overview

Divisive Clustering starts by placing all objects
into a single group. Before we start the
procedure, we need to decide on a threshold
distance.
The procedure is as follows
The distance between all pairs of objects within
the same group is determined and the pair with
the largest distance is selected.

18
Overview-contd

This maximum distance is compared to the
threshold distance.
If it is larger than the threshold, this group is
divided in two. This is done by placing the
selected pair into different groups and using
them as seed points. All other objects in this
group are examined, and are placed into the new
group with the closest seed point. The procedure
then returns to Step 1.
If the distance between the selected objects is
less than the threshold, the divisive clustering
stops.
To run a divisive clustering, you simply need to
decide upon a method of measuring the distance
between two objects.

19
DIANA- Explored

In DIANA, a divisive hierarchical clustering
method, all of the objects form one cluster.
The cluster is split according to some principle,
such as the minimum Euclidean distance between
the closest neighboring objects in the cluster.
The cluster splitting process repeats until,
eventually, each new cluster contains a single
object or a termination condition is met.

20
Difficulties with Hierarchical clustering

It encounters difficulties regarding the
selection of merge and split points.
Such a decision is critical because once a group
of objects is merged or split, the process at the
next step will operate on the newly generated
clusters.
It will not undo what was done previously.
Thus, split or merge decisions, if not well
chosen at some step, may lead to low-quality
clusters.

21
Solution to improve Hierarchical clustering

One promising direction for improving the
clustering quality of hierarchical methods is to
integrate hierarchical clustering with other
clustering techniques. A few such methods are
Birch
Cure
Chameleon

22
BIRCH An Efficient Data Clustering Method for
Very Large Databases

Paper by

Miron Livny
Computer Sciences Dept.
University of Wisconsin- Madison
miron_at_cs.wisc.edu

Raghu Ramakrishnan
Computer Sciences Dept.
University of Wisconsin- Madison
raghu_at_cs.wisc.edu

Tian Zhang Computer Sciences Dept. University of
Wisconsin- Madison zhang_at_cs.wisc.edu
In Proceedings of the International Conference
Management of Data (ACM-SIGMOD), pages 103-114,
Montreal, Canada, June, 1996.
23
Reference For Paper

www2.informatik.huberlin.de/wm/mldm2004/zhang96bir
ch.pdf

24
Birch (Balanced Iterative Reducing and Clustering
Using Hierarchies)

A hierarchical clustering method.
It introduces two concepts
Clustering feature
Clustering feature tree (CF tree)
These structures help the clustering method
achieve good speed and scalability in large
databases.

25
Clustering Feature Definition

Given N d-dimensional data points in a cluster
Xi where i 1, 2, , N,
CF (N, LS, SS)
N is the number of data points in the cluster,
LS is the linear sum of the N data points,
SS is the square sum of the N data points.

26
Clustering feature concepts

Each record (data object) is a tuple of values of
attributes and here is called a vector.
Here is a database.
We define
(Vi1, Vid) Oi
N N N N
LS ? Oi (?Vi1, ? Vi2, ?Vid)
i1 i1 i1 i 1

Linear Sum Definition
Definition
Name
27
Square sum

N N N
N
SS ? Oi2 ( ?Vi12, ?Vi22 ?Vid2)
i 1 i1 i1
i1

Definition
Name
28
Example of a case

Assume N 5 and d 2
Linear Sum
5 5 5
LS ? Oi (?Vi1, ? Vi2)
i1 i1 i1
Square Sum
5 5
SS ( ?Vi12), ?Vi22)
i1 i1

29
Example 2
Clustering feature CF( N, LS, SS) N 5 LS
(16, 30) SS ( 54, 190)
CF (5, (16,30),(54,190))
30
CF-Tree

A CF-tree is a height-balanced tree with two
parameters branching factor (B for nonleaf node
and L for leaf node) and threshold T.
The entry in each nonleaf node has the form CFi,
childi
The entry in each leaf node is a CF each leaf
node has two pointers prev' andnext'.
The CF tree is basically a tree used to store all
the clustering features.

31
CF Tree
Root
Non-leaf node
CF1
CF3
CF2
CF5
child1
child3
child2
child5
Leaf node
Leaf node
CF1
CF2
CF6
prev
next
CF1
CF2
CF4
prev
next
32
BIRCH Clustering

Phase 1 scan DB to build an initial in-memory CF
tree (a multi-level compression of the data that
tries to preserve the inherent clustering
structure of the data)
Phase 2 use an arbitrary clustering algorithm to
cluster the leaf nodes of the CF-tree

33
BIRCH Algorithm Overview

34
Summary of Birch

Scales linearly- with a single scan you get good
clustering and the quality of clustering improves
with a few additional scans.
It handles noise (data points that are not part
of the underlying pattern) effectively.

35
Density-Based Clustering Methods

Clustering based on density, such as
density-connected points instead of distance
metric.
Cluster set of density connected points.
Major features
Discover clusters of arbitrary shape
Handle noise
Need density parameters as termination
condition-
(when no new objects can be added to the
cluster.)
Example
DBSCAN (Ester, et al. 1996)
OPTICS (Ankerst, et al 1999)
DENCLUE (Hinneburg D. Keim 1998)

36
Density-Based Clustering Background

Eps neighborhood The neighborhood within a
radius Eps of a given object
MinPts Minimum number of points in an
Eps-neighborhood of that object.
Core object If the Eps neighborhood contains at
least a minimum number of points Minpts, then the
object is a core object
Directly density-reachable A point p is directly
density-reachable from a point q wrt. Eps, MinPts
if
1) p is within the Eps neighborhood of q
2) q is a core object

37
Figure showing the density reachability and
density connectivity in density based clustering

M, P, O, R and S are core objects since each is
in an Eps neighborhood containing at least 3
points

Minpts 3 Epsradius of the circles
38
Directly density reachable

Q is directly density reachable from M. M is
directly density reachable from P and vice versa.

39
Indirectly density reachable

Q is indirectly density reachable from P since Q
is directly density reachable from M and M is
directly density reachable from P. But, P is not
density reachable from Q since Q is not a core
object.

40
Core, border, and noise points

DBSCAN is a density-based algorithm.
Density number of points within a specified
radius (Eps)
A point is a core point if it has more than a
specified number of points (MinPts) within Eps
These are points that are at the interior of a
cluster.
A border point has fewer than MinPts within Eps,
but is in the neighborhood of a core point.
A noise point is any point that is not a core
point nor a border point.

41
DBSCAN (Density based Spatial clustering of
Application with noise) The Algorithm

Arbitrary select a point p
Retrieve all points density-reachable from p wrt
Eps and MinPts.
If p is a core point, a cluster is formed.
If p is a border point, no points are
density-reachable from p and DBSCAN visits the
next point of the database.
Continue the process until all of the points have
been processed.

42
Conclusions

We discussed two hierarchical clustering methods
Agglomerative and Divisive.
We also discussed Birch- a hierarchical
clustering which produces good clustering over a
single scan and with a few additional scans you
get better clustering.
DBSCAN is a density based clustering algorithm
and through this algorithm we discover clusters
of arbitrary shapes. Distance is not the metric
unlike the case of hierarchical methods.

43
GRID-BASED CLUSTERING METHODS

This is the approach in which we quantize
space into a finite number of cells that form a
grid structure on which all of the operations for
clustering is performed.
So, for example assume that we have a set of
records and we want to cluster with respect to
two attributes, then, we divide the related space
(plane), into a grid structure and then we find
the clusters.

44
Salary (10,000)
Our space is this plane
8
7
6
5
4
3
2
1
0
20 30 40
50 60
Age
45
Techniques for Grid-Based Clustering

The following are some techniques that are used
to perform Grid-Based Clustering
CLIQUE (CLustering In QUest.)
STING (STatistical Information Grid.)
WaveCluster

46
Looking at CLIQUE as an Example

CLIQUE is used for the clustering of
high-dimensional data present in large tables.
By high-dimensional data we mean records that
have many attributes.
CLIQUE identifies the dense units in the
subspaces of high dimensional data space, and
uses these subspaces to provide more efficient
clustering.

47
Definitions That Need to Be Known

Unit After forming a grid structure on the
space, each rectangular cell is
called a Unit.
Dense A unit is dense, if the fraction of
total data points contained in the
unit exceeds the input model
parameter.
Cluster A cluster is defined as a maximal set of
connected dense units.

48
How Does CLIQUE Work?

Let us say that we have a set of records that we
would like to cluster in terms of n-attributes.
So, we are dealing with an n-dimensional space.
MAJOR STEPS
CLIQUE partitions each subspace that has
dimension 1 into the same number of equal length
intervals.
Using this as basis, it partitions the
n-dimensional data space into non-overlapping
rectangular units.

49
CLIQUE Major Steps (Cont.)

Now CLIQUES goal is to identify the dense
n-dimensional units.
It does this in the following way
CLIQUE finds dense units of higher dimensionality
by finding the dense units in the subspaces.
So, for example if we are dealing with a
3-dimensional space, CLIQUE finds the dense units
in the 3 related PLANES (2-dimensional
subspaces.)
It then intersects the extension of the subspaces
representing the dense units to form a candidate
search space in which dense units of higher
dimensionality would exist.

50
CLIQUE Major Steps. (Cont.)

Each maximal set of connected dense units is
considered a cluster.
Using this definition, the dense units in the
subspaces are examined in order to find clusters
in the subspaces.
The information of the subspaces is then used to
find clusters in the n-dimensional space.
It must be noted that all cluster boundaries are
either horizontal or vertical. This is due to the
nature of the rectangular grid cells.

51
Example for CLIQUE

Let us say that we want to cluster a set of
records that have three attributes, namely,
salary, vacation and age.
The data space for the this data would be
3-dimensional.

vacation
age
salary
52
Example (Cont.)

After plotting the data objects, each dimension,
(i.e., salary, vacation and age) is split into
intervals of equal length.
Then we form a 3-dimensional grid on the space,
each unit of which would be a 3-D rectangle.
Now, our goal is to find the dense 3-D
rectangular units.

53
Example (Cont.)

To do this, we find the dense units of the
subspaces of this 3-d space.
So, we find the dense units with respect to age
for salary. This means that we look at the
salary-age plane and find all the 2-D rectangular
units that are dense.
We also find the dense 2-D rectangular units for
the vacation-age plane.

54
Example 1
55
Example (Cont.)

Now let us try to visualize the dense units of
the two planes on the following 3-d figure

56
Example (Cont.)

We can extend the dense areas in the vacation-age
plane inwards.
We can extend the dense areas in the salary-age
plane upwards.
The intersection of these two spaces would give
us a candidate search space in which
3-dimensional dense units exist.
We then find the dense units in the
salary-vacation plane and we form an extension of
the subspace that represents these dense units.

57
Example (Cont.)

Now, we perform an intersection of the candidate
search space with the extension of the dense
units of the salary-vacation plane, in order to
get all the 3-d dense units.
So, What was the main idea?
We used the dense units in subspaces in order to
find the dense units in the 3-dimensional space.
After finding the dense units, it is very easy to
find clusters.

58
Reflecting upon CLIQUE

Why does CLIQUE confine its search for dense
units in high dimensions to the intersection of
dense units in subspaces?
Because the Apriori property employs prior
knowledge of the items in the search space so
that portions of the space can be pruned.
The property for CLIQUE says that if a
k-dimensional unit is dense then so are its
projections in the (k-1) dimensional space.

59
Strength and Weakness of CLIQUE

Strength
It automatically finds subspaces of the highest
dimensionality such that high density clusters
exist in those subspaces.
It is quite efficient.
It is insensitive to the order of records in
input and does not presume some canonical data
distribution.
It scales linearly with the size of input and has
good scalability as the number of dimensions in
the data increases.
Weakness
The accuracy of the clustering result may be
degraded at the expense of simplicity of the
simplicity of this method.

60
STING A Statistical Information Grid Approach to
Spatial Data Mining

Paper by

Jiong Yang Department of Computer
Science University of California, Los Angeles CA
90095, U.S.A. jyang_at_cs.ucla.edu
Richard Muntz Department of Computer
Science University of California, Los Angeles CA
90095, U.S.A. muntz_at_cs.ucla.edu
Wei Wang Department of Computer
Science University of California, Los Angeles CA
90095, U.S.A. weiwang_at_cs.ucla.edu
VLDB Conference Athens, Greece, 1997
61
Reference For Paper

http//georges.gardarin.free.fr/Cours_XMLDM_Maste
r2/Sting.PDF

62
Definitions That Need to Be Known

Spatial Data
Data that have a spatial or location component.
These are objects that themselves are located in
physical space.
Examples My house, lake Geneva, New York City,
etc.
Spatial Area
The area that encompasses the locations of all
the spatial data is called spatial area.

63
STING (Introduction)

STING is used for performing clustering on
spatial data.
STING uses a hierarchical multi resolution grid
data structure to partition the spatial area.
STINGS big benefit is that it processes many
common region oriented queries on a set of
points, efficiently.
We want to cluster the records that are in a
spatial table in terms of location.
Placement of a record in a grid cell is
completely determined by its physical location.

64
Hierarchical Structure of Each Grid Cell

The spatial area is divided into rectangular
cells. (Using latitude and longitude.)
Each cell forms a hierarchical structure.
This means that each cell at a higher level is
further partitioned into 4 smaller cells in the
lower level.
In other words each cell at the ith level (except
the leaves) has 4 children in the i1 level.
The union of the 4 children cells would give back
the parent cell in the level above them.

65
Hierarchical Structure of Cells (Cont.)

The size of the leaf level cells and the number
of layers depends upon how much granularity the
user wants.
So, Why do we have a hierarchical structure for
cells?
We have them in order to provide a better
granularity, or higher resolution.

66
A Hierarchical Structure for Sting Clustering
67
Statistical Parameters Stored in each Cell

For each cell in each layer we have attribute
dependent and attribute independent parameters.
Attribute Independent Parameter
Count number of records in this cell.
Attribute Dependent Parameter
(We are assuming that our attribute values are
real numbers.)

68
Statistical Parameters (Cont.)

For each attribute of each cell we store the
following parameters
M ? mean of all values of each attribute in this
cell.
S ? Standard Deviation of all values of each
attribute in this cell.
Min ? The minimum value for each attribute in
this cell.
Max ? The maximum value for each attribute in
this cell.
Distribution ? The type of distribution that the
attribute value in this cell follows. (e.g.
normal, exponential, etc.) None is assigned to
Distribution if the distribution is unknown.

69
Storing of Statistical Parameters

Statistical information regarding the attributes
in each grid cell, for each layer are
pre-computed and stored before hand.
The statistical parameters for the cells in the
lowest layer is computed directly from the values
that are present in the table.
The Statistical parameters for the cells in all
the other levels are computed from their
respective children cells that are in the lower
level.

70
How are Queries Processed ?

STING can answer many queries, (especially region
queries) efficiently, because we dont have to
access full database.
How are spatial data queries processed?
We use a top-down approach to answer spatial
data queries.
Start from a pre-selected layer-typically with a
small number of cells.
The pre-selected layer does not have to be the
top most layer.
For each cell in the current layer compute the
confidence interval (or estimated range of
probability) reflecting the cells relevance to
the given query.

71
Query Processing (Cont.)

The confidence interval is calculated by using
the statistical parameters of each cell.
Remove irrelevant cells from further
consideration.
When finished with the current layer, proceed to
the next lower level.
Processing of the next lower level examines only
the remaining relevant cells.
Repeat this process until the bottom layer is
reached.

72
Different Grid Levels during Query Processing.
73
Sample Query Examples

Assume that the spatial area is the map of the
regions of Long Island, Brooklyn and Queens.
Our records represent apartments that are
present throughout the above region.
Query Find all the apartments that are for
rent near Stony Brook University that have a rent
range of 800 to 1000
The above query depend upon the parameter near.
For our example near means within 15 miles of
Stony Brook University.

74
Advantages and Disadvantages of STING

ADVANTAGES
Very efficient.
The computational complexity is O(k) where k is
the number of grid cells at the lowest level.
Usually
k ltlt N, where N is the number of records.
STING is a query independent approach, since
statistical information exists independently of
queries.
Incremental update.
DISADVANTAGES
All Cluster boundaries are either horizontal or
vertical, and no diagonal boundary is selected.