Cluster Analysis - PowerPoint PPT Presentation

About This Presentation

Title:

Cluster Analysis

Description:

... analysis to externally known results, e.g., to externally given class labels. ... the extent to which cluster labels match externally supplied class labels. ... – PowerPoint PPT presentation

Number of Views:516

Avg rating:3.0/5.0

Slides: 37

Provided by: HKUC4

Learn more at: https://www.cs.bu.edu

Category:

more less

Transcript and Presenter's Notes

Title: Cluster Analysis

1
Cluster Analysis
2
Midterm Monday Oct 29, 4PM

Lecture Notes from Sept 5, 2007 until Oct 15,
2007. Chapters from Textbook and papers
discussed in class (see below detailed list)
Specific Readings
Textbook
Chapter 1
Chapter 2 2.1- 2.4
Chapter 3 3.1-3.4
Chapter 4 4.1.1-4.1.2, 4.2.1
Chapter 5
Chapter 6 6.1-6.5, 6.9.1, 6.12, 6.13, 6.14
Chapter 7 7.1-7.4
Papers
Apriori Paper R. Agrawal, R. Srikant Fast
Algorithms for Mining Association Rules. VLDB
1994
MaxMiner Paper R. J. Bayardo Jr Efficiently
Mining Long Patterns from Databases. SIGMOD 1998
SLIQ paperM. Mehta, R. Agrawal, J. Rissanen
SLIQ A Fast Scalable Classifier for Data Mining.
EDBT 1996

3
Cluster Analysis

What is Cluster Analysis?
Types of Data in Cluster Analysis
A Categorization of Major Clustering Methods
Partitioning Methods
Hierarchical Methods
Density-Based Methods
Grid-Based Methods
Model-Based Clustering Methods

4
Clustering High-Dimensional Data

Clustering high-dimensional data
Many applications text documents, DNA
micro-array data
Major challenges
Many irrelevant dimensions may mask clusters
Distance measure becomes meaninglessdue to
equi-distance
Clusters may exist only in some subspaces
Methods
Feature transformation only effective if most
dimensions are relevant
PCA SVD useful only when features are highly
correlated/redundant
Feature selection wrapper or filter approaches
useful to find a subspace where the data have
nice clusters
Subspace-clustering find clusters in all the
possible subspaces
CLIQUE, ProClus, and frequent pattern-based
clustering

5
The Curse of Dimensionality (graphs adapted from
Parsons et al. KDD Explorations 2004)

Data in only one dimension is relatively packed
Adding a dimension stretch the points across
that dimension, making them further apart
Adding more dimensions will make the points
further aparthigh dimensional data is extremely
sparse
Distance measure becomes meaninglessdue to
equi-distance

6
Why Subspace Clustering?(adapted from Parsons et
al. SIGKDD Explorations 2004)

Clusters may exist only in some subspaces
Subspace-clustering find clusters in some of the
subspaces

7
CLIQUE (Clustering In QUEst)

Agrawal, Gehrke, Gunopulos, Raghavan (SIGMOD98)
Automatically identifying subspaces of a high
dimensional data space that allow better
clustering than original space
CLIQUE can be considered as both density-based
and grid-based
It partitions each dimension into the same number
of equal length interval
It partitions an m-dimensional data space into
non-overlapping rectangular units
A unit is dense if the fraction of total data
points contained in the unit exceeds the input
model parameter
A cluster is a maximal set of connected dense
units within a subspace

8
CLIQUE The Major Steps

Partition the data space and find the number of
points that lie inside each cell of the
partition.
Identify the subspaces that contain clusters
using the Apriori principle
Identify clusters
Determine dense units in all subspaces of
interests
Determine connected dense units in all subspaces
of interests.
Generate minimal description for the clusters
Determine maximal regions that cover a cluster of
connected dense units for each cluster
Determination of minimal cover for each cluster

9
Salary (10,000)
7
6
5
4
3
2
1
age
0
20
30
40
50
60
? 3
10
Strength and Weakness of CLIQUE

Strength
automatically finds subspaces of the highest
dimensionality such that high density clusters
exist in those subspaces
insensitive to the order of records in input and
does not presume some canonical data distribution
scales linearly with the size of input and has
good scalability as the number of dimensions in
the data increases
Weakness
The accuracy of the clustering result may be
degraded at the expense of simplicity of the
method

11
Frequent Pattern-Based Approach

Clustering high-dimensional space (e.g.,
clustering text documents, microarray data)
Projected subspace-clustering which dimensions
to be projected on?
CLIQUE, ProClus
Feature extraction costly and may not be
effective?
Using frequent patterns as features
Frequent are inherent features
Mining freq. patterns may not be so expensive
Typical methods
Frequent-term-based document clustering
Clustering by pattern similarity in micro-array
data (pClustering)

12
Clustering by Pattern Similarity (p-Clustering)

Right The micro-array raw data shows 3 genes
and their values in a multi-dimensional space
Difficult to find their patterns
Bottom Some subsets of dimensions form nice
shift and scaling patterns

13
Why p-Clustering?

Microarray data analysis may need to
Clustering on thousands of dimensions
(attributes)
Discovery of both shift and scaling patterns
Clustering with Euclidean distance measure?
cannot find shift patterns
Clustering on derived attribute Aij ai aj?
introduces N(N-1) dimensions
Bi-cluster using transformed mean-squared residue
score matrix (I, J)
Where
A submatrix is a d-cluster if H(I, J) d for
some d gt 0
Problems with bi-cluster
No downward closure property,
Due to averaging, it may contain outliers but
still within d-threshold

14
p-Clustering

Given objects x, y in O and features a, b in T,
pCluster is a 2 by 2 matrix
A pair (O, T) is in d-pCluster if for any 2 by 2
matrix X in (O, T), pScore(X) d for some d gt 0
Properties of d-pCluster
Downward closure
Clusters are more homogeneous than bi-cluster
(thus the name pair-wise Cluster)
Pattern-growth algorithm has been developed for
efficient mining
For scaling patterns, one can observe, taking
logarithmic on will lead to the pScore
form

15
Cluster Analysis

What is Cluster Analysis?
Types of Data in Cluster Analysis
A Categorization of Major Clustering Methods
Partitioning Methods
Hierarchical Methods
Density-Based Methods
Grid-Based Methods
Model-Based Clustering Methods

16
Model based clustering

Assume data generated from K probability
distributions
Typically Gaussian distribution Soft or
probabilistic version of K-means clustering
Need to find distribution parameters.
EM Algorithm

17
EM Algorithm

Initialize K cluster centers
Iterate between two steps
Expectation step assign points to clusters
Maximation step estimate model parameters

18
Cluster Analysis

What is Cluster Analysis?
Types of Data in Cluster Analysis
A Categorization of Major Clustering Methods
Partitioning Methods
Hierarchical Methods
Density-Based Methods
Grid-Based Methods
Model-Based Clustering Methods
Cluster Validity

19
Cluster Validity

For supervised classification we have a variety
of measures to evaluate how good our model is
Accuracy, precision, recall
For cluster analysis, the analogous question is
how to evaluate the goodness of the resulting
clusters?
But clusters are in the eye of the beholder!
Then why do we want to evaluate them?
To avoid finding patterns in noise
To compare clustering algorithms
To compare two sets of clusters
To compare two clusters

20
Clusters found in Random Data
Random Points
21
Different Aspects of Cluster Validation

Determining the clustering tendency of a set of
data, i.e., distinguishing whether non-random
structure actually exists in the data.
Comparing the results of a cluster analysis to
externally known results, e.g., to externally
given class labels.
Evaluating how well the results of a cluster
analysis fit the data without reference to
external information.
- Use only the data
Comparing the results of two different sets of
cluster analyses to determine which is better.
Determining the correct number of clusters.
For 2, 3, and 4, we can further distinguish
whether we want to evaluate the entire clustering
or just individual clusters.

22
Measures of Cluster Validity

Numerical measures that are applied to judge
various aspects of cluster validity, are
classified into the following three types.
External Index Used to measure the extent to
which cluster labels match externally supplied
class labels.
Entropy
Internal Index Used to measure the goodness of
a clustering structure without respect to
external information.
Sum of Squared Error (SSE)
Relative Index Used to compare two different
clusterings or clusters.
Often an external or internal index is used for
this function, e.g., SSE or entropy
Sometimes these are referred to as criteria
instead of indices
However, sometimes criterion is the general
strategy and index is the numerical measure that
implements the criterion.

23
Measuring Cluster Validity Via Correlation

Two matrices
Proximity Matrix
Incidence Matrix
One row and one column for each data point
An entry is 1 if the associated pair of points
belong to the same cluster
An entry is 0 if the associated pair of points
belongs to different clusters
Compute the correlation between the two matrices
Since the matrices are symmetric, only the
correlation between n(n-1) / 2 entries needs to
be calculated.
High correlation indicates that points that
belong to the same cluster are close to each
other.
Not a good measure for some density or contiguity
based clusters.

24
Measuring Cluster Validity Via Correlation

Correlation of incidence and proximity matrices
for the K-means clusterings of the following two
data sets.

Corr -0.9235
Corr -0.5810
25
Using Similarity Matrix for Cluster Validation

Order the similarity matrix with respect to
cluster labels and inspect visually.

26
Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

DBSCAN
27
Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

K-means
28
Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp

Complete Link
29
Using Similarity Matrix for Cluster Validation
DBSCAN
30
Internal Measures SSE

Clusters in more complicated figures arent well
separated
Internal Index Used to measure the goodness of
a clustering structure without respect to
external information
SSE
SSE is good for comparing two clusterings or two
clusters (average SSE).
Can also be used to estimate the number of
clusters

31
Internal Measures SSE

SSE curve for a more complicated data set

SSE of clusters found using K-means
32
Framework for Cluster Validity

Need a framework to interpret any measure.
For example, if our measure of evaluation has the
value, 10, is that good, fair, or poor?
Statistics provide a framework for cluster
validity
The more atypical a clustering result is, the
more likely it represents valid structure in the
data
Can compare the values of an index that result
from random data or clusterings to those of a
clustering result.
If the value of the index is unlikely, then the
cluster results are valid
These approaches are more complicated and harder
to understand.
For comparing the results of two different sets
of cluster analyses, a framework is less
necessary.
However, there is the question of whether the
difference between two index values is
significant

33
Internal Measures Cohesion and Separation

Cluster Cohesion Measures how closely related
are objects in a cluster
Example SSE
Cluster Separation Measure how distinct or
well-separated a cluster is from other clusters
Example Squared Error
Cohesion is measured by the within cluster sum of
squares (SSE)
Separation is measured by the between cluster sum
of squares
Where Ci is the size of cluster i

34
Internal Measures Cohesion and Separation

A proximity graph based approach can also be used
for cohesion and separation.
Cluster cohesion is the sum of the weight of all
links within a cluster.
Cluster separation is the sum of the weights
between nodes in the cluster and nodes outside
the cluster.

cohesion
separation
35
Internal Measures Silhouette Coefficient

Silhouette Coefficient combine ideas of both
cohesion and separation, but for individual
points, as well as clusters and clusterings
For an individual point, i
Calculate a average distance of i to the points
in its cluster
Calculate b min (average distance of i to
points in another cluster)
The silhouette coefficient for a point is then
given by s 1 a/b if a lt b, (or s b/a
- 1 if a ? b, not the usual case)
Typically between 0 and 1.
The closer to 1 the better.
Can calculate the Average Silhouette width for a
cluster or a clustering

36
Final Comment on Cluster Validity

The validation of clustering structures is
the most difficult and frustrating part of
cluster analysis.
Without a strong effort in this direction,
cluster analysis will remain a black art
accessible only to those true believers who have
experience and great courage.
Algorithms for Clustering Data, Jain and Dubes

Write a Comment

User Comments (0)