Clustering Analysis - PowerPoint PPT Presentation

1 / 35

About This Presentation

Title:

Clustering Analysis

Description:

Presented by: Ching-Pin Kao. 2. Problem Description ... 1. For i = 1 to 5, repeat the following steps: ... ELSE // i.e. o is a core object ... – PowerPoint PPT presentation

Number of Views:29

Avg rating:3.0/5.0

Slides: 36

Provided by: sunCis

Category:

more less

Transcript and Presenter's Notes

Title: Clustering Analysis

1

Clustering Analysis
Presented by Ching-Pin Kao

2
Problem Description

Cluster analysis is the classification of items
into a number of groups, or clusters.
Clusters collections of objects whose
intra-class similarity is high and inter-class
similarity is low.
Tasks of clustering analysis
Similarity measurements
Clustering methods
Validation techniques

3
Similarity Measurements

Types of similarity measurements
Distance measurements
Correlation coefficients
Association coefficients
Probabilistic similarity coefficients

4
Similarity Measurements Correlation Coefficients

The most popular correlation coefficient is
Pearson correlation coefficient. (1892)
Correlation between XX1, X2, , Xn and YY1,
Y2, , Yn
where

5
Similarity Measurements Correlation
Coefficients (Cont.)

It captures similarity of the shapes of two
expression profiles, and ignores differences
between their magnitudes.

r1.0
6
Concept Correlation Coefficients
7
Validation Techniques

Types of validation techniques
External indices
based on some gold standards
Matching coefficient, Jaccard coefficient
Internal indices
based on some statistics of the results
Huberts G Statistics

8
Validation Techniques Huberts G Statistics

XX(i, j) and YY(i, j) are two n n matrix
X(i, j) similarity of object i and object j
Huberts G statistic represents the point serial
correlation
where M n (n - 1) / 2 is the number of entries
in the double sum
A higher value of G represents the better
clustering quality.

9
Concept Huberts G Statistics
10
Validation Techniques External Indices

Given two binary matrices A and B of the same
dimensions
Matching coefficient (ad) / (abcd)
Jaccard coefficient a / (abc)

B
A
11
Types of Clustering Methods

Partitioning
K-Means, K-Medoids, PAM, CLARA, CLARANS, CAST,
Hierarchical
HAC, BIRCH, CURE, ROCK, CHAMELEON,
Density-based
DBSCAN, OPTICS, CLIQUE, WaveCluster,
Grid-based
STING, CLIQUE, WaveCluster,
Model-based
COBWEB, SOM, CLASSIT, AutoClass,
Outlier analysis
OLAP,

12
Hierarchical
button-up
top-down
Distinction between agglomerative and divisive
techniques
13
(No Transcript)
14
HAC(Hierarchical agglomerative clustering)
Proximity matrix
Single Link Complete Link
15
Partitioning
A partition with n20 and k 3
16
k-Means Clustering

1. Select an initial partition with K
clusters.Repeat steps 2-5 until the cluster
membership stabilizes.
2. Generate a new partition by assigning pattern
to its closest cluster centre.
3. Compute new cluster centres as the centroids
of the clusters.
4. Repeat step 2 and 3 until an optimum value of
the criterion function is found.
5. Adjust the number of clusters by merging and
splitting existing clusters or by removing small
clusters, or outliers.

17
k-Means Clustering (Cont.)
18
k-Medoid Methods

Medoidoptimal representative object for each
cluster(the most centrally loacted object within
the cluster)
k-medoid methodsThe method of partitioning
around medoid.

19
PAM(Partition Around Medoids)

Based on the k-medoid model
Oia selected object
Oha nonselected object
SwapOi is replaced by Oh as a medoid
Consider another nonselected object Oj and
calculate its costs(contribution) Cjih to the
swap

20
PAM (Cont.)

a. Oj belongs to a cluster other than the one
represented by Oi. Let Ol be the representative
object of that cluster.
a1. d(Oj, Oh)?d(Oj, Ol)
After swapping, Oj would belong to the
cluster represented by Ol.
Cjih0
a2. d(Oj, Oh)ltd(Oj, Ol)
After swapping, Oj would belong to the
cluster represented by Oh.
Cjihd(Oj, Oh)-d(Oj, Ol)lt0

21
PAM (Cont.)

b. Oj belongs to the cluster represented by Oi.
Let Oj,2 is the second most similar medoid to Oj
b1. d(Oj, Oh)?d(Oj, Oj,2)
After swapping, Oj would belong to the
cluster represented by Oj,2.
Cjihd(Oj, Oj,2)-d(Oj, Oi)?0
b2. d(Oj, Oh)ltd(Oj, Oj,2)
After swapping, Oj would belong to the
cluster represented by Oh.
Cjihd(Oj, Oh)-d(Oj, Oi)

22
PAM (Cont.)

a. b.
The total cost of replacing Oi with Oh

Oh
Oh
Ol
Oj,2
Oj
Oj
Oi
Oi
23
Algorithm PAM
24
CLARA(Clustering LARge Applications)

CLARA draws a sample of the data set, applies PAM
on the sample, and finds the medoids of the
sample.
CLARA draws multiple samples and gives the best
clustering as the output.
Experiments indicate that 5 samples of size 402k
give satisfactory results.

25
Algorithm CLARA

1. For i 1 to 5, repeat the following steps
2. Draw a sample of 40 2k objects randomly from
the entire data set 1, and call Algorithm PAM to
find k medoids of the sample.
1 Apart from the first sample, subsequent samples
include the best set of medoids found so far.

26
Algorithm CLARA (Cont.)

3. For each object Oj in the entire data set,
determine which of the k medoids is the most
similar to Oj.
4. Calculate the average distance of the
clustering obtained in the previous step. If this
value is less than the current minimum, use this
value as the current minimum, and retain the k
medoids found in Step (2) as the best set of
medoids obtained so far.
5. Return to Step (1) to start the next iteration.

27
Density-based DBSCAN

Epsthe neighborhood of a given radius
MinPtsthe cardinality of the neighborhood has to
exceed some threshold
directly density-reachable
p ? NEps(q)
Card(NEps(q)) ? MinPts
density-reachable
p gtD q

p directly density-reachable from q
q
p
28
DBSCAN (Cont.)

density-connected
p gtD o
q gtD o
cluster
Maximality? p,q ? D, if p ? C and q gtD p ? q ? C
Connectivity? p,q ? C, p is density-connected to
q in C
noise p ? D? i p ? Ci

29
DBSCAN (Cont.)

two different kinds of objects in a clustering
core object
non-core object
border object
noise object

30
DBSCAN Algorithm

Algorithm DBSCAN(D, Eps, Minpts)
// Precondition All objects in D are
unclassified.
FORALL object o in D DO
IF o is unclassified
call function expand_cluster to
construct a cluster wrt.
Eps and MinPts containing o.

31
DBSCAN Algorithm (Cont.)

FUNCTION expand_cluster(o, D, Eps, MinPts)
retrieve the Eps-neighborhood NEps(o) of o
IF NEps(o) lt MinPts // i.e. o is not a
core object
mark o as noise and RETURN
ELSE // i.e. o is a core object
select a new cluster-id and mark all object
in NEps(o) with this current cluster-id
push all object from NEps(o)\o onto the
stack seeds
WHILE NOT seeds.empty() DO
currentObject seeds.top()
retrieve the Eps-neighborhood
NEps(currentObject) of object currentObject
IF NEps(currentObject) ? MinPts
select all objects in
NEps(currentObject) not yet classified or are
marked as noise,
push the unclassified objects onto
seeds and mark all of these objects with current
cluster-id
seeds.pop()
RETURN

32
CAST

The clusters are constructed one at a time.
The currently constructed cluster is denoted by
Copen.
Affinity of element x
high affinity a(x) ? t Copen
low affinity a(x) lt t Copen
CAST alternates between adding high-affinity
elements to Copen, and removing low-affinity
elements from it.

33
(No Transcript)
34
CAST-Example