Title: Clustering Analysis
1- Clustering Analysis
- Presented by Ching-Pin Kao
2Problem Description
- Cluster analysis is the classification of items
into a number of groups, or clusters. - Clusters collections of objects whose
intra-class similarity is high and inter-class
similarity is low. - Tasks of clustering analysis
- Similarity measurements
- Clustering methods
- Validation techniques
3Similarity Measurements
- Types of similarity measurements
- Distance measurements
- Correlation coefficients
- Association coefficients
- Probabilistic similarity coefficients
4Similarity Measurements Correlation Coefficients
- The most popular correlation coefficient is
Pearson correlation coefficient. (1892) - Correlation between XX1, X2, , Xn and YY1,
Y2, , Yn - where
5Similarity Measurements Correlation
Coefficients (Cont.)
- It captures similarity of the shapes of two
expression profiles, and ignores differences
between their magnitudes.
6Concept Correlation Coefficients
7Validation Techniques
- Types of validation techniques
- External indices
- based on some gold standards
- Matching coefficient, Jaccard coefficient
- Internal indices
- based on some statistics of the results
- Huberts G Statistics
8Validation Techniques Huberts G Statistics
- XX(i, j) and YY(i, j) are two n n matrix
- X(i, j) similarity of object i and object j
- Huberts G statistic represents the point serial
correlation - where M n (n - 1) / 2 is the number of entries
in the double sum - A higher value of G represents the better
clustering quality.
9Concept Huberts G Statistics
10Validation Techniques External Indices
- Given two binary matrices A and B of the same
dimensions - Matching coefficient (ad) / (abcd)
- Jaccard coefficient a / (abc)
11Types of Clustering Methods
- Partitioning
- K-Means, K-Medoids, PAM, CLARA, CLARANS, CAST,
- Hierarchical
- Density-based
- Grid-based
- STING, CLIQUE, WaveCluster,
- Model-based
- Outlier analysis
Distinction between agglomerative and divisive
13(No Transcript)
14HAC(Hierarchical agglomerative clustering)
Proximity matrix
Single Link Complete Link
A partition with n20 and k 3
16k-Means Clustering
- 1. Select an initial partition with K
clusters.Repeat steps 2-5 until the cluster
membership stabilizes. - 2. Generate a new partition by assigning pattern
to its closest cluster centre. - 3. Compute new cluster centres as the centroids
of the clusters. - 4. Repeat step 2 and 3 until an optimum value of
the criterion function is found. - 5. Adjust the number of clusters by merging and
splitting existing clusters or by removing small
clusters, or outliers.
17k-Means Clustering (Cont.)
18k-Medoid Methods
- Medoidoptimal representative object for each
cluster(the most centrally loacted object within
the cluster) - k-medoid methodsThe method of partitioning
around medoid.
19PAM(Partition Around Medoids)
- Based on the k-medoid model
- Oia selected object
- Oha nonselected object
- SwapOi is replaced by Oh as a medoid
- Consider another nonselected object Oj and
calculate its costs(contribution) Cjih to the
20PAM (Cont.)
- a. Oj belongs to a cluster other than the one
represented by Oi. Let Ol be the representative
object of that cluster. - a1. d(Oj, Oh)?d(Oj, Ol)
- After swapping, Oj would belong to the
cluster represented by Ol. - Cjih0
- a2. d(Oj, Oh)ltd(Oj, Ol)
- After swapping, Oj would belong to the
cluster represented by Oh. - Cjihd(Oj, Oh)-d(Oj, Ol)lt0
21PAM (Cont.)
- b. Oj belongs to the cluster represented by Oi.
Let Oj,2 is the second most similar medoid to Oj - b1. d(Oj, Oh)?d(Oj, Oj,2)
- After swapping, Oj would belong to the
cluster represented by Oj,2. - Cjihd(Oj, Oj,2)-d(Oj, Oi)?0
- b2. d(Oj, Oh)ltd(Oj, Oj,2)
- After swapping, Oj would belong to the
cluster represented by Oh. - Cjihd(Oj, Oh)-d(Oj, Oi)
22PAM (Cont.)
- a. b.
- The total cost of replacing Oi with Oh
23Algorithm PAM
24CLARA(Clustering LARge Applications)
- CLARA draws a sample of the data set, applies PAM
on the sample, and finds the medoids of the
sample. - CLARA draws multiple samples and gives the best
clustering as the output. - Experiments indicate that 5 samples of size 402k
give satisfactory results.
25Algorithm CLARA
- 1. For i 1 to 5, repeat the following steps
- 2. Draw a sample of 40 2k objects randomly from
the entire data set 1, and call Algorithm PAM to
find k medoids of the sample. - 1 Apart from the first sample, subsequent samples
include the best set of medoids found so far.
26Algorithm CLARA (Cont.)
- 3. For each object Oj in the entire data set,
determine which of the k medoids is the most
similar to Oj. - 4. Calculate the average distance of the
clustering obtained in the previous step. If this
value is less than the current minimum, use this
value as the current minimum, and retain the k
medoids found in Step (2) as the best set of
medoids obtained so far. - 5. Return to Step (1) to start the next iteration.
27Density-based DBSCAN
- Epsthe neighborhood of a given radius
- MinPtsthe cardinality of the neighborhood has to
exceed some threshold - directly density-reachable
- p ? NEps(q)
- Card(NEps(q)) ? MinPts
- density-reachable
- p gtD q
p directly density-reachable from q
28DBSCAN (Cont.)
- density-connected
- p gtD o
- q gtD o
- cluster
- Maximality? p,q ? D, if p ? C and q gtD p ? q ? C
- Connectivity? p,q ? C, p is density-connected to
q in C - noise p ? D? i p ? Ci
29DBSCAN (Cont.)
- two different kinds of objects in a clustering
- core object
- non-core object
- border object
- noise object
30DBSCAN Algorithm
- Algorithm DBSCAN(D, Eps, Minpts)
- // Precondition All objects in D are
unclassified. - FORALL object o in D DO
- IF o is unclassified
- call function expand_cluster to
construct a cluster wrt. - Eps and MinPts containing o.
31DBSCAN Algorithm (Cont.)
- FUNCTION expand_cluster(o, D, Eps, MinPts)
- retrieve the Eps-neighborhood NEps(o) of o
- IF NEps(o) lt MinPts // i.e. o is not a
core object - mark o as noise and RETURN
- ELSE // i.e. o is a core object
- select a new cluster-id and mark all object
in NEps(o) with this current cluster-id - push all object from NEps(o)\o onto the
stack seeds - WHILE NOT seeds.empty() DO
- currentObject seeds.top()
- retrieve the Eps-neighborhood
NEps(currentObject) of object currentObject - IF NEps(currentObject) ? MinPts
- select all objects in
NEps(currentObject) not yet classified or are
marked as noise, - push the unclassified objects onto
seeds and mark all of these objects with current - cluster-id
- seeds.pop()
- The clusters are constructed one at a time.
- The currently constructed cluster is denoted by
Copen. - Affinity of element x
- high affinity a(x) ? t Copen
- low affinity a(x) lt t Copen
- CAST alternates between adding high-affinity
elements to Copen, and removing low-affinity
elements from it.
33(No Transcript)
- 1. C Ø, U 1, 2, 3, , 10
- 2. Copen Ø, a (?) 0
- 3. ADD
- 3.1 Arbitrarily choose a element 1?U
- Copen 1, U 2, 3, , 10
- a(1) S(1, 1), , a(10) S(10, 1)
- 3.2 If element 3?U has max high affinity
- Copen 1, 3, U 2, 4, , 10
- a(1) S(1, 3), , a(10) S(10, 3)
- 3.3 Repeat ADD until all elements?U has low
35CAST-Example (Cont.)
- 4. If Copen 1, 2, 3, 7, 10, U 4, 5, 6, 8,
9 - after ADD procedure
- 5.1 If element 2?Copen has min low affinity
- Copen 1, 3, 7, 10, U 3, 2, 5, 6, 8,
9 - a(1) - S(1, 2), , a(10) - S(10, 2)
- 5.2 Repeat REMOVE until all elements? Copen
- has high affinity
- 6. C C?Copen
- 7. Repeat step 2 until U Ø