Title: Cluster Analysis
1Cluster Analysis
- Lecture Notes for Chapter 8
2Review
- Quiz all clustering lectures including todays
- Problems with K-means
- Initial centroid selection
- Solutions?
- Different shapes, densities sizes
- Why?
- Solution?
- Empty clusters
- Solution?
- Unsupervised cluster Validation
- Reasons
3Hierarchical Clustering
- Produces a set of nested clusters organized as a
hierarchical tree - Can be visualized as a dendrogram
- A tree like diagram that records the sequences of
merges or splits
4Strengths of Hierarchical Clustering
- Do not have to assume any particular number of
clusters - Any desired number of clusters can be obtained by
cutting the dendrogram at the proper level - They may correspond to meaningful taxonomies
- Example in biological sciences (e.g., animal
kingdom, phylogeny reconstruction, )
5The Hyaluronidases A Chemical, Biological and
Clinical Overview
- Hyaluronidases are a group of neglected enzymes
that have recently taken on greater significance - Alignments of selected hyaluronidases from
various species were performed using Amino acid
sequences.
6Hierarchical Clustering
- Two main types of hierarchical clustering
- Agglomerative (bottom-up)
- Start with the points as individual clusters
- At each step, merge the closest pair of clusters
until only one cluster is (or k clusters are)
left - Divisive (top-down)
- Start with one, all-inclusive cluster
- At each step, split a cluster until each cluster
contains a point (or there are k clusters) - Traditional hierarchical algorithms use a
similarity or distance matrix (proximity matrix) - Merge or split one cluster at a time
7Proximity Matrix
Proximity Matrix
8(No Transcript)
9Agglomerative Clustering Algorithm
- More popular hierarchical clustering technique
- Basic algorithm is straightforward
- Compute the proximity matrix
- Let each data point be a cluster
- Repeat
- Merge the two closest clusters
- Update the proximity matrix (?)
- Until only a single cluster remains
-
- Key operation is the computation of the proximity
of two clusters - Different approaches to defining the distance
between clusters distinguish the different
algorithms
10Starting Situation
- Start with clusters of individual points and a
proximity matrix
Proximity Matrix
11Intermediate Situation
- After some merging steps, we have some clusters
C3
C4
Proximity Matrix
C1
C5
C2
12Intermediate Situation
- We want to merge the two closest clusters (C2 and
C5) and update the proximity matrix.
C3
C4
Proximity Matrix
C1
C5
C2
13After Merging
- The question is How do we update the proximity
matrix?
C2 U C5
C1
C3
C4
?
C1
? ? ? ?
C2 U C5
C3
?
C3
C4
?
C4
Proximity Matrix
C1
C2 U C5
14How to Define Inter-Cluster Similarity
Similarity?
- MIN
- MAX
- Group Average
- Distance Between Centroids
- Other methods driven by an objective function
- Wards Method uses squared error
Proximity Matrix
15How to Define Inter-Cluster Similarity
- MIN
- MAX
- Group Average
- Distance Between Centroids
- Other methods driven by an objective function
- Wards Method uses squared error
Proximity Matrix
16How to Define Inter-Cluster Similarity
- MIN
- MAX
- Group Average
- Distance Between Centroids
- Other methods driven by an objective function
- Wards Method uses squared error
Proximity Matrix
17How to Define Inter-Cluster Similarity
- MIN
- MAX
- Group Average
- Distance Between Centroids
- Other methods driven by an objective function
- Wards Method uses squared error
Proximity Matrix
18(No Transcript)
19How to Define Inter-Cluster Similarity
?
?
- MIN
- MAX
- Group Average
- Distance Between Centroids
- Other methods driven by an objective function
- Wards Method uses squared error
Proximity Matrix
20Cluster Similarity MIN or Single Link
- Distance between two clusters is based on the two
most similar (closest) points in the different
clusters
21Hierarchical Clustering MIN
Dendrogram
Nested Clusters
Dist(3,6, 2,5) min(dist(3,2), dist(6,2),
dist(3,5), dist(6,5))
min(0.15, 0.25, 0.28, 0.39)
0.15
22Cluster Similarity MAX or Complete Linkage
- Distance between two clusters is based on the two
least similar (most distant) points in the
different clusters !!!FIGURE IS WRONG!!!
23Hierarchical Clustering MAX
Dist(3,6, 2,5) max(dist(3,2), dist(6,2),
dist(3,5), dist(6, 5))
max(0.15, 0.25, 0.28, 0.39)
0.39
Nested Clusters
Dist(3,6, 4) max(dist(3,4), dist(6,4))
max(0.15, 0.22)
0.22
Dist(3,6, 1) max(dist(3,1), dist(6,1))
max(0.22, 0.23)
0.23
24Strength of MIN
Original Points
- Can handle non-elliptical shapes
25Limitations of MIN
Original Points
- Sensitive to noise and outliers
26Strength of MAX
Original Points
- Less susceptible to noise and outliers
27Limitations of MAX
Original Points
- Tends to break large clusters
- Biased towards globular clusters