Title: Unsupervised learning
1Unsupervised learning Cluster Analysis Basic
Concepts and Algorithms
Assaf Gottlieb
Some of the slides are taken form Introduction to
data mining, by Tan, Steinbach, and Kumar
2What is unsupervised learning Cluster Analysis ?
- Learning without a priori knowledge about the
classification of samples learning without a
teacher. - Kohonen (1995), Self-Organizing Maps
- Cluster analysis is a set of methods for
constructing a (hopefully) sensible and
informative classification of an initially
unclassified set of data, using the variable
values observed on each individual. - B. S. Everitt (1998), The Cambridge Dictionary
of Statistics
3What do we cluster?
Features/Variables
Samples/Instances
4Applications of Cluster Analysis
- UnderstandingGroup related documents for
browsing, group genes and proteins that have
similar functionality, or group stocks with
similar price fluctuations - Data Exploration
- Get insight into data distribution
- Understand patterns in the data
- SummarizationReduce the size of large data
setsA preprocessing step
5Objectives of Cluster Analysis
- Finding groups of objects such that the objects
in a group will be similar (or related) to one
another and different from (or unrelated to) the
objects in other groups
Competing objectives
Inter-cluster distances are maximized
Intra-cluster distances are minimized
6Notion of a Cluster can be Ambiguous
Depends on resolution !
7Prerequisites
- Understand the nature of your problem, the type
of features, etc. - The metric that you choose for similarity (for
example, Euclidean distance or Pearson
correlation) often impacts the clusters you
recover.
8Similarity/Distance measures
- Euclidean Distance
- Highly depends on scaleof features may require
normalization - City Block
9deuc0.5846
deuc1.1345
These examples of Euclidean distance match our
intuition of dissimilarity pretty well
deuc2.6115
10deuc1.41
deuc1.22
But what about these? What might be going on
with the expression profiles on the left? On the
right?
11Similarity/Distance measures
- Cosine
- Pearson Correlation
- Invariant to scaling (Pearson also to addition)
- Spearman correlation for ranks
12Similarity/Distance measures
- Jaccard similarity
- When interested in intersection size
X U Y
X
Y
X n Y
13Types of Clusterings
- Important distinction between hierarchical and
partitional sets of clusters - Partitional Clustering
- A division data objects into non-overlapping
subsets (clusters) such that each data object is
in exactly one subset - Hierarchical clustering
- A set of nested clusters organized as a
hierarchical tree
14Partitional Clustering
Original Points
15Hierarchical Clustering
Dendrogram 1
Dendrogram 2
16Other Distinctions Between Sets of Clustering
methods
- Exclusive versus non-exclusive
- In non-exclusive clusterings, points may belong
to multiple clusters. - Can represent multiple classes or border points
- Fuzzy versus non-fuzzy
- In fuzzy clustering, a point belongs to every
cluster with some weight between 0 and 1 - Weights must sum to 1
- Partial versus complete
- In some cases, we only want to cluster some of
the data - Heterogeneous versus homogeneous
- Cluster of widely different sizes, shapes, and
densities
17Clustering Algorithms
- Hierarchical clustering
- K-means
- Bi-clustering
18Hierarchical Clustering
- Produces a set of nested clusters organized as a
hierarchical tree - Can be visualized as a dendrogram
- A tree like diagram that records the sequences of
merges or splits
19Strengths of Hierarchical Clustering
- Do not have to assume any particular number of
clusters - Any desired number of clusters can be obtained by
cutting the dendogram at the proper level - They may correspond to meaningful taxonomies
- Example in biological sciences (e.g., animal
kingdom, phylogeny reconstruction, )
20Hierarchical Clustering
- Two main types of hierarchical clustering
- Agglomerative (bottom up)
- Start with the points as individual clusters
- At each step, merge the closest pair of clusters
until only one cluster (or k clusters) left - Divisive (top down)
- Start with one, all-inclusive cluster
- At each step, split a cluster until each cluster
contains a point (or there are k clusters) - Traditional hierarchical algorithms use a
similarity or distance matrix - Merge or split one cluster at a time
21Agglomerative Clustering Algorithm
- More popular hierarchical clustering technique
- Basic algorithm is straightforward
- Compute the proximity matrix
- Let each data point be a cluster
- Repeat
- Merge the two closest clusters
- Update the proximity matrix
- Until only a single cluster remains
-
- Key operation is the computation of the proximity
of two clusters - Different approaches to defining the distance
between clusters distinguish the different
algorithms
22Starting Situation
- Start with clusters of individual points and a
proximity matrix
Proximity Matrix
23Intermediate Situation
- After some merging steps, we have some clusters
C3
C4
C1
Proximity Matrix
C5
C2
24Intermediate Situation
- We want to merge the two closest clusters (C2 and
C5) and update the proximity matrix.
C3
C4
C1
Proximity Matrix
C5
C2
25After Merging
- The question is How do we update the proximity
matrix?
C2 U C5
C1
C3
C4
?
C1
C3
? ? ? ?
C2 U C5
C4
?
C3
?
C4
C1
Proximity Matrix
C2 U C5
26How to Define Inter-Cluster Similarity
Similarity?
- MIN
- MAX
- Group Average
- Distance Between Centroids
- Wards method (not discussed)
Proximity Matrix
27How to Define Inter-Cluster Similarity
- MIN
- MAX
- Group Average
- Distance Between Centroids
- Other methods driven by an objective function
- Wards Method uses squared error
Proximity Matrix
28How to Define Inter-Cluster Similarity
- MIN
- MAX
- Group Average
- Distance Between Centroids
- Other methods driven by an objective function
- Wards Method uses squared error
Proximity Matrix
29How to Define Inter-Cluster Similarity
- MIN
- MAX
- Group Average
- Distance Between Centroids
- Other methods driven by an objective function
- Wards Method uses squared error
Proximity Matrix
30How to Define Inter-Cluster Similarity
?
?
- MIN
- MAX
- Group Average
- Distance Between Centroids
Proximity Matrix
31Cluster Similarity MIN or Single Link
- Similarity of two clusters is based on the two
most similar (closest) points in the different
clusters - Determined by one pair of points, i.e., by one
link in the proximity graph.
32Hierarchical Clustering MIN
Nested Clusters
Dendrogram
33Strength of MIN
Original Points
- Can handle non-elliptical shapes
34Limitations of MIN
Original Points
- Sensitive to noise and outliers
35Cluster Similarity MAX or Complete Linkage
- Similarity of two clusters is based on the two
least similar (most distant) points in the
different clusters - Determined by all pairs of points in the two
clusters
36Hierarchical Clustering MAX
Nested Clusters
Dendrogram
37Strength of MAX
Original Points
- Less susceptible to noise and outliers
38Limitations of MAX
Original Points
- Tends to break large clusters
- Biased towards globular clusters
39Cluster Similarity Group Average
- Proximity of two clusters is the average of
pairwise proximity between points in the two
clusters. - Need to use average connectivity for scalability
since total proximity favors large clusters
40Hierarchical Clustering Group Average
Nested Clusters
Dendrogram
41Hierarchical Clustering Group Average
- Compromise between Single and Complete Link
- Strengths
- Less susceptible to noise and outliers
- Limitations
- Biased towards globular clusters
42Cluster Similarity Wards Method
- Similarity of two clusters is based on the
increase in squared error when two clusters are
merged - Similar to group average if distance between
points is distance squared - Less susceptible to noise and outliers
- Biased towards globular clusters
- Hierarchical analogue of K-means
- Can be used to initialize K-means
43Hierarchical Clustering Comparison
MAX
MIN
Group Average
44Hierarchical Clustering Time and Space
requirements
- O(N2) space since it uses the proximity matrix.
- N is the number of points.
- O(N3) time in many cases
- There are N steps and at each step the size, N2,
proximity matrix must be updated and searched - Complexity can be reduced to O(N2 log(N) ) time
for some approaches
45Hierarchical Clustering Problems and Limitations
- Once a decision is made to combine two clusters,
it cannot be undone - Different schemes have problems with one or more
of the following - Sensitivity to noise and outliers
- Difficulty handling different sized clusters and
convex shapes - Breaking large clusters (divisive)
- Dendrogram correspond to a given hierarchical
clustering is not unique, since for each merge
one needs to specify which subtree should go on
the left and which on the right - They impose structure on the data, instead of
revealing structure in these data. - How many clusters? (some suggestions later)
46K-means Clustering
- Partitional clustering approach
- Each cluster is associated with a centroid
(center point) - Each point is assigned to the cluster with the
closest centroid - Number of clusters, K, must be specified
- The basic algorithm is very simple
47K-means Clustering Details
- Initial centroids are often chosen randomly.
- Clusters produced vary from one run to another.
- The centroid is (typically) the mean of the
points in the cluster. - Closeness is measured mostly by Euclidean
distance, cosine similarity, correlation, etc. - K-means will converge for common similarity
measures mentioned above. - Most of the convergence happens in the first few
iterations. - Often the stopping condition is changed to Until
relatively few points change clusters - Complexity is O( n K I d )
- n number of points, K number of clusters, I
number of iterations, d number of attributes
Typical choice
48Evaluating K-means Clusters
- Most common measure is Sum of Squared Error (SSE)
- For each point, the error is the distance to the
nearest cluster - To get SSE, we square these errors and sum them.
- x is a data point in cluster Ci and mi is the
representative point for cluster Ci - can show that mi corresponds to the center
(mean) of the cluster - Given two clusters, we can choose the one with
the smallest error - One easy way to reduce SSE is to increase K, the
number of clusters - A good clustering with smaller K can have a
lower SSE than a poor clustering with higher K
49Issues and Limitations for K-means
- How to choose initial centers?
- How to choose K?
- How to handle Outliers?
- Clusters different in
- Shape
- Density
- Size
- Assumes clusters are spherical in vector space
- Sensitive to coordinate changes
50Two different K-means Clusterings
Original Points
51Importance of Choosing Initial Centroids
52Importance of Choosing Initial Centroids
53Importance of Choosing Initial Centroids
54Importance of Choosing Initial Centroids
55Solutions to Initial Centroids Problem
- Multiple runs
- Sample and use hierarchical clustering to
determine initial centroids - Select more than k initial centroids and then
select among these initial centroids - Select most widely separated
- Bisecting K-means
- Not as susceptible to initialization issues
56Bisecting K-means
- Bisecting K-means algorithm
- Variant of K-means that can produce a partitional
or a hierarchical clustering
57Bisecting K-means Example
58Issues and Limitations for K-means
- How to choose initial centers?
- How to choose K?
- Depends on the problem some suggestions later
- How to handle Outliers?
- Preprocessing
- Clusters different in
- Shape
- Density
- Size
59Issues and Limitations for K-means
- How to choose initial centers?
- How to choose K?
- How to handle Outliers?
- Clusters different in
- Shape
- Density
- Size
60Limitations of K-means Differing Sizes
Original Points
K-means (3 Clusters)
61Limitations of K-means Differing Density
K-means (3 Clusters)
Original Points
62Limitations of K-means Non-globular Shapes
Original Points
K-means (2 Clusters)
63Overcoming K-means Limitations
Original Points K-means Clusters
One solution is to use many clusters. Find parts
of clusters, but need to put together.
64Overcoming K-means Limitations
Original Points K-means Clusters
65Overcoming K-means Limitations
Original Points K-means Clusters
66K-means
- Pros
- Simple
- Fast for low dimensional data
- It can find pure sub clusters if large number of
clusters is specified - Cons
- K-Means cannot handle non-globular data of
different sizes and densities - K-Means will not identify outliers
- K-Means is restricted to data which has the
notion of a center (centroid)
67Biclustering/Co-clustering
M conditions
- Two genes can have similar expression patterns
only under some conditions - Similarly, in two related conditions, some genes
may exhibit different expression patterns
N genes
68Biclustering
- As a result, each cluster may involve only a
subset of genes and a subset of conditions, which
form a checkerboard structure
69Biclustering
- In general a hard task (NP-hard)
- Heuristic algorithms described briefly
- Cheng Church deletion of rows and columns.
Biclusters discovered one at a time - Order-Preserving SubMatrixes Ben-Dor et al.
- Coupled Two-Way Clustering (Getz. Et al)
- Spectral Co-clustering
70Cheng and Church
- Objective function for heuristic methods (to
minimize) - Greedy method
- Initialization the bicluster contains all rows
and columns. - Iteration
- Compute all aIj, aiJ, aIJ and H(I, J) for reuse.
- Remove a row or column that gives the maximum
decrease of H. - Termination when no action will decrease H or H
lt ?. - Mask this bicluster and continue
- Problem removing trivial biclusters
71Ben-Dor et al. (OPSM)
- Model
- For a condition set T and a gene g, the
conditions in T can be ordered in a way so that
the expression values are sorted in ascending
order (suppose the values are all unique). - Submatrix A is a bicluster if there is an
ordering (permutation) of T such that the
expression values of all genes in G are sorted in
ascending order. - Idea of algorithm to grow partial models until
they become complete models.
t1 t2 t3 t4 t5
g1 7 13 19 2 50
g2 19 23 39 6 42
g3 4 6 8 2 10
Induced permutation 2 3 4 1 5
72Ben-Dor et al. (OPSM)
73Getz et al. (CTWC)
- Idea repeatedly perform one-way clustering on
genes/conditions. - Stable clusters of genes are used as the
attributes for condition clustering, and vice
versa.
74Spectral Co-clustering
- Main idea
- Normalize the 2 dimension
- Form a matrix of size mn (using SVD)
- Use k-means to cluster both types of data
- http//adios.tau.ac.il/SpectralCoClustering/
75Evaluating cluster quality
- Use known classes (pairwise F-measure, best class
F-measure) - Clusters can be evaluated with internal as well
as external measures - Internal measures are related to the inter/intra
cluster distance - External measures are related to how
representative are the current clusters to true
classes
76Inter/Intra Cluster Distances
- Intra-cluster distance
- (Sum/Min/Max/Avg) the (absolute/squared) distance
between - All pairs of points in the cluster OR
- Between the centroid and all points in the
cluster OR - Between the medoid and all points in the
cluster
- Inter-cluster distance
- Sum the (squared) distance between all pairs of
clusters - Where distance between two clusters is defined
as - distance between their centroids/medoids
- (Spherical clusters)
- Distance between the closest pair of points
belonging to the clusters - (Chain shaped clusters)
77Davies-Bouldin index
- A function of the ratio of the sum of
within-cluster (i.e. intra-cluster) scatter to
between cluster (i.e. inter-cluster) separation - Let CC1,.., Ck be a clustering of a set of N
objects - with and
- where Ci is the ith cluster and ci is the
centroid for cluster i
78Davies-Bouldin index example
- For eg for the clusters shown
- Compute
- var(C1)0, var(C2)4.5, var(C3)2.33
- Centroid is simply the mean here, so c13,
c28.5, c318.33 - So, R121, R130.152, R230.797
- Now, compute
- R11 (max of R12 and R13) R21 (max of R21 and
R23) R30.797 (max of R31 and R32) - Finally, compute
- DB0.932
79Davies-Bouldin index example (ctd)
- For eg for the clusters shown
- Compute
- Only 2 clusters here
- var(C1)12.33 while var(C2)2.33 c16.67 while
c218.33 - R121.26
- Now compute
- Since we have only 2 clusters here, R1R121.26
R2R211.26 - Finally, compute
- DB1.26
80Other criteria
- Dunn method
- ?(Xi, Xj) intercluster distance between clusters
Xi and Xj ?(Xk) intracluster distance of cluster
Xk - Silhouette method
- Identifying outliers
- C-index
- Compare sum of distances S over all pairs from
the same cluster against the same of smallest
and largest pairs.
81Example datasetAML/ALL dataset (Golub et al.)
- Leukemia
- 72 patients (samples)
- 7129 genes
- 4 groups
- Two major types ALL AML
- T B Cells in ALL
- With/without treatment in AML
82AML/ALL dataset
- Davies-Bouldin index - C4
- Dunn method - C2
- Silhouette method C2
83Visual evaluation - coherency
84Cluster quality example do you see clusters?
C Silhouette
2 0.4922
3 0.5739
4 0.4773
5 0.4991
6 0.5404
7 0.541
8 0.5171
9 0.5956
10 0.6446
C Silhouette
2 0.4863
3 0.5762
4 0.5957
5 0.5351
6 0.5701
7 0.5487
8 0.5083
9 0.5311
10 0.5229
85Kleinbergs Axioms
- Scale Invariance
- F(?d)F(d) for all d and all strictly
positive ?.
- Consistency
- If d equals d, except for shrinking
distances within clusters of F(d) or stretching
between-cluster distances, then F(d)F(d).
- Richness
- For any partition P of S, there exists a
distance - function d over S so that F(d)P.
86Quality estimation
- Gamma is the best performing measure in
Milligans study of 30 internal criterions
(Milligan, 1981). - Let d() denote the number of times that points
which were clustered together in C had distance
greater than two points which were not in the
same cluster - Let d(-) denote the opposite result
- Gamma satisfies scale-invariance, consistency,
richness, and isomorphism invariance.
87Dimensionality Reduction
- Map points in high-dimensional space tolower
number of dimensions - Preserve structure pairwise distances, etc.
- Useful for further processing
- Less computation, fewer parameters
- Easier to understand, visualize
88Dimensionality Reduction
- Feature selection vs. Feature Extraction
- Feature selection select important features
- Pros
- meaningful features
- Less work acquiring
- Unsupervised
- Variance, Fold
- UFF
89Dimensionality Reduction
- Feature Extraction
- Transforms the entire feature set to lower
dimension. - Pros
- Uses objective function to select the best
projection - Sometime single features are not good enough
- Unsupervised
- PCA, SVD
90Principal Components Analysis (PCA)
- approximating a high-dimensional data setwith a
lower-dimensional linear subspace
Original axes
91Singular Value Decomposition
92Principal Components Analysis (PCA)
- Rule of thumb for selecting number of components
- Knee in screeplot
- Cumulative percentage variance
93Tools for clustering
- Matlab COMPACT
- http//adios.tau.ac.il/compact/
94Tools for clustering
- Matlab COMPACT
- http//adios.tau.ac.il/compact/
95Tools for clustering
- Cluster TreeView (Eisen et al.)
- http//rana.lbl.gov/eisen/?page_id42
96Summary
- Clustering is ill-defined and considered an art
- In fact, this means you need to
- Understand your data beforehand
- Know how to interpret clusters afterwards
- The problem determines the best solution (which
measure, which clustering algorithm) try to
experiment with different options.