Title: Clustering Continued
1Clustering Continued
- Slides modified from Wing Wongs slides
2Assess clustering reliabilities
- How to cut the clustering tree to get relatively
tight gene or sample clusters
3Resampling hierarchical clustering to assess
reliabilities
4Properties of Hierarchical clustering
- Bottom-up method, non-iterative
- Produce a hierarchy of clustering
- Ordering of genes for data table visualization
- Dendrogram can be useful for assessing cluster
strength.
5(No Transcript)
6K-means clustering algorithm
- Start with suitable choices of K cluster
mean-vectors (17-dimensional vectors for each
gene in the alpha factor cell cycle example).
Then iterate the following two steps - For each object (a 17-dim vector if we are
clustering the genes), find the closest mean
vector and assign the the object to the
corresponding cluster. - For each cluster, update its mean vector
according to the current assignments.
7Issues in application of K-means
- A) How many clusters are there? (How to choose
K?) - B) Initial values and convergence criterion
- C) Assessment of the quality of each cluster
and each cluster assignment. - D) Visualization of results
Note many of these issues are automatically
handled in hierarchical clustering.
8A) How to choose k?
Milligan Cooper(1985) compared 30 published
rules. 1. Calinski Harabasz (1974) 2.
Hartigan (1975) , Stop when
H(k)lt10
W(k) total sum of squares within clusters B(k)
sum of squares between cluster means
9(No Transcript)
10(No Transcript)
11(No Transcript)
12A promising new approach uses the concept of
prediction strength. Tibshirani et al, 2001
Let X be a n by p data matrix, C be a clustering
on X. DC, X n by n matrix whose ijth entry is
the indicator of whether the ith and jth
element of X belong to the same cluster
according to C. (i.e. whether they nearest
to the same cluster centroid. Also, let C(X,
k) be the result of applying K-means to X,
with Kk. To calculate the prediction strength
of this clustering, partition the data set
into a training data Xtr and a test data
Xte .
13Apply clustering with Kk separately to the
training data and to the test data. Let
nk1, nk2, nkk be the number of observations in
the k test clusters, and let Ak1, Ak2, Akk be
the indices of the test observations in these
clusters. Then the prediction strength is defined
as
-- For each test cluster, compute the proportion
of observation pairs in this cluster that are
also predicted to be co-clustering according to
cluster centroids from training. -- Take the
minimum over k test clusters. -- Average this
index over random partitions into test and
training data. -- partition is usually 2-fold.
14Tibshirani, 2001
15Tibshirani, 2001
16Let Ak(i) the other observations in the test
cluster that contains
observation i. Then the prediction strength for
observation i is
17Tibshirani, 2001
18Tibshirani, 2001
19Tibshirani, 2001
20Overall, prediction strength works well. However,
the results in the case of two elongated
clusters suggests that further investigation is
needed.
Other issues How to handle sporadic cases? How
to partition? Should it be 2 fold, or 5 fold?
Could one modify the method to allow training
and test sample to overlap?
21Starting value and convergence
Convergence is easy stop when cluster assignment
no longer changes
The choice of initial values is more difficult.
22K-means requires good initial values.
Hierarchical Clustering could be used but
sometimes performs poorly.
with-in sum of Sq. X965.32 O305.09
23- An improved choice of initial values
- Perform hierarchical clustering.
- Early truncation of the hierarchical tree to
obtain p?k clusters.
(k the number of clusters, p an
integer) - Among these p?k clusters, choose k clusters
containing the largest number of points. Use
cluster centers of the k clusters as the initial
value.
24Simulated Example 3 normally distributed
clusters (each with 50 points) plus 50 sporadic
points added
25Judge K-means errors R-value The sum of square
of distances from the computed cluster centers to
the true centers.
errors
26K-means error rates using different initial
values (100,000 simulations)
Overall, 3 k hierarchical clustering
provide good initial values.
27Visualization of clustering results
Myeloma (MM) data set 5 Normal samples, 5 MGUS
samples, and 9 MM samples Genes filtered by
large variation Filtering genes... Probe
set 12600 167 genes satisfied the filtering
criterion Variation across samples 1.00
lt standard deviation / mean between lt 10.00
P call in the array used gt 20
28Sample clustering use the 167 filtered genes Each
gene has its expression values standardized
across rows (samples), red 3, blue -3, white
0 Three major samples clusters on top
29Dimension reduction using Linear discriminant
analysis (LDA, to be introduced)
30Viewing clustering result through dimension
reduction
Clustering of gene expression data. a,
Hierarchical clustering dendrogram with the
cluster of 19 melanomas at the centre. b, MDS
three-dimensional plot of all 31 cutaneous
melanoma samples showing major cluster of 19
samples (blue, within cylinder), and remaining 12
samples (gold). Bitter et al. Nature. VOL 406
p536, 2000
31Homework
- Read and execute the K-means clustering program
http//people.revoledu.com/kardi/tutorial/kMean/ma
tlab_kMeans.htm - Read and execute Matlab functions for
hierarchical and K-means clusteringhttp//www.mat
hworks.com/access/helpdesk_r13/help/toolbox/bioinf
o/a1060813267b1.html