Clustering Continued - PowerPoint PPT Presentation

1 / 31
About This Presentation
Title:

Clustering Continued

Description:

B) Initial values and convergence criterion ... D) Visualization of results. Note: many of these issues ... However, the results in the case of two elongated ... – PowerPoint PPT presentation

Number of Views:48
Avg rating:3.0/5.0
Slides: 32
Provided by: shen161
Category:

less

Transcript and Presenter's Notes

Title: Clustering Continued


1
Clustering Continued
  • Slides modified from Wing Wongs slides

2
Assess clustering reliabilities
  • How to cut the clustering tree to get relatively
    tight gene or sample clusters

3
Resampling hierarchical clustering to assess
reliabilities
4
Properties of Hierarchical clustering
  • Bottom-up method, non-iterative
  • Produce a hierarchy of clustering
  • Ordering of genes for data table visualization
  • Dendrogram can be useful for assessing cluster
    strength.

5
(No Transcript)
6
K-means clustering algorithm
  • Start with suitable choices of K cluster
    mean-vectors (17-dimensional vectors for each
    gene in the alpha factor cell cycle example).
    Then iterate the following two steps
  • For each object (a 17-dim vector if we are
    clustering the genes), find the closest mean
    vector and assign the the object to the
    corresponding cluster.
  • For each cluster, update its mean vector
    according to the current assignments.

7
Issues in application of K-means
  • A) How many clusters are there? (How to choose
    K?)
  • B) Initial values and convergence criterion
  • C) Assessment of the quality of each cluster
    and each cluster assignment.
  • D) Visualization of results

Note many of these issues are automatically
handled in hierarchical clustering.
8
A) How to choose k?
Milligan Cooper(1985) compared 30 published
rules. 1. Calinski Harabasz (1974) 2.
Hartigan (1975) , Stop when
H(k)lt10
W(k) total sum of squares within clusters B(k)
sum of squares between cluster means
9
(No Transcript)
10
(No Transcript)
11
(No Transcript)
12
A promising new approach uses the concept of
prediction strength. Tibshirani et al, 2001
Let X be a n by p data matrix, C be a clustering
on X. DC, X n by n matrix whose ijth entry is
the indicator of whether the ith and jth
element of X belong to the same cluster
according to C. (i.e. whether they nearest
to the same cluster centroid. Also, let C(X,
k) be the result of applying K-means to X,
with Kk. To calculate the prediction strength
of this clustering, partition the data set
into a training data Xtr and a test data
Xte .
13
Apply clustering with Kk separately to the
training data and to the test data. Let
nk1, nk2, nkk be the number of observations in
the k test clusters, and let Ak1, Ak2, Akk be
the indices of the test observations in these
clusters. Then the prediction strength is defined
as
-- For each test cluster, compute the proportion
of observation pairs in this cluster that are
also predicted to be co-clustering according to
cluster centroids from training. -- Take the
minimum over k test clusters. -- Average this
index over random partitions into test and
training data. -- partition is usually 2-fold.
14
Tibshirani, 2001
15
Tibshirani, 2001
16
Let Ak(i) the other observations in the test
cluster that contains
observation i. Then the prediction strength for
observation i is
17
Tibshirani, 2001
18
Tibshirani, 2001
19
Tibshirani, 2001
20
Overall, prediction strength works well. However,
the results in the case of two elongated
clusters suggests that further investigation is
needed.
Other issues How to handle sporadic cases? How
to partition? Should it be 2 fold, or 5 fold?
Could one modify the method to allow training
and test sample to overlap?
21
Starting value and convergence
Convergence is easy stop when cluster assignment
no longer changes
The choice of initial values is more difficult.
22
K-means requires good initial values.
Hierarchical Clustering could be used but
sometimes performs poorly.
with-in sum of Sq. X965.32 O305.09
23
  • An improved choice of initial values
  • Perform hierarchical clustering.
  • Early truncation of the hierarchical tree to
    obtain p?k clusters.
    (k the number of clusters, p an
    integer)
  • Among these p?k clusters, choose k clusters
    containing the largest number of points. Use
    cluster centers of the k clusters as the initial
    value.

24
Simulated Example 3 normally distributed
clusters (each with 50 points) plus 50 sporadic
points added
25
Judge K-means errors R-value The sum of square
of distances from the computed cluster centers to
the true centers.
errors
26
K-means error rates using different initial
values (100,000 simulations)
Overall, 3 k hierarchical clustering
provide good initial values.
27
Visualization of clustering results
Myeloma (MM) data set 5 Normal samples, 5 MGUS
samples, and 9 MM samples Genes filtered by
large variation Filtering genes... Probe
set 12600 167 genes satisfied the filtering
criterion Variation across samples 1.00
lt standard deviation / mean between lt 10.00
P call in the array used gt 20
28
Sample clustering use the 167 filtered genes Each
gene has its expression values standardized
across rows (samples), red 3, blue -3, white
0 Three major samples clusters on top
29
Dimension reduction using Linear discriminant
analysis (LDA, to be introduced)
30
Viewing clustering result through dimension
reduction
Clustering of gene expression data. a,
Hierarchical clustering dendrogram with the
cluster of 19 melanomas at the centre. b, MDS
three-dimensional plot of all 31 cutaneous
melanoma samples showing major cluster of 19
samples (blue, within cylinder), and remaining 12
samples (gold). Bitter et al. Nature. VOL 406
p536, 2000
31
Homework
  • Read and execute the K-means clustering program
    http//people.revoledu.com/kardi/tutorial/kMean/ma
    tlab_kMeans.htm
  • Read and execute Matlab functions for
    hierarchical and K-means clusteringhttp//www.mat
    hworks.com/access/helpdesk_r13/help/toolbox/bioinf
    o/a1060813267b1.html
Write a Comment
User Comments (0)
About PowerShow.com