Cluster Analysis for Gene Expression Data - PowerPoint PPT Presentation

1 / 35

About This Presentation

Title:

Cluster Analysis for Gene Expression Data

Description:

Department of Microbiology. kayee_at_u.washington.edu. 10/18/2002. Ka Yee Yeung, CEA. 2 ... Objects in the same cluster (group) are more similar to each other ... – PowerPoint PPT presentation

Number of Views:145

Avg rating:3.0/5.0

Slides: 36

Provided by: cse46

Learn more at: http://faculty.washington.edu

Category:

more less

Transcript and Presenter's Notes

Title: Cluster Analysis for Gene Expression Data

1
Cluster Analysis for Gene Expression Data

Ka Yee Yeung
http//staff.washington.edu/kayee/research.html
Center for Expression Arrays
Department of Microbiology
kayee_at_u.washington.edu

2
A gene expression data set
..
p experiments

Snapshot of activities in the cell
Each chip represents an experiment
time course
tissue samples (normal/cancer)

n genes
Xij
3
What is clustering?

Group similar objects together
Objects in the same cluster (group) are more
similar to each other than objects in different
clusters
Data exploratory tool to find patterns in large
data sets
Unsupervised approach do not make use of prior
knowledge of data

4
Applications of clustering gene expression data

Cluster the genes ? functionally related genes
Cluster the experiments ? discover new subtypes
of tissue samples
Cluster both genes and experiments ? find
sub-patterns

5
Examples of clustering algorithms

Hierarchical clustering algorithms eg. Eisen et
al 1998
K-means eg. Tavazoie et al. 1999
Self-organizing maps (SOM) eg. Tamayo et al.
1999
CAST Ben-Dor, Yakhini 1999
Model-based clustering algorithms eg. Yeung et
al. 2001

6
Overview

Similarity/distance measures
Hierarchical clustering algorithms
Made popular by Stanford, ie. Eisen et al. 1998
K-means
Made popular by many groups, eg. Tavazoie et al.
1999
Model-based clustering algorithms Yeung et al.
2001

7
How to define similarity?
Experiments
genes
X
n
1
p
1
X
genes
genes
Y
Y
n
n
Raw matrix
Similarity matrix

Similarity measures
A measure of pairwise similarity or
dissimilarity
Examples
Correlation coefficient
Euclidean distance

8
Similarity measures(for those of you who enjoy
equations)

Euclidean distance
Correlation coefficient

9
Example
Correlation (X,Y) 1 Distance (X,Y)
4 Correlation (X,Z) -1 Distance (X,Z)
2.83 Correlation (X,W) 1 Distance (X,W)
1.41
10
Lessons from the example

Correlation direction only
Euclidean distance magnitude direction
Array data is noisy ? need many experiments to
robustly estimate pairwise similarity

11
Clustering algorithms

From pairwise similarities to groups
Inputs
Raw data matrix or similarity matrix
Number of clusters or some other parameters

12
Hierarchical Clustering Hartigan 1975

Agglomerative (bottom-up)
Algorithm
Initialize each item a cluster
Iterate
select two most similar clusters
merge them
Halt when required number of clusters is reached

dendrogram
13
Hierarchical Single Link

cluster similarity similarity of two most
similar members

- Potentially long and skinny clusters Fast
14
Example single link
5
4
3
2
1
15
Example single link
5
4
3
2
1
16
Example single link
5
4
3
2
1
17
Hierarchical Complete Link

cluster similarity similarity of two least
similar members

tight clusters - slow
18
Example complete link
5
4
3
2
1
19
Example complete link
5
4
3
2
1
20
Example complete link
5
4
3
2
1
21
Hierarchical Average Link

cluster similarity average similarity of all
pairs

tight clusters - slow
22
Software TreeView Eisen et al. 1998

Fig 1 in Eisens PNAS 99 paper
Time course of serum stinulation of primary human
fibrolasts
cDNA arrays with approx 8600 spots
Similar to average-link
Free download at http//rana.lbl.gov/EisenSoftwar
e.htm

23
Overview

Similarity/distance measures
Hierarchical clustering algorithms
Made popular by Stanford, ie. Eisen et al. 1998
K-means
Made popular by many groups, eg. Tavazoie et al.
1999
Model-based clustering algorithms Yeung et al.
2001

24
Partitional K-MeansMacQueen 1965
2
1
3
25
Details of k-means

Iterate until converge
Assign each data point to the closest centroid
Compute new centroid

Objective function Minimize
26
Properties of k-means

Fast
Proved to converge to local optimum
In practice, converge quickly
Tend to produce spherical, equal-sized clusters
Related to the model-based approach
Gavin Sherlocks Xcluster
http//genome-www.stanford.edu/sherlock/cluster.h
tml

27
What we have seen so far..

Definition of clustering
Pairwise similarity
Correlation
Euclidean distance
Clustering algorithms
Hierarchical agglomerative
K-means
Different clustering algorithms ? different
clusters
Clustering algorithms always spit out clusters

28
Which clustering algorithm should I use?

Good question
No definite answer on-going research
Our preference the model-based approach.

29
Model-based clustering (MBC)

Gaussian mixture model
Assume each cluster is generated by the
multivariate normal distribution
Each cluster k has parameters
Mean vector mk
Location of cluster k
Covariance matrix Sk
volume, shape and orientation of cluster k
Data transformations normality assumption

30
More on the covariance matrix Sk(volume,
orientation, shape)
Equal volume, spherical (EI)
unequal volume, spherical (VI)
Equal volume, orientation, shape (EEE)
Diagonal model
Unconstrained (VVV)
31
Key advantage of the model-based approach
choose the model and the number of clusters

Bayesian Information Criterion (BIC) Schwarz
1978
Approximate p(data model)
A large BIC score indicates strong evidence for
the corresponding model.

32
Gene expression data sets

Ovary data Michel Schummer, Institute of Systems
Biology
Subset of data 235 clones (portions of genes)
24 experiments (cancer/normal tissue samples)
235 clones correspond to 4 genes (external
criterion)

33
BIC analysis square root ovary data

EEE and diagonal models -gt first local max at 4
clusters
Global max -gt VI at 8 clusters

34
How do we know MBC is doing well?Answer compare
to external info

Adjusted Rand max at EEE 4 clusters (gt CAST)

35
Take home messages

MBC has superior performance on
Quality of clusters
Number of clusters and model chosen (BIC)
Clusters with high BIC scores tend to produce a
high agreement with the external information
MBC tends to produce better clusters than a
leading heuristic-based clustering algorithm
(CAST)
Splus or R versions
http//www.stat.washington.edu/fraley/mclust/