Cluster Analysis of Microarray Gene Expression - PowerPoint PPT Presentation

1 / 22

About This Presentation

Title:

Cluster Analysis of Microarray Gene Expression

Description:

Department of Computer Science and Engineering. University of Minnesota. kuang_at_cs.umn.edu ... Measurements of 6,220 ORFs at 15 time points ... – PowerPoint PPT presentation

Number of Views:175

Avg rating:3.0/5.0

Slides: 23

Provided by: KUA2

Category:

more less

Transcript and Presenter's Notes

Title: Cluster Analysis of Microarray Gene Expression

1
Cluster Analysis of Microarray Gene Expression
CSCI5980 Functional Genomics, Systems Biology
and Bioinformtics

Rui Kuang and Chad Myers
Department of Computer Science and Engineering
University of Minnesota
kuang_at_cs.umn.edu

2
Hierarchical Clustering

Organize the genes in a structure of a
hierarchical tree
Initial step each gene is regarded as a cluster
with one item
Find the 2 most similar clusters and merge them
into a common node
The length of the branch is proportional to the
distance
Iterate on merging nodes until all genes are
contained in one cluster- the root of the tree.

3
K-MEANS

The user sets the number of clusters- k
Initialization each gene is randomly assigned to
one of the k clusters
Average expression vector is calculated for each
cluster (clusters profile)
Iterate over the genes
For each gene- compute its similarity to the
cluster profiles.
Move the gene to the cluster it is most similar
to.
Recalculated cluster profiles.
Stop criteria further shuffling of genes results
in minor improvement in the clustering score

4
How Many Clusters?

Try several parameters and compare the clustering
solutions
Mathematical criteria for comparison of
clustering solutions later in the presentation
PCA (Principle Component Analysis)
A technique for projecting the gene expression
data set onto a reduced (2 or 3 dimensional)
easily visualized space

5
PCA - Example

Dataset Thousands of genes probed in 5
conditions (time points relative to treatment)
The expression profile of each gene is presented
by the vector of its expression levels X (X1,
X2, X3, X4, X5)
Imagine each gene X as a point in a 5-dimentional
space.
Each direction/axis corresponds to a specific
condition
Genes with similar profiles are close to each
other in this space
PCA- Project this dataset to 2 dimensions,
preserving as much information as possible

6
PCA Example
Visual estimation of the number of clusters in
the data
7
K-MEANS example 4 clusters
8
Cluster 1
Cluster 3
Mis-classified
Cluster 4
Cluster 2
9
K-means example 3 clusters
10
Too few clusters K2
11
Convergence of K-MEANS

Define goodness measure of cluster k as sum of
squared distances from cluster centroid.
Reassignment monotonically decreases G, since
each vector is assigned to the closest centroid.
Stop criteria, e.g.,
A fixed number of iterations.
Gene partition unchanged.
Centroid positions dont change.

12
Time Complexity

Computing distance between two genes is O(m),
where m is the number of experiments.
Reassigning clusters O(Kn) distance
computations, or O(Knm).
Computing centroids Each doc gets added once to
some centroid O(nm).
Assume these two steps are each done once for I
iterations O(IKnm).

13
K-MEANS Example

Tavazoie et al., Nature 1999
Cell cycles of S. cerevisiae
Measurements of 6,220 ORFs at 15 time points
K-means clustering discovers 30 clusters of 3,000
genes
Interpretation with MIPS protein functions and
common transcription factor binding motifs

14
K-MEANS Example
MluI cell-cycle box
Swi 4/6 cell-cycle box
15
K-means

Advantages
Simple and fast!
Disadvantages
The number of gene clusters is unknown
No structure between the clusters
Not robust to outliers centroid can be greatly
affected by outliers

16
SOMs (Self-Organizing Maps)

User sets the number of clusters in a form of a
rectangular grid (e.g., 3x2) map nodes
Imagine genes as points in (M-dimensional) space
Initialization map nodes (corresponding to
clusters) are randomly placed in the data space

17
Genes data points
Clusters map nodes
18
SOM - Scheme