Cluster Analysis of Microarray Gene Expression - PowerPoint PPT Presentation

1 / 22
About This Presentation
Title:

Cluster Analysis of Microarray Gene Expression

Description:

Department of Computer Science and Engineering. University of Minnesota. kuang_at_cs.umn.edu ... Measurements of 6,220 ORFs at 15 time points ... – PowerPoint PPT presentation

Number of Views:175
Avg rating:3.0/5.0
Slides: 23
Provided by: KUA2
Category:

less

Transcript and Presenter's Notes

Title: Cluster Analysis of Microarray Gene Expression


1
Cluster Analysis of Microarray Gene Expression
CSCI5980 Functional Genomics, Systems Biology
and Bioinformtics
  • Rui Kuang and Chad Myers
  • Department of Computer Science and Engineering
  • University of Minnesota
  • kuang_at_cs.umn.edu

2
Hierarchical Clustering
  • Organize the genes in a structure of a
    hierarchical tree
  • Initial step each gene is regarded as a cluster
    with one item
  • Find the 2 most similar clusters and merge them
    into a common node
  • The length of the branch is proportional to the
    distance
  • Iterate on merging nodes until all genes are
    contained in one cluster- the root of the tree.

3
K-MEANS
  • The user sets the number of clusters- k
  • Initialization each gene is randomly assigned to
    one of the k clusters
  • Average expression vector is calculated for each
    cluster (clusters profile)
  • Iterate over the genes
  • For each gene- compute its similarity to the
    cluster profiles.
  • Move the gene to the cluster it is most similar
    to.
  • Recalculated cluster profiles.
  • Stop criteria further shuffling of genes results
    in minor improvement in the clustering score

4
How Many Clusters?
  • Try several parameters and compare the clustering
    solutions
  • Mathematical criteria for comparison of
    clustering solutions later in the presentation
  • PCA (Principle Component Analysis)
  • A technique for projecting the gene expression
    data set onto a reduced (2 or 3 dimensional)
    easily visualized space

5
PCA - Example
  • Dataset Thousands of genes probed in 5
    conditions (time points relative to treatment)
  • The expression profile of each gene is presented
    by the vector of its expression levels X (X1,
    X2, X3, X4, X5)
  • Imagine each gene X as a point in a 5-dimentional
    space.
  • Each direction/axis corresponds to a specific
    condition
  • Genes with similar profiles are close to each
    other in this space
  • PCA- Project this dataset to 2 dimensions,
    preserving as much information as possible

6
PCA Example
Visual estimation of the number of clusters in
the data
7
K-MEANS example 4 clusters
8
Cluster 1
Cluster 3
Mis-classified
Cluster 4
Cluster 2
9
K-means example 3 clusters
10
Too few clusters K2
11
Convergence of K-MEANS
  • Define goodness measure of cluster k as sum of
    squared distances from cluster centroid.
  • Reassignment monotonically decreases G, since
    each vector is assigned to the closest centroid.
  • Stop criteria, e.g.,
  • A fixed number of iterations.
  • Gene partition unchanged.
  • Centroid positions dont change.

12
Time Complexity
  • Computing distance between two genes is O(m),
    where m is the number of experiments.
  • Reassigning clusters O(Kn) distance
    computations, or O(Knm).
  • Computing centroids Each doc gets added once to
    some centroid O(nm).
  • Assume these two steps are each done once for I
    iterations O(IKnm).

13
K-MEANS Example
  • Tavazoie et al., Nature 1999
  • Cell cycles of S. cerevisiae
  • Measurements of 6,220 ORFs at 15 time points
  • K-means clustering discovers 30 clusters of 3,000
    genes
  • Interpretation with MIPS protein functions and
    common transcription factor binding motifs

14
K-MEANS Example
MluI cell-cycle box
Swi 4/6 cell-cycle box
15
K-means
  • Advantages
  • Simple and fast!
  • Disadvantages
  • The number of gene clusters is unknown
  • No structure between the clusters
  • Not robust to outliers centroid can be greatly
    affected by outliers

16
SOMs (Self-Organizing Maps)
  • User sets the number of clusters in a form of a
    rectangular grid (e.g., 3x2) map nodes
  • Imagine genes as points in (M-dimensional) space
  • Initialization map nodes (corresponding to
    clusters) are randomly placed in the data space

17
Genes data points
Clusters map nodes
18
SOM - Scheme
  • Randomly choose a data point (gene).
  • Find its closest map node
  • Move this map node towards the data point
  • Move the neighbor map nodes towards this point,
    but to lesser extent
  • Iterate over data points

19
  • The extent of node displacements is relaxed with
    the iteration number
  • After thousands of iterations
  • Assign each gene to the map node (cluster) it is
    most similar to

20
SOM Example
  • Tamayo et al., PNAS 1999
  • Cell cycles of S. cerevisiae dataset and human
    hematopoietic differentiation.

21
SOM Example
  • Two Cell cycles of S. cerevisiae
  • 6 X 5 SOM

22
How to evaluate clustering solution
  • S and T are two clustering solution.
  • n11 of pairs that are mates in both S and T
  • n10 of pairs that are mates only in S
  • n01 of pairs that are mates only in T
  • Jaccard coefficient correctly identified mates
    vs all mates in S and T
  • Minkowski coefficient disagreements vs true
    mates in S
Write a Comment
User Comments (0)
About PowerShow.com