Lecture 5 Intro to Clustering - PowerPoint PPT Presentation

1 / 21
About This Presentation
Title:

Lecture 5 Intro to Clustering

Description:

Venn Diagram of Clustered Data. Similarity measures ... yeast cells, at 7 min intervals for 119 min (approximately 2 cell cycles) ... – PowerPoint PPT presentation

Number of Views:28
Avg rating:3.0/5.0
Slides: 22
Provided by: shen161
Category:

less

Transcript and Presenter's Notes

Title: Lecture 5 Intro to Clustering


1
Lecture 5Intro to Clustering
2
Motivation
  • Discover unexpected grouping of genes and samples
  • Prediction of functions of unknown genes by known
    ones
  • Does a sample cluster share similar clinical
    characteristics (e.g. survival, marker status)
  • Promoter analysis of commonly regulated genes

3
Functional significant gene clusters
4
Sub-classes of lung cancer types have signature
genes
Bhattacharjee et al. (2001) Classification of
human lung carcinomas by mRNA expression
profiling reveals distinct adenocarcinoma
subclasses Proc. Natl. Acad. Sci. USA, Vol. 98,
13790-13795
5
Promoter analysis of commonly regulated genes
David J. Lockhart Elizabeth A. Winzeler, NATURE
VOL 405 15 JUNE 2000, p827
6
Clustering algorithms
  • Start with a collection of n objects each
    represented by a pdimensional feature vector xi
    , i1, n.
  • The goal is to divide these n objects into k
    clusters so that objects within a clusters are
    more similar than objects between clusters. k
    is usually unknown.
  • Popular methods hierarchical, k-means, SOM,
    mixture models.
  • For microarrays, we may cluster genes, or
    samples, or both.

7
Example N700 objects, p2 measurements, k14
clusters
8
Hierarchical Clustering (Cont.)
  • Multilevel clustering, at level 1 we have n
    clusters and at level n we have one cluster.
  • Agglomerative HC starts with singleton and merge
    clusters.
  • Divisive HC starts with one sample and split
    clusters.

9
Hierarchical Clustering Nearest Neighbor
Algorithm
  • Nearest Neighbor Algorithm is an agglomerative HC
    (bottom-up).
  • The algorithm starts with n nodes (n is the size
    of our sample). At every level the 2 most
    similar nodes are merged together into one node.
    The algorithm stops when we get the desired
    number of clusters.

10
Hierarchical Clustering algorithm 1. Similarity
between all possible combinations of two profiles
is calculated. 2. Each profile is placed in a
separate cluster. 3. Two most similar clusters
are grouped together to form a new cluster. 4.
Similarity between the new cluster and all
remaining clusters is recalculated by a user
defined clustering method. 5. Steps 3 4 are
repeated until all profiles end up in one large
cluster. The researcher defines the
following Similarity Measure (Correlation,
Cosine correlation, Euclidean etc) Clustering
Method (Average linkage, Single linkage, Complete
linkage) Ordering Function (Input rank or Average
value)
11
Hierarchical Clustering
Venn Diagram of Clustered Data
Dendrogram
12
Similarity measures
13
(No Transcript)
14
(No Transcript)
15
Unweighted Pair Group Method with Arithmetic mean
16
(No Transcript)
17
Data table m genes, n conditionsXij, i1..m,
j1..n
  • Spellman et al (1998) monitor gene expression of
    6108 yeast genes on synchronized yeast cells, at
    7 min intervals for 119 min (approximately 2 cell
    cycles).
  • In this example, m6108, n17.
  • Xij is the log-ratio of gene i, condition j
    (ratio of the sample in condition j to a
    reference sample from asynchronous culture).

18
(No Transcript)
19
(No Transcript)
20
Clustering considerations
  • What genes are used to cluster samples?
  • Genes with large variation across samples
  • Genes not inherent variable in expression level
  • Exclude genes irrelevant to the question in hand
  • e.g. immunoglobulin genes are variable in lung
    cancer and can drive sample clustering, but not
    interesting to us

21
Homework (Due Sept 18)
  • Execute, read and comprehend this program
    (http//www.bioconductor.org/mogr/chapter-code/Ana
    lClust.R)
  • Find the attitude dataset in R. What does the
    dataset represent? Do a hierarchical clustering
    to the employees. Display the clustering result
    with a dendrogram.
Write a Comment
User Comments (0)
About PowerShow.com