Clustering in Microarray Datamining and Challenges Beyond - PowerPoint PPT Presentation

1 / 28
About This Presentation
Title:

Clustering in Microarray Datamining and Challenges Beyond

Description:

Gene Expression Matrix (Affymetrix GeneChip oligonucleotide arrays) sam/ref ... metabolism, photosynthesis, cell cycle, ... Challenges beyond clustering ... – PowerPoint PPT presentation

Number of Views:61
Avg rating:3.0/5.0
Slides: 29
Provided by: danb75
Category:

less

Transcript and Presenter's Notes

Title: Clustering in Microarray Datamining and Challenges Beyond


1
Clustering in Microarray Data-mining and
Challenges Beyond
CS491jh presentation March 7, 2002
  • Qing-jun Wang
  • Center for Biophysics Computational Biology
  • University of Illinois at Urbana-Champaign

2
Clustering
What?
Where?
How?
Challenges beyond clustering
3
Data Acquisition
Data Processing
  • Experimental design
  • -MIAME
  • Replicates
  • Single/multiple slides
  • Perform experiment
  • Collect data
  • Grid alignment
  • Data quality
  • e.g. bad data, S/N
  • Missing data
  • Normalization
  • Total intensity normalization
  • Regression techniques
  • Ratio statistics

4
Gene Expression Matrix (Affymetrix GeneChip
oligonucleotide arrays)
5
Gene Expression Matrix (glass slides)
6
Data Acquisition
Data Processing
  • MIAME
  • Experiment design
  • Replicates
  • Single/multiple slides
  • Data quality
  • e.g. bad data, S/N
  • Grid alignment
  • Missing data
  • Normalization
  • Total intensity normalization
  • Regression techniques
  • Ratio statistics

Data Analysis
Re-scale
Distance matrices
Data Validation
Supervised analysis e.g. SVM, K-nearest neighbor,
decision trees, voted classification, weighted
gene voting, Bayesian classification
  • Unsupervised analysis (clustering)
  • Hierarchical
  • Non-hierarchical (e.g. K-means, PCA-based
    clustering, self-organizing maps, block
    clustering, gene-shaving, plaid models)

7
Hierarchical clustering
Protocol
  • Calculate pairwise distance matrix
  • Find the two most similar genes or clusters
  • Merge the two selected clusters to produce a new
    cluster
  • Calculate pairwise distance matrix involving the
    new cluster
  • Repeat steps 2-4 until all objects are in one
    cluster
  • The clustering sequence is represented by a
    hierarchical tree dendrogram.

8
Hierarchical clustering
Variations differ in how distances are
calculated
Single-linkage clustering minimum
distance Complete-linkage clustering maximum
distance Average-linkage clustering
(UPGMA) Weighted pair-group average use size of
the clusters as the weights in computing
averages Within-groups clustering Wards method
smallest possible increase in the sum of squared
errors
9
Hierarchical clustering
Bottom-up (agglomerative) approach One-way
clustering Deterministic clustering Produce a
greater number of clusters than k-means
clustering valuable feature for
discovery. Produce an order for objects
informative for data display.
Difficulties
1. As clusters grow in size, the expression
vector that represents the cluster might no
longer represent any of the genes in the cluster
an artifact 2. If a bad assignment is made
early on, it cannot be corrected
10
Non-hierarchical clustering
K-means clustering
Top-down (divisive) approach Used when the number
of clusters is known in advance One-way
clustering Non-deterministic owing to the random
initialization Produce tighter clusters than
hierarchical clustering
Protocol
  • Initial reference vectors are assigned randomly
    or according to previous knowledge
  • Assign each object to one of k clusters randomly
  • Calculate average expression vectors for each
    cluster (as reference vectors) and the distance
    between clusters
  • Iteratively move objects between clusters and the
    objects stay in the new cluster when they are
    closer to the new cluster than to the old
    cluster.
  • Repeat steps 3-4 until converge, i.e. moving any
    more objects would increase intra-cluster
    distances

11
Non-hierarchical clustering
K-means clustering
(Borrowed from Dr. Jiawei Han March 5, 2002)
12
Non-hierarchical clustering
K-means clustering
Difficulty
How to determine whether there are really only k
distinct clusters represented in the data or not.
Solutions
Use K-means clustering with principal component
analysis (PCA), which allows visual estimation of
the number of clusters represented in the
data. Try sequential k-means approach which finds
number of clusters based on dataset.
13
Non-hierarchical clustering
Self-organizing map clustering
Top-down (divisive) approach One-way
clustering Neural-network-based clustering
approach Non-deterministic owing to the random
order in which genes are used to move the
reference vectors. Similar to k-means clustering
except that the cluster centers are restricted to
lie in a one or two-dimensional manifold Model
the complexity within a dataset more effectively
than k-means clustering.
14
Non-hierarchical clustering
Self-organizing map clustering
Protocol
  • Define a geometric configuration for the
    partitions, e.g. a 2D rectangular or hexagonal
    grid
  • Construct and assign random vectors to each
    partition
  • Pick a gene randomly identify the reference
    vector that is closest to the gene
  • Adjust the reference vectors so that they are
    more similar to the gene vector
  • Repeat steps 3-4 until the reference vectors
    converge
  • Map genes to the relevant partitions based on the
    reference vectors to which they are most similar

(Borrowed from Joshua Unger Feb. 28, 2002)
15
Non-hierarchical clustering
One-way clustering used to group genes with
similar behavior across samples or samples with
similar gene expression vectors
Hierarchical clustering K-means
clustering Self-organizing maps
Two-way clustering simultaneously cluster both
genes and samples
Block clustering Gene shaving Plaid models
16
Non-hierarchical clustering
Blocking clustering
Top-down approach Two-way clustering Produce a
matrix with homogeneous blocks of the
outcomes Produce hierarchical clustering trees
for the rows and columns
Protocol
  • Begin with the entire matrix in one block
  • Sort rows and columns by row and column means
  • Find the row or column splits of all existing
    blocks, choosing the one that produces largest
    reduction in the total within-block-variance
  • If there are existing row/column splits that
    intersect the block, one of them must be used.
    Otherwise all split points are tried.
  • The splitting is continued until a large number
    of blocks are obtained
  • Apply weakest link pruning to recombine some of
    the blocks until the optimal number of blocks is
    obtained.
  • The optimal number of blocks is estimated by
    maximum gap approach

17
Non-hierarchical clustering
Blocking clustering
Difficulty
When applied to median centered data, at the
start, all rows and column means are
approximately zero the procedure has difficulty
getting started.
18
Non-hierarchical clustering
The two-way clustering approach seek a single
re-ordering of the samples for all genes.
However, one set of genes might cluster the
samples in one way while another set of genes in
a very different way. Gene Shaving approach
finds the linear combination of genes having
maximal variation among samples. This linear
combination of genes is viewed as a super gene.
The genes having lowest correlation with the
super gene is removed (shaved). The process is
continued until the subset of genes contains only
one gene. This process produces a sequence of
gene blocks, each containing genes that are
similar to one another and displaying large
variance across samples.
A statistical approach Two-way clustering Identifi
es subsets of genes with coherent expression
patterns and large variation across
conditions Gene may belong to more than one
cluster Can be either un-supervised or supervised
19
Gene shaving
Protocol
  • Start with all data in one block.
  • Find the first principal component of the genes
  • For each gene i, compute the absolute value of
    its correlation with the first principal
    component
  • Remove the fraction a of genes having the
    smallest absolute correlation
  • Repeat steps 34 until only one gene remains
  • This procedure produces a set of nested gene
    groups G1?G2 ? ?G ? ?Gn, from which G is
    selected as the optimal gene block (small ),
    where the optimal shave size is estimated using
    maximum gap method.
  • The rows of the gene expression matrix are
    orthogonalised with respect to the average of all
    genes in cluster G to obtain a new gene
    expression matrix to encourage discovery of a
    different second cluster. Repeat steps 2-7 until
    no interesting gene shaves can be found.

20
(No Transcript)
21
Non-hierarchical clustering
A cellular process may involve a relatively small
subset of genes in the dataset. The process may
take place only in a small number of samples.
Therefore, when the full dataset is analyzed, the
signal of this process may be completely
overwhelmed by the noise of vast majority of
unrelated data. Plaid models search for
interpretable biological structures in microarray
data, i.e. subsets of the genes/samples, one of
which can be used to cluster the other to yield
stable and significant partitions/layers.
Two-way clustering Allows a gene to be in more
than one cluster or in none at all Allows a
cluster of genes to be defined with respect to
only a subset of samples, not necessarily all of
them
22
Non-hierarchical clustering
Plaid models
Ideal reordering Every gene and every sample are
in exactly one cluster
23
Non-hierarchical clustering
Plaid models
24
Evaluate clustering
Clarity of cluster definitions Computational
cost Robustness Reproducibility
Cancer research
Cancer typing Correlating whole-genome expression
pattern with particular clinical
implication Diagnose malignant tissue from normal
one Drug effect study
Pathway discovery Assign functions of unknown
genes Gene network regulation metabolism,
photosynthesis, cell cycle,
25
Challenges beyond clustering
Understand sources of noise and variations in
microarray experiments
Combine expression data with other sources of
information
Published literature DNA protein sequence
databases Protein data bank Phylogenetic
profiles Metabolic function Annotated
experimental functional studies
26
Clustering
Assumption guilt-by-association
Genes that are contained in a particular pathway,
or that respond to a common environmental
challenge, should be co-regulated and
consequently, should show similar patterns of
expression.
This is a controversial hypothesis because the
existence of
Convergent regulation (similar temporal
expression patterns, different control
strategies) Divergent regulation (similar
control regions, different ways to take effects)
27
Challenges beyond clustering
Understand sources of noise and variations in
microarray experiments
Combine expression data with other sources of
information
Published literature DNA protein sequence
databases Protein data bank Phylogenetic
profiles Metabolic function Annotated
experimental functional studies
Reconstruct networks of genetic interactions to
create integrated and systematic models of
biological systems
Boolean networks Linear modeling Generic
programming Bayesian belief networks
28
References
  • Quackenbush (2001) Nature Reviews Genetics.
    2418-427
  • Altman Raychaudhuri (2001) Curr. Opin. Struct.
    Biol. 11340-347
  • Lazzeroni Owen (2000) Tech. Report. Stanford
    Univ.
  • Aas (2001) SAMBA
  • Tibshirani et al. (1999) Tech. Report. Stanford
    Univ.
  • Hastie et al. (2000) Genome Biol. 1(2)
Write a Comment
User Comments (0)
About PowerShow.com