Clustering in Microarray Datamining and Challenges Beyond - PowerPoint PPT Presentation

1 / 28

About This Presentation

Title:

Clustering in Microarray Datamining and Challenges Beyond

Description:

Gene Expression Matrix (Affymetrix GeneChip oligonucleotide arrays) sam/ref ... metabolism, photosynthesis, cell cycle, ... Challenges beyond clustering ... – PowerPoint PPT presentation

Number of Views:61

Avg rating:3.0/5.0

Slides: 29

Provided by: danb75

Category:

more less

Transcript and Presenter's Notes

Title: Clustering in Microarray Datamining and Challenges Beyond

1
Clustering in Microarray Data-mining and
Challenges Beyond
CS491jh presentation March 7, 2002

Qing-jun Wang
Center for Biophysics Computational Biology
University of Illinois at Urbana-Champaign

2
Clustering
What?
Where?
How?
Challenges beyond clustering
3
Data Acquisition
Data Processing

Experimental design
-MIAME
Replicates
Single/multiple slides
Perform experiment
Collect data

Grid alignment
Data quality
e.g. bad data, S/N
Missing data
Normalization
Total intensity normalization
Regression techniques
Ratio statistics

4
Gene Expression Matrix (Affymetrix GeneChip
oligonucleotide arrays)
5
Gene Expression Matrix (glass slides)
6
Data Acquisition
Data Processing

MIAME
Experiment design
Replicates
Single/multiple slides

Data quality
e.g. bad data, S/N
Grid alignment
Missing data
Normalization
Total intensity normalization
Regression techniques
Ratio statistics

Data Analysis
Re-scale
Distance matrices
Data Validation
Supervised analysis e.g. SVM, K-nearest neighbor,
decision trees, voted classification, weighted
gene voting, Bayesian classification

Unsupervised analysis (clustering)
Hierarchical
Non-hierarchical (e.g. K-means, PCA-based
clustering, self-organizing maps, block
clustering, gene-shaving, plaid models)

7
Hierarchical clustering
Protocol

Calculate pairwise distance matrix
Find the two most similar genes or clusters
Merge the two selected clusters to produce a new
cluster
Calculate pairwise distance matrix involving the
new cluster
Repeat steps 2-4 until all objects are in one
cluster
The clustering sequence is represented by a
hierarchical tree dendrogram.

8
Hierarchical clustering
Variations differ in how distances are
calculated
Single-linkage clustering minimum
distance Complete-linkage clustering maximum
distance Average-linkage clustering
(UPGMA) Weighted pair-group average use size of
the clusters as the weights in computing
averages Within-groups clustering Wards method
smallest possible increase in the sum of squared
errors
9
Hierarchical clustering
Bottom-up (agglomerative) approach One-way
clustering Deterministic clustering Produce a
greater number of clusters than k-means
clustering valuable feature for
discovery. Produce an order for objects
informative for data display.
Difficulties
1. As clusters grow in size, the expression
vector that represents the cluster might no
longer represent any of the genes in the cluster
an artifact 2. If a bad assignment is made
early on, it cannot be corrected
10
Non-hierarchical clustering
K-means clustering
Top-down (divisive) approach Used when the number
of clusters is known in advance One-way
clustering Non-deterministic owing to the random
initialization Produce tighter clusters than
hierarchical clustering
Protocol

Initial reference vectors are assigned randomly
or according to previous knowledge
Assign each object to one of k clusters randomly
Calculate average expression vectors for each
cluster (as reference vectors) and the distance
between clusters
Iteratively move objects between clusters and the
objects stay in the new cluster when they are
closer to the new cluster than to the old
cluster.
Repeat steps 3-4 until converge, i.e. moving any
more objects would increase intra-cluster
distances

11
Non-hierarchical clustering
K-means clustering
(Borrowed from Dr. Jiawei Han March 5, 2002)
12
Non-hierarchical clustering
K-means clustering
Difficulty
How to determine whether there are really only k
distinct clusters represented in the data or not.
Solutions
Use K-means clustering with principal component
analysis (PCA), which allows visual estimation of
the number of clusters represented in the
data. Try sequential k-means approach which finds
number of clusters based on dataset.
13
Non-hierarchical clustering
Self-organizing map clustering
Top-down (divisive) approach One-way
clustering Neural-network-based clustering
approach Non-deterministic owing to the random
order in which genes are used to move the
reference vectors. Similar to k-means clustering
except that the cluster centers are restricted to
lie in a one or two-dimensional manifold Model
the complexity within a dataset more effectively
than k-means clustering.
14
Non-hierarchical clustering
Self-organizing map clustering
Protocol

Define a geometric configuration for the
partitions, e.g. a 2D rectangular or hexagonal
grid
Construct and assign random vectors to each
partition
Pick a gene randomly identify the reference
vector that is closest to the gene
Adjust the reference vectors so that they are
more similar to the gene vector
Repeat steps 3-4 until the reference vectors
converge
Map genes to the relevant partitions based on the
reference vectors to which they are most similar

(Borrowed from Joshua Unger Feb. 28, 2002)
15
Non-hierarchical clustering
One-way clustering used to group genes with
similar behavior across samples or samples with
similar gene expression vectors
Hierarchical clustering K-means
clustering Self-organizing maps
Two-way clustering simultaneously cluster both
genes and samples
Block clustering Gene shaving Plaid models
16
Non-hierarchical clustering
Blocking clustering
Top-down approach Two-way clustering Produce a
matrix with homogeneous blocks of the
outcomes Produce hierarchical clustering trees
for the rows and columns
Protocol

Begin with the entire matrix in one block
Sort rows and columns by row and column means
Find the row or column splits of all existing
blocks, choosing the one that produces largest
reduction in the total within-block-variance
If there are existing row/column splits that
intersect the block, one of them must be used.
Otherwise all split points are tried.
The splitting is continued until a large number
of blocks are obtained
Apply weakest link pruning to recombine some of
the blocks until the optimal number of blocks is
obtained.
The optimal number of blocks is estimated by
maximum gap approach

17
Non-hierarchical clustering
Blocking clustering
Difficulty
When applied to median centered data, at the
start, all rows and column means are
approximately zero the procedure has difficulty
getting started.
18
Non-hierarchical clustering
The two-way clustering approach seek a single
re-ordering of the samples for all genes.
However, one set of genes might cluster the
samples in one way while another set of genes in
a very different way. Gene Shaving approach
finds the linear combination of genes having
maximal variation among samples. This linear
combination of genes is viewed as a super gene.
The genes having lowest correlation with the
super gene is removed (shaved). The process is
continued until the subset of genes contains only
one gene. This process produces a sequence of
gene blocks, each containing genes that are
similar to one another and displaying large
variance across samples.
A statistical approach Two-way clustering Identifi
es subsets of genes with coherent expression
patterns and large variation across
conditions Gene may belong to more than one
cluster Can be either un-supervised or supervised
19
Gene shaving
Protocol

Start with all data in one block.
Find the first principal component of the genes
For each gene i, compute the absolute value of
its correlation with the first principal
component
Remove the fraction a of genes having the
smallest absolute correlation
Repeat steps 34 until only one gene remains
This procedure produces a set of nested gene
groups G1?G2 ? ?G ? ?Gn, from which G is
selected as the optimal gene block (small ),
where the optimal shave size is estimated using
maximum gap method.
The rows of the gene expression matrix are
orthogonalised with respect to the average of all
genes in cluster G to obtain a new gene
expression matrix to encourage discovery of a
different second cluster. Repeat steps 2-7 until
no interesting gene shaves can be found.

20
(No Transcript)
21
Non-hierarchical clustering
A cellular process may involve a relatively small
subset of genes in the dataset. The process may
take place only in a small number of samples.
Therefore, when the full dataset is analyzed, the
signal of this process may be completely
overwhelmed by the noise of vast majority of
unrelated data. Plaid models search for
interpretable biological structures in microarray
data, i.e. subsets of the genes/samples, one of
which can be used to cluster the other to yield
stable and significant partitions/layers.
Two-way clustering Allows a gene to be in more
than one cluster or in none at all Allows a
cluster of genes to be defined with respect to
only a subset of samples, not necessarily all of
them
22
Non-hierarchical clustering
Plaid models
Ideal reordering Every gene and every sample are
in exactly one cluster
23
Non-hierarchical clustering
Plaid models
24
Evaluate clustering
Clarity of cluster definitions Computational
cost Robustness Reproducibility
Cancer research
Cancer typing Correlating whole-genome expression
pattern with particular clinical
implication Diagnose malignant tissue from normal
one Drug effect study
Pathway discovery Assign functions of unknown
genes Gene network regulation metabolism,
photosynthesis, cell cycle,
25
Challenges beyond clustering
Understand sources of noise and variations in
microarray experiments
Combine expression data with other sources of
information
Published literature DNA protein sequence
databases Protein data bank Phylogenetic
profiles Metabolic function Annotated
experimental functional studies
26
Clustering
Assumption guilt-by-association
Genes that are contained in a particular pathway,
or that respond to a common environmental
challenge, should be co-regulated and
consequently, should show similar patterns of
expression.
This is a controversial hypothesis because the
existence of
Convergent regulation (similar temporal
expression patterns, different control
strategies) Divergent regulation (similar
control regions, different ways to take effects)
27
Challenges beyond clustering
Understand sources of noise and variations in
microarray experiments
Combine expression data with other sources of
information
Published literature DNA protein sequence
databases Protein data bank Phylogenetic
profiles Metabolic function Annotated
experimental functional studies
Reconstruct networks of genetic interactions to
create integrated and systematic models of
biological systems
Boolean networks Linear modeling Generic
programming Bayesian belief networks
28
References