Title: Sin ttulo de diapositiva
1Clustering of DNA microarray data
Joaquín Dopazo. Bioinformatics Unit,
CNIO. http//bioinfo.cnio.es
2Supervised vs Unsupervised clustering
Sample annotation information (additional rows,
eg cell type, treatment, disease state, time
course info)
Gene annotation Information (additional
columns) Eg gene name Function Genome location
Gene expression levels
3Look for structure within the gene matrix
Annotation information used later
Unsupervised Clustering
What do they have in common?
Identify Co-expressing genes...
Genes of a class
What profile(s) do they display? and...
Are there more genes?
Molecular classification of samples
4Analysis of genes with correlated expression
Gene
- Genes with correlated expression
- markers
- functionally related genes
5Genes of the same functional class have
correlated expression patterns
25 out of 40 ORFs belong to the functional class
cytoplasmic degradation (MIPS), most of them
are proteasome subunits
6Molecular classification of samples
D
C
Clustering
B
Sample A
A
7Taxonomic Relationships Between Normal
Malignant Lymphoid Populations
Alizadeh et al., Nature 2000(96 samples)
8The data
A
B
C
Different classes of experimental conditions,
e.g. Cancer types, tissues, drug treatments, time
survival, etc.
- Characteristics of the data
- We have much much more variables than experimets
- Low signal to noise ratio
- High redundancy and intra-gene correlations
- Most of the genes are not informative with
respect to the trait we are studying (account
forunrelated physiological conditions, etc.) - Many genes have no annotation!!
Expression profile of all the genes for a
experimental condition (array)
Genes (thousands)
Expression profile of a gene across the
experimental conditions
Experimental conditions (from tens up to no
more than a few hundreds)
9Be familiar with the data in your gene expression
matrix
- Absolute vs relative gene expression values (i.e.
ratios) - Relative expression ratios of log transformed
- Normalization between samples. Are the columns
comparable? - Set of related hybridizations with a common
reference sample or amalgamated data? - Gene replicates duplicates of same probes or
different probes - Sample replicates same number for each
experimental condition
10Gene expression matrixPoints of caution for
unsupervised clustering
- Many analytical methods are based on log2 ratio
expression values and not absolute values - Check for missing values Some analysis methods
cannot handle matrices with missing values - Either delete suspect row or column, or
interpolate from known values - Reduce the size of your matrix by only
considering genes that undergo a specified
fold-change in at least one of the samples, or
whose levels change significantly over samples
being compared (i.e remove genes with flat
patterns) -
-
11Unsupervised Clustering Distance
- You do not have external information on how
the data are arranged - You only have the values measured in the
experiment - You need to be able to measure the distance
between the profiles of expression values of
two genes, or the distance between the gene
expression values in two samples - The distance measure should be a quantitative
and non-subjective measure of the closeness of
a pair of data.
12Euclidian Distance
gene
t1
t2
A
x1
x2
B
y1
y2
B
y2
d
y1
Euclidean distance squared Manhattan
distance Minkowski distance (generalized)
x1
x2
A
13Linear Correlation
The correlation coefficient between n pairs of
observations, whose values are (xi, yi) is
The linear correlation coefficient measures the
strength of the linear relationships between the
paired x and y values in a sample.
0
-1
1
y
y
y
x
x
x
14Distance types
Differences (euclidean) BltgtC Correlation Alt
gtB
15Different distances account for different
properties
A B C
correlation
A B C
euclidean
Correlation tendencies euclidean global
similarity
16Unsupervised Clustering other important choices
- Measurement of pair-wise distances between genes
expression values - NEXT STEP
- Measurement of pair-wise distances between
clusters - Single linkage (or nearest neighbour)
- Complete linkage (or furthest neighbour, or
maximum distance) - Average linkage
- i) average distance between each point in a
cluster and every point in the other cluster -
- weighted methods compensate for size of cluster
(WPGMA) - unweighted methods treat clusters of different
sizes equally (UPGMA -
- ii) from mean centroid of each cluster
- weighted (WPGMC )
17Unsupervised Clustering Which computation
algorithm should I use?
- The aim of clustering is to group together genes
or samples that have similar expression profiles - There are many different computational algorithms
for doing this - You can have
- hierarchical clustering
- agglomerative clustering
- divisive clustering (SOTA)
- flat (or non-hierarchical clustering) (K-means,
SOM)
18Unsupervised clustering methods
Non hierarchical
hierarchical
K-means, PCA
hierarchical
quick and robust
SOM
SOTA
Different levels of information
19Aggregative hierarchical clustering
Relationships among profiles are represented by
branch lengths. The closest pair of profiles are
recursively linked until the complete hierarchy
is reconstructed Allows to explore the
relationship among groups of related genes at
higher levels.
CLUSTER
20c1
c2
c3
c4
c5
Aggregative hierarchical clustering
The pair of closest profiles is recursively
joined until a complete hierarchy is
constructed Branch lengths are proportional to
the differences between profiles.
21Different aggregative criterion
minimum
maximum
22Exercise
- Using real data, try to build a tree with average
linkage. - Steps
- Construct distance matrices (use d (1-dc)/2 in
correlation) - Use the algorithm. Select the closest pair, and
collapse column and row joining entries as dxy,z
(dx,zdy,z)/2
correlation
ORF R1 R2 R3 R4 R5 YHR007C 0.16 0.25 0.40 -0.19
-0.25 YBR218C 0.24 0.30 -0.38 -0.43 -0.33 YAL051
W -0.04 0.40 0.41 0.24 0.17 YAL053W 0.19 0.41 0.2
3 -0.01 -0.31 YAL054C -0.67 -0.19 0.00 -0.19 -0.3
0 YAL055W -0.56 0.00 -0.13 -0.06 -0.31 YAL056W 0
.01 0.65 0.24 -0.00 -0.09 YAL058W 0.04 0.30 0.20
0.05 -0.20 YOL109W 0.63 0.65 0.91 0.55 0.17 YAL0
65C -0.13 -0.62 0.18 -0.05 -0.35 YAL066W -0.58 -0
.22 0.03 -0.26 -0.19 YAL067C -1.12 -0.99 -0.41 -1
.03 -0.89
dx,y
euclidean
23Differences in clustering of experiments
Euclidean
Correlation
24Results
The best correlated is not the most similar.
Correlation
...and the most similar is not the best correlated
Euclidean
25Aggregative hierarchical clustering
- Problems
- lack of robustness
- difficult interpretation
- subjective cluster definition
26Clustering methods
Non hierarchical
Hierarchical
deterministic
K-means, PCA
UPGMA
NN
SOTA
SOM
Robust
Provides different levels of information
Properties
27K-Means clustering
The idea is to find the best division of N
samples by K clusters Ci such that the total
distance between the clustered samples and their
respective centers (that is, the total variance)
is minimized. This criterion is expressed like
this
where ?i is the center of class i. Analogy to
linear regression can be seen there the
residuals are the distance from each point to the
regression line. In clustering, the residuals are
the distance between each point and its cluster
center. The k-means algorithm starts by randomly
assigning instances to the classes, computes the
centers according to
then reassignes the instances to the nearest
clusters center, recalculates centers, reassigns
the instances, etc. until J stops decreasing (or
centers stop to move). Here is a two-dimensional
example of clustering
28K-Means clustering
K-means clustering algorithm
- Partition the items randomly into k initial
clusters - Decide which distance measure to use
- Determine the centroid (or mean of distances)
for each cluster - Then, for each item, in turn
- a) Calculate the distance between the item and
all the means - b) Re-assign the item to the cluster with the
closest mean (or centroid) - c) Recalculate the centroids for the cluster
gaining and - the cluster losing an item
- 3. Repeat step 2 until no more reassignments
take place
29Self organising maps SOM
Bidimensional hexagonal or rectangular network
Output nodes
exp1 exp2 .. expp gen1 a11 a12
.. a1p gen2 a21 a22 .. a2p
genn an1 an2 .. anp
30SOM The algorithm
Step 1. Initialize nodes to random values. Set
the initial radius of the neighborhood. Step
2. Present new input Compute distances to all
nodes. Euclidean distances are commonly
used Step 3. Select output node j with minimum
distance dj. Update node j and neighbors. Nodes
updated for the neighborhood NEj(t) as wij(t1)
wij(t) ?(t)(xi(t) - wij(t)) for j ?
NEj(t) ?(t) is a gain term that decreases in
time. Step4 Repeat by going to Step 2 until
convergence.
Input
31SOM results
DeRisi et al. (1997) Exploring the Metabolic and
Genetic Control of Gene Expression on a genomic
Scale. Science, 278, 680-686
32SOMExample
Response of human fibroblasts to serum Iyer et
al., 1999 Science 28383-87
If a given class is overrepresented, it takes
over many neurons
33Clustering methods
Non hierarchical
Hierarchical
deterministic
K-means, PCA
UPGMA
NN
SOM
SOTA
Robust
Provides different levels of information
Properties
34SOTA clustering
A
B
E
Interactive Web based Configurable
D
C
F
35SOTAThe algorithm
Step 1. Initialize nodes to random values. Step
2. Present new input Compute distances to all
terminal nodes. Step 3. Select output node j
with minimum distance dj. Update node j and
neighbors. Nodes updated for the neighborhood
NEj(t) as wij(t1) wij(t) ?(t)(xi(t) -
wij(t)) for j ? NEj(t) ?(t) is a gain term than
decreases in time. Step 4 Repeat by going to
Step 2 until convergence. Step 5 Reproduce the
node with highest variability.
The Self Organising Tree Algorithm (SOTA) is a
hierarchical divisive method based on a neural
network
SOTA, unlike other hierarchical methods, grows
from top to bottom until an appropriate level of
variability is reached
Input
Dopazo, Carazo (1997) Herrero, Valencia, Dopazo
(2001)
36Advantages of SOTA
Robusteness against noise
Divisive algorithm SOTA grows from top to bottom
growing can be stopped at any desired level of
variability.
Clusterspatterns Each node of the tree has a
pattern associated which corresponds to the
cluster under itself.
Distribution preserving The number of clusters
depends on the variability of the data.
37SOTA/SOM vs classical clustering (UPGMA)
38What we have learned? Lessons from the
firs-generation algorithms and specific demands
for clustering microarray data
- Number of clusters. K-means, SOM and hierarchical
methods do not provide any method for defining
the true number of clusters
- The wish list
- Methods must be fast
- Robustness and noise tolerance
- Deterministic
- Able to decide the number of clusters
automatically