Sin ttulo de diapositiva

About This Presentation

Title:

Sin ttulo de diapositiva

Description:

... to the trait we are studying (account forunrelated physiological conditions, etc. ... Expression profile of all the genes for a experimental condition (array) ... – PowerPoint PPT presentation

Number of Views:15

Avg rating:3.0/5.0

Slides: 39

Provided by: joaquin7

Category:

more less

Transcript and Presenter's Notes

Title: Sin ttulo de diapositiva

1
Clustering of DNA microarray data
Joaquín Dopazo. Bioinformatics Unit,
CNIO. http//bioinfo.cnio.es
2
Supervised vs Unsupervised clustering
Sample annotation information (additional rows,
eg cell type, treatment, disease state, time
course info)
Gene annotation Information (additional
columns) Eg gene name Function Genome location
Gene expression levels
3
Look for structure within the gene matrix
Annotation information used later
Unsupervised Clustering
What do they have in common?
Identify Co-expressing genes...
Genes of a class
What profile(s) do they display? and...
Are there more genes?
Molecular classification of samples
4
Analysis of genes with correlated expression
Gene

Genes with correlated expression
markers
functionally related genes

5
Genes of the same functional class have
correlated expression patterns
25 out of 40 ORFs belong to the functional class
cytoplasmic degradation (MIPS), most of them
are proteasome subunits
6
Molecular classification of samples
D
C
Clustering
B
Sample A
A
7
Taxonomic Relationships Between Normal
Malignant Lymphoid Populations
Alizadeh et al., Nature 2000(96 samples)
8
The data
A
B
C
Different classes of experimental conditions,
e.g. Cancer types, tissues, drug treatments, time
survival, etc.

Characteristics of the data
We have much much more variables than experimets
Low signal to noise ratio
High redundancy and intra-gene correlations
Most of the genes are not informative with
respect to the trait we are studying (account
forunrelated physiological conditions, etc.)
Many genes have no annotation!!

Expression profile of all the genes for a
experimental condition (array)
Genes (thousands)
Expression profile of a gene across the
experimental conditions
Experimental conditions (from tens up to no
more than a few hundreds)
9
Be familiar with the data in your gene expression
matrix

Absolute vs relative gene expression values (i.e.
ratios)
Relative expression ratios of log transformed
Normalization between samples. Are the columns
comparable?
Set of related hybridizations with a common
reference sample or amalgamated data?
Gene replicates duplicates of same probes or
different probes
Sample replicates same number for each
experimental condition

10
Gene expression matrixPoints of caution for
unsupervised clustering

Many analytical methods are based on log2 ratio
expression values and not absolute values
Check for missing values Some analysis methods
cannot handle matrices with missing values
Either delete suspect row or column, or
interpolate from known values
Reduce the size of your matrix by only
considering genes that undergo a specified
fold-change in at least one of the samples, or
whose levels change significantly over samples
being compared (i.e remove genes with flat
patterns)

11
Unsupervised Clustering Distance

You do not have external information on how
the data are arranged
You only have the values measured in the
experiment
You need to be able to measure the distance
between the profiles of expression values of
two genes, or the distance between the gene
expression values in two samples
The distance measure should be a quantitative
and non-subjective measure of the closeness of
a pair of data.

12
Euclidian Distance
gene
t1
t2
A
x1
x2
B
y1
y2
B
y2
d
y1
Euclidean distance squared Manhattan
distance Minkowski distance (generalized)
x1
x2
A
13
Linear Correlation
The correlation coefficient between n pairs of
observations, whose values are (xi, yi) is
The linear correlation coefficient measures the
strength of the linear relationships between the
paired x and y values in a sample.
0
-1
1
y
y
y
x
x
x
14
Distance types
Differences (euclidean) BltgtC Correlation Alt
gtB
15
Different distances account for different
properties
A B C
correlation
A B C
euclidean
Correlation tendencies euclidean global
similarity
16
Unsupervised Clustering other important choices

Measurement of pair-wise distances between genes
expression values
NEXT STEP
Measurement of pair-wise distances between
clusters
Single linkage (or nearest neighbour)
Complete linkage (or furthest neighbour, or
maximum distance)
Average linkage
i) average distance between each point in a
cluster and every point in the other cluster
weighted methods compensate for size of cluster
(WPGMA)
unweighted methods treat clusters of different
sizes equally (UPGMA
ii) from mean centroid of each cluster
weighted (WPGMC )

17
Unsupervised Clustering Which computation
algorithm should I use?

The aim of clustering is to group together genes
or samples that have similar expression profiles
There are many different computational algorithms
for doing this
You can have
hierarchical clustering
agglomerative clustering
divisive clustering (SOTA)
flat (or non-hierarchical clustering) (K-means,
SOM)

18
Unsupervised clustering methods
Non hierarchical
hierarchical
K-means, PCA
hierarchical
quick and robust
SOM
SOTA
Different levels of information
19
Aggregative hierarchical clustering
Relationships among profiles are represented by
branch lengths. The closest pair of profiles are
recursively linked until the complete hierarchy
is reconstructed Allows to explore the
relationship among groups of related genes at
higher levels.
CLUSTER
20
c1
c2
c3
c4
c5
Aggregative hierarchical clustering
The pair of closest profiles is recursively
joined until a complete hierarchy is
constructed Branch lengths are proportional to
the differences between profiles.
21
Different aggregative criterion
minimum
maximum
22
Exercise

Using real data, try to build a tree with average
linkage.
Steps
Construct distance matrices (use d (1-dc)/2 in
correlation)
Use the algorithm. Select the closest pair, and
collapse column and row joining entries as dxy,z
(dx,zdy,z)/2

correlation
ORF R1 R2 R3 R4 R5 YHR007C 0.16 0.25 0.40 -0.19
-0.25 YBR218C 0.24 0.30 -0.38 -0.43 -0.33 YAL051
W -0.04 0.40 0.41 0.24 0.17 YAL053W 0.19 0.41 0.2
3 -0.01 -0.31 YAL054C -0.67 -0.19 0.00 -0.19 -0.3
0 YAL055W -0.56 0.00 -0.13 -0.06 -0.31 YAL056W 0
.01 0.65 0.24 -0.00 -0.09 YAL058W 0.04 0.30 0.20
0.05 -0.20 YOL109W 0.63 0.65 0.91 0.55 0.17 YAL0
65C -0.13 -0.62 0.18 -0.05 -0.35 YAL066W -0.58 -0
.22 0.03 -0.26 -0.19 YAL067C -1.12 -0.99 -0.41 -1
.03 -0.89
dx,y
euclidean
23
Differences in clustering of experiments
Euclidean
Correlation
24
Results
The best correlated is not the most similar.
Correlation
...and the most similar is not the best correlated
Euclidean
25
Aggregative hierarchical clustering

Problems
lack of robustness
difficult interpretation
subjective cluster definition

26
Clustering methods
Non hierarchical
Hierarchical
deterministic
K-means, PCA
UPGMA
NN
SOTA
SOM
Robust
Provides different levels of information
Properties
27
K-Means clustering
The idea is to find the best division of N
samples by K clusters Ci such that the total
distance between the clustered samples and their
respective centers (that is, the total variance)
is minimized. This criterion is expressed like
this
where ?i is the center of class i. Analogy to
linear regression can be seen there the
residuals are the distance from each point to the
regression line. In clustering, the residuals are
the distance between each point and its cluster
center. The k-means algorithm starts by randomly
assigning instances to the classes, computes the
centers according to
then reassignes the instances to the nearest
clusters center, recalculates centers, reassigns
the instances, etc. until J stops decreasing (or
centers stop to move). Here is a two-dimensional
example of clustering
28
K-Means clustering
K-means clustering algorithm

Partition the items randomly into k initial
clusters
Decide which distance measure to use
Determine the centroid (or mean of distances)
for each cluster
Then, for each item, in turn
a) Calculate the distance between the item and
all the means
b) Re-assign the item to the cluster with the
closest mean (or centroid)
c) Recalculate the centroids for the cluster
gaining and
the cluster losing an item
3. Repeat step 2 until no more reassignments
take place

29
Self organising maps SOM
Bidimensional hexagonal or rectangular network
Output nodes
exp1 exp2 .. expp gen1 a11 a12
.. a1p gen2 a21 a22 .. a2p
genn an1 an2 .. anp
30
SOM The algorithm
Step 1. Initialize nodes to random values. Set
the initial radius of the neighborhood. Step
2. Present new input Compute distances to all
nodes. Euclidean distances are commonly
used Step 3. Select output node j with minimum
distance dj. Update node j and neighbors. Nodes
updated for the neighborhood NEj(t) as wij(t1)
wij(t) ?(t)(xi(t) - wij(t)) for j ?
NEj(t) ?(t) is a gain term that decreases in
time. Step4 Repeat by going to Step 2 until
convergence.
Input
31
SOM results
DeRisi et al. (1997) Exploring the Metabolic and
Genetic Control of Gene Expression on a genomic
Scale. Science, 278, 680-686
32
SOMExample
Response of human fibroblasts to serum Iyer et
al., 1999 Science 28383-87
If a given class is overrepresented, it takes
over many neurons
33
Clustering methods
Non hierarchical
Hierarchical
deterministic
K-means, PCA
UPGMA
NN
SOM
SOTA
Robust
Provides different levels of information
Properties
34
SOTA clustering
A
B
E
Interactive Web based Configurable
D
C
F
35
SOTAThe algorithm
Step 1. Initialize nodes to random values. Step
2. Present new input Compute distances to all
terminal nodes. Step 3. Select output node j
with minimum distance dj. Update node j and
neighbors. Nodes updated for the neighborhood
NEj(t) as wij(t1) wij(t) ?(t)(xi(t) -
wij(t)) for j ? NEj(t) ?(t) is a gain term than
decreases in time. Step 4 Repeat by going to
Step 2 until convergence. Step 5 Reproduce the
node with highest variability.
The Self Organising Tree Algorithm (SOTA) is a
hierarchical divisive method based on a neural
network
SOTA, unlike other hierarchical methods, grows
from top to bottom until an appropriate level of
variability is reached
Input
Dopazo, Carazo (1997) Herrero, Valencia, Dopazo
(2001)
36
Advantages of SOTA
Robusteness against noise
Divisive algorithm SOTA grows from top to bottom
growing can be stopped at any desired level of
variability.
Clusterspatterns Each node of the tree has a
pattern associated which corresponds to the
cluster under itself.
Distribution preserving The number of clusters
depends on the variability of the data.
37
SOTA/SOM vs classical clustering (UPGMA)
38
What we have learned? Lessons from the
firs-generation algorithms and specific demands
for clustering microarray data