Introduction to microarray analysis II

About This Presentation

Title:

Introduction to microarray analysis II

Description:

2- Une valeur id ale pour les 'MM' est calcul e et soustraite ... KO (Levures, Drosophile, C. Elegans, Zebrafish, Souris)? Design de l'experience. R ponse ... – PowerPoint PPT presentation

Number of Views:130

Avg rating:3.0/5.0

Slides: 72

Provided by: jeanmor

Category:

more less

Transcript and Presenter's Notes

Title: Introduction to microarray analysis II

1
Introduction to microarray analysis II
Ariel Chernomoretz Plataforme de
Bioinformatique Centre de Recherche du CHUL
2
Question pertinente
Calcul du signal (Affymetrix-GCOS)? 1- Chaque
intensité est corrigée par le bruit de fond. 2-
Une valeur idéale pour les MM est calculée et
soustraite à chaque valeur de PM. 3-
Lintensité ainsi ajustée est log transformée
pour stabiliser la variance. 4- Une moyenne
pondérée est calculée en appliquant léquation de
Turkey, et exprimée en antilog. 5- Finalement, le
signal est corrigé en utilisant une moyenne nette
de 500.
Design de lexperience
Affymetrix .EXP .DAT .CEL .CHP .RPT .TXT
Protocole expérimental
Bioestatisticien
Données des biopuces
Normalisation
Algorithmes de sélection de gènes
significativement modulés 1- Fold change fixe
ou variable selon lintensité. 2- t-test ou ANOVA
si exp. Avec replicats.
LFCM FC fixe T-test ANOVA
Gènes significativement modulées
Logiciels de triage et regroupement GO classifie
aprox. les 50 de gènes en 3 groupes. Arbres,
SOM, K-means et PCA font différents types de
triage en utilisant le profil et lintensité de
chaque gène. Classifient les 100 des gènes dans
la liste. Utilisent des interfaces graphiques
très puissantes.

PCA (Principal
Component analysis)?
Neural networks
Support vector machines
Bayesian inference

Clustering
S.O.M.
K-Means

Triage Pas supervisé
Triage supervisé
Gènes "intéressants" - Gènes déjà caractérisés
dans la bibliographie (validation
bibliographique)? - Gènes connus mais jamais
associés aux conditions particulières de notre
expérience. - Gènes inconnus (EST) mais associés
à des gènes connus (patron dexpression
similaire). - Gènes inconnus et avec un patron
dexpression particulier.

Pathways
Classification ontologique
Diagrammes de Venn
Moteurs de recherche bibliographique

Listes retenues
Validation Tout expérience avec des biopuces
génère des résultats faux positifs et faux
négatifs. A présent il ny a pas de critères
établis pour les exigences minimes de validation
de ce type dexpérience.
Quantitative RT-PCR Inhibition par ARNdb
(iRNA)? KO (Levures, Drosophile, C. Elegans,
Zebrafish, Souris)?
Validation
Nouvelle question
Réponse
3
Curse of dimensionality

After RMA (or MAS 5.0 or .) data is in the form
of a data matrix
Samples point of view
Which exp. conditions have similar effects across
a set of genes?
10 points in 10000-dim space
Genes point of view
Which genes behaves similarly across experiments?
10000 points in 10-dim space

Genes
samples 10 genes 10000
4
Curse of dimensionality

We normally filter out low quality or
uninformative data
Low intensity data
Outliers
Genes that are not interesting for our study
Genes that do not change vs genes that change
differential expression

5
Differential Expression

Detect genes that are expressed at significantly
different level in one sample compared to another
Identify list of genes that act like markers
between different samples

6
Differential expressionFold Change

FC Experiment/Control
In the beginning
FC gt 2 gt upregulation
FC lt 2 gt downregulation
Why 2? Intensity dependent cutoff!
More ellaborted intensity dependent methods were
developped.

Intensity dependent variation
7
Differential Expressiont-test

For each gene we want to know if the means of two
groups are different or not
Kind of signal to noise calculation we compare
distance of means against total variance
Calculate a p-value how probable it is that the
estimated means are different
Assumptions normal distribution, large number of
replicates are necessary

group1
group2
8
Differential Expressiont-test
group1
group2

If t is higher than a certain threshold, the
difference between X and Y can be said to be
significant
The p-value tells us how probable is to find a
higher t value by chance if X and Y came from the
same distribution

9
Differential ExpressionANOVA

ANOVA test if different groups have the same mean
(null hypothesis) by comparing two estimates of
variance ?
MSE (mean square error) within-group variability
MSB (mean square beetween) inter-group
variability
http//www.psych.utah.edu/stat/introstats/anovafla
sh.html

10
Differential ExpressionANOVA

The MSE is an estimate of ? whether or not the
null hypothesis is true.
MSB is only an estimate of ? if the null
hypothesis is true. If the null hypothesis is
false then MSB estimates something larger than ?
Therefore, if MSB is sufficiently larger than
MSE, the null hypothesis can be rejected.
A p-value is calculated. Low p-value means it is
unlikely the means are from the same distribution

11
ANOVA

ANOVA
tests if different groups have the same mean
(null hypothesis) by comparing two estimates of
variance ?
MSB (mean square beetween) inter-group
variability
MSE (mean square error) within-group variability
Tests if a factor is 'important', i.e. if it can
explain the observed variability

12
Anova

log(yijkg)µAiDjVkGg(AG)ig(VG)kgeijkg

Overall mean
Random noise
Effect of array i
Effect of dye j
Effect of variate (treatment) k
Effect of gen g
Array-gen interaction ('spot' effect)?
Variate-gen interaction differential expression!!
13
ANOVA

Example compare two conditions A and B looking
for genes expressed differently
Hybridize A (cy3 labelled) and B (cy5 labelled)
in a single array. Foe a given gene g

log(y111g)µA1D1V1Gg(AG)1g(DG)1g(VG)1ge111
g
log(y122g)µA1D2V2Gg(AG)1g(DG)2g(VG)2ge122
g
log(y111g/y122g)(D1-D2)(V1-V2) (DG)1g- (DG)2g
(VG)1g-(VG)2geg
14
ANOVA
A -gt B
log(y111g/y122g)(D1-D2)(V1-V2) (DG)1g- (DG)2g
(VG)1g-(VG)2geg
Dye swap experiment A lt-gt B (two slides)?
15
Differential ExpressionOther methods of gene
selection

Fisher criterion score
Entropy measure (information theory)?
?2 measure
Information gain - Information gain ratio
Correlation-based feature selection
Principal Component Analysis (PCA)?
Linear models Bayesian estimates
Etc

16
Differential ExpressionMultiple hypothesis
testing

When testing tens of thousands of genes, each
with a significance level p, we will have a large
number of errors (false positives)?
For plt0.01 , 250 genes out of 25000 will be found
just by random!
Methods to lower the number of predicted FP
Bonferroni use pp/num_tests
Benjamin-Hochberg
Holm

gt long list of DE genes.......What next!!??!
17
Differential ExpressionMultiple hypothesis
testing

The aforementioned methods provide corrected
p-value cutoffs

Gene p-value unadjusted
adjusted
cutoff cutoff
18
Microarray Data Analysis

Long lists of DE genes is not biological
understanding.
What's next?
Select some gene for validation (e.g. By QRTPCR)?
Do follow up experiments on some genes?
Try to learn about all the genes on the
list...(read 500 papers)?
Try to publish a huge table with all the results.
....

19
Microarray Data Analysis

Look for patterns in your data
From one-gene to set-of-genes analysis
Gene in biological pathways
Genes asociated with particular location in cell
Genes having a particular function or involved in
particular processes
A priori selected genes

20
Pattern recognition

Find structure in the data that correlate/explain
some biological behavior
Which experimental conditions have similar
effects across a set of genes? (disease markers,
cancer subgroup discovery,etc)?
Which genes behaves similarly across experiments?
(gene networks,etc)?
Clustering Finding groups of genes (experiments)
with similar expression
profiles
Classification Finding models that separate two
or more data classes.

21
Pattern recognition

Clustering Finding groups of genes (experiments)
with similar expression profiles
Hierarchical clustering
K-means
SOM
Classification Finding models that separate two
or more classes.
k-Nearest Neighbor (kNN)?
artificial neural networks
hidden Markov models
Bayesian methods

22
Clustering

A cluster is a group of genes (experiments) with
similar expression profiles
It is an unsupervised procedure. Once the notions
of distance and neighborhood are given, no
previous knowledge is used to find the grouping.
Several methods
Hierarchical clustering (hierarchical method)?
K-means (partitioning method)?
More

23
ClusteringWhat is similar?
24
ClusteringWhat is similar?

How can we quantify the notion of similarity?
Vector distance measurements
Euclidean distance
Manhattan distance
Pearson correlation
Spearmans rank correlation
Mutual information

25
ClusteringWhat is similar?

Vector distance measurements
Euclidean distance
Manhattan distance
Pearson correlation
Spearmans rank correlation
Mutual information

26
ClusteringWhat is similar?

Vector distance measurements
Euclidean distance
Manhattan distance
Pearson correlation
Spearmans rank correlation
Mutual information

27
ClusteringWhat is similar?

Vector distance measurements
Euclidean distance
Manhattan distance
Pearson correlation
Spearmans rank correlation
Mutual information

28
ClusteringWhat is similar?

Vector distance measurements
Euclidean distance
Manhattan distance
Pearson correlation
Spearmans rank correlation Pearson
correlation of ranks
Mutual information Amount of info gained about X
when Y is learned

29
ClusteringWhat is similar?
30
ClusteringHierarchical clustering
Regroupement selon la similitude entre
échantillons
Gènes sur-exprimés Gènes sous-exprimés
Regroupement des gènes selon la similitude du
profil dexpression
Échantillons 1 5 2 9 11 3 4 6 10 7 8
31
ClusteringHierarchical clustering
32
ClusteringHierarchical clustering

Distance matrix

High similarity
Low similarity
33
ClusteringHierarchical clustering

Join A and F. Recalculate distance matrix.
Distance to a cluster
Single linkage
Average linkage
Complete linkage

34
ClusteringHierarchical clustering
35
ClusteringHierarchical clustering
36
ClusteringHierarchical clustering
37
ClusteringHierarchical clustering
38
ClusteringHierarchical clustering

The resulting figure is known as a hierarchical
tree.
Different number of clusters depending on how
deep we look.

39
ClusteringHierarchical clustering

In each step the individual order between the two
group joined is arbitrary

40
ClusteringHierarchical clustering

Pros
Usefull to provide a view of a data structure and
similarities.
It is simple.
It is colorful. It is part of most microarray
studies.
Cons
Be cautious! Anything will cluster. Even random
data!
Clusters are kind of arbitrary, depending on how
the tree is cut.
What is a good cluster?

41
ClusteringHierarchical clustering

What is a good cluster?. Bootstraping
The data is resampled (some experiments taken out
randomly and replaced by copies of other
experiments)?
The whole clustering is repeated.
Clusters that often appear are more statistically
safe than others.

42
ClusteringK-mean

A specific number of clusters have to be provided
Goal assign element to clusters

43
ClusteringK-mean

Start by guessing k centers

44
ClusteringK-mean

Assign elements to these centers

45
ClusteringK-mean

Move to gravity centers

46
ClusteringK-mean

Reassign elements, and repeat until convergence

47
ClusteringK-mean

K-means is iterative.
The outcome depends on initial guesses
The number of final clusters is an input of the
algorithm

48
ClusteringK-mean

To guess a good number of clusters we can use a
figure of merit (FOM)?
FOM quantifies how good the clusters are

49
ClusteringSOM

Self organizing maps
Start with a given number of clusters
For each cluster create a node and give them
initial positions

50
ClusteringSOM

Pick a random gene

51
ClusteringSOM

Move the nodes toward the selected gene

52
ClusteringSOM

Pick another gene and move the nodes again

53
ClusteringSOM

Keep iterating, for iteration decrease node
movility

54
ClusteringSOM

Eventually the nodes will have stable positions.
Clusters are defined as the closest set of genes

55
Classification

Classification is the process of finding models
that separate two or more data classes.
Given classes A and B, can we use them as a basis
to decide if a new unknow sample is A or B?
Supervised classification means we are using a
priori information to find different classes
The methods find the structure in the data that
explains this information

56
Classification

First, the data is divided into a training and a
test set

57
Classification

Learn a classifier with the training set

58
Classification

Apply the classifier to the test data
Compare predicted classes with known classes to
assess performance of the classifier

59
Classification

Example of classifiers
Linear discriminants
K-nearest neighbours
Artificial neural networks
Decision Trees
Support Vector Machines
Bayesian Methods
Hidden Markov Models
etc

60
ClassificationK-NN

Assign a test sample to the class most often
found in the K nearest training samples
Rule of thumb Ksqrt(Ntraining)?
Normally, Euclidean distance is assumed

61
ClassificationK-NN

Choose Ksqrt(23)5

62
Example

Leukemia classification
Golub et al, Science 286531-537 (1999)?
Cancer classification applied to acute leukemias
Class discovery recognize previously undefined
tumor subtypes
Class prediction assignement of samples to
already defined classes

63
Example

Leukemia classification
Golub et al, Science 286531-537 (1999)?
Cancer classification applied to acute leukemias
Class prediction assignement of samples to
already defined classes
Class discovery recognize previously undefined
tumor subtypes

64
Example

Ramaswamy et al. (Nature Genetics, 2003) Gene
expression program of metastasis is already
present in primary tumors?
Used a weighted voting algorithm to find genes
that separate primary tumours from metastases
128 genes that best distinguished primary from
metastatic tumours using a weighting voting
algorithm

65
Example

horiz bar
red recurrent
black non-recurrent
vert bar
red originally primary
blackoriginally metastasis
Hierarchical clustering in the space of the 128
genes identifies two main clusters of primary
tumors, highly correlated to the original
primary-tumor vs. metastasis distinction

66
Gene Ontology Consortium

Goal of GO consortium provide a controlled
vocabulary that can be applied to all organisms,
even as knwoledge of gene protein and gene roles
is accumulating and changing.
GO provides three ontologies
MOLECULAR FUNCTION (what?)?
BIOLOGICAL PROCESS (why?)?
CELLULAR COMPONENT (where?)?

67
GO

GO is organized as a Directed Acyclic Graph
structure/home/ariel/Academ/Docencia/CharlaLeloir/
aux/go.cgi.htmlgo

68
GO

Each GO node has zero or more ENTREZID
annotations
Parent terms inherit annotations from children
BP aopotosis node

69
GO
70
GO
71
Differential Exression lists GO

Are there any GO terms that have a larger than
expected subset of our selected genes in their
annotation list?
Is so, these GO terms will give us insight into
the functional characteristics of the gene list
How large is 'larger'?

72
GO as a urn

The urn contains a ball for each gene in the
universe
Paint the balls representing genes in our
selected list white and paint the rest black.
Testing a GO term amounts to drawing the genes
annotated at it from the urn and tallying white
and black.

73
GO an microarray gene sets

Is a GO term specific for a set?

2x2 table
pvalue
in Go category
51 416 467 125 8588 8713 173 9004 9177
8 10-52
NOT in Go category
Fisher Exact test or chi-square test
in DE list
NOT in DE list
74
Gene Products interact...a lot!
75
Gene Networks
Gene networks describe relations among a group of
genes.
76
Relations and Interactions

Examples
A gene expressing a transcription factor that
regulate the expression of a set of other genes
A gene expressing an enzyme which activates a set
of proteins
Two proteins binding each other to produce a
functional complex
A gene expressing an enzyme which catalyses the
production of a metabolic compound, which in turn
inhibits another enzyme
A gene participating in the same cellular process
as a set of other genes

77
Interaction data