Title: Introduction to microarray analysis II
1Introduction to microarray analysis II
Ariel Chernomoretz Plataforme de
Bioinformatique Centre de Recherche du CHUL
2Question pertinente
Calcul du signal (Affymetrix-GCOS)? 1- Chaque
intensité est corrigée par le bruit de fond. 2-
Une valeur idéale pour les MM est calculée et
soustraite à chaque valeur de PM. 3-
Lintensité ainsi ajustée est log transformée
pour stabiliser la variance. 4- Une moyenne
pondérée est calculée en appliquant léquation de
Turkey, et exprimée en antilog. 5- Finalement, le
signal est corrigé en utilisant une moyenne nette
de 500.
Design de lexperience
Affymetrix .EXP .DAT .CEL .CHP .RPT .TXT
Protocole expérimental
Bioestatisticien
Données des biopuces
Normalisation
Algorithmes de sélection de gènes
significativement modulés 1- Fold change fixe
ou variable selon lintensité. 2- t-test ou ANOVA
si exp. Avec replicats.
LFCM FC fixe T-test ANOVA
Gènes significativement modulées
Logiciels de triage et regroupement GO classifie
aprox. les 50 de gènes en 3 groupes. Arbres,
SOM, K-means et PCA font différents types de
triage en utilisant le profil et lintensité de
chaque gène. Classifient les 100 des gènes dans
la liste. Utilisent des interfaces graphiques
très puissantes.
- PCA (Principal
- Component analysis)?
- Neural networks
- Support vector machines
- Bayesian inference
- Clustering
- S.O.M.
- K-Means
Triage Pas supervisé
Triage supervisé
Gènes "intéressants" - Gènes déjà caractérisés
dans la bibliographie (validation
bibliographique)? - Gènes connus mais jamais
associés aux conditions particulières de notre
expérience. - Gènes inconnus (EST) mais associés
à des gènes connus (patron dexpression
similaire). - Gènes inconnus et avec un patron
dexpression particulier.
- Pathways
- Classification ontologique
- Diagrammes de Venn
- Moteurs de recherche bibliographique
Listes retenues
Validation Tout expérience avec des biopuces
génère des résultats faux positifs et faux
négatifs. A présent il ny a pas de critères
établis pour les exigences minimes de validation
de ce type dexpérience.
Quantitative RT-PCR Inhibition par ARNdb
(iRNA)? KO (Levures, Drosophile, C. Elegans,
Zebrafish, Souris)?
Validation
Nouvelle question
Réponse
3Curse of dimensionality
- After RMA (or MAS 5.0 or .) data is in the form
of a data matrix - Samples point of view
- Which exp. conditions have similar effects across
a set of genes? - 10 points in 10000-dim space
- Genes point of view
- Which genes behaves similarly across experiments?
- 10000 points in 10-dim space
Genes
samples 10 genes 10000
4Curse of dimensionality
- We normally filter out low quality or
uninformative data - Low intensity data
- Outliers
- Genes that are not interesting for our study
- Genes that do not change vs genes that change
differential expression
5Differential Expression
- Detect genes that are expressed at significantly
different level in one sample compared to another - Identify list of genes that act like markers
between different samples
6Differential expressionFold Change
- FC Experiment/Control
- In the beginning
- FC gt 2 gt upregulation
- FC lt 2 gt downregulation
- Why 2? Intensity dependent cutoff!
- More ellaborted intensity dependent methods were
developped.
Intensity dependent variation
7Differential Expressiont-test
- For each gene we want to know if the means of two
groups are different or not - Kind of signal to noise calculation we compare
distance of means against total variance - Calculate a p-value how probable it is that the
estimated means are different - Assumptions normal distribution, large number of
replicates are necessary
group1
group2
8Differential Expressiont-test
group1
group2
- If t is higher than a certain threshold, the
difference between X and Y can be said to be
significant - The p-value tells us how probable is to find a
higher t value by chance if X and Y came from the
same distribution
9Differential ExpressionANOVA
- ANOVA test if different groups have the same mean
(null hypothesis) by comparing two estimates of
variance ? - MSE (mean square error) within-group variability
- MSB (mean square beetween) inter-group
variability - http//www.psych.utah.edu/stat/introstats/anovafla
sh.html
10Differential ExpressionANOVA
- The MSE is an estimate of ? whether or not the
null hypothesis is true. - MSB is only an estimate of ? if the null
hypothesis is true. If the null hypothesis is
false then MSB estimates something larger than ? - Therefore, if MSB is sufficiently larger than
MSE, the null hypothesis can be rejected. - A p-value is calculated. Low p-value means it is
unlikely the means are from the same distribution
11ANOVA
- ANOVA
- tests if different groups have the same mean
(null hypothesis) by comparing two estimates of
variance ? - MSB (mean square beetween) inter-group
variability - MSE (mean square error) within-group variability
- Tests if a factor is 'important', i.e. if it can
explain the observed variability
12Anova
- log(yijkg)µAiDjVkGg(AG)ig(VG)kgeijkg
Overall mean
Random noise
Effect of array i
Effect of dye j
Effect of variate (treatment) k
Effect of gen g
Array-gen interaction ('spot' effect)?
Variate-gen interaction differential expression!!
13ANOVA
- Example compare two conditions A and B looking
for genes expressed differently - Hybridize A (cy3 labelled) and B (cy5 labelled)
in a single array. Foe a given gene g
log(y111g)µA1D1V1Gg(AG)1g(DG)1g(VG)1ge111
g
log(y122g)µA1D2V2Gg(AG)1g(DG)2g(VG)2ge122
g
log(y111g/y122g)(D1-D2)(V1-V2) (DG)1g- (DG)2g
(VG)1g-(VG)2geg
14ANOVA
A -gt B
log(y111g/y122g)(D1-D2)(V1-V2) (DG)1g- (DG)2g
(VG)1g-(VG)2geg
Dye swap experiment A lt-gt B (two slides)?
15Differential ExpressionOther methods of gene
selection
- Fisher criterion score
- Entropy measure (information theory)?
- ?2 measure
- Information gain - Information gain ratio
- Correlation-based feature selection
- Principal Component Analysis (PCA)?
- Linear models Bayesian estimates
- Etc
16Differential ExpressionMultiple hypothesis
testing
- When testing tens of thousands of genes, each
with a significance level p, we will have a large
number of errors (false positives)? - For plt0.01 , 250 genes out of 25000 will be found
just by random! - Methods to lower the number of predicted FP
- Bonferroni use pp/num_tests
- Benjamin-Hochberg
- Holm
gt long list of DE genes.......What next!!??!
17Differential ExpressionMultiple hypothesis
testing
- The aforementioned methods provide corrected
p-value cutoffs
Gene p-value unadjusted
adjusted
cutoff cutoff
18Microarray Data Analysis
- Long lists of DE genes is not biological
understanding. - What's next?
- Select some gene for validation (e.g. By QRTPCR)?
- Do follow up experiments on some genes?
- Try to learn about all the genes on the
list...(read 500 papers)? - Try to publish a huge table with all the results.
- ....
19Microarray Data Analysis
- Look for patterns in your data
- From one-gene to set-of-genes analysis
- Gene in biological pathways
- Genes asociated with particular location in cell
- Genes having a particular function or involved in
particular processes - A priori selected genes
20Pattern recognition
- Find structure in the data that correlate/explain
some biological behavior - Which experimental conditions have similar
effects across a set of genes? (disease markers,
cancer subgroup discovery,etc)? - Which genes behaves similarly across experiments?
(gene networks,etc)? - Clustering Finding groups of genes (experiments)
with similar expression
profiles - Classification Finding models that separate two
or more data classes.
21Pattern recognition
- Clustering Finding groups of genes (experiments)
with similar expression profiles - Hierarchical clustering
- K-means
- SOM
- Classification Finding models that separate two
or more classes. - k-Nearest Neighbor (kNN)?
- artificial neural networks
- hidden Markov models
- Bayesian methods
22Clustering
- A cluster is a group of genes (experiments) with
similar expression profiles - It is an unsupervised procedure. Once the notions
of distance and neighborhood are given, no
previous knowledge is used to find the grouping. - Several methods
- Hierarchical clustering (hierarchical method)?
- K-means (partitioning method)?
- More
23ClusteringWhat is similar?
24ClusteringWhat is similar?
- How can we quantify the notion of similarity?
- Vector distance measurements
- Euclidean distance
- Manhattan distance
- Pearson correlation
- Spearmans rank correlation
- Mutual information
25ClusteringWhat is similar?
- Vector distance measurements
- Euclidean distance
- Manhattan distance
- Pearson correlation
- Spearmans rank correlation
- Mutual information
26ClusteringWhat is similar?
- Vector distance measurements
- Euclidean distance
- Manhattan distance
- Pearson correlation
- Spearmans rank correlation
- Mutual information
27ClusteringWhat is similar?
- Vector distance measurements
- Euclidean distance
- Manhattan distance
- Pearson correlation
- Spearmans rank correlation
- Mutual information
28ClusteringWhat is similar?
- Vector distance measurements
- Euclidean distance
- Manhattan distance
- Pearson correlation
- Spearmans rank correlation Pearson
correlation of ranks - Mutual information Amount of info gained about X
when Y is learned
29ClusteringWhat is similar?
30ClusteringHierarchical clustering
Regroupement selon la similitude entre
échantillons
Gènes sur-exprimés Gènes sous-exprimés
Regroupement des gènes selon la similitude du
profil dexpression
Échantillons 1 5 2 9 11 3 4 6 10 7 8
31ClusteringHierarchical clustering
32ClusteringHierarchical clustering
High similarity
Low similarity
33ClusteringHierarchical clustering
- Join A and F. Recalculate distance matrix.
- Distance to a cluster
- Single linkage
- Average linkage
- Complete linkage
34ClusteringHierarchical clustering
35ClusteringHierarchical clustering
36ClusteringHierarchical clustering
37ClusteringHierarchical clustering
38ClusteringHierarchical clustering
- The resulting figure is known as a hierarchical
tree. - Different number of clusters depending on how
deep we look.
39ClusteringHierarchical clustering
- In each step the individual order between the two
group joined is arbitrary
40ClusteringHierarchical clustering
- Pros
- Usefull to provide a view of a data structure and
similarities. - It is simple.
- It is colorful. It is part of most microarray
studies. - Cons
- Be cautious! Anything will cluster. Even random
data! - Clusters are kind of arbitrary, depending on how
the tree is cut. - What is a good cluster?
41ClusteringHierarchical clustering
- What is a good cluster?. Bootstraping
- The data is resampled (some experiments taken out
randomly and replaced by copies of other
experiments)? - The whole clustering is repeated.
- Clusters that often appear are more statistically
safe than others.
42ClusteringK-mean
- A specific number of clusters have to be provided
- Goal assign element to clusters
43ClusteringK-mean
- Start by guessing k centers
44ClusteringK-mean
- Assign elements to these centers
45ClusteringK-mean
46ClusteringK-mean
- Reassign elements, and repeat until convergence
47ClusteringK-mean
- K-means is iterative.
- The outcome depends on initial guesses
- The number of final clusters is an input of the
algorithm
48ClusteringK-mean
- To guess a good number of clusters we can use a
figure of merit (FOM)? - FOM quantifies how good the clusters are
49ClusteringSOM
- Self organizing maps
- Start with a given number of clusters
- For each cluster create a node and give them
initial positions
50ClusteringSOM
51ClusteringSOM
- Move the nodes toward the selected gene
52ClusteringSOM
- Pick another gene and move the nodes again
53ClusteringSOM
- Keep iterating, for iteration decrease node
movility
54ClusteringSOM
- Eventually the nodes will have stable positions.
Clusters are defined as the closest set of genes
55Classification
- Classification is the process of finding models
that separate two or more data classes. - Given classes A and B, can we use them as a basis
to decide if a new unknow sample is A or B? - Supervised classification means we are using a
priori information to find different classes - The methods find the structure in the data that
explains this information
56Classification
- First, the data is divided into a training and a
test set
57Classification
- Learn a classifier with the training set
58Classification
- Apply the classifier to the test data
- Compare predicted classes with known classes to
assess performance of the classifier
59Classification
- Example of classifiers
- Linear discriminants
- K-nearest neighbours
- Artificial neural networks
- Decision Trees
- Support Vector Machines
- Bayesian Methods
- Hidden Markov Models
- etc
60ClassificationK-NN
- Assign a test sample to the class most often
found in the K nearest training samples - Rule of thumb Ksqrt(Ntraining)?
- Normally, Euclidean distance is assumed
61ClassificationK-NN
62Example
- Leukemia classification
- Golub et al, Science 286531-537 (1999)?
- Cancer classification applied to acute leukemias
- Class discovery recognize previously undefined
tumor subtypes - Class prediction assignement of samples to
already defined classes
63Example
- Leukemia classification
- Golub et al, Science 286531-537 (1999)?
- Cancer classification applied to acute leukemias
- Class prediction assignement of samples to
already defined classes - Class discovery recognize previously undefined
tumor subtypes
64Example
- Ramaswamy et al. (Nature Genetics, 2003) Gene
expression program of metastasis is already
present in primary tumors? - Used a weighted voting algorithm to find genes
that separate primary tumours from metastases - 128 genes that best distinguished primary from
metastatic tumours using a weighting voting
algorithm
65Example
- horiz bar
- red recurrent
- black non-recurrent
- vert bar
- red originally primary
- blackoriginally metastasis
- Hierarchical clustering in the space of the 128
genes identifies two main clusters of primary
tumors, highly correlated to the original
primary-tumor vs. metastasis distinction
66Gene Ontology Consortium
- Goal of GO consortium provide a controlled
vocabulary that can be applied to all organisms,
even as knwoledge of gene protein and gene roles
is accumulating and changing. - GO provides three ontologies
- MOLECULAR FUNCTION (what?)?
- BIOLOGICAL PROCESS (why?)?
- CELLULAR COMPONENT (where?)?
67GO
- GO is organized as a Directed Acyclic Graph
structure/home/ariel/Academ/Docencia/CharlaLeloir/
aux/go.cgi.htmlgo
68GO
- Each GO node has zero or more ENTREZID
annotations - Parent terms inherit annotations from children
- BP aopotosis node
69GO
70GO
71Differential Exression lists GO
- Are there any GO terms that have a larger than
expected subset of our selected genes in their
annotation list? - Is so, these GO terms will give us insight into
the functional characteristics of the gene list - How large is 'larger'?
72GO as a urn
- The urn contains a ball for each gene in the
universe - Paint the balls representing genes in our
selected list white and paint the rest black. - Testing a GO term amounts to drawing the genes
annotated at it from the urn and tallying white
and black.
73GO an microarray gene sets
- Is a GO term specific for a set?
2x2 table
pvalue
in Go category
51 416 467 125 8588 8713 173 9004 9177
8 10-52
NOT in Go category
Fisher Exact test or chi-square test
in DE list
NOT in DE list
74Gene Products interact...a lot!
75Gene Networks
Gene networks describe relations among a group of
genes.
76Relations and Interactions
- Examples
- A gene expressing a transcription factor that
regulate the expression of a set of other genes - A gene expressing an enzyme which activates a set
of proteins - Two proteins binding each other to produce a
functional complex - A gene expressing an enzyme which catalyses the
production of a metabolic compound, which in turn
inhibits another enzyme - A gene participating in the same cellular process
as a set of other genes
77Interaction data
- Protein-protein interactions
- Protein-DNA interactions
- Functional classifications
- Metabolic pathways
- Signalling pathways
- Sequence and structure information
- Other gene expression studies