Title: Microarray Statistics
1Reverse engineering gene and protein regulatory
networks using Graphical Models. A comparative
evaluation study.
Marco Grzegorczyk Dirk Husmeier Adriano Werhli
2(No Transcript)
3Systems biology Learning signalling pathways and
regulatory networks from postgenomic data
4(No Transcript)
5possibly completely unknown
6possibly completely unknown
E.g. Flow cytometry experiments
data
Here Concentrations of (phosphorylated) proteins
7possibly completely unknown
E.g. Flow cytometry experiments
data
data
Machine Learning
statistical methods
8extracted network
true network
Is the extracted network a good prediction of the
real relationships?
9extracted network
true network
Evaluation of learning performance
biological knowledge (gold standard network)
10Reverse Engineering of Regulatory Networks
- Can we learn network structures from postgenomic
data themselves? - Are there statistical methods to distinguish
between direct and indirect correlations? - Is it worth applying time-consuming Bayesian
network approaches although computationally
cheaper methods are available? - Do active interventions improve the learning
performances? - Gene knockouts (VIGs, RNAi)
11direct interaction
common regulator
indirect interaction
co-regulation
12Reverse Engineering of Regulatory Networks
- Can we learn network structures from postgenomic
data themselves? - Are there statistical methods to distinguish
between direct and indirect correlations? - Is it worth applying time-consuming Bayesian
network approaches although computationally
cheaper methods are available? - Do active interventions improve the learning
performances? - Gene knockouts (VIGs, RNAi)
13Three widely applied methodologies
- Relevance networks
- Graphical Gaussian models
- Bayesian networks
14- Relevance networks
- Graphical Gaussian models
- Bayesian networks
15Relevance networks(Butte and Kohane, 2000)
- Choose a measure of association A(.,.)
- Define a threshold value tA
- For all pairs of domain variables (X,Y) compute
their association A(X,Y) - 4. Connect those variables (X,Y) by an
undirected edge whose association A(X,Y) exceeds
the predefined threshold value tA
16Relevance networks(Butte and Kohane, 2000)
17Relevance networks(Butte and Kohane, 2000)
- Choose a measure of association A(.,.)
- Define a threshold value tA
- For all pairs of domain variables (X,Y) compute
their association A(X,Y) - 4. Connect those variables (X,Y) by an
undirected edge whose association A(X,Y) exceeds
the predefined threshold value tA
18direct interaction
common regulator
indirect interaction
co-regulation
19Pairwise associations without taking the context
of the system into consideration
direct interaction
pseudo-correlations between A and B
E.g.Correlation between A and C is disturbed
(weakend) by the influence of B
20strong correlation s12
21- Relevance networks
- Graphical Gaussian models
- Bayesian networks
22Graphical Gaussian Models
Partial correlation, i.e. correlation
conditional on all other domain variables
Corr(X1,X2X3,,Xn)
strong partial correlation p12
But usually observations lt variables
23Shrinkage estimation of the covariance matrix
(Schäfer and Strimmer, 2005)
0lt?0lt1 estimated (optimal) shrinkage intensity,
with
where
is guaranteed
24direct interaction
common regulator
indirect interaction
co-regulation
25Graphical Gaussian Models
direct interaction
common regulator
indirect interaction
P(A,B)P(A)P(B) But P(A,BC)?P(AC)P(BC)
26Further drawbacks
- Relevance networks and Graphical Gaussian models
can extract undirected edges only. - Bayesian networks promise to extract at least
some directed edges. But can we trust in these
edge directions? - It may be better to learn undirected edges than
learning directed edges with false orientations.
27- Relevance networks
- Graphical Gaussian models
- Bayesian networks
28Bayesian networks
NODES
- Marriage between graph theory and probability
theory. - Directed acyclic graph (DAG) represents
conditional independence relations. - Markov assumption leads to a factorization of the
joint probability distribution
A
C
B
EDGES
D
E
F
29Bayesian networks versus causal networks
Bayesian networks represent conditional
(in)dependency relations - not necessarily causal
interactions.
30Bayesian networks versus causal networks
31Bayesian networks
NODES
- Marriage between graph theory and probability
theory. - Directed acyclic graph (DAG) represents
conditional independence relations. - Markov assumption leads to a factorization of the
joint probability distribution
A
C
B
EDGES
D
E
F
32Bayesian networks
Parameterisation Gaussian BGe scoring
metric dataN(µ,S) with normal-Wishart
distribution of the (unknown) parameters,
i.e. µN(µ,(vW)-1) and WWishart(T0)
33Bayesian networks
BGe metric closed form solution
34Learning the network structure
graph ? scoreBGe(graph)
Idea Heuristically searching for the graph M
that is most supported by the data
P(Mdata)gtP(graphdata), e.g. greedy search
35MCMC sampling of Bayesian networks
- Better idea Bayesian model averaging via Markov
Chain Monte Carlo (MCMC) simulations - Construct and simulate a Markov Chain (Mt)t in
the space of DAGs graph whose distribution
converges to the graph posterior distribution as
stationary distribution, i.e. - P(Mtgraphdata) ? P(graphdata)
- t ? 8
- to generate a DAG sample G1,G2,G3,GT
36Order MCMC(Friedman and Koller, 2003)
- Order MCMC generates a sample of node orders from
which in a second step DAGs can be sampled
Acceptance probability (Metropolis Hastings)
G1,G2,G3,GT DAG sample
37Equivalence classes of BNs
A
C
B
A
C
A
B
P(A,B)?P(A)P(B) P(A,BC)P(AC)P(BC)
C
B
A
C
completed partially directed graphs (CPDAGs)
B
v-structure
A
P(A,B)P(A)P(B) P(A,BC)?P(AC)P(BC)
C
B
38CPDAG representations
CPDAGs
DAGs
Utilise the CPDAG sample for estimating the
posterior probability of edge relation features
where I(Gi) is 1 if the CPDAG Gi contains the
directed edge A?B, and 0 otherwise
39CPDAG representations
CPDAGs
interpretation
DAGs
superposition
Utilise the DAG (CPDAG) sample for estimating the
posterior probability of edge relation features
where I(Gi) is 1 if the CPDAG of Gi contains the
directed edge A?B, and 0 otherwise
40Interventional data
A and B are correlated
A
B
inhibition of A
A
B
A
B
A
B
down-regulation of B
no effect on B
41Evaluation of Performance
- Relevance networks and Graphical Gaussian models
extract undirected edges (scores (partial)
correlations) - Bayesian networks extract undirected as well as
directed edges (scores posterior probabilities
of edges) - Undirected edges can be interpreted as
superposition of two directed edges with opposite
direction. - How to cross-compare the learning performances
when the true regulatory network is known? - Distinguish between DGE (directed graph
evaluation) and UGE (undirected graph evaluation)
42Probabilistic inference - DGE
true regulatory network
edge scores
data
low
high
Thresholding
concrete network predictions
TP1/2 FP0/4
TP2/2 FP1/4
43Probabilistic inference - UGE
skeleton of true regulatory network
undirected edge scores add up scores of directed
edges with opposite direction
data
44Probabilistic inference - UGE
skeleton of true regulatory network
undirected edge scores add up scores of directed
edges with opposite direction
data
45Probabilistic inference
skeleton of true regulatory network
undirected edge scores
data
46Probabilistic inference
skeleton of true regulatory network
undirected edge scores
data
high
low
Thresholding
concrete network (skeleton) predictions
TP1/2 FP0/1
TP2/2 FP1/1
47Evaluation 1 AUC scoresArea under Receiver
Operator Characteristic (ROC) curve
sensitivity
inverse specificity
AUC0.5
AUC1
0.5ltAUC1
48Evaluation 2 TP scores
We set the threshold such that we obtained 5
spurious edges (5 FPs) and counted the
corresponding number of true edges (TP count).
49Evaluation 2 TP scores
5 FP counts
50Evaluation 2 TP scores
BN
GGM
RN
5 FP counts
51(No Transcript)
52Evaluation
- On real experimental cytometric from the RAF
signalling pathway for which a gold standard
network is known - On synthetic data simulated from this
gold-standard network topology
53Evaluation
- On real experimental cytometric from the RAF
signalling pathway for which a gold standard
network is known - On synthetic data simulated from the
gold-standard network topology
54Evaluation Raf signalling pathway
- Cellular signalling cascade which consists of 11
phosphorylated proteins and phospholipids in
human immune systems cell - Deregulation ? carcinogenesis
- Extensively studied in the literature ?
gold standard network
55gold standard RAF pathway according to Sachs et
al. (2004)
56Raf pathway
11 nodes (proteins) and 20 directed edges
57Data
- Intracellular multicolour flow cytometry
experiments concentrations of 11 proteins - 5400 cells have been measured under 9 different
cellular conditions (cues) - We decided to downsample our test data sets to
100 instances - indicative of microarray
experiments
58(No Transcript)
59Two types of experiments
60(No Transcript)
61Evaluation
- On real experimental data, using the gold
standard network from the literature - On synthetic data simulated from this
gold-standard network
62Raf pathway
63Gaussian simulated data
64Netbuilder simulated data
DNA TFs? DNA?TF ? mRNA ? protein
65Netbuilder simulated data
KA
DNA TFA DNA?TFA DNA TFB DNA?TFB
KB
Steady-state approximation
66Netbuilder simulated data
- Generating data using Netbuilder tool
- The main idea of Netbuilder is instead of
- solving the steady state approximation to
- ODEs explicitly, we approximate them
- with a qualitatively equivalent combination
- of multiplications and sums of sigmoidal
- transfer functions.
67 68Synthetic data, observations
69Synthetic data, interventions
70Cytometry data, observations
71Cytometry data, interventions
72(No Transcript)
73(No Transcript)
74How can we explain the difference between
synthetic and real data ?
75Raf pathway
76(No Transcript)
77Disputed structure of the gold-standard network
Regulation of Raf-1 by Direct Feedback
Phosphorylation. Molecular Cell, Vol. 17, 2005
Dougherty et al
78Complications with real data
- Interventions are not ideal owing to negative
feedback loops. - Putative negative feedback loops Can we
trust our gold-standard network?
79Stabilisation through negative feedback loops
inhibition
80Conclusions 1
- BNs and GGMs outperform RNs, most notably on
Gaussian data. - No significant difference between BNs and GGMs on
observational data. - For interventional data, BNs clearly outperform
GGMs and RNs, especially when taking the edge
direction (DGE score) rather than just the
skeleton (UGE score) into account.
81Conclusions 2
- Performance on synthetic data better than on
real data. - Real data more complex
- Real interventions are not ideal
- Errors in the gold-standard network
82Additional analysis I Raf pathway
83Additional analysis I Raf pathway
84Additional analysis I Raf pathway
85CPDAGs of networks
ORIGINAL
MODIFIED
3/20 directed edges
13/16 directed edges
86ORIGINAL
MODIFIED
Gaussian
Netbuilder
87Some additional analysis II
88Thank you
89- References
- Butte, A.S. and Kohane, I.S. (2003) Relevance
networks A first step toward - finding genetic regulatory networks within
microarray data. - In Parmigiani, G., Garett, E.S., Irizarry, R.A.
und Zeger, S.L. editors, The analysis - of Gene Expression Data, pages 428-446. Springer.
- Doughtery, M.K. et al. (2005) Regulation of
Raf-1 by Direct Feedback - Phosphorylation. Molecular Cell, 17, 215-227.
- Friedman, N. and Koller, D. (2003) Being
Bayesian about network structure. - Machine Learning, 5095-126.
- Madigan, D. and York, J. (1995) Bayesian
graphical models for discrete data. - International Statistical Review, 63215-232.
- Sachs, K., Perez, O., Peer, D., Lauffenburger,
D.A., Nolan, G.P. (2004) - Protein-signaling networks derived from
multiparameter single-cell data. - Science, 308523-529.