Title: Gene Expression Analysis Using Bayesian Networks
1Gene Expression Analysis Using Bayesian Networks
- Éric Paquet
- LBIT
- Université de Montréal
2Biological basis
RNA Polymerase (Copy DNA in RNA)
DNA (Storage of Genetic Information)
Ribosome (Translate Genetic Information in
Proteins)
mRNA (Storage Transport of Genetic Information)
Proteins (Expression of Genetic Information)
-PDB file 1L3A, Transcriptional Regulator Pbf-2
2
3Biological basis
- How do proteins get regulated?
- E. coli operon lactose example
- In normal time, E. coli uses glucose to get
energy, but how does it react if there is no more
glucose but only lactose?
3
4Biological basis
E. coli environment
RNA Polymerase
...
...
Gene Lac I associated protein
Polymerase action is blocked because of a DNA lock
4
5Biological basis
E. coli environment
X
RNA Polymerase
...
...
Lactose
Lactose
unlocking the DNA that is then accessible to the
polymerase
Lactose recruits gene lacI associated protein
5
6Biological basis
Lactose decomposor (ß-galactosidase)
Lactose getter (permease)
6
7Biological basis
E.coli environment
X
RNA Polymerase
...
...
Lactose
In absence of glucose, a polymerase magnet binds
to the DNA to accelerate the products of
information that help lactose decomposition
7
8Biological basis
Lactose decomposor (ß-galactosidase)
Lactose getter (permease)
activate
8
9Why?
- Get insights about cellular processes
- Help understand diseases
- Find drug targets
9
10How?
- Using gene expression data and tools for learning
Bayesian networks
Experiments
Tools for Learning Bayesian networks
mRNA
-Spellman et al.(1998) Mol Biol Cell 93273-97
10
11What is gene expression data?
- Data showing the concentration of a specific mRNA
at a given time of the cell life.
Experiments
mRNA
A real value is coming from one spot and tells if
the concentration of a specific mRNA is higher()
or lower(-) than the normal value
Every columns are the result of one image
-Spellman et al.(1998) Mol Biol Cell 93273-97
12What is Bayesian networks?
- Graphic representation of a joint distribution
over a set of random variables.
P(A,B,C,D,E) P(A)P(B)
P(CA)P(DA,B)
P(ED)
Nodes represent gene expression while edges
encode the interactions (cf. inhibition,
activation)
13Bayesian networks little problem
- A Bayesian network should be a DAG (Direct
Acyclic Graph), but there are a lot of example of
regulatory networks having directed cycles.
-Husmeier D.,Bioinformatics,Vol. 19 no. 17 2003,
pages 22712282
14How can we deal with that?
- Using DBN (Dynamic Bayesian Networks) and
sequential gene expression data
t
t1
We unfold the network in time
DBN BN with constraints on parents and children
nodes
-Friedman, Murphy, Russell,Learning the
Structure of Dynamic Probabilitic Networks
15What are we searching for?
- A Bayesian network that is most probable given
the data D (gene expression) - We found this BN like that
- BN argmaxBNP(BND)
Where
Marginal likelihood
Prior on network structure
Data probability
Naïve approach to the problem try all possible
dags and keep the best one!
16It is impossible to try all possible DAGs because
- The number of dags increases super-exponentially
with the number of nodes - n 3 ? 25 dags
- n 4 ? 543 dags
- n 5 ? 29281 dags
- n 6 ? 3781503 dags
- n 7 ? 1138779265 dags
- n 8 ? 783702329343 dags
We are interested in problem having around 60
nodes .
17Learning Bayesian Networks from data?
- Choosing search space method and a conditional
distribution representation
- Networks space search methods
- Greedy hill-climbing
- Beam-search
- Stochastic hill-climbing
- Simulated annealing
- MCMC simulation
- Conditional distribution representation
- Linear Gaussian
- Multinomial, binomial
A
P(a) ? P(b) ? P(ca,b) ?
C
Basically add, remove and reverse edges
B
18Learning Bayesian Networks from data?
- Choosing search space method and a conditional
distribution representation
- Networks space search methods
- Greedy hill-climbing
- Beam-search
- Stochastic hill-climbing
- Simulated annealing
- MCMC simulation
- Conditional distribution representation
- Linear Gaussian
- Multinomial, binomial
A
P(a) ? P(b) ? P(ca,b) ?
C
Basically add, remove and reverse edges
B
19We use three types of gene expression level?
Sort
-1.06 -0.12 0.18 0.21 1.16
1.19
Split data in 3 equal buckets
Discretized data
0 0 2 2 1 1
20Return on
21Insight on each terms
- P(BN) ? prior on network
- In our research, we always use a prior equals to
1 - We could incorporate knowledge using it
- Eg. we know the presence of an edge. If the
edge is in the BN, P(BN) 1 else P(BN) 0 - Efforts are made to reduce the search space by
using knowledge eg. limit the number of parents
or children
22Insight on each terms
- P(DBN) ? marginal likelihood
- Easy to calculate using Multinomial distribution
with Dirichlet prior
-Heckerman,A Tutorial on Learning With Bayesian
Networks and Neapolitan,Learning Bayesian Networks
23MCMC (Markov Chain Monte Carlo) simulation
- Markov Chain part
- Zoom on a node of the chain
A
A
P(BNnew)
B
C
B
C
1/5
1/5
A
1/5
0
A
B
C
B
1/5
C
1/5
A
A
B
C
B
C
24MCMC (Markov Chain Monte Carlo) simulation
- Monte Carlo part
- Choose next BN with probability P(BNnew)
- Accept the new BN with the following
MetropolisHastings acceptance criterion
25Monte Carlo part example
- Choose a random path. Each path having a P(BNnew)
of 1/5
- Choose a random path. Each path having a P(BNnew)
of 1/5 - Choose another random number. If it is smaller
than the Metropolis-Hasting criterion, accept
BNnew else return to BNold
A
A
P(BNnew)
B
C
B
C
1/5
1/5
A
1/5
0
A
B
C
B
1/5
C
1/5
A
A
B
C
B
C
26MCMC (Markov Chain Monte Carlo) simulation recap
- Choose a starting BN at random
- Burning phase (generate 5N BN from MCMC without
storing them) - Storing phase (get 100N BN structure from MCMC)
Burning phase
Storing phase
log(P(D BN)P(BN))
Iteration
27Why 100N BN and not only 1
- Cause we dont have enough data and there are a
lot of high scoring networks - Instead, we associate confidence to edge. Eg.
how many time in the sample can we find edge
going from A to B? - We could fix a threshold on confidence and
retrieve a global network construct with edges
having confidence over the threshold
28What we are working on
- Mixing both sequential and non-sequential data to
retrieve interesting relation between genes - How?
- Using DBN and MCMC for sequential data BN and
MCMC for non-sequential
100N networks from BN
100N networks from DBN
Information tuner
Learn network
29How to test the approach
- Problem There is no way to test it on real data
cause there is no completely known network - Solution Work on realistic simulation where we
know the network structure - Example
0
1
12
2
4
13
Simulate
3
5
6
7
8
9
10
11
-Hartemink A. Using Bayesian Network Inference
Algorithms to Recover Molecular Genetic
Regulatory Networks
30How to test the approach
BN MCMC
Info tuner
DBN MCMC
Sequential data
Non-Sequential data
Compare using ROC curves
0
1
12
2
4
13
Simulate
3
5
6
7
8
9
10
11
-Hartemink A. Using Bayesian Network Inference
Algorithms to Recover Molecular Genetic
Regulatory Networks
31Test description
- Generate 60 sequential data
- Generate 120 non-sequential data (reality
proportion) - Run DBN MCMC on sequential data keep 100N sample
net - Run BN MCMC on non-sequential data keep 100N
sample net - Test performance using weight on sample
- 0 BN 1 DBN
- .05 BN 0.95 DBN
-
- 0.95 BN .05 DBN
- 1 BN 0 DBN
- The metric used is the area under ROC curve.
Perfect learner gets 1.0 , random gets 0.5 and
the worst one gets 0.
32Results
Area under ROC curve
1
0 BN
0
1 DBN
33Perspective
- Working on more sophisticated ways to mix
sequential and non-sequential data - Working on real cases
- Yeast cell-cycle
- Arabidopsis Thaliana circadian rhythm
- Real data also means missing values
- Evaluate missing values solution (EM, KNNImpute)
34Acknowledgements
François Major
35Why are there missing datas?
36ROC Curve
- Receiver Operating Characteristic curve
-http//gim.unmc.edu/dxtests/roc2.htm
37MCMC simulation and number of sampled networks