Gene Expression Analysis Using Bayesian Networks - PowerPoint PPT Presentation

About This Presentation

Title:

Gene Expression Analysis Using Bayesian Networks

Description:

In normal time, E. coli uses glucose to get energy, but how does it react if ... Arabidopsis Thaliana circadian rhythm. Real data also means missing values ... – PowerPoint PPT presentation

Number of Views:291

Avg rating:3.0/5.0

Slides: 38

Provided by: iroUmo

Category:

more less

Transcript and Presenter's Notes

Title: Gene Expression Analysis Using Bayesian Networks

1
Gene Expression Analysis Using Bayesian Networks

Éric Paquet
LBIT
Université de Montréal

2
Biological basis
RNA Polymerase (Copy DNA in RNA)
DNA (Storage of Genetic Information)
Ribosome (Translate Genetic Information in
Proteins)
mRNA (Storage Transport of Genetic Information)
Proteins (Expression of Genetic Information)
-PDB file 1L3A, Transcriptional Regulator Pbf-2
2
3
Biological basis

How do proteins get regulated?
E. coli operon lactose example
In normal time, E. coli uses glucose to get
energy, but how does it react if there is no more
glucose but only lactose?

3
4
Biological basis
E. coli environment
RNA Polymerase
...
...
Gene Lac I associated protein
Polymerase action is blocked because of a DNA lock
4
5
Biological basis
E. coli environment
X
RNA Polymerase
...
...
Lactose
Lactose
unlocking the DNA that is then accessible to the
polymerase
Lactose recruits gene lacI associated protein
5
6
Biological basis
Lactose decomposor (ß-galactosidase)
Lactose getter (permease)
6
7
Biological basis
E.coli environment
X
RNA Polymerase
...
...
Lactose
In absence of glucose, a polymerase magnet binds
to the DNA to accelerate the products of
information that help lactose decomposition
7
8
Biological basis
Lactose decomposor (ß-galactosidase)
Lactose getter (permease)
activate
8
9
Why?

Get insights about cellular processes
Help understand diseases
Find drug targets

9
10
How?

Using gene expression data and tools for learning
Bayesian networks

Experiments

Tools for Learning Bayesian networks
mRNA
-Spellman et al.(1998) Mol Biol Cell 93273-97
10
11
What is gene expression data?

Data showing the concentration of a specific mRNA
at a given time of the cell life.

Experiments

mRNA
A real value is coming from one spot and tells if
the concentration of a specific mRNA is higher()
or lower(-) than the normal value
Every columns are the result of one image
-Spellman et al.(1998) Mol Biol Cell 93273-97
12
What is Bayesian networks?

Graphic representation of a joint distribution
over a set of random variables.

P(A,B,C,D,E) P(A)P(B)
P(CA)P(DA,B)
P(ED)
Nodes represent gene expression while edges
encode the interactions (cf. inhibition,
activation)
13
Bayesian networks little problem

A Bayesian network should be a DAG (Direct
Acyclic Graph), but there are a lot of example of
regulatory networks having directed cycles.

-Husmeier D.,Bioinformatics,Vol. 19 no. 17 2003,
pages 22712282
14
How can we deal with that?

Using DBN (Dynamic Bayesian Networks) and
sequential gene expression data

t
t1
We unfold the network in time
DBN BN with constraints on parents and children
nodes
-Friedman, Murphy, Russell,Learning the
Structure of Dynamic Probabilitic Networks
15
What are we searching for?

A Bayesian network that is most probable given
the data D (gene expression)
We found this BN like that
BN argmaxBNP(BND)

Where
Marginal likelihood
Prior on network structure
Data probability
Naïve approach to the problem try all possible
dags and keep the best one!
16
It is impossible to try all possible DAGs because

The number of dags increases super-exponentially
with the number of nodes
n 3 ? 25 dags
n 4 ? 543 dags
n 5 ? 29281 dags
n 6 ? 3781503 dags
n 7 ? 1138779265 dags
n 8 ? 783702329343 dags

We are interested in problem having around 60
nodes .
17
Learning Bayesian Networks from data?

Choosing search space method and a conditional
distribution representation

Networks space search methods
Greedy hill-climbing
Beam-search
Stochastic hill-climbing
Simulated annealing
MCMC simulation

Conditional distribution representation
Linear Gaussian
Multinomial, binomial

A
P(a) ? P(b) ? P(ca,b) ?
C
Basically add, remove and reverse edges
B
18
Learning Bayesian Networks from data?

Choosing search space method and a conditional
distribution representation

Networks space search methods
Greedy hill-climbing
Beam-search
Stochastic hill-climbing
Simulated annealing
MCMC simulation

Conditional distribution representation
Linear Gaussian
Multinomial, binomial

A
P(a) ? P(b) ? P(ca,b) ?
C
Basically add, remove and reverse edges
B
19
We use three types of gene expression level?
Sort
-1.06 -0.12 0.18 0.21 1.16
1.19
Split data in 3 equal buckets
Discretized data
0 0 2 2 1 1
20
Return on
21
Insight on each terms

P(BN) ? prior on network
In our research, we always use a prior equals to
1
We could incorporate knowledge using it
Eg. we know the presence of an edge. If the
edge is in the BN, P(BN) 1 else P(BN) 0
Efforts are made to reduce the search space by
using knowledge eg. limit the number of parents
or children

22
Insight on each terms

P(DBN) ? marginal likelihood
Easy to calculate using Multinomial distribution
with Dirichlet prior

-Heckerman,A Tutorial on Learning With Bayesian
Networks and Neapolitan,Learning Bayesian Networks
23
MCMC (Markov Chain Monte Carlo) simulation

Markov Chain part
Zoom on a node of the chain

A
A
P(BNnew)
B
C
B
C
1/5
1/5
A
1/5
0
A
B
C
B
1/5
C
1/5
A
A
B
C
B
C
24
MCMC (Markov Chain Monte Carlo) simulation

Monte Carlo part
Choose next BN with probability P(BNnew)
Accept the new BN with the following
MetropolisHastings acceptance criterion

25
Monte Carlo part example

Choose a random path. Each path having a P(BNnew)
of 1/5

Choose a random path. Each path having a P(BNnew)
of 1/5
Choose another random number. If it is smaller
than the Metropolis-Hasting criterion, accept
BNnew else return to BNold

A
A
P(BNnew)
B
C
B
C
1/5
1/5
A
1/5
0
A
B
C
B
1/5
C
1/5
A
A
B
C
B
C
26
MCMC (Markov Chain Monte Carlo) simulation recap

Choose a starting BN at random
Burning phase (generate 5N BN from MCMC without
storing them)
Storing phase (get 100N BN structure from MCMC)

Burning phase
Storing phase
log(P(D BN)P(BN))
Iteration
27
Why 100N BN and not only 1

Cause we dont have enough data and there are a
lot of high scoring networks
Instead, we associate confidence to edge. Eg.
how many time in the sample can we find edge
going from A to B?
We could fix a threshold on confidence and
retrieve a global network construct with edges
having confidence over the threshold

28
What we are working on

Mixing both sequential and non-sequential data to
retrieve interesting relation between genes
How?
Using DBN and MCMC for sequential data BN and
MCMC for non-sequential

100N networks from BN
100N networks from DBN
Information tuner
Learn network
29
How to test the approach

Problem There is no way to test it on real data
cause there is no completely known network
Solution Work on realistic simulation where we
know the network structure
Example

0
1
12
2
4
13
Simulate
3
5
6
7
8
9
10
11
-Hartemink A. Using Bayesian Network Inference
Algorithms to Recover Molecular Genetic
Regulatory Networks
30
How to test the approach
BN MCMC
Info tuner
DBN MCMC
Sequential data
Non-Sequential data
Compare using ROC curves

0
1
12
2
4
13
Simulate
3
5
6
7
8
9
10
11
-Hartemink A. Using Bayesian Network Inference
Algorithms to Recover Molecular Genetic
Regulatory Networks
31
Test description

Generate 60 sequential data
Generate 120 non-sequential data (reality
proportion)
Run DBN MCMC on sequential data keep 100N sample
net
Run BN MCMC on non-sequential data keep 100N
sample net
Test performance using weight on sample
0 BN 1 DBN
.05 BN 0.95 DBN
0.95 BN .05 DBN
1 BN 0 DBN
The metric used is the area under ROC curve.
Perfect learner gets 1.0 , random gets 0.5 and
the worst one gets 0.