Gene Expression Analysis Using Bayesian Networks - PowerPoint PPT Presentation

About This Presentation
Title:

Gene Expression Analysis Using Bayesian Networks

Description:

In normal time, E. coli uses glucose to get energy, but how does it react if ... Arabidopsis Thaliana circadian rhythm. Real data also means missing values ... – PowerPoint PPT presentation

Number of Views:288
Avg rating:3.0/5.0
Slides: 38
Provided by: iroUmo
Category:

less

Transcript and Presenter's Notes

Title: Gene Expression Analysis Using Bayesian Networks


1
Gene Expression Analysis Using Bayesian Networks
  • Éric Paquet
  • LBIT
  • Université de Montréal

2
Biological basis
RNA Polymerase (Copy DNA in RNA)
DNA (Storage of Genetic Information)
Ribosome (Translate Genetic Information in
Proteins)
mRNA (Storage Transport of Genetic Information)
Proteins (Expression of Genetic Information)
-PDB file 1L3A, Transcriptional Regulator Pbf-2
2
3
Biological basis
  • How do proteins get regulated?
  • E. coli operon lactose example
  • In normal time, E. coli uses glucose to get
    energy, but how does it react if there is no more
    glucose but only lactose?

3
4
Biological basis
E. coli environment
RNA Polymerase
...
...
Gene Lac I associated protein
Polymerase action is blocked because of a DNA lock
4
5
Biological basis
E. coli environment
X
RNA Polymerase
...
...
Lactose
Lactose
unlocking the DNA that is then accessible to the
polymerase
Lactose recruits gene lacI associated protein
5
6
Biological basis
Lactose decomposor (ß-galactosidase)
Lactose getter (permease)
6
7
Biological basis
E.coli environment
X
RNA Polymerase
...
...
Lactose
In absence of glucose, a polymerase magnet binds
to the DNA to accelerate the products of
information that help lactose decomposition
7
8
Biological basis
Lactose decomposor (ß-galactosidase)
Lactose getter (permease)
activate
8
9
Why?
  • Get insights about cellular processes
  • Help understand diseases
  • Find drug targets

9
10
How?
  • Using gene expression data and tools for learning
    Bayesian networks

Experiments


Tools for Learning Bayesian networks
mRNA
-Spellman et al.(1998) Mol Biol Cell 93273-97
10
11
What is gene expression data?
  • Data showing the concentration of a specific mRNA
    at a given time of the cell life.

Experiments

mRNA
A real value is coming from one spot and tells if
the concentration of a specific mRNA is higher()
or lower(-) than the normal value
Every columns are the result of one image
-Spellman et al.(1998) Mol Biol Cell 93273-97
12
What is Bayesian networks?
  • Graphic representation of a joint distribution
    over a set of random variables.

P(A,B,C,D,E) P(A)P(B)
P(CA)P(DA,B)
P(ED)
Nodes represent gene expression while edges
encode the interactions (cf. inhibition,
activation)
13
Bayesian networks little problem
  • A Bayesian network should be a DAG (Direct
    Acyclic Graph), but there are a lot of example of
    regulatory networks having directed cycles.


-Husmeier D.,Bioinformatics,Vol. 19 no. 17 2003,
pages 22712282
14
How can we deal with that?
  • Using DBN (Dynamic Bayesian Networks) and
    sequential gene expression data

t
t1
We unfold the network in time
DBN BN with constraints on parents and children
nodes
-Friedman, Murphy, Russell,Learning the
Structure of Dynamic Probabilitic Networks
15
What are we searching for?
  • A Bayesian network that is most probable given
    the data D (gene expression)
  • We found this BN like that
  • BN argmaxBNP(BND)

Where
Marginal likelihood
Prior on network structure
Data probability
Naïve approach to the problem try all possible
dags and keep the best one!
16
It is impossible to try all possible DAGs because
  • The number of dags increases super-exponentially
    with the number of nodes
  • n 3 ? 25 dags
  • n 4 ? 543 dags
  • n 5 ? 29281 dags
  • n 6 ? 3781503 dags
  • n 7 ? 1138779265 dags
  • n 8 ? 783702329343 dags

We are interested in problem having around 60
nodes .
17
Learning Bayesian Networks from data?
  • Choosing search space method and a conditional
    distribution representation
  • Networks space search methods
  • Greedy hill-climbing
  • Beam-search
  • Stochastic hill-climbing
  • Simulated annealing
  • MCMC simulation
  • Conditional distribution representation
  • Linear Gaussian
  • Multinomial, binomial

A
P(a) ? P(b) ? P(ca,b) ?
C
Basically add, remove and reverse edges
B
18
Learning Bayesian Networks from data?
  • Choosing search space method and a conditional
    distribution representation
  • Networks space search methods
  • Greedy hill-climbing
  • Beam-search
  • Stochastic hill-climbing
  • Simulated annealing
  • MCMC simulation
  • Conditional distribution representation
  • Linear Gaussian
  • Multinomial, binomial

A
P(a) ? P(b) ? P(ca,b) ?
C
Basically add, remove and reverse edges
B
19
We use three types of gene expression level?
Sort
-1.06 -0.12 0.18 0.21 1.16
1.19
Split data in 3 equal buckets
Discretized data
0 0 2 2 1 1
20
Return on
21
Insight on each terms
  • P(BN) ? prior on network
  • In our research, we always use a prior equals to
    1
  • We could incorporate knowledge using it
  • Eg. we know the presence of an edge. If the
    edge is in the BN, P(BN) 1 else P(BN) 0
  • Efforts are made to reduce the search space by
    using knowledge eg. limit the number of parents
    or children

22
Insight on each terms
  • P(DBN) ? marginal likelihood
  • Easy to calculate using Multinomial distribution
    with Dirichlet prior

-Heckerman,A Tutorial on Learning With Bayesian
Networks and Neapolitan,Learning Bayesian Networks
23
MCMC (Markov Chain Monte Carlo) simulation
  • Markov Chain part
  • Zoom on a node of the chain

A
A
P(BNnew)
B
C
B
C
1/5
1/5
A
1/5
0
A
B
C
B
1/5
C
1/5
A
A
B
C
B
C
24
MCMC (Markov Chain Monte Carlo) simulation
  • Monte Carlo part
  • Choose next BN with probability P(BNnew)
  • Accept the new BN with the following
    MetropolisHastings acceptance criterion

25
Monte Carlo part example
  1. Choose a random path. Each path having a P(BNnew)
    of 1/5
  1. Choose a random path. Each path having a P(BNnew)
    of 1/5
  2. Choose another random number. If it is smaller
    than the Metropolis-Hasting criterion, accept
    BNnew else return to BNold

A
A
P(BNnew)
B
C
B
C
1/5
1/5
A
1/5
0
A
B
C
B
1/5
C
1/5
A
A
B
C
B
C
26
MCMC (Markov Chain Monte Carlo) simulation recap
  • Choose a starting BN at random
  • Burning phase (generate 5N BN from MCMC without
    storing them)
  • Storing phase (get 100N BN structure from MCMC)

Burning phase
Storing phase
log(P(D BN)P(BN))
Iteration
27
Why 100N BN and not only 1
  • Cause we dont have enough data and there are a
    lot of high scoring networks
  • Instead, we associate confidence to edge. Eg.
    how many time in the sample can we find edge
    going from A to B?
  • We could fix a threshold on confidence and
    retrieve a global network construct with edges
    having confidence over the threshold

28
What we are working on
  • Mixing both sequential and non-sequential data to
    retrieve interesting relation between genes
  • How?
  • Using DBN and MCMC for sequential data BN and
    MCMC for non-sequential

100N networks from BN
100N networks from DBN
Information tuner
Learn network
29
How to test the approach
  • Problem There is no way to test it on real data
    cause there is no completely known network
  • Solution Work on realistic simulation where we
    know the network structure
  • Example


0
1
12
2
4
13
Simulate
3
5
6
7
8
9
10
11
-Hartemink A. Using Bayesian Network Inference
Algorithms to Recover Molecular Genetic
Regulatory Networks
30
How to test the approach
BN MCMC
Info tuner
DBN MCMC
Sequential data
Non-Sequential data
Compare using ROC curves

0
1
12
2
4
13
Simulate
3
5
6
7
8
9
10
11
-Hartemink A. Using Bayesian Network Inference
Algorithms to Recover Molecular Genetic
Regulatory Networks
31
Test description
  • Generate 60 sequential data
  • Generate 120 non-sequential data (reality
    proportion)
  • Run DBN MCMC on sequential data keep 100N sample
    net
  • Run BN MCMC on non-sequential data keep 100N
    sample net
  • Test performance using weight on sample
  • 0 BN 1 DBN
  • .05 BN 0.95 DBN
  • 0.95 BN .05 DBN
  • 1 BN 0 DBN
  • The metric used is the area under ROC curve.
    Perfect learner gets 1.0 , random gets 0.5 and
    the worst one gets 0.

32
Results
Area under ROC curve
1
0 BN
0
1 DBN
33
Perspective
  • Working on more sophisticated ways to mix
    sequential and non-sequential data
  • Working on real cases
  • Yeast cell-cycle
  • Arabidopsis Thaliana circadian rhythm
  • Real data also means missing values
  • Evaluate missing values solution (EM, KNNImpute)

34
Acknowledgements
François Major
35
Why are there missing datas?
36
ROC Curve
  • Receiver Operating Characteristic curve


-http//gim.unmc.edu/dxtests/roc2.htm
37
MCMC simulation and number of sampled networks
Write a Comment
User Comments (0)
About PowerShow.com