Title: CisModule: De novo discovery of cis-regulatory modules by hierarchical mixture modeling
1CisModule De novo discovery of cis-regulatory
modules by hierarchical mixture modeling
- Qing Zhou and Wing Wong
- Slides by Qiaozhu Mei and
- Hong Cheng
- Presented by Saurabh Sinha
2Existing Methods
- Existing motif discovery methods
- Experimental methods
- DNase footprinting and gel-mobility shift assay
- Computational methods
- EM algorithm
- Gibbs sampler
- Word enumeration
- Dictionary model
- A good number of useful TF motifs found, but
still many important TF motifs unexplored.
3Cis-regulatory Modules (CRMs)
- Observation Most eukaryotic genes are controlled
by cis-regulatory modules (CRMs) each consisting
of multiple TF-binding sites (TFBSs). - When no prior knowledge on TFs is available, we
must resort to de novo motif discovery algorithm.
4CRMs Discovery and Motif Estimation
- Greater sensitivity and specificity can be
achieved for motif discovery by considering the
colocalization of different TFBSs - search for modules and motifs simultaneously.
- Module discovery and motif estimation is tightly
coupled - Motif patterns and binding sites are essential
for predicting regulatory modules - Discovery of modules will greatly improve the
performance of motif detection.
5Method
- Goal search for binding sites for K different
TFs within the CRMs of a given set of sequences - A Hierarchical Mixture (HMx) model to generate
the sequence - 1st level the sequences are viewed as a mixture
of CRMs, each of length l, and pure background
sequences outside the modules - 2nd level each module is modeled as a mixture of
motifs and within-module background. - Bayesian inference to estimate locations of
modules, TFBSs and motif patterns based on the
joint posterior distribution.
6HMx Model as a Stochastic Process
- Treat HMx model as a stochastic machinery to
generate sequences. - From the first sequence position, make a series
of random decisions of whether to initiate a
module of length l or generate a letter from the
background model. - Inside a module, If a site for the kth motif was
initiated at position n, then generate wk letters
from its PWM and place them at n, nwk-1,
otherwise generate a letter from the background. - After reaching the end of the current module,
decide whether sampling from the background or
initiating a new module.
7HMx Illustration
(A) Unaligned motif sites (Consider motifs
independently)
(B) Aligned motif sites represented by a
multinomial model (representation of a motif)
(C) Cis-regulatory regions of coregulated genes
(consider modules and motifs in a hierarchical
manner)
8Inside the Model
- Data Observed S
- S Set of sequences
- Model variables ?
- ?0 - first-order Markov Chain to generate
background - ?k - product multinomial parameters (PWM) for a
motif k - ? (?0, ?1, ?2,, ?K)
- r - probability of a module start
- qk - probability of starting a site for motif k
- q (q0,q1,,qK)
- wk - width of motif k
- W (w1,w2,,wK)
- Hidden variables (missing data) M A
- M - indicators for a module start
- S(M) sequence of modules S(Mc) sequence of
background outside modules - Ak - indicators for start positions for sites
of motif k - A (A0,A1,,Ak)
- Model parameters
- l - length of modules
- K - number of motifs
9Inside the Model (cont.)
- Under the HMx model, the complete sequence
likelihood with M and A given is - The joint posterior distribution is
- Priors
- ?(? wk) a Dirichlet distribution with
parameter ?k - ?(q) a Dirichlet distribution with parameter ?
- ?(wk) Poisson(w0)
- ?(r) Beta(a, b)
10Bayesian Inference
- Problem how to estimate ? (?, q, W, r)
- Regarded M and A as missing data and used the
Gibbs sampler to perform Bayesian inference. - With a random initialization, the algorithm
CisModule iteratively cycles through the steps of
parameter update and module-motif detection. - Given current modules and motif sites (M and A),
update all the parameters sample ? from
conditional prob. ? M, A, S - Given current values of the parameters, sample
modules and motif sites from the conditional
distribution
11Sampling ? given M and A
- ? (?, q, r, W) parameters of model
- Align binding sites of each motif, calculate PWM
from these to get samples of ? - q (motif transition probabilities) derived from
total number of sites of each motif - r (module probability) derived from number of
modules prescribed by M - W (motif widths) sampled by Metropolis strategy
12Sampling M and A given ?
- Need to pick (M,A) with Pr(M,A ?,S)
- Use forward summation to compute
- Then use backward sampling to generate the
module indicators (i.e., M) and the site
indicators (i.e., A)
13Forward Summation
- Forward Summation
- where is the probability of observing
- given that it is within a module.
14Backward Sampling
- Starting from n L.
- At position n, decide whether
- (i) is at the last position of a module,
or - (ii) is from the background.
- Probabilities of these events are proportional to
An(?) and Bn(?) respectively - Depending on choosing event (i) or (ii), move to
position n-l or n-1. - Repeat the binary decision process.
- In this way, generate all the module indicators.
- Then, generate motif indicators in a similar
manner
15Algorithm Illustration
sample M, A from conditional prob. M,A ?, S -
Given ?, how to decode the sequence? Two-phase
Sampling..
sample ? from conditional prob. ? M, A, S -
Given M, A, how to estimate ?? Alignments!
16Parameter selection
- In the previous discussion, module length l and
TF number K are left as user-input parameters. - How to determine them when no prior knowledge
provided? - For l An extra conditional sampling by a
Metropolis update can be performed to determine
the most likely module length. - Metropolis ration
17Strategies on l and K
- About the TF number K.
- Formulated as a Bayesian model selection problem.
- Then run CisModule with K1,..,Km, where with Km
the algorithm stops detecting new motifs. Treat
the K in 1,, Km-1 that maximize the posterior
odds as the estimated K.
18Results
- Simulation Studies
- Motif E2F, YY1 and c_MYC
- Background sequences are generated by a
first-order Markov chain . - Module Predictions
- Total length, 2,009 and 4,108 bp on average,
excess rates 0.5 and 2.7 - Coverage of true sites, 84.3 and 94.0
- Motif discovery
- Comparison with MEME and BioProspector
- Improvement over MEME and BP
19Output Result
- By using the samples from the joint posterior
distribution, and can be estimated. - The top -mers that are most frequently
sampled as sites for the kth motif are aligned as
output sites. - The modules are inferred by the marginal
posterior probability of each sequence position
being sampled as within modules. - The positions where this probability gt0.5 are
output as modules.
20Simulation Results
21Homotypic Regulatory Modules in Drosophila
- Motif
- Bicoid (Bcd),Hunchback(Hb) and Kruppel (Kr)
- Results
22Muscle-Specific Regulatory Regions
- Motif
- Mef-2, TEF and SRF
- Results
23Discussion
- HMx model
- Capture the spatial correlation between different
binding sites - CisModule
- A Bayesian module sampler to infer the motif
modules and the binding sites for a set of TFs - May be trapped in local modes.
- Need multiple trials.
- Can use available prior informatioin
24Future Work
- Incorporate the information from comparative
genomics into CisModule. - Greater prior probabilities for modules and sites
can be assigned to the regions that are highly
conserved across species of appropriate
evolutionary distances. - The HMx model captures the colocalization
tendency of cooperating TFBSs but not their order
or precise spacing. - Additional refinements to the model may improve.