CisModule: De novo discovery of cis-regulatory modules by hierarchical mixture modeling

1 / 22
About This Presentation
Title:

CisModule: De novo discovery of cis-regulatory modules by hierarchical mixture modeling

Description:

W (motif widths) sampled by Metropolis strategy. Sampling M and A given ... Metropolis ration. Strategies on l and K. About the TF number K. ... –

Number of Views:38
Avg rating:3.0/5.0
Slides: 23
Provided by: Hongc
Category:

less

Transcript and Presenter's Notes

Title: CisModule: De novo discovery of cis-regulatory modules by hierarchical mixture modeling


1
CisModule De novo discovery of cis-regulatory
modules by hierarchical mixture modeling
  • Qing Zhou and Wing Wong
  • Slides by Qiaozhu Mei and
  • Hong Cheng
  • Presented by Saurabh Sinha

2
Existing Methods
  • Existing motif discovery methods
  • Experimental methods
  • DNase footprinting and gel-mobility shift assay
  • Computational methods
  • EM algorithm
  • Gibbs sampler
  • Word enumeration
  • Dictionary model
  • A good number of useful TF motifs found, but
    still many important TF motifs unexplored.

3
Cis-regulatory Modules (CRMs)
  • Observation Most eukaryotic genes are controlled
    by cis-regulatory modules (CRMs) each consisting
    of multiple TF-binding sites (TFBSs).
  • When no prior knowledge on TFs is available, we
    must resort to de novo motif discovery algorithm.

4
CRMs Discovery and Motif Estimation
  • Greater sensitivity and specificity can be
    achieved for motif discovery by considering the
    colocalization of different TFBSs
  • search for modules and motifs simultaneously.
  • Module discovery and motif estimation is tightly
    coupled
  • Motif patterns and binding sites are essential
    for predicting regulatory modules
  • Discovery of modules will greatly improve the
    performance of motif detection.

5
Method
  • Goal search for binding sites for K different
    TFs within the CRMs of a given set of sequences
  • A Hierarchical Mixture (HMx) model to generate
    the sequence
  • 1st level the sequences are viewed as a mixture
    of CRMs, each of length l, and pure background
    sequences outside the modules
  • 2nd level each module is modeled as a mixture of
    motifs and within-module background.
  • Bayesian inference to estimate locations of
    modules, TFBSs and motif patterns based on the
    joint posterior distribution.

6
HMx Model as a Stochastic Process
  • Treat HMx model as a stochastic machinery to
    generate sequences.
  • From the first sequence position, make a series
    of random decisions of whether to initiate a
    module of length l or generate a letter from the
    background model.
  • Inside a module, If a site for the kth motif was
    initiated at position n, then generate wk letters
    from its PWM and place them at n, nwk-1,
    otherwise generate a letter from the background.
  • After reaching the end of the current module,
    decide whether sampling from the background or
    initiating a new module.

7
HMx Illustration
(A) Unaligned motif sites (Consider motifs
independently)
(B) Aligned motif sites represented by a
multinomial model (representation of a motif)
(C) Cis-regulatory regions of coregulated genes
(consider modules and motifs in a hierarchical
manner)
8
Inside the Model
  • Data Observed S
  • S Set of sequences
  • Model variables ?
  • ?0 - first-order Markov Chain to generate
    background
  • ?k - product multinomial parameters (PWM) for a
    motif k
  • ? (?0, ?1, ?2,, ?K)
  • r - probability of a module start
  • qk - probability of starting a site for motif k
  • q (q0,q1,,qK)
  • wk - width of motif k
  • W (w1,w2,,wK)
  • Hidden variables (missing data) M A
  • M - indicators for a module start
  • S(M) sequence of modules S(Mc) sequence of
    background outside modules
  • Ak - indicators for start positions for sites
    of motif k
  • A (A0,A1,,Ak)
  • Model parameters
  • l - length of modules
  • K - number of motifs

9
Inside the Model (cont.)
  • Under the HMx model, the complete sequence
    likelihood with M and A given is
  • The joint posterior distribution is
  • Priors
  • ?(? wk) a Dirichlet distribution with
    parameter ?k
  • ?(q) a Dirichlet distribution with parameter ?
  • ?(wk) Poisson(w0)
  • ?(r) Beta(a, b)

10
Bayesian Inference
  • Problem how to estimate ? (?, q, W, r)
  • Regarded M and A as missing data and used the
    Gibbs sampler to perform Bayesian inference.
  • With a random initialization, the algorithm
    CisModule iteratively cycles through the steps of
    parameter update and module-motif detection.
  • Given current modules and motif sites (M and A),
    update all the parameters sample ? from
    conditional prob. ? M, A, S
  • Given current values of the parameters, sample
    modules and motif sites from the conditional
    distribution

11
Sampling ? given M and A
  • ? (?, q, r, W) parameters of model
  • Align binding sites of each motif, calculate PWM
    from these to get samples of ?
  • q (motif transition probabilities) derived from
    total number of sites of each motif
  • r (module probability) derived from number of
    modules prescribed by M
  • W (motif widths) sampled by Metropolis strategy

12
Sampling M and A given ?
  • Need to pick (M,A) with Pr(M,A ?,S)
  • Use forward summation to compute
  • Then use backward sampling to generate the
    module indicators (i.e., M) and the site
    indicators (i.e., A)

13
Forward Summation
  • Forward Summation
  • where is the probability of observing
  • given that it is within a module.

14
Backward Sampling
  • Starting from n L.
  • At position n, decide whether
  • (i) is at the last position of a module,
    or
  • (ii) is from the background.
  • Probabilities of these events are proportional to
    An(?) and Bn(?) respectively
  • Depending on choosing event (i) or (ii), move to
    position n-l or n-1.
  • Repeat the binary decision process.
  • In this way, generate all the module indicators.
  • Then, generate motif indicators in a similar
    manner

15
Algorithm Illustration
sample M, A from conditional prob. M,A ?, S -
Given ?, how to decode the sequence? Two-phase
Sampling..
sample ? from conditional prob. ? M, A, S -
Given M, A, how to estimate ?? Alignments!
16
Parameter selection
  • In the previous discussion, module length l and
    TF number K are left as user-input parameters.
  • How to determine them when no prior knowledge
    provided?
  • For l An extra conditional sampling by a
    Metropolis update can be performed to determine
    the most likely module length.
  • Metropolis ration

17
Strategies on l and K
  • About the TF number K.
  • Formulated as a Bayesian model selection problem.
  • Then run CisModule with K1,..,Km, where with Km
    the algorithm stops detecting new motifs. Treat
    the K in 1,, Km-1 that maximize the posterior
    odds as the estimated K.

18
Results
  • Simulation Studies
  • Motif E2F, YY1 and c_MYC
  • Background sequences are generated by a
    first-order Markov chain .
  • Module Predictions
  • Total length, 2,009 and 4,108 bp on average,
    excess rates 0.5 and 2.7
  • Coverage of true sites, 84.3 and 94.0
  • Motif discovery
  • Comparison with MEME and BioProspector
  • Improvement over MEME and BP

19
Output Result
  • By using the samples from the joint posterior
    distribution, and can be estimated.
  • The top -mers that are most frequently
    sampled as sites for the kth motif are aligned as
    output sites.
  • The modules are inferred by the marginal
    posterior probability of each sequence position
    being sampled as within modules.
  • The positions where this probability gt0.5 are
    output as modules.

20
Simulation Results
21
Homotypic Regulatory Modules in Drosophila
  • Motif
  • Bicoid (Bcd),Hunchback(Hb) and Kruppel (Kr)
  • Results

22
Muscle-Specific Regulatory Regions
  • Motif
  • Mef-2, TEF and SRF
  • Results

23
Discussion
  • HMx model
  • Capture the spatial correlation between different
    binding sites
  • CisModule
  • A Bayesian module sampler to infer the motif
    modules and the binding sites for a set of TFs
  • May be trapped in local modes.
  • Need multiple trials.
  • Can use available prior informatioin

24
Future Work
  • Incorporate the information from comparative
    genomics into CisModule.
  • Greater prior probabilities for modules and sites
    can be assigned to the regions that are highly
    conserved across species of appropriate
    evolutionary distances.
  • The HMx model captures the colocalization
    tendency of cooperating TFBSs but not their order
    or precise spacing.
  • Additional refinements to the model may improve.
Write a Comment
User Comments (0)
About PowerShow.com