Stochastic Models For Heterogeneous DNA Sequences - PowerPoint PPT Presentation

1 / 17
About This Presentation
Title:

Stochastic Models For Heterogeneous DNA Sequences

Description:

They work well when the sequence composition is fairly homogeneous but not when ... These parameters are typically estimated from the data. ... – PowerPoint PPT presentation

Number of Views:34
Avg rating:3.0/5.0
Slides: 18
Provided by: matthewn152
Category:

less

Transcript and Presenter's Notes

Title: Stochastic Models For Heterogeneous DNA Sequences


1
Stochastic Models For Heterogeneous DNA Sequences
  • Churchill, G.A.
  • Bulletin of Mathematical Biology
  • Vol. 51, pp. 79-94,1989.

Presented By Matthew McCall
2
DNA Representation
  • A typical representation of a DNA sequence is a
    single strand written from the 5 to 3 end.
  • Consider two binary representations the
    purine-pyrimidine (AG-TC) and strong-weak
    hydrogen-bonding (GC-AT).
  • Together these two yield the same information as
    the single strand representation.

3
Stochastic Modeling
  • Stochastic Process - A non-deterministic process
    in that the next state of the environment is not
    fully determined by the previous state of the
    environment.
  • This is a method for extracting information.
  • It doesnt try to mimic the process followed in
    nature, which is far too complex for any simple
    model.

4
Markov Chains
  • Markov Chains have traditionally been used to
    model sequences.
  • They work well when the sequence composition is
    fairly homogeneous but not when the composition
    is heterogeneous.
  • Compositional variation between segments of the
    sequence is likely to reflect functional or
    structural differences.

5
Proposed Solution
  • Sequences are assumed to be composed of
    homogeneous regions, which differ from one
    another.
  • As such, each region can be classified into one
    of a finite number of states.
  • These states represent the underlying structure
    of the sequence in that region and are assumed to
    develop according to a hidden Markov process.

6
The General Case
  • A sequence of random variables Yi i1,,n
  • Corresponding unobservable states Si
  • Denote the sequence of observed outcomes up to
    time t by yty1,,yt
  • And similarly the states sts1,,st
  • Then the probability of an observation given the
    current state and past observations is
  • Pr(ytst,yt-1)
  • This is called the observation equation.

7
The General Case (cont.)
  • The sequence of states, st, cannot be observed
    however they can be inferred from the
    observations.
  • The states are assumed to evolve according to a
    set of system equations with the Markov property
  • Pr(stst-1) Pr(stst-1)
  • The problem addressed here is how to estimate the
    states Si given the observations Yi.
  • The result, the smoothed estimate at time t, will
    be denoted Pr(styn).

8
Computing Pr(styn)
  • Pr(styn) Pr(styt) ? Pr(st1yn)Pr(st1st)/Pr
    (st1yt) dst1
  • We can compute each of these terms, so the
    smoothed estimate at time t can be expressed in
    terms of the system equations, Pr(st1st),
    quantities derived from filtering, Pr(styt) and
    Pr(st1yt), and the smoothed estimate at time
    t1, Pr(st1yn).
  • So we can employ a recursive algorithm to compute
    the smoothed estimate at each time t.

9
Parameter Estimation
  • The previous algorithm requires that the
    parameters of the observation and system
    equations be specified.
  • These parameters are typically estimated from the
    data.
  • The paper suggests an EM algorithm for
    determining the approximate maximum likelihood
    estimate of these parameters.

10
Application to DNA Sequences
  • Bases of a sequence are viewed as the observed
    outcomes Yi.
  • The states Si are assumed to be fixed and
    finite in number.
  • Each region then has a specific state.

11
The Simplest Case
  • Think back to the binary representations
    (strong-weak hydrogen-bonding).
  • A sequence of independent binary outcomes
    yt?0,1 which depend on the underlying states
    st?0,1.
  • Then the observation equation would be binomial
  • Pr(ytstj) (pjyt) (1-pj)(1-yt) , j?0,1
  • where p0 Pr(yt1st0) and p1 Pr(yt1st1)

12
The Simplest Case (cont.)
  • The system equations would be
  • Pr(stjst-1i) ?ij
  • Define the transition probabilities
  • ? Pr(st1st-10)
  • ? Pr(st0st-11)
  • Both these are assumed to be small, so states
    tend to persist.
  • In this case, the size of the different regions
    will have a geometric distribution with means
    equal to the reciprocals of the transition
    probabilities.

13
GC Clusters in Yeast mtDNA
  • The mitochondrial genome of yeast is 85 kb, which
    appears to have three distinct types of regions
    with differing GC content.
  • Intergenic segments consist of large stretches of
    DNA with less than 5 GC content intermingled
    with short stretches with greater than 50 GC
    content.
  • The genes themselves have GC content of 18 to 28.

14
GC Clusters in Yeast mtDNA
  • Respiration deficient mutants, p-, come about
    through a deletion of most of the wild-type DNA
    (p).
  • The small amount left is amplified in tandem
    repeats, replicated, and maintained in the
    mitochondrion.
  • Some of the p- mutants displace the p DNA in all
    the diploid descendants when mating with
    wild-type strains.
  • These are called hypersuppressive p- genomes.

15
GC Clusters in Yeast mtDNA
  • This paper considers two hypersuppressive and two
    non-suppressive phenotypes.
  • Because the segments they consider contain no
    coding regions, they use the binary model.
  • The model parameters are set at
  • p00.9, p10.5, ??0.01

16
Yeast mtDNA Results
  • The profiles of the hypersuppressive segments
    share a general structure of GC-rich regions and
    AT-rich regions that differ from the
    non-suppressive segments.
  • This suggests that the GC content is in some way
    related to the function of these hypersuppressive
    sequences.

17
Conclusion
  • In its simplest form, this algorithm is useful
    for studying compositional heterogeneity in DNA.
  • The stochastic models proposed are useful in
    extracting information from large and complex
    data sets such as DNA sequence data.
  • This algorithm could be used to study
    relationships between DNA primary structure and
    global organization of entire chromosomes.
Write a Comment
User Comments (0)
About PowerShow.com