Stochastic Models For Heterogeneous DNA Sequences - PowerPoint PPT Presentation

1 / 17

About This Presentation

Title:

Stochastic Models For Heterogeneous DNA Sequences

Description:

They work well when the sequence composition is fairly homogeneous but not when ... These parameters are typically estimated from the data. ... – PowerPoint PPT presentation

Number of Views:34

Avg rating:3.0/5.0

Slides: 18

Provided by: matthewn152

Category:

more less

Transcript and Presenter's Notes

Title: Stochastic Models For Heterogeneous DNA Sequences

1
Stochastic Models For Heterogeneous DNA Sequences

Churchill, G.A.
Bulletin of Mathematical Biology
Vol. 51, pp. 79-94,1989.

Presented By Matthew McCall
2
DNA Representation

A typical representation of a DNA sequence is a
single strand written from the 5 to 3 end.
Consider two binary representations the
purine-pyrimidine (AG-TC) and strong-weak
hydrogen-bonding (GC-AT).
Together these two yield the same information as
the single strand representation.

3
Stochastic Modeling

Stochastic Process - A non-deterministic process
in that the next state of the environment is not
fully determined by the previous state of the
environment.
This is a method for extracting information.
It doesnt try to mimic the process followed in
nature, which is far too complex for any simple
model.

4
Markov Chains

Markov Chains have traditionally been used to
model sequences.
They work well when the sequence composition is
fairly homogeneous but not when the composition
is heterogeneous.
Compositional variation between segments of the
sequence is likely to reflect functional or
structural differences.

5
Proposed Solution

Sequences are assumed to be composed of
homogeneous regions, which differ from one
another.
As such, each region can be classified into one
of a finite number of states.
These states represent the underlying structure
of the sequence in that region and are assumed to
develop according to a hidden Markov process.

6
The General Case

A sequence of random variables Yi i1,,n
Corresponding unobservable states Si
Denote the sequence of observed outcomes up to
time t by yty1,,yt
And similarly the states sts1,,st
Then the probability of an observation given the
current state and past observations is
Pr(ytst,yt-1)
This is called the observation equation.

7
The General Case (cont.)

The sequence of states, st, cannot be observed
however they can be inferred from the
observations.
The states are assumed to evolve according to a
set of system equations with the Markov property
Pr(stst-1) Pr(stst-1)
The problem addressed here is how to estimate the
states Si given the observations Yi.
The result, the smoothed estimate at time t, will
be denoted Pr(styn).

8
Computing Pr(styn)

Pr(styn) Pr(styt) ? Pr(st1yn)Pr(st1st)/Pr
(st1yt) dst1
We can compute each of these terms, so the
smoothed estimate at time t can be expressed in
terms of the system equations, Pr(st1st),
quantities derived from filtering, Pr(styt) and
Pr(st1yt), and the smoothed estimate at time
t1, Pr(st1yn).
So we can employ a recursive algorithm to compute
the smoothed estimate at each time t.

9
Parameter Estimation

The previous algorithm requires that the
parameters of the observation and system
equations be specified.
These parameters are typically estimated from the
data.
The paper suggests an EM algorithm for
determining the approximate maximum likelihood
estimate of these parameters.

10
Application to DNA Sequences

Bases of a sequence are viewed as the observed
outcomes Yi.
The states Si are assumed to be fixed and
finite in number.
Each region then has a specific state.

11
The Simplest Case

Think back to the binary representations
(strong-weak hydrogen-bonding).
A sequence of independent binary outcomes
yt?0,1 which depend on the underlying states
st?0,1.
Then the observation equation would be binomial
Pr(ytstj) (pjyt) (1-pj)(1-yt) , j?0,1
where p0 Pr(yt1st0) and p1 Pr(yt1st1)

12
The Simplest Case (cont.)

The system equations would be
Pr(stjst-1i) ?ij
Define the transition probabilities
? Pr(st1st-10)
? Pr(st0st-11)
Both these are assumed to be small, so states
tend to persist.
In this case, the size of the different regions
will have a geometric distribution with means
equal to the reciprocals of the transition
probabilities.

13
GC Clusters in Yeast mtDNA

The mitochondrial genome of yeast is 85 kb, which
appears to have three distinct types of regions
with differing GC content.
Intergenic segments consist of large stretches of
DNA with less than 5 GC content intermingled
with short stretches with greater than 50 GC
content.
The genes themselves have GC content of 18 to 28.

14
GC Clusters in Yeast mtDNA

Respiration deficient mutants, p-, come about
through a deletion of most of the wild-type DNA
(p).
The small amount left is amplified in tandem
repeats, replicated, and maintained in the
mitochondrion.
Some of the p- mutants displace the p DNA in all
the diploid descendants when mating with
wild-type strains.
These are called hypersuppressive p- genomes.

15
GC Clusters in Yeast mtDNA

This paper considers two hypersuppressive and two
non-suppressive phenotypes.
Because the segments they consider contain no
coding regions, they use the binary model.
The model parameters are set at
p00.9, p10.5, ??0.01

16
Yeast mtDNA Results

The profiles of the hypersuppressive segments
share a general structure of GC-rich regions and
AT-rich regions that differ from the
non-suppressive segments.
This suggests that the GC content is in some way
related to the function of these hypersuppressive
sequences.

17
Conclusion

In its simplest form, this algorithm is useful
for studying compositional heterogeneity in DNA.
The stochastic models proposed are useful in
extracting information from large and complex
data sets such as DNA sequence data.
This algorithm could be used to study
relationships between DNA primary structure and
global organization of entire chromosomes.

Write a Comment

User Comments (0)