Title: Modeling Sequence Conservation
1Modeling Sequence Conservation
with phylogenomic HMMs
B. Majoros
2Overview of Comparative Genome Analysis
model
Noncomparative
?
f
predicted genomic features
human ATCTCATTCGCGCATTCTGATCCGATCTATC
model
Comparative
?
f
predicted genomic features
human ATCTCATTCGCGCATTCTGATCCGATCTATC
informant genomes
chimp ATCTCATTCGCGCATTCTGATCCGATCTATC mouse
CTCTCATACGCGCCTTCTGTTCCGATGTATC dog
AAGTCATACGGGCAATCTCATGCGAACTACC chicken
GTTTAACTCTCGGATAAATATCCAGCCAACA
3Non-independence of Informants
Due to their common ancestry, the informant
sequences are not independent. We can control for
that non-independence by explicitly modeling
their dependence structure using a phylogenetic
tree
Branch lengths represent evolutionary distance,
which conflates the distinct phenomena of elapsed
time and mutation rate.
We will see later that a phylogenetic tree (or
phylogeny) can be interpreted as a special type
of Bayesian network, in which sequence
conservation probabilities are expressed as a
function of the branch lengths.
4The Utility of Sequence Conservation
Suppose we have a multiple-sequence alignment for
a genomic region of interest
Phylogenomic methods make use of the assumption
that natural selection operates more strongly on
some genomic features than others (i.e., coding
versus noncoding), resulting in a detectable bias
in sequence conservation for the features of
interest. More generally, conservation patterns
may differ between levels of DNA organization
(i.e., amino acids in coding segments, versus
individual nucleotides in conserved noncoding
elements).
5Levels of Conservation
A. fumigatus
A. nidulans
feature amino acid conservation nucleotide conservation
exon 1 100 gt 71
intron 1 14 lt 51
exon 2 98 gt 85
intron 2 29 lt 49
exon 3 97 gt 82
intron 3 9 lt 49
exon 4 96 gt 83
6Phylogenomic Gene Finding
model of gene structure
model of phylogeny
I
A1
GT
AG
Eint
A2
A3
Einit
Efin
Esng
S
I2
I1
ATG
TAG
N
a model of gene structure informed by
observed evolutionary divergence
7PhyloHMMs
A PhyloHMM is an HMM (or GHMM) in which each
state qi has an associated evolution model ?i
describing the expected patterns and rates of
evolution in the class of features represented by
that state.
A1
A2
A3
I
S
I2
I1
In practice, many states will share the same
evolution model (parameter tying). A typical
PhyloHMM will have only two evolution models one
for coding sequence and one for noncoding.
GT
AG
A1
A1
A2
A3
A2
A3
A1
S
I2
I1
S
I2
I1
A2
A3
Eint
S
I2
I1
A1
Einit
Efin
A1
A2
A3
A2
A3
A1
Esng
S
I2
I1
S
I2
I1
A2
A3
S
I2
I1
ATG
TAG
A1
A1
A2
A3
A2
A3
N
S
I2
I1
S
I2
I1
A1
A2
A3
S
I2
I1
8Recall Gene Finding with a GHMM
Recall gene finding with a GHMM involves finding
the parse ? which is most probable, given the
input sequence S
This can be further factored into a product of
emission, transition, and duration probabilities
emission
transition
duration
Methods for efficiently evaluating these terms
and extracting the optimal parse (via dynamic
programming) were described in a previous lecture.
9Incorporating Homology Evidence
Given a target sequence S and a set of homologous
informant sequences I(1),...,I(n) aligned to S,
we wish to find the most probable parse ? of S,
given S and I(1),...,I(n)
where a parse ? (Gi,di)0?iltL is a series of
feature types Gi and their corresponding lengths
di along the sequence in left-to-right order.
10Using a Phylogeny as a Bayesian Network
If the target sequence S and the informant
sequences I(1),...,I(n) are related via a known
phylogeny, then we can re-root that phylogeny to
place S at the root, and then attach matrices to
the branches to describe the mutation
probabilities between ancestral sequences and
their children
reroot
Bayesian network
phylogeny
Re-rooting at S corresponds to conditioning the
informants on S...
11Factoring the Likelihood
We can then use the re-rooted tree as a Bayesian
network in order to factor the P(I(1),...,I(n)S,?
) term
This factorization is possible because of the
common (and entirely justifiable) assumption of
conditional independence in traditional
phylogenetic models. More generally
where we have summed over all possible
assignments to the unobservables (ancestral
vertices).
12Eliminating Useless Ancestors
Any ancestor having only one child can be
eliminated algebraically
A
A
B
B
C
C
This is a form of variable elimination (recall
from the Bayesian networks lecture), and is
trivially justified
Possible only because of the conditional
indepen-dence assumption P(C B, A) P(C
B)
13Evaluating the Likelihood
Given a pre-computed alignment of the target and
informant sequences, and assuming independence
between sites (columns of the alignment), the
likelihood can be computed on a per-nucleotide
basis using a recursion known as Felsensteins
pruning algorithm (FPA)
for C(u) the children of node u, ?(u,a) the
Kronecker match function, and the augmented DNA
alphabet ?A,C,G,T,- where - denotes missing
information, typically due to a gap in the
alignment.
14Evaluating the Likelihood Efficiently
Felsensteins recurrence should be computed using
dynamic programming, for the sake of efficiency.
Lu(a) can be evaluated using bottom-up DP (using
a postorder tree traversal) or via memoization
(computing each value only once and storing it in
a memo). In either case, Lc(b) can be obtained
during all subsequent evaluations via a simple
lookup in a DP matrix
The size of the matrix is ?N, for N the number
of taxa in the tree and ?4 for the DNA
alphabet.
15Applying Felsensteins Algorithm
Using Felsensteins recursion, the conditional
likelihood for a single column j of the alignment
is given by P(I(1) j,...,I(n) jS,?) Lr(S
j) where r is root node of the tree, S j
denotes the jth symbol in the target track of the
alignment, and ? is the set of model parameters.
Evaluating the conditional likelihood of the
entire alignment (assuming independence between
sites) can be accomplished via multiplication
P(I(1),...,I(n)S,?) ?0?jltL Lr(S j) where L
is the length of the alignment.
16Modeling Feature Types
Now let us attend to ?. Recall that for the
P(?)P(S?) term we partitioned ? into a series of
states and their durations. We can do the same
for the informant term
where Ii( j) is the subsequence emitted by qi
(the ith state in ?) into the I( j) track of the
alignment, and we have again employed a
conditional independence assumption between the
features emitted by the different states in the
parse. This decomposition by state allows us to
utilize a different evolution model for different
feature types.
17Combining Terms into the Complete Formula
where the output of the ith state in ? spans the
interval (bi,ei) in the alignment. This
optimization problem can be solved efficiently
using a GHMM decoding algorithm and prefix sum
arrays
eAe1-Ab Computing the product ?b?x?e f (x)
for an arbitrary interval (b,e) within the
sequence can be achieved by simple subtraction
followed by exponentiation. Because this
operation can be performed in constant time, the
use of prefix sum arrays is very fast.
18Evaluating the Probability of a Substitution
P(cbua) is the probability of observing base b
in a child node, given we observe a base a at
that location in the parents genome. We can
model this using a matrix of substitution rates,
parameterized by the evolutionary time t that has
passed between the parent and child species
child
A C G T
parent
A C G T
19Continuous-time Markov Models
Substitution models are typically based on
continuous-time Markov models. The Markov
property for continuous-time Markov chains states
that
That is, the probability of a given substitution
is insensitive to the absolute position along the
time axis (i.e., the substitution rate is
stationary), so that time-dependent substitution
rates are simply compounded via matrix
multiplication.
From this we can derive an instantaneous rate
matrix Q from P(t), where we make use of the
obvious fact that P(0)I
20Obtaining P(t) from Q
eQt (the matrix exponential) denotes a Taylor
expansion, as shown above. In practice, we can
solve this via eigenvector decomposition
What remains is to determine Q and the branch
lengths ti of the phylogeny. We will return to
this in a few minutes.
21Constraining the Form of Q
Two key properties of rate matrices are
reversibility and transition-transversion
modeling. A reversible model is one in
which ?ijpiPij(t)pjPji(t) where pi is the
background frequency of base i. Reversibility
ensures that eigenvalues will be
real. Transition-transversion modeling simply
requires that the model be parameterized so that
transition (R?R, Y?Y) and transversion (R?Y)
rates be differentially expressible.
22Some Common Forms for Q
Jukes-Cantor
Kimura
Felsenstein
Hasegawa, Kishino, Yano
General reversible model
23Estimation of Evolutionary Parameters
Now we return to the final problem of estimating
the evolutionary parameters that determine the
substitution probabilities P(t). Consider an
evolution model ? (?, ?, Q) consisting of a
tree topology ?, a set ? ti 0?iltn of branch
lengths, and a rate matrix Q. It is reasonable to
infer the phylogeny ? first, and then to infer ?
and Q via maximum likelihood, given ?. The
UPGMA algorithm constructs phylogenetic trees
from aligned sequences, as follows 1.
Initialize a population of tree stubs, one stub
per sequence 2. Compute all pairwise sequence
edit distances between stubs 3. Iteratively
combine subtrees 4. Pick the two closest
subtrees Ti and Tj 5. Combine Ti and Tj
into a new subtree Tk 6. Remove Ti and Tj
from the population, add Tk, and recompute
distances between Tk and all other subtrees
by averaging Ti Tj
24UPGMA Example
1
2
3
4
5
1
2
6
5
Start with a population of taxa Compute all
pairwise edit distances and pick the closest pair
(3 4)
4
3
Join the closest pair with a new ancestor (6),
which replaces them in the population
Recompute distances and again combine the closest
subtrees
6
5
7
1
2
4
3
9
Repeat......
7
8
7
8
.....until only one tree remains.
1
2
6
5
1
2
6
5
4
3
4
3
25Evaluating the Accuracy of UPGMA
After simulating the evolution of a random 5000
bp sequence over a randomly generated phylogeny,
using a random HKY matrix
True phylogeny
Phylogeny inferred via UPGMA
26Improving on UPGMA
A better algorithm than UPGMA is the
Neighbor-Joining (NJ) algorithm, which follows
the same logic as UPGMA, but with different
distance formulae. Choosing the nearest trees to
merge is done via Dij
Updating of distances for a new subtree Tk is
done via
Branch lengths for the children (i, j) of new
node k are
27Neighbor-Joining (NJ) Algorithm
Applying the Neighbor-Joining algorithm (plus
arbitrary rooting), rather than UPGMA, produces a
more accuracy topology
True phylogeny
Phylogeny inferred via Neighbor-Joining
28Estimating Q and ti
Now we return to the final problem of determining
Q and the branch lengthsti
for alignment . BFGS algorithm
GNU Scientific Library, GSL routine
gsl_multimin_fdfminimizer_vector_bfgs. This
procedure requires all first partial derivatives
of the objective function, which can be evaluated
via differencingcomputing the full tree
likelihood at two nearby points and taking the
difference ?L(x)/?x ? (L(xdx)-L(x))/dx or we
can compute the partial derivatives
analytically...
29Likelihood Gradient
F fa,b
30Gradient Ascent with the GSL
struct MyObjective public GSLObjectiveFunction
virtual double f(const GSLVector
currentPoint) double
xcurrentPoint0 return
(x-3)(x-3) virtual void
gradient(const GSLVector currentPoint,GSLVect
or gradient) double
xcurrentPoint0 gradient02(x-3)
GSLVector initialPoint(1) initi
alPoint.setAllTo(0) MyObjective
f GSLOptimizer optimizer(GSLBFGS,f,initialPoi
nt,0.01,GSLBY_EITHER,
0.01,100) optimizer.run() coutltlt"optimal point
"ltltoptimizer.getOptimalPoint()ltltendl coutltlt"took
"ltltoptimizer.iterationsUsed()ltlt"
iterations"ltltendl
31Classifying Coding vs. Noncoding DNA
classification accuracy
noncoding mutation rate
Figure 9.24 Classification accuracy (y-axis,
percentages) of a 0th-order PhyloHMM for an exon
identification task. Equal numbers of coding and
noncoding segments were independently evolved
over a simulated phylogeny, using a Jukes-Cantor
model (section 9.6.4) with variable substitution
rates, and then classified via likelihood ratio
P(Scoding)/P(Snoncoding). Coding substitution
rate was fixed at 5 noncoding substitution rate
was varied from 6 to 80 (x-axis). Increasing
the noncoding substitution rate relative to the
coding rate quickly enabled the PhyloHMM to
achieve reliable discrimination between coding
and noncoding elements.
32Modeling Sequential Dependencies
AAC
?
42n matrices
?
...
?
ATG
2nd order REV 1536 free parameters
Probabilities are now conditional on some number
of symbols in preceding columns.
33Summary
- Sequence conservation patterns can differ
between genomic feature types, due to the effects
of natural selection, and these biases can be
used to improve prediction accuracy - PhyloHMMs model conservation patterns in
multi-sequence alignments via substitution rate
matrices evaluated over a phylogeny - The likelihood of a column in an alignment,
given an evolutionary model, can be evaluated via
Felsensteins pruning algorithm - Dependences between columns can be modeled using
higher-order rate matrices