Title: Uncovering evolutionary history: new methods for inferring phylogenies
1Uncovering evolutionary history new methods for
inferring phylogenies
Tom Nye, Wally Gilks (MRC BSU)
Pietro Liò (Computer Laboratory, Cambridge)
- RSS Manchester
- 12 October 2005
2Winding Back the Evolutionary Clock
- Biologists want to reconstruct the evolutionary
history of genes, genomes, and species - Evolutionary history helps us to understand the
genomes we see today - Phylogenetic trees represent evolutionary
relationships between species and are a vital
ingredient in many biological analyses
3Todays Talk
- An introduction to phylogeny
- Maximum likelihood and distance-matrix frameworks
- A new distance-matrix approach
4Evolutionary Trees
- Evolutionary relationships can be represented by
trees, called phylogenies - Leaf nodes are extant species
- Internal nodes are speciation events
- Branch lengths show evolutionary distance
Moa
Cassowary
Emu
Kiwi
Ostrich
Rhea
5Biological Data
6Sequence Evolution
AAGCTGATC
7Models of Nucleotide Substitution
- Model sites along the DNA string as evolving
independently - Continuous time Markov chain with states A,C,G,T
- Define
- Pij (t) Prob (in state j at time t given in
state i at t 0) - So that
- P(t) exp ?tQ
- where
- Q is the instantaneous rate matrix
- ? is the rate of mutation events, ?t represents
branch length - Various models available for Q
8Molecular Clocks
- Branch lengths represent evolutionary distance
(typically number of nucleotide substitutions) - Rates of change may vary between branches
- Molecular clock no rate variation
t
No Clock
With Clock
9Tree Likelihood
- Given a tree topology and branch lengths,
evaluate the likelihood of the tree under the
substitution model
AAGCGCATG AACCTCATC TAGCCGTTC TAGCCCTGC G
AGCAGTTC
Likelihood Prob(G,T,C,C,A Tree)
???? Prob(w) Prob(zw) Prob(Az) Prob(Cz)
? Prob(yw) Prob(Cy)
? Prob(xy) Prob(Gx) Prob(Tx)
x y z w
- Full likelihood is obtained by taking product
over nucleotide sites
10Likelihood Maximization
- We can search for the maximum likelihood tree
- Pick an initial topology
- Find the optimal set of branch lengths
- Is this the highest likelihood we have seen?
- Pick a new topology
- Tree space is huge the search is computationally
intensive
11Distance-Matrix Approaches
- Given a matrix of evolutionary distances,
estimate the tree that gave rise to those
distances
Tree unknown
Distance matrix known
- Even if we knew the underlying tree, in practice
any observed distance matrix will not match it
perfectly - branch length gives the expected number of
substitutions at one site - the distance matrix is unlikely to match any tree
12Comments
- Distance matrices
- The distance matrix summarizes the information in
the full sequence data set - Data loss problematic for widely diverged
sequences - Distance matrix is obtained from sequence data
using a substitution model many ways to do this - Comparison with likelihood
- Distance matrix methods are less sophisticated...
- ... but they are much faster!
13Least Squares Fitting
- Suppose we are given a tree topology and a
distance matrix how would we find branch lengths
on the tree? - For two leaves i,j denote
- true distances on tree tij
- observed distances dij
- Assume that observed distances are unbiased
estimates of the true distances - Use branch lengths tij that minimize the error
term
14Neighbour Joining
- Neighbour Joining (NJ) is defined by an
agglomerative algorithm
i
i
u
u
j
j
Start with star-like topology
Consider joining every pair of nodes i,j in turn
Replace nodes i,j by a single node u
Calculate branch length estimates via least
squares
Continue...
Join the pair i,j that minimizes the sum of
branch lengths
15Comments
- NJ is hard to justify statistically...
- ... but it works surprisingly well!
- Recent improvements to the algorithm have not
introduced a thorough statistical framework
16Our Methodology
- New distance-matrix method for constructing
phylogenies - Motivated by the example of gene families but
also applies to species trees - Essential ingredients
- Distribution free, moment-based approach
- Handles variance/covariance of distances more
thoroughly than existing distance-matrix methods
17Motivation Families of Paralogs
- Certain genes have many copies within
- the same genome
- Examples olfactory receptors, proteases, kinases
- Appear to have evolved through duplications of
individual genes, clusters of genes, and
rearrangements within gene clusters - Phylogenetic tree for these genes ? history of
gene duplication - Could we construct a more sophisticated history?
- Block duplications of more than one gene
- A history of linear arrangement along the genome
18Assumptions (1)
- Molecular clock setting necessary in order to
consider events in which more than one gene is
duplicated - In a block duplication, two or more genes are
copied at the same time - The observed distances dij are the result of a
random process perturbing the underlying true
tree T
T
D
i
j
k
l
i
j
k
l
dij denotes distance between i,j in D
tij denotes path between i,j in T
19Assumptions (2)
- The observed distances dij are the result of a
random process perturbing the underlying true
tree T, that satisfies
T
i
j
k
l
- Note that we do not need a complete description
of the perturbation process just consider
moments
20Building trees
- Adopt an agglomerative approach winding back
the clock
- Try joining every pair of nodes
- Can also consider block joins
- Pick the best join according to some statistical
score
- Update the set of distances between nodes
Etc...
- Continue recursively
- We need to specify
- How to score join events
- How to fix the time for a join
- How to update the distances at each iteration
21Scoring Joins (1)
- Suppose we have constructed T as far back as some
time t. What is the covariance matrix for
distances between nodes?
i
t
Tt
i
22Scoring Joins (2)
j
i
t
Time ti
Tt
i
j
Next consider the expectation and variance of
distances between the black nodes
23Scoring Joins (3)
- Score tree using the goodness-of-fit of the
calculated distances dt to expected distances - Under suitable asymptotic assumptions this is a
?2 statistic - The distance vector dt is n x n so the covariance
matrix is n2 x n2 and inverting it is potentially
O(n6) - However, it can be inverted algebraically in
O(n2) steps, and score evaluated in O(n) steps
24Results
- Construction of large trees
- Comparison with other methods (NJ) in progress...
- Issues
- As a purely phylogenetic method it is held back
by the molecular clock assumption - It is not a complete approach to inferring
historical arrangements of paralogous genes,
although it can incorporate duplication of more
than one gene at a time
25Conclusions
- Full probabilistic models for constructing
phylogenies are unsuitable if there are many
leaves - Existing distance-matrix methods could be
improved upon - We have a new distance-matrix approach that
improves upon standard approaches to variance /
covariance - Future Work
- Could we build our approach to covariance into a
setting with no molecular clock? - Can we develop approaches that combine
phylogenetic and arrangement information to build
evolutionary histories?
26(No Transcript)