Uncovering evolutionary history: new methods for inferring phylogenies

1 / 26

About This Presentation

Title:

Uncovering evolutionary history: new methods for inferring phylogenies

Description:

Biologists want to reconstruct the evolutionary history of ... Cassowary. Kiwi. Ostrich. Rhea. Biological Data. Morphological information. Paralogous genes ... –

Number of Views:27

Avg rating:3.0/5.0

Slides: 27

Provided by: elapa

Category:

more less

Transcript and Presenter's Notes

Title: Uncovering evolutionary history: new methods for inferring phylogenies

1
Uncovering evolutionary history new methods for
inferring phylogenies
Tom Nye, Wally Gilks (MRC BSU)
Pietro Liò (Computer Laboratory, Cambridge)

RSS Manchester
12 October 2005

2
Winding Back the Evolutionary Clock

Biologists want to reconstruct the evolutionary
history of genes, genomes, and species
Evolutionary history helps us to understand the
genomes we see today
Phylogenetic trees represent evolutionary
relationships between species and are a vital
ingredient in many biological analyses

3
Todays Talk

An introduction to phylogeny
Maximum likelihood and distance-matrix frameworks
A new distance-matrix approach

4
Evolutionary Trees

Evolutionary relationships can be represented by
trees, called phylogenies
Leaf nodes are extant species
Internal nodes are speciation events
Branch lengths show evolutionary distance

Moa
Cassowary
Emu
Kiwi
Ostrich
Rhea
5
Biological Data
6
Sequence Evolution
AAGCTGATC
7
Models of Nucleotide Substitution

Model sites along the DNA string as evolving
independently
Continuous time Markov chain with states A,C,G,T
Define
Pij (t) Prob (in state j at time t given in
state i at t 0)
So that
P(t) exp ?tQ
where
Q is the instantaneous rate matrix
? is the rate of mutation events, ?t represents
branch length
Various models available for Q

8
Molecular Clocks

Branch lengths represent evolutionary distance
(typically number of nucleotide substitutions)
Rates of change may vary between branches
Molecular clock no rate variation

t
No Clock
With Clock
9
Tree Likelihood

Given a tree topology and branch lengths,
evaluate the likelihood of the tree under the
substitution model

AAGCGCATG AACCTCATC TAGCCGTTC TAGCCCTGC G
AGCAGTTC
Likelihood Prob(G,T,C,C,A Tree)
???? Prob(w) Prob(zw) Prob(Az) Prob(Cz)
? Prob(yw) Prob(Cy)
? Prob(xy) Prob(Gx) Prob(Tx)
x y z w

Full likelihood is obtained by taking product
over nucleotide sites

10
Likelihood Maximization

We can search for the maximum likelihood tree
Pick an initial topology
Find the optimal set of branch lengths
Is this the highest likelihood we have seen?
Pick a new topology

Tree space is huge the search is computationally
intensive

11
Distance-Matrix Approaches

Given a matrix of evolutionary distances,
estimate the tree that gave rise to those
distances

Tree unknown
Distance matrix known

Even if we knew the underlying tree, in practice
any observed distance matrix will not match it
perfectly
branch length gives the expected number of
substitutions at one site
the distance matrix is unlikely to match any tree

12
Comments

Distance matrices
The distance matrix summarizes the information in
the full sequence data set
Data loss problematic for widely diverged
sequences
Distance matrix is obtained from sequence data
using a substitution model many ways to do this
Comparison with likelihood
Distance matrix methods are less sophisticated...
... but they are much faster!

13
Least Squares Fitting

Suppose we are given a tree topology and a
distance matrix how would we find branch lengths
on the tree?
For two leaves i,j denote
true distances on tree tij
observed distances dij
Assume that observed distances are unbiased
estimates of the true distances
Use branch lengths tij that minimize the error
term

14
Neighbour Joining

Neighbour Joining (NJ) is defined by an
agglomerative algorithm

i
i
u
u
j
j
Start with star-like topology
Consider joining every pair of nodes i,j in turn
Replace nodes i,j by a single node u
Calculate branch length estimates via least
squares
Continue...
Join the pair i,j that minimizes the sum of
branch lengths
15
Comments

NJ is hard to justify statistically...
... but it works surprisingly well!
Recent improvements to the algorithm have not
introduced a thorough statistical framework

16
Our Methodology

New distance-matrix method for constructing
phylogenies
Motivated by the example of gene families but
also applies to species trees
Essential ingredients
Distribution free, moment-based approach
Handles variance/covariance of distances more
thoroughly than existing distance-matrix methods

17
Motivation Families of Paralogs

Certain genes have many copies within
the same genome
Examples olfactory receptors, proteases, kinases
Appear to have evolved through duplications of
individual genes, clusters of genes, and
rearrangements within gene clusters
Phylogenetic tree for these genes ? history of
gene duplication
Could we construct a more sophisticated history?
Block duplications of more than one gene
A history of linear arrangement along the genome

18
Assumptions (1)

Molecular clock setting necessary in order to
consider events in which more than one gene is
duplicated
In a block duplication, two or more genes are
copied at the same time
The observed distances dij are the result of a
random process perturbing the underlying true
tree T

T
D
i
j
k
l
i
j
k
l
dij denotes distance between i,j in D
tij denotes path between i,j in T
19
Assumptions (2)

The observed distances dij are the result of a
random process perturbing the underlying true
tree T, that satisfies

T
i
j
k
l

Note that we do not need a complete description
of the perturbation process just consider
moments

20
Building trees

Adopt an agglomerative approach winding back
the clock

- Try joining every pair of nodes
- Can also consider block joins

Pick the best join according to some statistical
score

- Update the set of distances between nodes
Etc...
- Continue recursively

We need to specify
How to score join events
How to fix the time for a join
How to update the distances at each iteration

21
Scoring Joins (1)

Suppose we have constructed T as far back as some
time t. What is the covariance matrix for
distances between nodes?

i
t
Tt
i
22
Scoring Joins (2)
j
i
t
Time ti
Tt
i
j
Next consider the expectation and variance of
distances between the black nodes
23
Scoring Joins (3)

Score tree using the goodness-of-fit of the
calculated distances dt to expected distances
Under suitable asymptotic assumptions this is a
?2 statistic
The distance vector dt is n x n so the covariance
matrix is n2 x n2 and inverting it is potentially
O(n6)
However, it can be inverted algebraically in
O(n2) steps, and score evaluated in O(n) steps

24
Results

Construction of large trees
Comparison with other methods (NJ) in progress...
Issues
As a purely phylogenetic method it is held back
by the molecular clock assumption
It is not a complete approach to inferring
historical arrangements of paralogous genes,
although it can incorporate duplication of more
than one gene at a time

25
Conclusions

Full probabilistic models for constructing
phylogenies are unsuitable if there are many
leaves
Existing distance-matrix methods could be
improved upon
We have a new distance-matrix approach that
improves upon standard approaches to variance /
covariance
Future Work
Could we build our approach to covariance into a
setting with no molecular clock?
Can we develop approaches that combine
phylogenetic and arrangement information to build
evolutionary histories?