Uncovering evolutionary history: new methods for inferring phylogenies

1 / 26
About This Presentation
Title:

Uncovering evolutionary history: new methods for inferring phylogenies

Description:

Biologists want to reconstruct the evolutionary history of ... Cassowary. Kiwi. Ostrich. Rhea. Biological Data. Morphological information. Paralogous genes ... –

Number of Views:27
Avg rating:3.0/5.0
Slides: 27
Provided by: elapa
Category:

less

Transcript and Presenter's Notes

Title: Uncovering evolutionary history: new methods for inferring phylogenies


1
Uncovering evolutionary history new methods for
inferring phylogenies
Tom Nye, Wally Gilks (MRC BSU)
Pietro Liò (Computer Laboratory, Cambridge)
  • RSS Manchester
  • 12 October 2005

2
Winding Back the Evolutionary Clock
  • Biologists want to reconstruct the evolutionary
    history of genes, genomes, and species
  • Evolutionary history helps us to understand the
    genomes we see today
  • Phylogenetic trees represent evolutionary
    relationships between species and are a vital
    ingredient in many biological analyses

3
Todays Talk
  • An introduction to phylogeny
  • Maximum likelihood and distance-matrix frameworks
  • A new distance-matrix approach

4
Evolutionary Trees
  • Evolutionary relationships can be represented by
    trees, called phylogenies
  • Leaf nodes are extant species
  • Internal nodes are speciation events
  • Branch lengths show evolutionary distance

Moa
Cassowary
Emu
Kiwi
Ostrich
Rhea
5
Biological Data
6
Sequence Evolution
AAGCTGATC
7
Models of Nucleotide Substitution
  • Model sites along the DNA string as evolving
    independently
  • Continuous time Markov chain with states A,C,G,T
  • Define
  • Pij (t) Prob (in state j at time t given in
    state i at t 0)
  • So that
  • P(t) exp ?tQ
  • where
  • Q is the instantaneous rate matrix
  • ? is the rate of mutation events, ?t represents
    branch length
  • Various models available for Q

8
Molecular Clocks
  • Branch lengths represent evolutionary distance
    (typically number of nucleotide substitutions)
  • Rates of change may vary between branches
  • Molecular clock no rate variation

t
No Clock
With Clock
9
Tree Likelihood
  • Given a tree topology and branch lengths,
    evaluate the likelihood of the tree under the
    substitution model

AAGCGCATG AACCTCATC TAGCCGTTC TAGCCCTGC G
AGCAGTTC
Likelihood Prob(G,T,C,C,A Tree)
???? Prob(w) Prob(zw) Prob(Az) Prob(Cz)
? Prob(yw) Prob(Cy)
? Prob(xy) Prob(Gx) Prob(Tx)
x y z w
  • Full likelihood is obtained by taking product
    over nucleotide sites

10
Likelihood Maximization
  • We can search for the maximum likelihood tree
  • Pick an initial topology
  • Find the optimal set of branch lengths
  • Is this the highest likelihood we have seen?
  • Pick a new topology
  • Tree space is huge the search is computationally
    intensive

11
Distance-Matrix Approaches
  • Given a matrix of evolutionary distances,
    estimate the tree that gave rise to those
    distances

Tree unknown
Distance matrix known
  • Even if we knew the underlying tree, in practice
    any observed distance matrix will not match it
    perfectly
  • branch length gives the expected number of
    substitutions at one site
  • the distance matrix is unlikely to match any tree

12
Comments
  • Distance matrices
  • The distance matrix summarizes the information in
    the full sequence data set
  • Data loss problematic for widely diverged
    sequences
  • Distance matrix is obtained from sequence data
    using a substitution model many ways to do this
  • Comparison with likelihood
  • Distance matrix methods are less sophisticated...
  • ... but they are much faster!

13
Least Squares Fitting
  • Suppose we are given a tree topology and a
    distance matrix how would we find branch lengths
    on the tree?
  • For two leaves i,j denote
  • true distances on tree tij
  • observed distances dij
  • Assume that observed distances are unbiased
    estimates of the true distances
  • Use branch lengths tij that minimize the error
    term

14
Neighbour Joining
  • Neighbour Joining (NJ) is defined by an
    agglomerative algorithm

i
i
u
u
j
j
Start with star-like topology
Consider joining every pair of nodes i,j in turn
Replace nodes i,j by a single node u
Calculate branch length estimates via least
squares
Continue...
Join the pair i,j that minimizes the sum of
branch lengths
15
Comments
  • NJ is hard to justify statistically...
  • ... but it works surprisingly well!
  • Recent improvements to the algorithm have not
    introduced a thorough statistical framework

16
Our Methodology
  • New distance-matrix method for constructing
    phylogenies
  • Motivated by the example of gene families but
    also applies to species trees
  • Essential ingredients
  • Distribution free, moment-based approach
  • Handles variance/covariance of distances more
    thoroughly than existing distance-matrix methods

17
Motivation Families of Paralogs
  • Certain genes have many copies within
  • the same genome
  • Examples olfactory receptors, proteases, kinases
  • Appear to have evolved through duplications of
    individual genes, clusters of genes, and
    rearrangements within gene clusters
  • Phylogenetic tree for these genes ? history of
    gene duplication
  • Could we construct a more sophisticated history?
  • Block duplications of more than one gene
  • A history of linear arrangement along the genome

18
Assumptions (1)
  • Molecular clock setting necessary in order to
    consider events in which more than one gene is
    duplicated
  • In a block duplication, two or more genes are
    copied at the same time
  • The observed distances dij are the result of a
    random process perturbing the underlying true
    tree T

T
D
i
j
k
l
i
j
k
l
dij denotes distance between i,j in D
tij denotes path between i,j in T
19
Assumptions (2)
  • The observed distances dij are the result of a
    random process perturbing the underlying true
    tree T, that satisfies

T
i
j
k
l
  • Note that we do not need a complete description
    of the perturbation process just consider
    moments

20
Building trees
  • Adopt an agglomerative approach winding back
    the clock

- Try joining every pair of nodes
- Can also consider block joins
  • Pick the best join according to some statistical
    score

- Update the set of distances between nodes
Etc...
- Continue recursively
  • We need to specify
  • How to score join events
  • How to fix the time for a join
  • How to update the distances at each iteration

21
Scoring Joins (1)
  • Suppose we have constructed T as far back as some
    time t. What is the covariance matrix for
    distances between nodes?

i
t
Tt
i
22
Scoring Joins (2)
j
i
t
Time ti
Tt
i
j
Next consider the expectation and variance of
distances between the black nodes
23
Scoring Joins (3)
  • Score tree using the goodness-of-fit of the
    calculated distances dt to expected distances
  • Under suitable asymptotic assumptions this is a
    ?2 statistic
  • The distance vector dt is n x n so the covariance
    matrix is n2 x n2 and inverting it is potentially
    O(n6)
  • However, it can be inverted algebraically in
    O(n2) steps, and score evaluated in O(n) steps

24
Results
  • Construction of large trees
  • Comparison with other methods (NJ) in progress...
  • Issues
  • As a purely phylogenetic method it is held back
    by the molecular clock assumption
  • It is not a complete approach to inferring
    historical arrangements of paralogous genes,
    although it can incorporate duplication of more
    than one gene at a time

25
Conclusions
  • Full probabilistic models for constructing
    phylogenies are unsuitable if there are many
    leaves
  • Existing distance-matrix methods could be
    improved upon
  • We have a new distance-matrix approach that
    improves upon standard approaches to variance /
    covariance
  • Future Work
  • Could we build our approach to covariance into a
    setting with no molecular clock?
  • Can we develop approaches that combine
    phylogenetic and arrangement information to build
    evolutionary histories?

26
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com