Title: Distance Based Methods for estimating phylogenetic trees
1Distance Based Methodsfor estimating
phylogenetic trees
Phylogenetics Workhop, 16-18 August 2006
Barbara Holland
Cat
Rat
2
1
1
2
4
Dog
Cow
2Overview
- How do we get distance data?
- Observed vs. actual distances
- Correcting for hidden changes
- Not all distances are tree-like
- Tree building clustering methods
- UPGMA
- Neighbor-joining
- Tree building optimality criteria
- Least Squares
3What do edge lengths represent?
- In some trees edges represent time, in which case
all modern sequences should be the same distance
from the root. - Sometimes edge lengths represent the product µt
of the rate of change µ and time t in which case
different tips can be different distances from
the root provided that the rate has changed
across the tree.
Cat
Rat
2
1
1
2
4
Dog
Cow
4Distance matrices
- There are many ways of building phylogenetic
trees, one family of methods uses a distance
matrix as a starting point. - A distance matrix is a table that indicates
pairwise dissimilarity, for instance...
5Properties of distances
- d(x,x) 0
- d(x,y) d(y,x)
- d(x,y) d(y,z) gt d(x,z) (the triangle
inequality) - The distances used in phylogenetics always have
the first two properties but sometimes not the
third.
6I want to build a tree - will any old distances
do?
- Not all distances will be suitable for building
trees. - Tree-building methods do not discriminate, they
will return a tree regardless of whether you give
them roadmap distances or distances based on a
sequence alignment. - Some distances are perfectly tree-like.
7Perfectly tree-like distances
Cat
Rat
2
1
1
2
4
Dog
Cow
8Perfectly tree-like distances
Cat
Rat
2
1
1
2
4
Dog
Cow
9Perfectly tree-like distances
Cat
Rat
2
1
1
2
4
Dog
Cow
10Perfectly tree-like distances
Cat
Rat
2
1
1
2
4
Dog
Cow
11Perfectly tree-like distances
Cat
Rat
2
1
1
2
4
Dog
Cow
12Perfectly tree-like distances
Cat
Rat
2
1
1
2
4
Dog
Cow
13The 4-Point Condition
- Distances that fit exactly on a tree can be
characterised by a condition on any quartet i, j,
k, l (i.e. it must hold true for any 4 taxa). - We write d(x,y) for the distance between x and y.
- Given 4 taxa i, j, k, l, of the 3 sums
- d(i,j) d(k,l)
- d(i,k) d(j,l)
- d(i,l) d(j,k)
- The largest two are equal.
- Distances with this property are called additive,
because the weights on the paths along the tree
add up to the values in the distance matrix.
14Why is this true of tree-like distances?
i
k
i
k
i
k
j
j
j
l
l
l
d(i,k)d(j,l)
d(i,j)d(k,l)
d(i,l)d(j,k)
lt
15Clock-like distances
- An even stricter condition on distances is that
they fit on a clock-like tree. - Distances with this property are called
ultrametric.
time
d(i,k) d(j,k) gt d(i,j)
i
j
k
16Where do we get distances from?
- Distances can be derived from Multiple Sequence
Alignments (MSAs). - The most basic distance is just a count of the
number of sites which differ between two
sequences divided by the sequence length. These
are sometimes known as p-distances.
17Other sources of distances
- Immunological data
- Similarity between proteins A and B can be
assessed by how well the immune system responds
to B after already having seen A. - DNA/DNA hybridization
- more similar DNA hybrids "melt" at higher
temperatures - Fragment length polymorphism
- Chop DNA up using restriction enzymes.
- Amplify some fragments usign PCR
- Run the fragments out on an electrophoretic gel
- Compare profiles of different genomes
- BLAST scores
18Observed distances usually underestimate the true
number of changes
ATTTGCGATA
Actual Changes 2 Observed Changes 2
ATTTGCGGTA
ATCTGCGATA
19- Parallel changes
- Reversals
- Superimposed changes
ATTCGCGATA
Actual Changes 4 Observed Changes 2
ATTTGCGGTA
ATCTGCGATA
20- Parallel changes
- Reversals
- Superimposed changes
ATTTGCGATA
Actual Changes 4 Observed Changes 2
ATTCGCGATA
ATTTGCGGTA
ATCTGCGATA
21- Parallel changes
- Reversals
- Superimposed changes
ATTTGCGATA
Actual Changes 3 Observed Changes 2
ATTTGCGTTA
ATTTGCGGTA
ATCTGCGATA
22Correcting for hidden changes
- Given a statistical model of how point mutations
occur it is possible to estimate the true genetic
distance from the observed distance.
23Correcting under a simple model
- The Jukes-Cantor model states that all states
A,C,G,T and all changes between states, e.g.
A?C, are equally likely.
u/3
As a mathematical conviencence imagine we have a
rate 4u/3 of change to a random state, this
includes the possibility of a state changing to
itself.
u/3
u/3
u/3
u/3
u/3
24A Poisson process
- The probability of no change at a site over time
t is e-4/3ut - The probability of at least one event is then 1-
e-4/3ut - The probability of at least one event that leads
to a different state from the one we started at
is ¾(1- e-4/3ut) as one time out of four we will
mutate to the same base we started with. - The expected observed distance d given a true
genetic distance of ut is d ¾(1- e-4/3ut) - Inverting this formula gives our correction D
ut -3/4 ln (1-4/3d)
25Correcting for hidden changes
- Correction for hidden changes has been shown
(both theoretically and by simulation studies) to
improve accuracy. - However, this is not universally true.
- If data is clock-like then corrections will not
change the relative size of the distances - However, the more complicated the model is the
larger the variance (error) of the distances will
become.
26Under the Jukes-Cantor model where all point
mutations are equally likely the correction is
Dactual ¾ ln(1 4/3dobserved)
27(No Transcript)
28An interesting observation
- Uncorrected distances always obey the triangle
inequality d(x,y) d(y,z) gt d(x,z) - But corrected distance do not.
- E.g. if sequences a and b differ at 10 / 100
sites and sequences b and c differ at a different
10 / 100 sites the uncorrected distances are
d(a,b) d(b,c) 0.1, d(a,c) 0.2 and the
corrected distances (under the JC model) are
D(a,b) D(b,c) 0.107, D(a,c) 0.233
29Tree building - UPGMA
- UPGMA works by progressively clustering the most
similar taxa until all the taxa form a rooted
clock-like tree. - Find the smallest entry in the distance matrix,
say d(x,y). - Form a new internal node, z, that is a parent to
x and y and set the edge lengths from z to x and
z to y to half d(x,y). - Update the distance matrix by setting the
distances from the new node z to all the other
taxa to be the average distance between groups x
and y. - REPEAT until all groups have been joined.
30What precisely is meant by the average distance?
- If we a joining two groups i and j that already
have ni and nj members we update the distances
using
31Step 1 Find the smallest entry in the distance
matrix
d(i,j)
A
B
C
D
E
F
A
-
B
2
-
C
4
4
-
D
4
4
2
-
E
7
7
7
7
-
F
5
5
5
5
6
-
G
8
8
8
8
9
5
Step 2 - Cluster taxa A and B, form a new
internal node I Calculate the lengths of the new
edges d(A,I)d(B,I)1/2 d(A,B)1
B
A
A
Step 3 Update the distance matrix d(C,I)
½(d(A,C) d(B,C)) 4 etc...
1
1
B
G
I
C
D
C
F
E
D
F
E
G
32Step 1 Find the smallest entry in the distance
matrix
d(i,j)
I (AB)
C
D
E
F
I (AB)
-
C
4
-
D
4
2
-
E
7
7
7
-
F
5
5
5
6
-
G
8
8
8
9
5
Step 2 - Cluster taxa C and D, form a new
internal node II Calculate the lengths of the new
edges d(C,II)d(D,II)1/2 d(C,D)1
A
B
B
D
A
C
Step 3 Update the distance matrix d(I,II)1/2(d(
I,C)d(I,D)) 4 d(E,II) ½(d(E,C)
d(E,D)) 7 etc...
1
1
1
1
1
1
C
I
I
II
D
E
E
F
F
G
G
33And so on...
A
A
B
B
C
A
B
D
A
C
D
B
G
C
D
I
I
I
II
II
C
F
III
E
E
D
E
E
F
F
F
G
G
G
F
B
C
A
D
E
G
F
B
C
F
A
B
C
D
A
D
E
1
1
1
1
II
I
2.5
II
1
II
1
I
I
3.4
III
3.8
0.5
0.9
III
III
IV
V
IV
IV
V
E
0.4
VI
G
G
...until we have a rooted tree. But, is it the
right tree?
34UPGMA is not consistent for additive distances
d(i,j)
A
B
C
D
E
F
The tree that matches the distances is not
recovered by UPGMA.
A
-
B
2
-
C
4
4
-
D
4
4
2
-
E
7
7
7
7
-
F
5
5
5
5
6
-
G
8
8
8
8
9
5
35Inconsistency
- When a method is given perfect data but still
gets the wrong tree it is said to be
inconsistent. - UPGMA is inconsistent for data that isnt
ultrametric (clock-like). - Next well look at a method that is consistent
for any additive data.
36Neighbor-joining (NJ)
- NJ works by progressively clustering taxa until
all the taxa form an unrooted tree. - Rather than using the distance matrix directly to
determine which taxa should be clustered at each
stage, NJ uses the S matrix where - S(i,j) (N-2)d(i,j) - R(i) - R(j)
- N is the number of taxa.
- R(i) is the sum of the ith row in the distance
matrix. - R(j) is the sum of the jth row in the distance
matrix. - Find the smallest entry in the S matrix, say
S(x,y).
37- Form a new internal node, z, that is a parent to
x and y and calculate the edge lengths from z to
x and z to y. - d(x,z) 1/(2(N-2))(N-2)d(x,y) R(x) R(y)
- d(y,z) d(x,y) d(x,z)
- Update the distance matrix
- d(w,z) ½ (d(x,w) d(y,w) d(x,y))
- REPEAT until only two things are left to be
joined.
38NJ Example
D
S
Step 1
R(cat) 13 R(dog) 15 R(rat) 15 R(cow) 19
e.g. S(cat,dog) (4-2)x3 13 15
-22 S(cat,rat) (4-2)x4 13 15 -20
39NJ Example
D
S
Step 2
Step 1
Cat
Rat
Step 3 d(cat,z) ¼2d(cat,dog) R(cat)
R(dog) ¼ 6 13 15 1 d(dog,z) 3-1
2
z
Dog
Cow
40Step 4 d(z,rat) ½ d(cat,rat) d(dog,rat)
d(cat,dog) ½ 4 5 3 3 d(z,cow) ½
6 7 3 5
Cat
Rat
z
Dog
Cow
41Global vs Local methods
- UPGMA and NJ are local construction methods. At
each step they pick they best pair of taxa to
cluster, once a decision is made it cannot be
unmade. This makes these methods very fast. - There are also global methods for making trees
based on distances. These evaluate an optimality
criterion on each possible tree and then pick the
tree with the best score. Examples of global
methods for distance data include least squares
and minimum evolution. Because the number of
trees grows very quickly with the number of taxa,
these methods are slow.
42Least Squares
- We would like the path lengths on the tree we
choose to be as close as possible to the
corresponding values in the distance matrix. - With additive data we can always find a tree
where the path length distances and the distance
matrix match exactly. However, most data isnt
perfect... - We can try and minimise the discrepency between
the observed distances and the tree distances
using a least squares approach.
43A family of least squares methods
wij 1 unweighted least squares
(Cavalli-Sforza and Edwards 1967) wij
1/Dij wij 1/Dij2 (Fitch and Margoliash 1967)
44Picking the best weights for a given tree
- The tree distances dij can be represented by the
equation
where xij,k is an indicator variable that is 1 if
edge k lies on the path from i to j and 0
otherwise. We want to find edge weights ek that
minimise
45The indicator variables can be expressed in
matrix format
1 1 1 0 0 0 0 1 1 0 1 1 0 0 1 1 0 1 0 0 1 1 0 0 0
0 1 0 0 0 1 1 1 0 0 0 0 1 1 0 0 1 0 1 1 0 0 1 0 0
0 0 0 1 0 1 0 1 0 1 1 1 0 0 0 0 0 0 1 1
DAB DAC DAD DAE DBC DBD DBE DCD DCE DDE
e1 e2 e3 e4 e5 e6 e7
B
A
C
e1
e3
e5
e2
e4
D
e
X
e6
e7
D
E
Each row of X corresponds to a path in the
tree We can write D Xe
46Experience the joy of linear algebra
- DXe
- XTD (XTX)e
- e (XTX)-1XTD
This assumes that the weights wij 1
47Minimum evolution
- Uses the least squares method to fit the branch
lengths for each tree - BUT uses a different optimality criterion than
least squares. - Prefers the tree with the shortest sum of branch
lengths
48Review
- Observed distances derived from sequence
alignments will always underestimate the true
number of mutations. Hence it is ususally a good
idea to correct for these hidden changes. - Clustering methods like UPGMA and
Neighbor-joining are very fast as they only make
local decisions and never backtrack. These
methods are often used as a starting point for
heuristic searches. - There are also optimality criteria that use
distances as input, e.g. Least squares and
minimum evolution.
49Review
- Not all distances can be fit perfectly onto a
tree. - Methods can be inconsistent, for example for some
non-clocklike distances UPGMA is guaranteed to
recover the wrong tree. - UPGMA is consistent for clock-like distances and
NJ is consistant for any additive distances.