Distance Based Methods for estimating phylogenetic trees - PowerPoint PPT Presentation

1 / 49
About This Presentation
Title:

Distance Based Methods for estimating phylogenetic trees

Description:

The Jukes-Cantor model states that all states {A,C,G,T} and all ... Under the Jukes-Cantor model where all point mutations are equally likely the correction is: ... – PowerPoint PPT presentation

Number of Views:334
Avg rating:3.0/5.0
Slides: 50
Provided by: barbara66
Category:

less

Transcript and Presenter's Notes

Title: Distance Based Methods for estimating phylogenetic trees


1
Distance Based Methodsfor estimating
phylogenetic trees
Phylogenetics Workhop, 16-18 August 2006
Barbara Holland
Cat
Rat
2
1
1
2
4
Dog
Cow
2
Overview
  • How do we get distance data?
  • Observed vs. actual distances
  • Correcting for hidden changes
  • Not all distances are tree-like
  • Tree building clustering methods
  • UPGMA
  • Neighbor-joining
  • Tree building optimality criteria
  • Least Squares

3
What do edge lengths represent?
  • In some trees edges represent time, in which case
    all modern sequences should be the same distance
    from the root.
  • Sometimes edge lengths represent the product µt
    of the rate of change µ and time t in which case
    different tips can be different distances from
    the root provided that the rate has changed
    across the tree.

Cat
Rat
2
1
1
2
4
Dog
Cow
4
Distance matrices
  • There are many ways of building phylogenetic
    trees, one family of methods uses a distance
    matrix as a starting point.
  • A distance matrix is a table that indicates
    pairwise dissimilarity, for instance...

5
Properties of distances
  • d(x,x) 0
  • d(x,y) d(y,x)
  • d(x,y) d(y,z) gt d(x,z) (the triangle
    inequality)
  • The distances used in phylogenetics always have
    the first two properties but sometimes not the
    third.

6
I want to build a tree - will any old distances
do?
  • Not all distances will be suitable for building
    trees.
  • Tree-building methods do not discriminate, they
    will return a tree regardless of whether you give
    them roadmap distances or distances based on a
    sequence alignment.
  • Some distances are perfectly tree-like.

7
Perfectly tree-like distances
Cat
Rat
2
1
1
2
4
Dog
Cow
8
Perfectly tree-like distances
Cat
Rat
2
1
1
2
4
Dog
Cow
9
Perfectly tree-like distances
Cat
Rat
2
1
1
2
4
Dog
Cow
10
Perfectly tree-like distances
Cat
Rat
2
1
1
2
4
Dog
Cow
11
Perfectly tree-like distances
Cat
Rat
2
1
1
2
4
Dog
Cow
12
Perfectly tree-like distances
Cat
Rat
2
1
1
2
4
Dog
Cow
13
The 4-Point Condition
  • Distances that fit exactly on a tree can be
    characterised by a condition on any quartet i, j,
    k, l (i.e. it must hold true for any 4 taxa).
  • We write d(x,y) for the distance between x and y.
  • Given 4 taxa i, j, k, l, of the 3 sums
  • d(i,j) d(k,l)
  • d(i,k) d(j,l)
  • d(i,l) d(j,k)
  • The largest two are equal.
  • Distances with this property are called additive,
    because the weights on the paths along the tree
    add up to the values in the distance matrix.

14
Why is this true of tree-like distances?
i
k
i
k
i
k
j
j
j
l
l
l
d(i,k)d(j,l)
d(i,j)d(k,l)
d(i,l)d(j,k)
lt

15
Clock-like distances
  • An even stricter condition on distances is that
    they fit on a clock-like tree.
  • Distances with this property are called
    ultrametric.

time
d(i,k) d(j,k) gt d(i,j)
i
j
k
16
Where do we get distances from?
  • Distances can be derived from Multiple Sequence
    Alignments (MSAs).
  • The most basic distance is just a count of the
    number of sites which differ between two
    sequences divided by the sequence length. These
    are sometimes known as p-distances.

17
Other sources of distances
  • Immunological data
  • Similarity between proteins A and B can be
    assessed by how well the immune system responds
    to B after already having seen A.
  • DNA/DNA hybridization
  • more similar DNA hybrids "melt" at higher
    temperatures
  • Fragment length polymorphism
  • Chop DNA up using restriction enzymes.
  • Amplify some fragments usign PCR
  • Run the fragments out on an electrophoretic gel
  • Compare profiles of different genomes
  • BLAST scores

18
Observed distances usually underestimate the true
number of changes
ATTTGCGATA
Actual Changes 2 Observed Changes 2
ATTTGCGGTA
ATCTGCGATA
19
  • Parallel changes
  • Reversals
  • Superimposed changes

ATTCGCGATA
Actual Changes 4 Observed Changes 2
ATTTGCGGTA
ATCTGCGATA
20
  • Parallel changes
  • Reversals
  • Superimposed changes

ATTTGCGATA
Actual Changes 4 Observed Changes 2
ATTCGCGATA
ATTTGCGGTA
ATCTGCGATA
21
  • Parallel changes
  • Reversals
  • Superimposed changes

ATTTGCGATA
Actual Changes 3 Observed Changes 2
ATTTGCGTTA
ATTTGCGGTA
ATCTGCGATA
22
Correcting for hidden changes
  • Given a statistical model of how point mutations
    occur it is possible to estimate the true genetic
    distance from the observed distance.

23
Correcting under a simple model
  • The Jukes-Cantor model states that all states
    A,C,G,T and all changes between states, e.g.
    A?C, are equally likely.

u/3
As a mathematical conviencence imagine we have a
rate 4u/3 of change to a random state, this
includes the possibility of a state changing to
itself.
u/3
u/3
u/3
u/3
u/3
24
A Poisson process
  • The probability of no change at a site over time
    t is e-4/3ut
  • The probability of at least one event is then 1-
    e-4/3ut
  • The probability of at least one event that leads
    to a different state from the one we started at
    is ¾(1- e-4/3ut) as one time out of four we will
    mutate to the same base we started with.
  • The expected observed distance d given a true
    genetic distance of ut is d ¾(1- e-4/3ut)
  • Inverting this formula gives our correction D
    ut -3/4 ln (1-4/3d)

25
Correcting for hidden changes
  • Correction for hidden changes has been shown
    (both theoretically and by simulation studies) to
    improve accuracy.
  • However, this is not universally true.
  • If data is clock-like then corrections will not
    change the relative size of the distances
  • However, the more complicated the model is the
    larger the variance (error) of the distances will
    become.

26
Under the Jukes-Cantor model where all point
mutations are equally likely the correction is
Dactual ¾ ln(1 4/3dobserved)
27
(No Transcript)
28
An interesting observation
  • Uncorrected distances always obey the triangle
    inequality d(x,y) d(y,z) gt d(x,z)
  • But corrected distance do not.
  • E.g. if sequences a and b differ at 10 / 100
    sites and sequences b and c differ at a different
    10 / 100 sites the uncorrected distances are
    d(a,b) d(b,c) 0.1, d(a,c) 0.2 and the
    corrected distances (under the JC model) are
    D(a,b) D(b,c) 0.107, D(a,c) 0.233

29
Tree building - UPGMA
  • UPGMA works by progressively clustering the most
    similar taxa until all the taxa form a rooted
    clock-like tree.
  • Find the smallest entry in the distance matrix,
    say d(x,y).
  • Form a new internal node, z, that is a parent to
    x and y and set the edge lengths from z to x and
    z to y to half d(x,y).
  • Update the distance matrix by setting the
    distances from the new node z to all the other
    taxa to be the average distance between groups x
    and y.
  • REPEAT until all groups have been joined.

30
What precisely is meant by the average distance?
  • If we a joining two groups i and j that already
    have ni and nj members we update the distances
    using

31
Step 1 Find the smallest entry in the distance
matrix


d(i,j)

A

B

C

D

E

F

A
-






B

2

-





C

4

4

-




D

4

4

2

-



E

7

7

7

7

-


F

5

5

5

5

6

-

G

8

8

8

8

9

5



Step 2 - Cluster taxa A and B, form a new
internal node I Calculate the lengths of the new
edges d(A,I)d(B,I)1/2 d(A,B)1
B
A
A
Step 3 Update the distance matrix d(C,I)
½(d(A,C) d(B,C)) 4 etc...
1
1
B
G
I
C


D
C
F
E
D
F
E

G
32
Step 1 Find the smallest entry in the distance
matrix


d(i,j)

I (AB)

C

D

E

F

I (AB)

-





C

4

-




D

4

2

-



E

7

7

7

-


F

5

5

5

6

-

G

8

8

8

9

5


Step 2 - Cluster taxa C and D, form a new
internal node II Calculate the lengths of the new
edges d(C,II)d(D,II)1/2 d(C,D)1
A
B
B
D
A
C
Step 3 Update the distance matrix d(I,II)1/2(d(
I,C)d(I,D)) 4 d(E,II) ½(d(E,C)
d(E,D)) 7 etc...
1
1
1
1
1
1
C
I

I
II

D
E
E

F

F
G
G
33
And so on...
A
A
B
B
C
A
B
D
A
C
D
B
G
C
D
I

I
I
II
II
C

F
III
E
E
D
E

E
F

F

F
G
G
G
F
B
C
A
D
E
G
F
B
C
F
A
B
C
D
A
D
E
1
1
1
1
II
I
2.5
II
1
II
1
I
I
3.4
III
3.8
0.5
0.9
III
III
IV
V
IV
IV
V
E
0.4
VI


G

G
...until we have a rooted tree. But, is it the
right tree?
34
UPGMA is not consistent for additive distances


d(i,j)

A

B

C

D

E

F

The tree that matches the distances is not
recovered by UPGMA.
A

-






B

2

-





C

4

4

-




D

4

4

2

-



E

7

7

7

7

-


F

5

5

5

5

6

-

G

8

8

8

8

9

5



35
Inconsistency
  • When a method is given perfect data but still
    gets the wrong tree it is said to be
    inconsistent.
  • UPGMA is inconsistent for data that isnt
    ultrametric (clock-like).
  • Next well look at a method that is consistent
    for any additive data.

36
Neighbor-joining (NJ)
  • NJ works by progressively clustering taxa until
    all the taxa form an unrooted tree.
  • Rather than using the distance matrix directly to
    determine which taxa should be clustered at each
    stage, NJ uses the S matrix where
  • S(i,j) (N-2)d(i,j) - R(i) - R(j)
  • N is the number of taxa.
  • R(i) is the sum of the ith row in the distance
    matrix.
  • R(j) is the sum of the jth row in the distance
    matrix.
  • Find the smallest entry in the S matrix, say
    S(x,y).

37
  • Form a new internal node, z, that is a parent to
    x and y and calculate the edge lengths from z to
    x and z to y.
  • d(x,z) 1/(2(N-2))(N-2)d(x,y) R(x) R(y)
  • d(y,z) d(x,y) d(x,z)
  • Update the distance matrix
  • d(w,z) ½ (d(x,w) d(y,w) d(x,y))
  • REPEAT until only two things are left to be
    joined.

38
NJ Example
D
S
Step 1
R(cat) 13 R(dog) 15 R(rat) 15 R(cow) 19
e.g. S(cat,dog) (4-2)x3 13 15
-22 S(cat,rat) (4-2)x4 13 15 -20
39
NJ Example
D
S
Step 2
Step 1
Cat
Rat
Step 3 d(cat,z) ¼2d(cat,dog) R(cat)
R(dog) ¼ 6 13 15 1 d(dog,z) 3-1
2
z
Dog
Cow
40
Step 4 d(z,rat) ½ d(cat,rat) d(dog,rat)
d(cat,dog) ½ 4 5 3 3 d(z,cow) ½
6 7 3 5
Cat
Rat
z
Dog
Cow
41
Global vs Local methods
  • UPGMA and NJ are local construction methods. At
    each step they pick they best pair of taxa to
    cluster, once a decision is made it cannot be
    unmade. This makes these methods very fast.
  • There are also global methods for making trees
    based on distances. These evaluate an optimality
    criterion on each possible tree and then pick the
    tree with the best score. Examples of global
    methods for distance data include least squares
    and minimum evolution. Because the number of
    trees grows very quickly with the number of taxa,
    these methods are slow.

42
Least Squares
  • We would like the path lengths on the tree we
    choose to be as close as possible to the
    corresponding values in the distance matrix.
  • With additive data we can always find a tree
    where the path length distances and the distance
    matrix match exactly. However, most data isnt
    perfect...
  • We can try and minimise the discrepency between
    the observed distances and the tree distances
    using a least squares approach.

43
A family of least squares methods
wij 1 unweighted least squares
(Cavalli-Sforza and Edwards 1967) wij
1/Dij wij 1/Dij2 (Fitch and Margoliash 1967)
44
Picking the best weights for a given tree
  • The tree distances dij can be represented by the
    equation

where xij,k is an indicator variable that is 1 if
edge k lies on the path from i to j and 0
otherwise. We want to find edge weights ek that
minimise
45
The indicator variables can be expressed in
matrix format
1 1 1 0 0 0 0 1 1 0 1 1 0 0 1 1 0 1 0 0 1 1 0 0 0
0 1 0 0 0 1 1 1 0 0 0 0 1 1 0 0 1 0 1 1 0 0 1 0 0
0 0 0 1 0 1 0 1 0 1 1 1 0 0 0 0 0 0 1 1
DAB DAC DAD DAE DBC DBD DBE DCD DCE DDE
e1 e2 e3 e4 e5 e6 e7
B
A
C
e1
e3
e5
e2
e4
D
e
X
e6
e7
D
E
Each row of X corresponds to a path in the
tree We can write D Xe
46
Experience the joy of linear algebra
  • DXe
  • XTD (XTX)e
  • e (XTX)-1XTD

This assumes that the weights wij 1
47
Minimum evolution
  • Uses the least squares method to fit the branch
    lengths for each tree
  • BUT uses a different optimality criterion than
    least squares.
  • Prefers the tree with the shortest sum of branch
    lengths

48
Review
  • Observed distances derived from sequence
    alignments will always underestimate the true
    number of mutations. Hence it is ususally a good
    idea to correct for these hidden changes.
  • Clustering methods like UPGMA and
    Neighbor-joining are very fast as they only make
    local decisions and never backtrack. These
    methods are often used as a starting point for
    heuristic searches.
  • There are also optimality criteria that use
    distances as input, e.g. Least squares and
    minimum evolution.

49
Review
  • Not all distances can be fit perfectly onto a
    tree.
  • Methods can be inconsistent, for example for some
    non-clocklike distances UPGMA is guaranteed to
    recover the wrong tree.
  • UPGMA is consistent for clock-like distances and
    NJ is consistant for any additive distances.
Write a Comment
User Comments (0)
About PowerShow.com