Title: Methods for Determining Trees
1Methods for Determining Trees
- Sequence Based methods
- Maximum Parsimony
- Maximum Likelihood
- Distance Based methods
- UPGMA
- Neighbor Joining
2Maximum Parsimony
- A phylogeny constructed with the method of
maximum parsimony explains evolution with the
fewest evolutionary changes. - Multiple sequence alignment must first be
obtained. Each aligned column is a site.
3Maximum Parsimony Example
- 1 A A G A G T G C A
- 2 A G C C G T G C G
- 3 A G A T A T C C A
- 4 A G A G A T C C G
- four sequences, nine sites, three possible
unrooted trees
4Maximum Parsimony Example
Number of Mutations 10
5Maximum Parsimony Example
(1)AAGAGTGCA
(2)AGCCGTGCG
1
3
5
AGGAGTGCA
AGAGGTCCG
4
1
(3)AGATATCCA
(4)AGAGATCCG
Number of Mutations 14
6Maximum Parsimony Example
(1)AAGAGTGCA
(2)AGCCGTGCG
1
3
5
AGGAGTGCA
AGATGTCCG
5
2
(3)AGATATCCA
(4)AGAGATCCG
Number of Mutations 16
Tree I has the topology with the least number of
mutations and thus is the most parsimonious tree.
7Maximum Parsimony Example
- Some sites are informative, others are not
- Informative site there are at least two
different kinds of nucleotides at the site, each
of which is represented in at least two of the
sequences under study. - Only informative sites are considered
8Maximum Parsimony Example
- 1 A A G A G T G C A
- 2 A G C C G T G C G
- 3 A G A T A T C C A
- 4 A G A G A T C C G
- Three informative columns
9Maximum Parsimony Example
- 1 G G A
- 2 G G G
- 3 A C A
- 4 A C G
- Tree 1 4
- Tree 2 5
- Tree 3 6
Column 1
Column 2
Column 3
Is a substitution
10Maximum Parsimony Problems
- Small Parsimony Problem
- Given the phylogeny topology, compute the
internal nodes to minimize the total number of
mutations - Used to evaluate the phylogeny
- Polynomial time solvable.
- Large Parsimony Problem
- Given that we have a way of determining the score
of a given phylogeny, search through all possible
phylogenies to find the best one - Proved to be NP-complete.
11Fitchs Algorithm for Small Parsimony Problem
- Consider each site separately
- Dynamic programming style
- Constructs a set of possible states (possible
nucleotides) for each internal node - Start at the leaves of the phylogeny. Each leaf
is labeled with the singleton set containing the
nucleotide at that particular site. - Traverse in a postorder manner (all of the
children of the current node have been visited
before the current node).
12Fitchs Algorithm for Small Parsimony Problem
- If m is an internal node with children l and r
having states Sl and Sr respectively. The state
of m, Sm , is computed as follows - if
is empty - otherwise
- Each application of the first rule
contributes one count to the number of changes.
13Fitchs Algorithm for Small Parsimony Problem
14Exhaustive Search Number of Trees
15Branch and Bound for Large Parsimony Problem
- Consider trees of increasing size (starting from
3 species) - Branch add one species, check all possible
phylogeny topologies - Bound one solution as the first bound, update
the bound while finding a better one - Abort an extension if score already exceeds
current best.
16Branch and Bound for Large Parsimony Problem
- The worst case time complexity is the same as the
complexity of exhaustive search - With a wisely chosen bound, many subtrees will be
cut and therefore the running time will decrease - Sometimes a special traversal order finds better
solutions faster - A algorithm ?
17Maximum Parsimony
- Time consuming algorithm
- Only works well if the sequences have a strong
sequence similarity
18Maximum Likelihood
- Evaluate the topologies of different trees and
picks the best one according to an optimality
criterion, the likelihood score - Require a specified model of the evolutionary
process that can account for the conversion of
one sequence into another.
19Maximum Likelihood Model
- The model is composed of the composition and the
substitution process - Composition Frequencies of the character states
- Substitution Process Rate of change from one
character state to another character state
20Maximum Likelihood Model
- For DNA sequences, A simple model is that the
rate of change from a to c or vice versa is 0.4,
the composition of a is 0.25 and the composition
of c is 0.25
P
21Maximum Likelihood Model
- For nucleotide sequences, there are 16 possible
ways to describe substitutions - a 4x4 matrix.
Each entry in the matrix represents the
substitution rate from nucleotide i to nucleotide
j (rows, and columns, follow the order A, C, G,
T).
22Maximum Likelihood Model
- In this matrix, the probability of an a changing
to a c is 0.01 and the probability of a c
remaining the same is 0.979, etc. - The rows of this matrix sum to 1 - meaning that
for every nucleotide, we have covered all the
possibilities of what might happen to it. The
columns do not sum to anything in particular
23Maximum Likelihood Model
- This matrix corresponds to one Certain
Evolutionary Distance - In the computation of likelihood, we need a
matrix that can describe the branch lengths. - Normally, a model gives another rate matrix, Q,
which gives branch lengths in substitutions per
site. For a branch length of v
24Maximum Likelihood Example
What is the likelihood of alignment
given a tree topology with branch lengths,
a rate matrix,
and a composition,
25Maximum Likelihood Example
26Maximum Likelihood Example
- If we calculate the other three sites in a
similar way, we get site likelihoods 0.245,
0.00368, and 0.166. If we multiply them together,
we get a likelihood for the tree of 3.0410-6.
27Maximum Likelihood Example
28Maximum Likelihood Search
- Input
- Alignment of sequences
- Model rate matrix, base frequencies
- Search
- Go through all possible trees, For each tree
calculate branch lengths, and then likelihood
value - Output the tree with maximum likelihood value
29Heuristic Search in PAUP
- Stepwise addition
- As is
- Random
- Closest
- Simple
- Branch swapping
- Nearest Neighbor Interchange(NNI)
- Subtree Pruning Regrafting (SPR)
- Tree Bisection Reconnection(TBR)
30Heuristic Search in PAUP
31Addition Order
- As is Input sequences order
- Random In each step, select a random one
- Closest In each step, all remaining taxa are
considered - Simple Assume a reference taxon, calculate the
distance between this reference taxon and all the
other taxa, add the taxon in the increase order
of the distances
32Nearest Neighbor Interchange(NNI)
33Subtree Pruning Regrafting (SPR)
34Tree Bisection Reconnection(TBR)
35Heuristic Search in PHYLIP
- Stepwise addition in As Is or Random order
- In each step, do Local Rearrangement by using
Nearest Neighbor Interchange (NNI) - After finish adding, do Global Rearrangement by
using Subtree Pruning Regrafting (SPR)
36Distance Method
37Distance Method
- Distance table used, a symmetric matrix M that
gives the pairwise distances - Goal Build an edge-weighted tree where each
leaf (external node) corresponds to one object of
M and so that distances measured on the tree
between leaves i and j correspond to Mij
38Example of Distance Analysis
- Distances can be shown as a table
- A ACGCGTTGGGCGATGGCAAC
- B ACGCGTTGGGCGACGGTAAT
- C ACGCATTGAATGATGATAAT
- D ACACATTGAGTGATAATAAT
39Example of Distance Analysis
- Using this information, a tree can be drawn
- A ACGCGTTGGGCGATGGCAAC
- B ACGCGTTGGGCGACGGTAAT
- C ACGCATTGAATGATGATAAT
- D ACACATTGAGTGATAATAAT
40UPGMA
- Unweighted Pair Group Method Using Arithmetic
Averages - Works by clustering sequences
41UPGMA
- distance dij between clusters Ci and Cj is
average distance between pairs of sequences from
each cluster -
- Ci and Cj are the number of sequences in
clusters i and j
42UPGMA
- Initialization Steps
- Assign each sequence i to its own cluster Ci
- Define one leaf of the tree T for each sequence
43UPGMA
- Iteration steps
- 3. Determine the two clusters, i and j for which
dij is minimal - 4. Define a new cluster k by Ck Ci ? Cj, and
define dkl for all l - 5. Add k to the current clusters and remove i and
j. - 6. Continue steps 3-6 until only two clusters i
and j remain.
44UPGMA
- Example using 5 Sequences
45Neighbor Joining
- Begin with star topology no neighbors have been
joined
46Neighbor Joining
- Tree modified by joining pairs of sequences
- Pair is chosen by calculating sum of branch
lengths for the corresponding tree
47Neighbor Joining
48Neighbor Joining
- Pair with smallest branch length chosen to be
joined - Calculate new branch lengths
49Neighbor Joining
- A new distance table is created with joined
sequences entered as a composite - Repeat process to select next pair to join
- Process continues until correctly branched tree
and distances identified
50Comparison of Methods