Title: In this approach, trees are constructed by comparing the ..
1Phylogenetic TreesLecture 2
Based on Durbin et al 7.4 Gusfield 17
2Character-based methodsfor constructing
phylogenies
- In this approach, trees are constructed by
comparing the characters of the corresponding
species. - Characters may be morphological (teeth
structures) or molecular (homologous DNA
sequences). One common approach is Maximum
Parsimony. - Assumptions
- Independence of characters (no interactions)
- Best tree is one where minimal changes take place
31. Maximum Parsimony
Input four nucleotide sequences AAG, AAA, GGA,
AGA taken from four species. Question Which
evolutionary tree best explains these sequences ?
4Example Continued
There are many trees possible. For example
The left tree is preferred over the right tree.
The total number of changes is called the
parsimony score.
5Simple Example
- Suppose we have five species, such that three
have C and two T at a specified position - Minimal tree has one evolutionary change
C
T
C
T
C
C
C
T
T ? C
6Extension to Many Letters
- What is the parsimony score of
A CAGGTA B CAGACA C CGGGTA D TGCACT E TGCGTA
We do it character after character each score is
computed independently of the others.
7Fitchs Algorithm of Evaluating Trees
- Assume that a tree is given.
- Traverse tree from leaves to root determining
set of possible states (e.g. nucleotides) for
each internal node - Traverse tree from root to leaves picking
ancestral states for internal nodes
8Fitchs Algorithm Step 1
- of changes union operations
9Fitchs Algorithm Step 1
- Do a post-order (from leaves to root) traversal
of tree - Determine possible states Ri of internal node
i with children j and k
10Fitchs Algorithm Step 2
11Fitchs Algorithm Step 2
- Do a pre-order (from root to leaves) traversal
of tree -
- Select state rj of internal node j with
parent i
12Weighted Version of Fitchs Algorithm
- Instead of assuming all state changes are
equally likely, use different costs c(a, b)
for different changes -
- 1st step of algorithm is to propagate costs up
through tree
13Weighted Version of Fitchs Algorithm
- Want to determine minimal cost S(i, a)
- of assigning character a to node i
- For leave nodes i
14Weighted Version of Fitchs Algorithm
- Want to determine minimal cost S(i, a)
- of assigning character a to node i
- For internal nodes
a
i
j
k
b
15Weighted Version of Fitchs Algorithm Step 2
- Do a pre-order (from root to leaves) traversal
of tree - Select minimal cost character for root
- For each internal node j, select character
that produced minimal cost at parent i
16Weighted Parsimony Scores
- Weighted Parsimony score
- Each change is weighted by a score c(a, b).
- The weighted parsimony score reduces to the
parsimony score when c(a,a)0 and c(a,b)1 for
all b ? a.
17Evaluating Weighted Parsimony Scores
- Each position is independent and computed by
itself. - Use Dynamic Programming on a given tree.
- If i is a node with children j and k , then
S(i, a) minx(S(j, x)c(a, x)) miny(S(k,
y)c(a, y))
S(i, a)?the minimum score of subtree rooted at k
when k has character a.
S(i,a)
S(j,x)
S(k,y)
18Evaluating Parsimony Scores
- Dynamic programming on a given tree
- Initialization
- For each leaf i set S(i,a) 0 if i is labeled
by a, otherwise S(i,a) ? - Iteration
- if i is node with children j and k, then
- S(i,a) minx(S(j,x)c(a,x))
miny(S(k,y)c(a,y)) - Termination
- cost of tree is minxS(r,x) where r is the root
Comment To reconstruct an optimal assignment,
we need to keep in each node i and for each
character a the two characters x, y that
bring about the minimum when i has character a.
19Cost of Evaluating Parsimony for binary trees
- If there are n nodes, m characters, and k
possible values for each character, then
complexity is O(nmk2).
- Of course, we still need to search over ALL
possible trees and find the best one. One usually
resorts to heuristic search techniques.
20Exploring the Space of Trees
- Weve considered how to find the minimum number
of changes for a given tree topology - Need some search procedure for exploring the
space of tree topologies - Given n sequences there are
possible rooted trees
21Counting Trees
n 3 One Unrooted Tree
n 4 3 Unrooted Trees
A rooted tree with n leaves has (2n-1) nodes and
(2n-2) edges, discounting the edge to the root
hence an unrooted tree has (2n-3) edges. For
each additional leaf we add two edges. Therefore
we have 1 3 5 (2n-5) unrooted trees
with n leaves. Each of such trees has (2n-3)
edges, which can be chosen as a root of the
rooted tree. Hence we have 1 3 5
(2n-5) (2n-3) rooted trees with n leaves
22Exploring the Space of Trees
23Maximum Parsimony
1 2 3 4 5 6 7 8 9 10 Species
1 A G G G T A A C T G Species 2 - A C G A T T
A T T A Species 3 - A T A A T T G T C T Species
4 - A A T G T T G T C G
How many possible unrooted trees?
24Maximum Parsimony
How many possible unrooted trees?
1 2 3 4 5 6 7 8 9 10 Species
1 - A G G G T A A C T G Species 2 - A C G A T T
A T T A Species 3 - A T A A T T G T C T Species
4 - A A T G T T G T C G
25Maximum Parsimony
How many substitutions?
MP
26Maximum Parsimony
1 2 3 4 5 6 7 8 9 10 1 - A G G G T A A C T
G 2 - A C G A T T A T T A 3 - A T A A T T G T C
T 4 - A A T G T T G T C G
27Maximum Parsimony
1 2 3 4 5 6 7 8 9 10 1 - A G G G T A A C T
G 2 - A C G A T T A T T A 3 - A T A A T T G T C
T 4 - A A T G T T G T C G
28Maximum Parsimony
1 - G 2 - C 3 - T 4 - A
29Maximum Parsimony
1 2 3 4 5 6 7 8 9 10 1 - A G G G T A A C T
G 2 - A C G A T T A T T A 3 - A T A A T T G T C
T 4 - A A T G T T G T C G
30Maximum Parsimony
1 2 3 4 5 6 7 8 9 10 1 - A G G G T A A C T
G 2 - A C G A T T A T T A 3 - A T A A T T G T C
T 4 - A A T G T T G T C G
31Maximum Parsimony
G
A
4 1 - G 2 - A 3 - A 4 - G
2
A
G
A
G
A
2
A
G
A
1
A
G
A
32Maximum Parsimony
33Maximum Parsimony
1 2 3 4 5 6 7 8 9 10 1 - A G G G T A A C T
G 2 - A C G A T T A T T A 3 - A T A A T T G T C
T 4 - A A T G T T G T C G
0 3 2 2 0 1 1 1 1 3 14
34Finding most parsimonious trees - exact solutions
- Exact solutions can only be used for small
numbers of taxa. - Exhaustive search examines all possible trees.
- Typically used for problems with less than 10
taxa.
35Finding most parsimonious trees - exhaustive
search
(1)
B
C
Starting tree, any 3 taxa
A
Add fourth taxon (D) in each of three possible
positions three trees
E
D
C
D
B
B
C
(2b)
(2a)
(2c)
A
A
Add fifth taxon (E) in each of the five possible
positions on each of the three trees -gt 15
trees, and so on
36Finding most parsimonious trees - exact solutions
- Branch and bound saves time by discarding
families of trees during tree construction that
can not be smaller than the smallest tree found
so far. - (Here smaller means smaller score i.e.,
more parsimonious.) - Can be enhanced by specifying an initial upper
bound for tree length (total of changes on the
tree) e.g., from distance method. - Typically used only for problems with less than
20 taxa.
37Finding most parsimonious trees branch and bound
C2.1
B
C
C
C3.1
D
C
B
C2.2
B
C3.2
D
C2.3
C3.3
A
C2.4
C3.4
B2
B3
A
A
C2.5
C3.5
D
B
B
E
E
B
C
D
C
D
C
B1
C1.1
C1.5
A
A
A
B
E
D
D
B
D
B
E
C
C
C1.3
C
E
C1.2
C1.4
A
A
A
38Finding most parsimonious trees - heuristics
- The number of possible trees increases
exponentially with the number of taxa making
exhaustive searches impractical for many data
sets (an NP complete problem) - Heuristic methods are used to search tree space
for most parsimonious trees - The trees found are not guaranteed to be the most
parsimonious - they are best guesses
39Finding most parsimonious trees - heuristics
- Stepwise addition
- Asis - the order in the distance matrix
- Closest -starts with shortest 3-taxon tree and
adds taxa in order that produces the least
increase in tree length - Simple - the first taxon in the matrix is a taken
as a reference - taxa are added to it in the
order of their decreasing similarity to the
reference - Random - taxa are added in a random sequence,
many different sequences can be used - Recommend random with as many (e.g. 10-100)
addition sequences as practical
40Finding most parsimonious trees - heuristics
- Branch Swapping
- Nearest neighbor interchange (NNI)
- Subtree pruning and regrafting (SPR)
- Tree bisection and reconnection (TBR)
41Finding most parsimonious trees - heuristics 1
- Nearest neighbor interchange (NNI)
C
D
E
A
F
B
G
D
C
C
D
E
A
E
A
F
B
F
B
G
G
42Finding most parsimonious trees - heuristics 2
- Subtree pruning and regrafting (SPR)
C
D
E
A
F
B
G
E
C
D
F
E
G
C
F
B
D
A
G
43Finding most parsimonious trees - heuristics 3
- Tree bisection and reconnection (TBR)
C
D
E
A
F
B
G
B
G
E
F
A
D
C
F
D
C
E
G
44Finding most parsimonious trees - heuristics -
summary
- Branch Swapping
- Nearest neighbor interchange (NNI)
- Subtree pruning and regrafting (SPR)
- Tree bisection and reconnection (TBR)
- The nature of heuristic searches means we cannot
know which method will find the most parsimonious
trees or all such trees. - However, TBR is the most extensive swapping
routine and its use with multiple random addition
sequences should work well.
45Tree space may be populated by local minima and
islands of most parsimonious trees
RANDOM ADDITION SEQUENCE REPLICATES
Tree
SUCCESS
FAILURE
FAILURE
Length
Branch
Swapping
Branch Swapping
Branch Swapping
Local
Minimum
Local
GLOBAL
Minima
MINIMUM
46Multiple most parsimonious trees
- Many parsimony analyses yield multiple equally
optimal trees - Multiple trees are due to either
- Alternative equally parsimonious optimizations of
homoplastic characters - Missing data
- Or both
- We can further select among these trees with
additional criteria, but - Most commonly relationships common to all the
optimal trees are summarized with consensus trees
47Consensus methods
- A consensus tree is a summary of the agreement
among a set of fundamental trees - There are many different consensus methods that
differ in - 1. the kind of agreement
- 2. the level of agreement
- Consensus methods can be used with any types of
tree - not just parsimony
48Strict consensus methods
- Strict consensus methods require agreement across
all the fundamental trees - They show only those relationships that are
unambiguously supported by the parsimonious
interpretation of the data - The commonest method (strict component consensus)
focuses on clades - This method produces a consensus tree that
includes all and only those clades found in all
the fundamental trees - Other relationships (those in which the
fundamental trees disagree) are shown as
unresolved polytomies
49Strict consensus methods
TWO FUNDAMENTAL TREES
A
B
C
D
E
F
G
B
E
F
G
A
C
D
B
D
F
G
A
C
E
STRICT COMPONENT CONSENSUS TREE
50Majority-rule consensus methods
- Majority-rule consensus methods require agreement
across a majority of the fundamental trees - May include relationships that are not supported
by the most parsimonious interpretation of the
data - The commonest method focuses on clades
- This method produces a consensus tree that
includes all and only those clades found in a
majority (gt50) of the fundamental trees - Other relationships are shown as unresolved
polytomies - Of particular use in bootstrapping
51Majority rule consensus
THREE FUNDAMENTAL TREES
A
B
C
D
E
F
G
B
E
F
G
A
C
D
B
E
D
G
A
C
F
A
B
C
E
D
F
G
66
100
66
66
Numbers indicate frequency of clades in the
fundamental trees
66
MAJORITY-RULE COMPONENT CONSENSUS TREE
52Reduced consensus methods
- Focuses upon any cladistic relationships
(statements that some taxa are more closely
related to each other than to some other taxa) - Reduced consensus methods occur in strict and
majority-rule varieties - Other relationships are shown as unresolved
polytomies - May be more sensitive than methods focusing only
on clades
53Reduced consensus methods
TWO FUNDAMENTAL TREES
B
D
F
G
A
G
B
C
D
E
F
A
C
E
B
D
F
A
C
E
Strict component consensus
completely unresolved
STRICT REDUCED CLADISTIC CONSENSUS TREE
Taxon G is excluded
54Consensus methods - 2
strict reduced cladistic
Three fundamental trees
strict (component)
Euplotes excluded
Ochromonas
Ochromonas
Symbiodinium
Symbiodinium
Symbiodinium
Prorocentrum
Prorocentrum
Loxodes
Prorocentrum
Loxodes
Tetrahymena
Loxodes
Tetrahymena
Spirostomumum
Tracheloraphis
Tracheloraphis
Tetrahymena
Euplotes
Spirostomum
Spirostomum
Gruberia
Euplotes
Tracheloraphis
Ochromonas
Gruberia
Symbiodinium
Gruberia
Prorocentrum
Ochromonas
Loxodes
majority-rule
Tetrahymena
Spirostomumum
Euplotes
Tracheloraphis
100
Gruberia
Ochromonas
100
Symbiodinium
100
66
Prorocentrum
Loxodes
66
Tetrahymena
Euplotes
Spirostomumum
100
Tracheloraphis
Gruberia
55Consensus methods
- Use strict methods to identify those
relationships unambiguously supported by
parsimonious interpretation of the data - Use reduced methods where consensus trees are
poorly resolved - Use majority-rule methods in bootstrapping
- Avoid other methods which have ambiguous
interpretations
56Parsimony - advantages
- A simple method - easily understood operation
- Does not seem to depend on an explicit model of
evolution - Gives both trees and associated hypotheses of
character evolution - Should give reliable results if the data is well
structured and homoplasy is either rare or
randomly distributed on the tree
57(No Transcript)
58Parsimony - disadvantages
- May give misleading results if homoplasy is
common or concentrated in particular parts of the
tree, e.g - thermophilic convergence
- base composition biases
- long branch attraction
- Underestimates branch lengths
- Model of evolution is implicit - behaviour of
method not well understood - Parsimony often justified on purely philosophical
grounds - we must prefer simplest hypotheses -
particularly by morphologists - For most molecular systematists this is
uncompelling
59Parsimony can be inconsistent
- Felsenstein (1978) developed a simple model
phylogeny including four taxa and a mixture of
short and long branches - Under this model parsimony will give the wrong
tree
A
B
Parsimony tree
Long branches are attracted but the similarity
is homoplastic
Model tree
C
A
Rates or
p
p
Branch lengths
q
p gtgt q
Wrong
q
q
B
D
C
D
- With more data the certainty that parsimony will
give the wrong tree increases - so that parsimony
is statistically inconsistent. - Advocates of parsimony initially responded by
claiming that Felsensteins result showed only
that his model was unrealistic. - It is now recognized that the long-branch
attraction (the Felsenstein Zone) is one of the
most serious problems in phylogenetic inference.
602. Perfect Phylogeny
- Data on species is given by a Character State
Matrix. - Cell (p, i) has value j iff character i of object
(species) p has state j . - Goal constructing evolution tree for the species.
61Motivation Evolution Tree
Internal nodes correspond to speciation events,
where some character (attribute) is
acquired. Assumptions 1. No reversals
(characters are not lost) 2. No convergences (a
character is created only once)
62(No Transcript)
63Perfect Phylogeny for a 0-1 Matrix
- A 0-1 matrix Each character is either 0 (non
exists) or 1 (exists). - Each of the n objects label exactly one leaf of T
- Each of the m characters labels exactly one edge
of T - Object p has exactly the characters labeling the
path from p to the root. - A perfect phylogeny for the matrix Tree with no
convergence, no reversals.
2
3
1
4
E
B
D
5
A
C
64The (Binary) Perfect Phylogeny Problem
- Problem Given a 0-1 matrix M, determine if it
has a perfect phylogeny, and construct one if it
does. - (Note edges are labeled by characters edge
labeled by i represent changing character is
state from 0 to 1).
65Solution to Perfect Phylogeny Problem
- Definition Given a 0-1 matrix M, Okj Mjk1
i.e., Ok is the set of objects that have
character k. - Theorem M has a perfect phylogenetic tree iff
the sets Oi are laminar, ie for all i, j,
either Oi and Oj are disjoint, or one includes
the other.
Laminar
Not Laminar
66Proof
- ? Assume M has a perfect phylogeny, and let
i, j be given. - Consider the edges labeled i and j.
- Case 1 There is a root to leaf path containing
both. Then one is included in the other (2 and 1
below). - Case 2 not case 1. Then they are disjoint (2 and
3 below).
2
3
1
4
E
D
B
5
A
C
67Proof (cont.)
- ? Assume for all i, j, either Oi and Oj are
disjoint, or one includes the other. We prove by
induction on the number of characters that it
has. - Basis one character. Then there are at most two
objects, one with and one without this character.
68Proof (cont.)
- ? Induction step Assume correctness for n-1
characters, and consider a matrix with n
characters (non-zero columns). - WLOG assume that O1 is not contained in Oj for j
gt 1. - Let S1 be the set of objects that have character
1, and S2 be the remaining objects. Then each
character belongs to objects in S1 or S2, but not
both. By inductive hypothesis there are trees T1
and T2 for S1 and S2. Combining them as below
gives the desired tree.
69Efficient Implementation
- 1. Sort the columns by decreasing value when
considered as binary numbers. (Time complexity
O(mn), using radix sort). - Claim If the binary value of column i is larger
than that of column j, then Oi is not a proper
subset of Oj. - Proof Oi Oj gt 0 means the 1s in Oi are not
covered by the 1s in Oj.
70Efficient Implementation (2)
- 2. Make a backwards linked list of the 1s in
each row (leftmost 1 in each row points at
itself). Time complexity O(mn).
4
5
3
1
2
Claim If the columns are sorted, then the set of
columns is laminar iff for each column i, all
the links leaving column i point at the same
column. Can be checked in O(mn) time.
0
0
0
1
1
A
0
0
1
0
0
B
0
1
0
1
1
C
1
0
1
0
0
D
0
0
0
0
1
E
71Examples
laminar
4
5
3
1
2
0
0
0
1
1
A
0
0
1
0
0
B
0
1
0
1
1
C
1
0
1
0
0
D
0
0
0
0
1
E
72Efficient Implementation (3)
- 3. When the matrix is laminar, the tree edges
corresponding to characters are defined by the
backwards links in the matrix.
Remaining edges and leaves are determined by the
characters of each object. Need O(mn) time.
2
3
1
4
E
D
B
5
A
C