Title: Class 9: Phylogenetic Trees
1Class 9 Phylogenetic Trees
2The Tree of Life
Daprès Ernst Haeckel, 1891
3Evolution
- Many theories of evolution
- Basic idea
- speciation events lead to creation of different
species - Speciation caused by physical separation into
groups where different genetic variants become
dominant - Any two species share a (possibly distant) common
ancestor
4Phylogenies
- A phylogeny is a tree that describes the sequence
of speciation events that lead to the forming of
a set of current day species - Leafs - current day species
- Nodes - hypothetical most recent common ancestors
- Edges length - time from one speciation to the
next
Aardvark
Bison
Chimp
Dog
Elephant
5Phylogenetic Tree
- Topology bifurcating
- Leaves - 1N
- Internal nodes N12N-2
6Example Primate evolution
20-25 mya
35-37 mya
40-45 mya
7How to construct a Phylogeny?
- Until mid 1950s phylogenies were constructed by
experts based on their opinion (subjective
criteria) - Since then, focus on objective criteria for
constructing phylogenetic trees - Thousands of articles in the last decades
- Important for many aspects of biology
- Classification (systematics)
- Understanding biological mechanisms
8Morphological vs. Molecular
- Classical phylogenetic analysis morphological
features - number of legs, lengths of legs, etc.
- Modern biological methods allow to use molecular
features - Gene sequences
- Protein sequences
- Analysis based on homologous sequences (e.g.,
globins) in different species
9Dangers in Molecular Phylogenies
- We have to remember that gene/protein sequence
can be homologous for different reasons - Orthologs -- sequences diverged after a
speciation event - Paralogs -- sequences diverged after a
duplication event - Xenologs -- sequences diverged after a horizontal
transfer (e.g., by virus)
10Dangers of Paralogues
Gene Duplication
Speciation events
2B
1B
3A
3B
2A
1A
11Dangers of Paralogs
- If we only consider 1A, 2B, and 3A...
Gene Duplication
Speciation events
2B
1B
3A
3B
2A
1A
12Types of Trees
- A natural model to consider is that of rooted
trees
Common Ancestor
13Types of Trees
- Depending on the model, data from current day
species does not distinguish between different
placements of the root
vs
14Types of trees
- Unrooted tree represents the same phylogeny with
out the root node
15Positioning Roots in Unrooted Trees
- We can estimate the position of the root by
introducing an outgroup - a set of species that are definitely distant from
all the species of interest
Proposed root
Falcon
Aardvark
Bison
Chimp
Dog
Elephant
16Types of Data
- Distance-based
- Input is a matrix of distances between species
- Can be fraction of residues they disagree on, or
-alignment score between them, or - Character-based
- Examine each character (e.g., residue) separately
17Simple Distance-Based Method
- Input distance matrix between species
- Outline
- Cluster species together
- Initially clusters are singletons
- At each iteration combine two closest clusters
to get a new one
18UPGMA Clustering
- Let Ci and Cj be clusters, define distance
between them to be - When combining two clusters, Ci and Cj, to form a
new cluster Ck, then
19Molecular Clock
- UPGMA implicitly assumes that all distances
measure time in the same way
2
3
2
3
4
1
1
4
20Additivity
- A weaker requirement is additivity
- In real tree, distances between species are the
sum of distances between intermediate nodes
k
c
b
j
a
i
21Consequences of Additivity
- Suppose input distances are additive
- For any three leaves
- Thus
k
c
b
j
a
m
i
22Neighbor Joining
- Can we use this fact to construct trees?
- Let
- where
- Theorem if D(i,j) is minimal (among all pairs of
leaves), then i and j are neighbors in the tree
23Neighbor Joining
- Set L to contain all leaves
- Iteration
- Choose i,j such that D(i,j) is minimal
- Create new node k, and set
- remove i,j from L, and add k
- Terminatewhen L 2, connect two remaining
nodes
24Distance Based Methods
- If we make strong assumptions on distances, we
can reconstruct trees - In real-life distances are not additive
- Sometimes they are close to additive
25Character Based Methods
- We start with a multiple alignment
- Assumptions
- All sequences are homologous
- Each position in alignment is homologous
- Positions evolve independently
- No gaps
- We seek to explain the evolution of each position
in the alignment
26Parsimony
- Character-based method
- A way to score trees (but not to build trees!)
- Assumptions
- Independence of characters (no interactions)
- Best tree is one where minimal changes take place
27A Simple Example
- What is the parsimony score of
A CAGGTA B CAGACA C CGGGTA D TGCACT E TGCGTA
28A Simple Example
A CAGGTA B CAGACA C CGGGTA D TGCACT E TGCGTA
- Each column is scored separately.
- Lets look at the first column
- Minimal tree has one evolutionary change
C
T
C
T
C
C
C
T
T ? C
29Evaluating Parsimony Scores
- How do we compute the Parsimony score for a given
tree? - Traditional Parsimony
- Each base change has a cost of 1
- Weighted Parsimony
- Each change is weighted by the score c(a,b)
30Traditional Parsimony
a
a
- Solved independently for each position
- Linear time solution
a,g
a
31Evaluating Weighted Parsimony
- Dynamic programming on the tree
- S(i,a) cost of tree rooted at i if i is labeled
by a - Initialization
- For each leaf i set S(i,a) 0 if i is labeled by
a, otherwise S(i,a) ? - Iteration
- if k is a node with children i and j, then
S(k,a) minb(S(i,b)c(a,b))
minb(S(j,b)c(a,b)) - Termination
- cost of tree is minaS(r,a) where r is the root
32Cost of Evaluating Parsimony
- Score is evaluated on each position independetly.
Scores are then summed over all positions. - If there are n nodes, m characters, and k
possible values for each character, then
complexity is O(nmk) - By keeping traceback information, we can
reconstruct most parsimonious values at each
ancestor node
33Maximum Parsimony
1 2 3 4 5 6 7 8 9 10 Species 1
- A G G G T A A C T G Species 2 - A C G A T T A
T T A Species 3 - A T A A T T G T C T Species 4
- A A T G T T G T C G
How many possible unrooted trees?
34Maximum Parsimony
How many possible unrooted trees?
1 2 3 4 5 6 7 8 9
10 Species 1 - A G G G T A A C T G Species 2 - A
C G A T T A T T A Species 3 - A T A A T T G T C
T Species 4 - A A T G T T G T C G
35Maximum Parsimony
How many substitutions?
MP
36Maximum Parsimony
1 2 3 4 5 6 7 8 9 10 1 - A G G G T A A
C T G 2 - A C G A T T A T T A 3 - A T A A T T G
T C T 4 - A A T G T T G T C G
37Maximum Parsimony
1 2 3 4 5 6 7 8 9 10 1 - A G G G T A A C
T G 2 - A C G A T T A T T A 3 - A T A A T T G T
C T 4 - A A T G T T G T C G
38Maximum Parsimony
2 1 - G 2 - C 3 - T 4 - A
39Maximum Parsimony
1 2 3 4 5 6 7 8 9 10 1 - A G G G T A A C
T G 2 - A C G A T T A T T A 3 - A T A A T T G T
C T 4 - A A T G T T G T C G
40Maximum Parsimony
1 2 3 4 5 6 7 8 9 10 1 - A G G G T A A C
T G 2 - A C G A T T A T T A 3 - A T A A T T G T
C T 4 - A A T G T T G T C G
41Maximum Parsimony
4 1 - G 2 - A 3 - A 4 - G
42Maximum Parsimony
43Maximum Parsimony
1 2 3 4 5 6 7 8 9 10 1 - A G G G T A A C
T G 2 - A C G A T T A T T A 3 - A T A A T T G T
C T 4 - A A T G T T G T C G
0 3 2 2 0 1 1 1 1 3 14
44Searching for Trees
45Searching for the Optimal Tree
- Exhaustive Search
- Very intensive
- Branch and Bound
- A compromise
- Heuristic
- Fast
- Usually starts with NJ
46Phylogenetic Tree Assumptions
- Topology bifurcating
- Leaves - 1N
- Internal nodes N12N-2
- Lengths t ti for each branch
- Phylogenetic tree (Topology, Lengths) (T,t)
47Probabilistic Methods
- The phylogenetic tree represents a generative
probabilistic model (like HMMs) for the observed
sequences. - Background probabilities q(a)
- Mutation probabilities P(ab,t)
- Models for evolutionary mutations
- Jukes Cantor
- Kimura 2-parameter model
- Such models are used to derive the probabilities
48Jukes Cantor model
- A model for mutation rates
- Mutation occurs at a constant rate
- Each nucleotide is equally likely to mutate into
any other nucleotide with rate a.
49Kimura 2-parameter model
- Allows a different rate for transitions and
transversions.
50Mutation Probabilities
- The rate matrix R is used to derive the mutation
probability matrix S - S is obtained by integration. For Jukes Cantor
- q can be obtained by setting t to infinity
51Mutation Probabilities
- Both models satisfy the following properties
- Lack of memory
-
- Reversibility
- Exist stationary probabilities Pa s.t.
52Probabilistic Approach
- Given P,q, the tree topology and branch lengths,
we can compute
x5
t4
x4
t2
t3
t1
x1
x2
x3
53Computing the Tree Likelihood
- We are interested in the probability of observed
data given tree and branch lengths - Computed by summing over internal nodes
- This can be done efficiently using a tree upward
traversal pass.
54Tree Likelihood Computation
- Define P(Lka) prob. of leaves below node k
given that xka - Init for leaves P(Lka)1 if xka 0 otherwise
- Iteration if k is node with children i and j,
then - Termination Likelihood is
55Maximum Likelihood (ML)
- Score each tree by
- Assumption of independent positions
- Branch lengths t can be optimized
- Gradient ascent
- EM
- We look for the highest scoring tree
- Exhaustive search
- Sampling methods (Metropolis)
56Optimal Tree Search
- Perform search over possible topologies
Parameter space
Parametric optimization (EM)
Local Maxima
57Computational Problem
- Such procedures are computationally expensive!
- Computation of optimal parameters, per candidate,
requires non-trivial optimization step. - Spend non-negligible computation on a candidate,
even if it is a low scoring one. - In practice, such learning procedures can only
consider small sets of candidate structures
58Structural EM
- Idea Use parameters found for current topology
to help evaluate new topologies. - Outline
- Perform search in (T, t) space.
- Use EM-like iterations
- E-step use current solution to compute expected
sufficient statistics for all topologies - M-step select new topology based on these
expected sufficient statistics
59The Complete-Data Scenario
- Suppose we observe H, the ancestral sequences.
60Expected Likelihood
- Start with a tree (T0,t0)
- Compute
- Formal justification
- Define
- Theorem
- Consequence improvement in expected score ?
improvement in likelihood
61Proof
- Theorem
- Simple application of Jensens inequality
62Algorithm Outline
Unlike standard EM for trees, we compute all
possible pairwise statistics Time O(N2M)
63Algorithm Outline
Pairwise weights
This stage also computes the branch length for
each pair (i,j)
64Algorithm Outline
Max. Spanning Tree
Fast greedy procedure to find tree By
construction Q(T,t) ? Q(T0,t0) Thus,
l(T,t) ? l(T0,t0)
65Algorithm Outline
Fix Tree
Remove redundant nodes Add nodes to break large
degree This operation preserves likelihood
l(T1,t) l(T,t) ? l(T0,t0)
66Assessing trees the Bootstrap
- Often we dont trust the tree found as the
correct one. - Bootstrapping
- Sample (with replacement) n positions from the
alignment - Learn the best tree for each sample
- Look for tree features which are frequent in all
trees. - For some models this procedure approximates the
tree posterior P(T X1,,Xn)
67Algorithm Outline
Construct bifurcation T1
New Tree
Thm l(T1,t1) ? l(T0,t0)
These steps are then repeated until convergence