Title: Phylogenetics%20I
1Phylogenetics I
2Evolution
- Evolution of new organisms is driven by
- Mutations
- The DNA sequence can be changed due to single
base changes, deletion/insertion of DNA segments,
etc. - Selection bias
3Theory of Evolution
- Basic idea
- speciation events lead to creation of different
species. - Speciation caused by physical separation into
groups where different genetic variants become
dominant - Any two species share a (possibly distant) common
ancestor
4The Tree of Life
5Primate evolution
A phylogeny is a tree that describes the sequence
of speciation events that lead to the forming of
a set of current day species also called a
phylogenetic tree.
6Morphological vs. Molecular
- Classical phylogenetic analysis morphological
features number of legs, lengths of legs, etc. - Modern biological methods allow to use molecular
features - Gene sequences
- Protein sequences
7Morphological topology
(Based on Mc Kenna and Bell, 1997)
Archonta
Ungulata
8From sequences to a phylogenetic tree
Rat QEPGGLVVPPTDA Rabbit QEPGGMVVPPTDA Gorilla QE
PGGLVVPPTDA Cat REPGGLVVPPTEG
There are many possible types of sequences to use
(e.g. Mitochondrial vs Nuclear proteins).
9Mitochondrial topology
(Based on Pupko et al.,)
10Nuclear topology
(Based on Pupko et al. slide)
(tree by Madsenl)
11Phylogenenetic trees
- Leaves - current day species (or taxa plural of
taxon) - Internal vertices - hypothetical common ancestors
- Edges length - time from one speciation to the
next
12Twists in molecular phylogenies
- We have to emphasize that gene/protein sequence
can be homologous for several different reasons - Orthologs -- sequences diverged after a
speciation event - Paralogs -- sequences diverged after a
duplication event - Xenologs -- sequences diverged after a horizontal
transfer (e.g., by virus)
13Paralogs
Consider evolutionary tree of three taxa
Gene Duplication
and assume that at some point in the past a gene
duplication event occurred.
14Paralogs
The gene evolution is described by this tree (A,
B are the copies of the same gene).
Gene Duplication
Speciation events
2B
1B
3A
3B
2A
1A
15Paralogs
- If we happen to consider genes 1A, 2B, and 3A of
species 1,2,3, we get a wrong tree that does not
represent the phylogeny of the host species
Gene Duplication
S
S
S
Speciation events
2B
1B
3A
3B
2A
1A
16Types of Trees
- A natural model to consider is that of rooted
trees
Common Ancestor
17Types of trees
- Unrooted tree represents the same phylogeny
without the root node
Depending on the model, data from current day
species does not distinguish between different
placements of the root.
18Rooted versus unrooted trees
Tree c
b
a
c
Represents the three rooted trees
19Total numbers of trees
- For N taxa,
- Rooted bifurcating trees
- (2n-3)!! (2n-3)!/2n-2(n-2)!
- Unrooted bifurcating trees
- (2n-5)!!
- Tree shapes
20Positioning Roots in Unrooted Trees
- We can estimate the position of the root by
introducing an outgroup - a set of species that are definitely distant from
all the species of interest
Proposed root
Falcon
Aardvark
Bison
Chimp
Dog
Elephant
21Type of Data
- Distance-based
- Input is a matrix of distances between species
- Can be fraction of residue they disagree on, or
alignment score between them, or - Character-based
- Examine each character (e.g., residue) separately
22Two methods of tree Construction
- Distance- A weighted tree that realizes the
distances between the objects. - Parsimony A tree with a total minimum number of
character changes between nodes.
We start with distance based methods, considering
the following question Given a set of species
(leaves in a supposed tree), and distances
between them construct a phylogeny which best
fits the distances.
23Distance Matrix
- Given n species, we can compute the n x n
distance matrix Dij - Dij may be defined as the edit distance between a
gene in species i and species j, where the gene
of interest is sequenced for all n species. -
24The distance between two sequences
- Protein sequences
- PAM
- BLOSUM
- DNA sequences
- Jukes-Cantor
- HGY
- Kimura 2-Parameter
25General Stationary Time-reversible Model
. pCrCA pGrGA pTrTA
pArAC . pGrGC pTrTC
pArAG pCrCG . pTrTG
pArAT pCrCT pGrGT .
R
(Diagonal elements such that rows sum to zero)
Time reversibility pirij pjrji
26General Stationary Time-reversible Model
- P(t) eRt
- Given rates, one can find transition
probabilities, and vice-versa.
27Jukes-Cantor
. u/3 u/3 u/3
u/3 . u/3 u/3
u/3 u/3 . u/3
u/3 u/3 u/3 .
R
28Jukes-Cantor
- P(no mutation) e-4/3ut
- P(at least one mutation) 1-e-4/3ut
- Ds ¾ (1-e-4/3ut)
- D ? ut -3/4 ln (1-4/3 Ds)
29Kimura 2-Parameter
A C G T
. b a b
b . b a
a b . b
b a b .
R
a/b transition/transversion bias ? R a2b 1
per unit time
30Kimura 2-Parameter
31HKY (Hasegawa, Kishino, Yano)
. mpC mkpG mpT
mpA . mpG mkpT
mkpA mpC . mpT
mpA mkpC mpG .
R
k transversion / transition
32Distances in Trees
- Edges may have weights reflecting
- Number of mutations on evolutionary path from one
species to another - Time estimate for evolution of one species into
another - In a tree T, we often compute
- dij(T) - the length of a path between leaves i
and j
33Distance in Trees an Exampe
d1,4 12 13 14 17 12 68
34Fitting Distance Matrix
- Given n species, we can compute the n x n
distance matrix Dij - Evolution of these genes is described by a tree
that we dont know. - We need an algorithm to construct a tree that
best fits the distance matrix Dij
35Reconstructing a 3 Leaved Tree
- Tree reconstruction for any 3x3 matrix is
straightforward - We have 3 leaves i, j, k and a center vertex c
Observe dic djc Dij dic dkc Dik djc
dkc Djk
36Reconstructing a 3 Leaved Tree
37Trees with gt 3 Leaves
- An tree with n leaves has 2n-3 edges
- This means fitting a given tree to a distance
matrix D requires solving a system of n choose
2 equations with 2n-3 variables - This is not always possible to solve for n gt 3
38Additive Distance Matrices
Matrix D is ADDITIVE if there exists a tree T
with dij(T) Dij
NON-ADDITIVE otherwise
39Distance Based Phylogeny Problem
- Goal Reconstruct an evolutionary tree from a
distance matrix - Input n x n distance matrix Dij
- Output weighted tree T with n leaves fitting D
- If D is additive, this problem has a solution and
there is a simple algorithm to solve it
40Using Neighboring Leaves to Construct the Tree
- Find neighboring leaves i and j with parent k
- Remove the rows and columns of i and j
- Add a new row and column corresponding to k,
where the distance from k to any other leaf m can
be computed as
Dkm (Dim Djm Dij)/2
Compress i and j into k, iterate algorithm for
rest of tree
41Finding Neighboring Leaves
- To find neighboring leaves we simply select a
pair of closest leaves.
42Finding Neighboring Leaves
- To find neighboring leaves we simply select a
pair of closest leaves. - WRONG
43Finding Neighboring Leaves
- Closest leaves arent necessarily neighbors
- i and j are neighbors, but (dij 13) gt (djk 12)
- Finding a pair of neighboring leaves is
- a nontrivial problem!
44Neighbor Joining Algorithm
- In 1987 Naruya Saitou and Masatoshi Nei developed
a neighbor joining algorithm for phylogenetic
tree reconstruction - Finds a pair of leaves that are close to each
other but far from other leaves implicitly finds
a pair of neighboring leaves - Advantages works well for additive and other
non-additive matrices, it does not have the
flawed molecular clock assumption
45Constructing additive treesThe neighbor joining
algorithm
- Let i, j be neighboring leaves in a tree, let k
be their parent, and let m be any other vertex. - The formula
- shows that we can compute the distances of k to
all other leaves. This suggest the following
method to construct tree from a distance matrix - Find neighboring leaves i,j in the tree,
- Replace i,j by their parent k and recursively
construct a tree T for the smaller set. - Add i,j as children of k in T.
46Neighbor Finding
- How can we find from distances alone a pair of
nodes which are neighboring leaves? - Closest nodes arent necessarily neighboring
leaves.
Next we show one way to find neighbors from
distances.
47Neighbor Finding Seitou Nei algorithm
Definitions
Theorem (Saitou Nei) Assume all edge weights
are positive. If D(i,j) is minimal (among all
pairs of leaves), then i and j are neighboring
leaves in the tree.
48Complexity of Neighbor Joining Algorithm
- Naive Implementation
- Initialization ?(L2) to compute d(r,i) and
C(i,j) for all i,j?L. - Each Iteration
- O(L2) to find the maximal C(i,j).
- O(L) to compute C(m,k)m? L for the new node k.
- Total of O(L3).
r
C(m,k)
m
k
49Complexity of Neighbor Joining Algorithm
- Using Heap to store the C(i,j)s
- Input Distance matrix D d(i,j), and an
arbitrary object r. - Initialization ?(L2) to compute and heapify the
C(i,j)s in a heap H. - Each Iteration
- O(log L) to find and delete the maximal C(i,j)
from H. - O(L) to add the values d(k,m) to D, for all
objects m. - O(L) to delete d(m,i), d(m,j) from D (for all
m). - O(L log L) to delete C(i,m), C(j,m) and add
C(k,m) from H, for all objects m. - Total of O(L2 log L).
- (implementation details are omitted)
50Neighbor Joining Algorithm
- Applicable to matrices which are not additive
- Known to work good in practice
- The algorithm and its variants are the most
widely used distance-based algorithms today.
51The Four Point Condition
Compute 1. Dij Dkl, 2. Dik Djl, 3. Dil Djk
2
3
1
2 and 3 represent the same number the length of
all edges the middle edge (it is counted twice)
1 represents a smaller number the length of all
edges the middle edge
52The Four Point Condition Theorem
- The four point condition for the quartet i,j,k,l
is satisfied if two of these sums are the same,
with the third sum smaller than these first two - Theorem An n x n matrix D is additive if and
only if the four point condition holds for every
quartet 1 i,j,k,l n
53Least Squares Distance Phylogeny Problem
- If the distance matrix D is NOT additive, then we
look for a tree T that approximates D the best - Squared Error ?i,j (dij(T)
Dij)2 - Squared Error is a measure of the quality of the
fit between distance matrix and the tree we want
to minimize it. - Least Squares Distance Phylogeny Problem finding
the best approximation tree T for a non-additive
matrix D (NP-hard).