Title: Building phylogenetic trees
1Building phylogenetic trees
- Topics in Computational Biology
- 25.2.2004
- Pia Laine
2Contents
- Phylogeny
- Phylogenetic trees
- How to make a phylogenetic tree from pairwise
distances - UPGMA method ( an example)
- Neighbor-Joining method ( an example)
- Comparison of methods
- Conclusion
3Phylogeny
- Phylogeny is the evolution of related
species/genes - Phylogenetic tree diagram showing evolutionary
lineages of species/genes - The history of genes or species may be very
different - Genes can be homologous or analogous, but still
remind each other - Homologous sequences can be devided into two
parts - Orthologous sequences diverged by specification
from a common ancestor - Paralogous sequences evolved by gene dublication
within species - Analogous sequences may appear and function very
similarly, but they do not have a common ancestor - WHEN WE WANT TO EXPLORE EVOLUTIONARY
RELATIONSHIPS, WE NEED TO HANDLE ORTHOLOGOUS
SEQUENCES
4Phylogenetic trees
- WHY construct a phylogenetic tree?
- to understand lineage of various species
- to understand how various functions evolved
- to inform multiple alignments
- Trees can be rooted (a common ancestor in known)
or unrooted - Leaves are the terminal nodes that correspond to
the observed sequences of genes or species (A, B,
C, D) - Internal nodes are hypothetical ancestral nodes
- All trees will be assumed to be binary, meaning
that an edge that branches splits into two
daughter edges - Each edge has a certain amount of evolutionary
divergence associated to it, defined by some
measure of distance between sequences, or from a
model of substitution of residues over the course
of evolution
5Phylogenetic trees
- Different ways to represent a phylogenetic tree
(illustrated by Treeview)
6Different algorithms used to infer phylogeny from
sequence data
- Distance methods
- Parsimony
- Likelihood
- Probabilistic methods
- Phylogenetic invariants
7Route from the molecular sequences to the
phylogenetic tree
- Distance methods
- Select a set of related (orthologous) nucleotide
or amino acid sequences - Perform multiple sequence alignment (Clustal
series widely used) - Calculate pairwise distances of the sequence
using chosen evolution model of substitution
(Distances between sequences describe the
evolution the smaller distances are the closer
they are related) - Select the most suitable algorithm to infer
phylogeny - View the tree with a certain program (Treeview,
NJPlot,..)
8Making a tree from pairwise distances
- Distances dij between each pair of sequences i
and j are calculated in the given dataset - Different ways defining distances
- For nucleotide sequences
- Jukes-Cantor, Kimura-2-parameter K2P, HKY
(Hasegawa-Kishino-Yano), F84, Tamura-Nei, General
time-reversible model, General 12-parameter model - For amino acid sequences
- PAM-matrices, BLOSUM-matrices
9Distance matrix methods
- UPGMA
- Algorithm introduced by Sokal and Michener 1958
- Neighbor-Joining
- Algorithm introduced by Saitou and Nei 1987
- Modified by Studier and Keppler 1988
10Clustering method UPGMA
- UPGMA Unweighted pair group method using
arithmetic averages - Simple method
- It works by clustering the sequences, at each
stage connecting two clusters and finally
creating a new node on a tree - Method assumes equal rate of evolutionary change
along branches ? Molecular clock assumption
11UPGMA
A
C
B
D
- UPGMA produces a rooted tree
- Branch lengths satisfy a molecular clock
- ? The divergence of sequences is assumed to occur
at the same constant rate at all points in the
tree - Trees that are clocklike are rooted and the total
branch length from the root up to any leaf is
equal - Trees are often referred to be ultrametric
- A distance measures are ultrametric if either all
three distances are equal - dij dik djk or two of them are equal and one
is smaller djk lt dij dik - ? UPGMA is guaranteed to build the correct tree
if distances are ultrametric - Method can be used for reconstructing phylogenies
if evolutionary rates are assumed to be same in
all lineages ? criticism in the phylogeny
literature - Suitable for the species closely related
- Running time O(n2)
12Algorithm UPGMA
- Initialisation
- Assign each sequence i in dataset to its own
cluster - Define one leaf of T for each sequence, and
place at height zero - Iteration
- Find the two clusters i and j for which dij is
the smallest (pick randomly if several equal
distances) - Define a new cluster ij by Cij Ci U Cj.
Cluster ij has nij ni nj members ( initially
ni 1 ) - Connect i and j on the tree to a new node v
- The branch lengths from new node to i and j are
- placed at height
-
13Algorithm UPGMA (cont.)
- Iteration (cont.)
- Compute the distances between the new cluster
and the remaining clusters by using -
- Add ij to the current clusters and remove i and
j - Termination
- When only two clusters i and j remain, place the
root at height
14An example UPGMA (1)
- Distance matrix (arbitrary)
- for four items (sequences)
- A, B, C and D
- Actually distances are not ultrametric, because
three distances are not equal - dij ? dik ? djk or two of them are not equal and
one is smaller djk lt dij ? dik
Step 1. Find the smallest distance, dij, between
two clusters ? A and C, where dij is 7
15An example UPGMA (2)
- Step 2. Define new cluster ij, which has nij
ni nj - members (initially ni 1)
- New cluster ? A and C
- nAC nA nC2
- Step 3. Connect A and C on the tree to a new
node v1 - Step 4. The branch lengths from new node v1 to A
and C -
3,5
A
C
3,5
16An example UPGMA (3)
- Step 5. Compute the distances between the new
cluster AC and the remaining clusters (B and D) - Step 6. Delete the columns and rows of the
distance matrix that correspond to clusters A and
C, and add a column and a row for cluster AC
?New distance matrix
17An example UPGMA (4)
- 2nd iteration process
- Step 1. Find the two sequences i and j for which
dij - is the smallest (randomly if several equal
distances) - ?AC-B
- Step 2. Define new cluster (ij), which has nij
ni nj - members ( initially ni 1 ) New cluster ? AC and
B - nACB nAC nB 2 1 3
- Step 3. Connect AC and B on the tree to a new
node v2 - Step 4. The branch lengths from new node v2 to AC
and B - ?
3,5
A
C
3,5
B
4,25
18An example UPGMA (5)
- Step 5. Compute the distances between the new
cluster and the remaining cluster (D) - Step 6. Delete the columns and rows of the
distance matrix that correspond to clusters AC
and B, and add a column and a row for cluster ACB
?New distance matrix
19An example UPGMA (6)
- Termination
- Only two clusters (ACB and D) remaining
- Place the root height
Original distance matrix and final phylogenetic
tree(including the branch lengths)
3,5
A
0,75
C
1,92
3,5
B
4,25
D
6,17
20Neighbor-Joining (N-J)
D
B
- Another algorithm that works by clustering the
sequences - Does not assume molecular clock
- N-J trees are unrooted
- N-J assumes additivity
- Def. Edge lengths are said to be additive if the
distance between any pair of leaves is the sum of
lengths of the edges on the path connecting them - Method uses an approximate algorithm, where the
tree is built by finding a pair of neighboring
leaves i and j that minimize the length of the
tree. Finally neighboring leaves are joined. - Running time O(n2)
A
C
21Algorithm Neighbor-Joining
- Initialisation
- Define T to be the set of leaf nodes, one for
each given sequence - Iteration
- Compute for each sequence,
where n is the number of sequences in the
distance matrix - Pick a pair i and j (for which dij ui uj is
the smallest (pick randomly if several equal) - Join items i and j with a new node v
- Compute the branch lengths from a new node v to
items i and j - Compute the distances between new node v and
remaining items - Remove i and j from the distance matrix and
replace them by new node v - Termination
- When only two items i and j remain, add the
remaining edge between i and j, with length dij
22An example N-J (1)
Step 1. Compute for each row in distance
matrix Step 2. Compute (the lower-diagonal
matrix) and choose the smallest (most
negative)
23An example N-J (2)
- Step 3. Join A and B together with a new node
v1. Compute the edge lengths, from A to node v
and from B to node v1 -
- Step 4. Compute distances between the new node
v1 and remaining items (C and D)
B
5
v1
3
A
24An example N-J (3)
New reduced distance matrix
- Step 5. Delete A and B from the distance matrix
and replace them by new item AB - Step 6. Continue from step 1, because more than
two items remain - Step 1. Compute
- for each row in
- distance matrix
- Step 2 Compute
- and choose
- the smallest (the lower-diagonal matrix)
25An example N-J (4)
- Step 3 Join v1 and C together with a new node
v2. Compute the edge lengths, from v1 to node v2
and from C to node v2 - Step 4 Compute distances between the new node v2
and remaining items (D)
B
5
v1
v2
1
3
3
A
C
26An example N-J (5)
- Step 5 Delete AB and C from the distance matrix
and replace them by ABC - Step 6 Only two nodes remaining ? connect them
Original distance matrix and final phylogenetic
tree (including the edge lengths)
D
8
B
5
1
3
3
A
C
27Comparison
- Neighbor-joining
- Unrooted tree, where the direction of evolution
is unknown - Suitable for datasets with largely varying rates
of evolution - Suitable for large datasets
- UPGMA
- The total branch length from the root up to any
leaf is equal - Produces a rooted tree, where the root is
hypothesized ancestor of the sequences in the
tree - Suitable for closely related sequences
- Can be used to infer phylogenies if one can
assume that evolutionary rates are the same in
all lineages
D
8
3,5
A
B
5
C
3,5
1
B
3
3
A
C
4,25
D
6,17
28Conclusion
- UPGMA method constructs a rooted phylogenetic
tree correctly if there is a molecular clock with
a constant rate of mutation - UPGMA method is rarely used, because molecular
clock assumption is not generally true selection
pressures vary across time periods, genes within
organisms, organisms, regions within gene - N-J method produces an unrooted tree without
molecular clock hypothesis - N-J method is one of the most popular and widely
used by molecular evolutionist - Distance methods are strongly dependent on the
model of evolution used - Sequence information is reduced when transforming
sequence data into distances - Distance methods are computationaly fast
29Reference
- Durbin, R., Eddy, S., Krogh, A., Mithchison G.
2003 Biological sequence analysis Probabilistic
models of proteins and nucleic acid. Campridge
University Press. - Li, W. 1997. Molecular Evolution. Sinauer
Associates, Sunderland, MA. p. 108 - Felsenstein, J. 2003. Inferring Phylogenies.
Sinauer Associates, Sunderland, MA. p.147-170
30Examples of phylogeny programs
- Multiple sequence alignment
- Clustal series (W, V) (free, http//www-igbmc.u-st
rasbg.fr/BioInfo/ClustalX/Top.html ) - Phylogeny packages
- PAUP (http//paup.csit.fsu.edu/ )
- Phylip (free, http//evolution.gs.washington.edu)
- MEGA (free, http//www.megasoftware.net)
- Viewing/plotting phylogenetic trees
- Treeview (free, http//taxonomy.zoology.gla.ac.uk/
rod/treeview.html) - NJPlot (free, http//pbil.univ-lyon1.fr/software/n
jplot.html)
31Further reading
- N-J Saitou, N. and M. Nei.1987. The
neighbor-joining method a new method for
reconstructing phylogenetic trees. Mol Biol Evol
4(4) 406-25. - N-J Studier, J. A., K. J. Keppler, et al. 1988.
A note on the neighbor-joining algorithm of
Saitou and Nei The neighbor-joining method a new
method for reconstructing phylogenetic trees. Mol
Biol Evol 5(6) 729-31. - UPGMA Michener, C. D., and R. R. Sokal. 1957. A
quantative approach to a problem in
classification. Evolution 11 130-162. - ClustalW Thompson, J. D., T. J. Gibson, et al.
1997. The CLUSTAL_X windows interface flexible
strategies for multiple sequence alignment aided
by quality analysis tools. Nucleic Acids Res
25(24) 4876-82.