Title: Phylogenetic Tree Reconstruction
1Phylogenetic TreeReconstruction
- Modified version of Dr. Chun-Chieh Shih
- Institute of Information Sciences
- Academia Sinica
2OUTLINE
Concept 0f evolutionary trees
Flowchart of phylogenetic analysis
Tree reconstruction methods
Evaluation of reconstructed trees
3Why RECONSTRUCT phylogenetic trees
For Example
Understand evolutionary history
Map pathogen strain diversity for vaccines
Assist in epidemiology of infectious diseases
Aid in prediction of function of novel genes
Biodiversity studies
Understanding microbial ecologies
4Concept 0f evolutionary trees
Rooted tree
Unrooted tree
One sequence (root) defined to be common ancestor
of all other sequences
Indicates evolutionary relationship without
revealing location of oldest ancestry
If molecular clock hypothesis holds, it is
possible to predict a root
5- Image http//www.ncbi.nlm.nih.gov/About/primer/ph
ylo.html
6Concept 0f evolutionary trees
Number of Trees
Taken From http//bioquest.org16080/bedrock/terr
e_haute_03_04/phylogenetics_1.0.ppt
7Types of data used in phylogenetic inference
Character-based methods
Use the aligned characters, such as DNA or
protein sequences, directly during tree
inference.
Distance-based methods
Transform the sequence data into pairwise
distances, and use the matrix during tree
building.
8(No Transcript)
9Distance Methods
Calculate changes between each pair in a group of
sequences (The first step in producing a
multiple sequence Alignment)
Finding closest neighbors among a group
of Sequences
Identify tree that correctly positions
neighbors and that also has branch lengths that
reproduce the original data as closely as possible
10Distance Methods - Example
distance table
distances between sequences
11Distance Methods - Example
Distance Programs in Phylip
- FITCH estimates phylogenetic tree assuming
additivity of branch lengths using the
Fitch-Margoliash method - KITSH same as FITCH, but under the assumption of
a molecular clock - NEIGHBOR estimates phylogenies using either
- Neighbor-joining (no molecular clock assumed)
- Unweighted Pair Group Method with Arithmetic Mean
(UPGMA) (molecular clock assumed)
12Distance Methods - UPGMA
Clustering
? All leaves are assigned to a cluster, which
then are iteratively merged according to
their distance
Construct a distance tree
A -GCTTGTCCGTTACGAT B ACTTGTCTGTTACGAT C
ACTTGTCCGAAACGAT D -ACTTGACCGTTTCCTT E
AGATGACCGTTTCGAT F -ACTACACCCTTATGAG
13Distance Methods - UPGMA
The distance between two clusters i and j is
defined as
where Ci and Cj denote the number of
sequences in cluster i and j, respectively.
Replacing
Ck Ci ? Cj
The new distances between the new node k and all
other clusters l are computed according to
14Distance Methods - UPGMA
Step I Initialization
. Assign each sequence i to its own cluster Ci.
. Define one leaf of T for each sequence, and
place at height zero.
Step II Iteration
. Determine the two clusters i, j for which di,j
is minimal
. Define a new cluster k by Ck CiU Cj, and
define dkl for all l
. Define a node k with daughter nodes i and j,
and place it at height di,j/2.
. Add k to the current clusters and remove i.
Step III Termination
. When only two clusters i, j remain, place the
root at height di,j/2.
15Distance Methods - UPGMA
1
First round
A
B
1
dist((A,B),C) (distACdistBC)/2 4
dist((A,B),D) (distADdistBD)/2 6
dist((A,B),E) (distAEdistBE)/2 6
dist((A,B),F) (distAFdistBF)/2 8
Choose the most similar pair, cluster them
together and calculate the new distance matrix.
16Distance Methods - UPGMA
Second round
1
A
1
B
2
D
2
E
Third round
1
A
1
1
B
2
C
2
D
2
E
17Distance Methods - UPGMA
Fourth round
Fifth round
18Distance Methods - UPGMA
The UPGMA clustering method is very sensitive to
unequal evolutionary rates
? Assumes that the evolutionary rate is the same
for all branches
Clustering works only if the data are ultrametric
Ultrametric tree
1
1
3
1
1
3
2
1
1
1
1
Special kind of additive tree in which the tips
of the trees are all equidistant from the root
A cladogram with branch lengths, also called
phylograms and metric trees
19Distance Methods - UPGMA
UPGMA fails when rates of evolution are not
constant
Wrong topology
2
2
1
A
A
1
1
A
2
2
1
0.5
2
0.5
C
C
A
1
C
3
3
B
B
0.5
2
C
2.5
2.5
D
D
1.5
1.5
3
2.5
2.5
D
2.5
E
E
2.5
4.5
E
F
20Distance Methods Neighbor Joining
The Four Point Condition
dAC dBD dAD dBC a b c d 2x dAB
dCD 2x
The 4-point condition
dAB dCDlt dAC dBD
dAB dCDlt dAD dBC
neighbors
non-neighbors
- Neighbors are closer than non-neighbors
21Distance Methods Neighbor Joining
Sequences chosen to give best least-squares estima
te of branch length
Begin with star topology no neighbors have been
joined
Tree modified by joining pairs of sequences
22Distance Methods Neighbor Joining
Pair is chosen by calculating sum of
branch lengths for the corresponding tree
If A and B are joined
23Distance Methods Neighbor Joining
Neighbor-Joining approximates the least
squares tree, assuming additivity, but without
resorting to the assumption of a molecular clock.
Idea join clusters that are not only close to
one another, but are also far from the rest.
In each iteration find direct ancestor of
two species in the tree ? neighboring leaves.
24Distance Methods Neighbor Joining
Example neighboring leaves i, j with ancestor k.
Join i and j? remove them from list of leave
nodes ? add k to list with distances to other
leave(s) m defined as
Problem it is not sufficient to pick simply the
two closest leaves
25Distance Methods Neighbor Joining
Solution For node i, define average distance ui
to all other leaves
and correct distances
Minimum-evolution criterion minimize the sum of
all branch lengths. Nodes i and j that are
clustered next are those for which Dij is
smallest.
26Distance Methods Neighbor Joining
Initialization
1. Initialize n clusters with the given species,
one species per cluster
2. Set the size of each cluster to 1 ni ? 1
3. In the output tree T, assign a leaf for each
species
Iteration
1. For each species, compute
2. Choose the i and j for which dij - ui - uj is
smallest.
3. Join clusters i and j to new cluster, with
corresponding node k and set
Calculate the branch lengths from i and j to the
new node as
,
4. Delete clusters i and j from T and add k
5. If more than two nodes remain, go back to 1.
Otherwise, --- end
27Maximum Parsimony
Predicts evolutionary tree by minimizing
number of steps required to generate observed
variation
For each position, a phylogenetic tree
requires smallest number of evolutionary changes
to produce observed sequence changes are
identified
Trees producing smallest number of changes
for all sequence positions are identified
Time consuming algorithm
Only works well if the sequences have a
strong sequence similarity
28Maximum Parsimony
Step I
Input multiple sequence alignment
Step II
For each aligned position, identify phylogenetic
trees that require the smallest number of
evolutionary changes to produce the observed
sequence changes
Step III
Continue analysis for every position in the
sequence alignment
Step IV
Sequence variations at each site in the alignment
are placed at the tips of the trees
29Maximum Parsimony - Example
positions
Sequences
Informative sites must favor one tree over
another ? site 5 is informative, but sites 1,
6, 8 are not
To be informative, a site must also have the same
sequence character in at least two genomes ?
only sites 5, 7, and 9 are informative according
to this rule
E.g. trees for position 5
Combining sites 5, 7, and 9, the left tree is the
best tree for these 4 sequences
30Maximum Parsimony - Example
What is the parsimony score of
31Maximum Parsimony - Example
How many possible unrooted trees?
32Maximum Parsimony - Example
How many substitutions?
1 change
5 changes
tree
33Maximum Parsimony - Example
1 2 3 4 5 6 7 8 9 10
Species 1 - A G G G T A A C T G Species 2 - A C G
A T T A T T A Species 3 - A T A A T T G T C
T Species 4 - A A T G T T G T C G
0
0
0
34Maximum Parsimony - Example
1 2 3 4 5 6 7 8 9 10
Species 1 - A G G G T A A C T G Species 2 - A C G
A T T A T T A Species 3 - A T A A T T G T C
T Species 4 - A A T G T T G T C G
0 3
0 3
0 3
35Maximum Parsimony - Example
1 2 3 4 5 6 7 8 9 10
Species 1 - A G G G T A A C T G Species 2 - A C G
A T T A T T A Species 3 - A T A A T T G T C
T Species 4 - A A T G T T G T C G
0 3 2 2
0 3 2 2
0 3 2 1
36Maximum Parsimony - Example
1 2 3 4 5 6 7 8 9 10
Species 1 - A G G G T A A C T G Species 2 - A C G
A T T A T T A Species 3 - A T A A T T G T C
T Species 4 - A A T G T T G T C G
Minimum substitutions
14
0 3 2 2 0 1 1 1 1 3
16
0 3 2 2 0 1 2 1 2 3
16
0 3 2 1 0 1 2 1 2 3
37Maximum Parsimony Searching for Trees
Imagine how large of 10182 ...
38Maximum Parsimony
Where maximum parsimony fails
Parsimony can give misleading information when
rates of sequence change vary in the different
branches of a tree that are represented by the
sequence data
In parsimony analysis rates of change along all
branches of the tree are assumed equal. Therefore
the tree predicted from parsimony will not be
correct.
Real tree 2 long branches in which G has turned
to A independently, possibly with some
intermediate steps.
39Standard problem Maximum Parsimony (Hamming
distance Steiner Tree)
Maximum Parsimony - Example
- Input Set S of n aligned sequences of length k
- Output A phylogenetic tree T
- leaf-labeled by sequences in S
- additional sequences of length k labeling the
internal nodes of T - such that is minimized.
40Maximum parsimony (example)
Maximum Parsimony - Example
- Input Four sequences
- ACT
- ACA
- GTT
- GTA
- Question which of the three trees has the best
MP scores?
41All possible unrooted trees
Maximum Parsimony - Example
ACT
ACT
ACA
GTA
GTT
GTT
ACA
GTA
GTA
ACA
ACT
GTT
42Possible substitutions
Maximum Parsimony - Example
ACT
ACT
ACA
GTA
GTT
GTA
ACA
ACT
2
1
1
3
3
2
GTT
GTT
ACA
GTA
MP score 7
MP score 5
GTA
ACA
ACA
GTA
2
1
1
ACT
GTT
MP score 4
Optimal MP tree
43Maximum Parsimony computational complexity
Maximum Parsimony - Example
Optimal labeling can be computed in linear time
O(nk)
GTA
ACA
ACA
GTA
2
1
1
ACT
GTT
MP score 4
Finding the optimal MP tree is NP-hard
44Maximum likelihood approach
Method uses probability calculations to find
a tree that best accounts for the variation in
a set of sequences
Similar to maximum parsimony method in
that analysis is performed on each column of
a multiple sequence alignment
Start with an evolutionary model of
sequence change that provides estimates of rates
of substitution of one base for
another (transitions and transversions).
45Maximum likelihood approach
Statistical method - powerful and flexible, also
computationally complex
Given a particular tree and a model of
the evolutionary change, calculate the
likelihood of the tree based on data, i.e. the
given multiple sequence alignment
Likelihood (tree data) proportional
to Probability( data tree)
46Maximum likelihood approach
Tree with branches, vk branch lengths
Probability of character change PAC(t) for A ? C
in time t
Dont know character states inside tree (in the
past) so calculate for all possibilities, e.g. A,
C, G, T
47Maximum likelihood approach
L p(A) PAA(v1) PAA(v2) PAG(v4) PAA(v5) PAA(v6)
PAA(v3) PAA(v7) PAA(v8)
L p(A) PAA(v1) PAG(v2) PGG(v4) PGA(v5) PAA(v6)
PAA(v3) PAA(v7) PAA(v8)
48Maximum likelihood approach
L p(s0) Ps0s1(v1) Ps1s2(v2) Ps2s4(v4) Ps2s5(v5)
Ps1s6(v6) Ps0s3(v3) Ps3s7(v7) Ps3s8(v8)
Maximum likelihood does best in simulation but is
also slowest method
Variety of new heuristics to find ML tree faster
49Maximum Likelihood (ML)
Maximum likelihood approach
- Given stochastic model of sequence evolution
(e.g. Jukes-Cantor) and a set S of sequences - Objective Find tree T and probabilities p(e) of
substitution on each edge, to maximize the
probability of the data. - Preferred by some systematists, but even harder
than MP in practice.
50Quality of the tree
Phylogenetic trees can vary dramatically with
slight changes in data
We want to know which branches are reliable, and
which branches do not have strong support from
the data
Bootstrapping is the most common method used
A general statistical technique for determining
how much error is in a set of results
51Confidence assessment
Bootstrapping
Original analysis, e.g. MP, ML, NJ.
Original data set with n characters
Repeat original analysis on each of
the pseudo-replicate data sets.
Draw n characters randomly with
re-placement. Repeat m times.
m pseudo-replicates, each with n characters.
Evaluate the results from the m analyses.
52Confidence assessment
Bootstrap sampling of phylogenies
53Confidence assessment
What do the bootstrap values mean?
Bootstrap values for phylogenetic trees do
not follow proper statistical behavior
Bootstrap value 95 actually close to
100 confidence in that branch
Bootstrap value 75 often close to 95 confidence
Bootstrap value 60 is much lower confidence
Less than 50 bootstrap no confidence in
that branch over an alternative
54Computer Software for Phylogenetics
- Due to the lack of consensus among evolutionary
biologists about basic principles for
phylogenetic analysis, it is not surprising that
there is a wide array of computer software
available for this purpose. - PHYLIP is a free package that includes 30
programs that compute various phylogenetic
algorithms on different kinds of data. - The GCG package (available at most research
institutions) contains a full set of programs for
phylogenetic analysis including simple
distance-based clustering and the complex
cladistic analysis program PAUP (Phylogenetic
Analysis Using Parsimony) - CLUSTALX is a multiple alignment program that
includes the ability to create tress based on
Neighbor Joining. - MacClade is a well designed cladistics program
that allows the user to explore possible trees
for a data set.
55Phylogenetics on the Web
- There are several phylogenetics servers available
on the Web - some of these will change or disappear in the
near future - these programs can be very slow so keep your
sample sets small - The Institut Pasteur, Paris has a PHYLIP server
at - http//bioweb.pasteur.fr/seqanal/phylogeny/phyli
p-uk.html - Louxin Zhang at the Natl. University of
Singapore has a WebPhylip server - http//sdmc.krdl.org.sg8080/lxzhang/phylip/
- The Belozersky Institute at Moscow State
University has their own "GeneBee" phylogenetics
server - http//www.genebee.msu.su/services/phtree_reduced.
html - The Phylodendron website is a tree drawing
program with a nice user interface and a lot of
options, however, the output is limited to gifs
at 72 dpi - not publication quality. http//iu
bio.bio.indiana.edu/treeapp/treeprint-form.html
56Other Web Resources
- Joseph Felsenstein (author of PHYLIP) maintains a
comprehensive list of Phylogeny programs at - http//evolution.genetics.washington.edu/phyl
ip/software.html - Introduction to Phylogenetic Systematics,
- Peter H. Weston Michael D. Crisp, Society of
Australian Systematic Biologists - http//www.science.uts.edu.au/sasb/WestonCrisp.htm
l - University of California, Berkeley Museum of
Paleontology (UCMP) - http//www.ucmp.berkeley.edu/clad/clad4.html
57Software Hazards
- There are a variety of programs for Macs and PCs,
but you can easily tie up your machine for many
hours with even moderately sized data sets (i.e.
fifty 300 bp sequences) - Moving sequences into different programs can be a
major hassle due to incompatible file formats. - Just because a program can perform a given
computation on a set of data does not mean that
that is the appropriate algorithm for that type
of data.
58Which Method to Choose?
Depends upon the sequences that are being compared
Strong sequence similarity
? Maximum parsimony
Clearly recognizable sequence similarity
? Distance methods
All others
? Maximum likelihood
Best to choose at least two approaches
Compare the results if they are similar, you
can have more confidence
59Which Method to Choose?
60Comparison of Methods
Tony Weisstein, http//bioquest.org16080/bedrock/
terre_haute_03_04/phylogenetics_1.0.ppt
61More Topics Related to Phylogenetics
62More topics related to Phylogenetics
Supertree / Tree of life
Phylogeny epidemiology
Phylogeography
63Idea of the Tree of Life
The idea that the evolution of life can be
represented as a tree, with leaves corresponding
to extant species and nodes to extinct ancestors,
came from Charles Darwin
The earliest trees formed by Ernst Haeckel and
others were based on a general idea of a
hierarchy of relationships between species and
higher taxa
Gradually, quantitative criteria have been
developed to measure the degree of morphological
difference that was thought to reflect
evolutionary distance
64Winds of Change
In the early days of molecular phylogenetics, a
gene tree was usually equated with the species
tree. This view was typified using ribosomal RNA
(rRNA) sequences as the principal molecular
phylogenetic marker
This resulted in the discovery of a previously
unrecognized domain of life, the Archaea, and in
a tree topology that has been aptly called the
standard model of evolution
This model involves the early descent of the
bacterial clade from the last universal common
ancestor and a subsequent separation of archaea
and eukaryotes.
All this was to change once comparative genomics
yielded more information and multiple complete
genome sequences became available for comparison
65The three domains of Life
Identified by phylogenetic analysis of the
highly conserved 16S ribosomal RNA
66Three strategies for constructing phylogenies
Homologous single-gene data set
? Rely on many taxa for a single gene
? Less genes and less samples
? Large number sequence alignment
Sequence concatenation
? Combine or concatenate multiple sequences
for the same set of species
? Need for close concordance of species
sampling among genes, which is difficult
because of the hit-or-miss sampling in the
databases.
Supertree construction
? Sample multiple genes only for minimally
overlapping sets of species
? Tree constructed by a set of subtrees
67Assembling the Tree of Life (ATOL )
What difficulty in computing
With current computational tools, phylogenetic
analyses for 1,000 species is possible with
adequate computer resources
It is currently impossible to reach a reasonable
solution for 500,000 species, even with months of
computation .
PARALLEL ALGORITHMS FOR GENETICS
Tree of Life ( 30,000 species )
David Hillis, Science, 2003
68(No Transcript)
69Assembling large data matrices by concatenation
Domination by biological problems
Advantages
? Improve the accuracy of a specific portion of a
tree
? The addition of species can be useful in cases
of so-called long-branch attraction, in
which high substitution rates or long
intervals of time can mislead phylogenetic
inference methods
Two potential problems
? Multiple genes can mix phylogenetic signals
arising from different evolutionary histories
? Some sequences are usually unavailable for
some species, missing data, with possible
deleterious effects on accuracy
70Reconstruction of trees from large data matrices
Domination by computational problems
Two issues in constructing phylogenetic trees
? Computation time
? Reliability
Two time-consuming computational problems
? Multiple sequence alignment
? Phylogenetic inference
? Optimal methods ( parsimony and maximum
likelihood ) are time-consuming
? Even heuristic approach
Months of processor time were devoted to a
heuristic parsimony analysis of the Chase et al.
dataset of 500 sequences, and it never ran to
completion ( Sanderson and Driskell, 2004)
71Synthesis of large trees supertree
Tree constructed by a set of trees
Advantages
? Independent studies can be combined into a
single tree
? Initial trees can be based on different kinds
of data
? Initial trees can be obtained by different
methodologies
? Initial trees often have been selected from
competing trees by professional judgment
? There are most likely no common data for all
species
? Methods such as maximum likelihood would not be
computationally tractable on such a large
dataset
72Synthesis of large trees supertree
Classification ( Wilkinson et al, 2001,
Bininda-Emonds et al, 2002 )
Supertree technique past and present (
Bininda-Emonds, 2004 )
? Present
? Past
73Reconstructing the Tree of Life
Handling large datasets millions of species
The Tree of Life is not really a tree
reticulate evolution
74PhylogeneticEpidemiology
75Infectious diseases are caused by pathogens
pathogen microbe that causes disease
microbe microscopic organism
The major classes of disease-causing microbes
are viruses, bacteria, and eukaryotes (protists,
fungi, and worms)
RNA Viruses
The RNA viruses are more often associated with
epidemic and emerging diseases in humans than DNA
viruses.
The gene sequences of many RNA viruses change so
rapidly that it is possible to watch spatial and
temporal patterns unfold on a real time scale
that is not usually visible in other organisms.
Diseases caused by RNA viruses avian influenza,
HIV, dengue...
76The rapidity of RNA virus evolution is caused by
a combination of (Holmes, 2004)
? Extremely high mutation rates
? Short generation times
? Immense population sizes.
These factors produce rates of nucleotide
substitution that are, on average, some six
orders of magnitude higher than those in
eukaryotes and DNA viruses (Jenkins et al. 2002).
The high rates of substitution found in viruses
and bacteria allow phylogenies to be
reconstructed for sequences that have diverged
only recently
Molecular phylogenies have come to play an
increasingly important role in epidemiological
studies of microbial pathogens, as they provide
information about the location, timing, and
mechanisms by which virulent strains arise.
77Guan et al. (2002) Emergence of multiple
genotypes of H5N1 avian influenza viruses in Hong
Kong SAR. Proc Natl Acad Sci U S A, 99,
8950-8955.
78Moya, A., Holmes, E.C., and Gonzalez-Candelas, F.
(2004) The population genetics and evolutionary
epidemiology of RNA viruses. Nat Rev Microbiol,
2, 279-288.
79Rannala, B. 2002. Molecular phylogenies and
virulence evolution. In Adaptive Dynamics of
Infectious Diseases In Pursuit of Virulence
Management
Maximum likelihood estimate of phylogeny of eight
strains of influenza A isolated from humans,
swine, and birds based on an analysis of the HA
gene. The divergence years prior to 1870,
estimated using a partially constrained molecular
clock, are shown at the left of the branch. The
branch lengths (after 1870) are calibrated in
units of years (scale at bottom).
80Difficulties With Phylogenetic Analysis
Garbage in, garbage out ! Alignment crucial
Horizontal or lateral transfer of genetic
material (for instance through viruses) makes it
difficult to determine phylogenetic origin of
some evolutionary events
Genes selective pressure can be rapidly evolving,
masking earlier changes that had
occurred phylogenetically
81Difficulties With Phylogenetic Analysis
Two sites within comparative sequences may
be evolving at different rates
Rearrangements of genetic material can lead
to false conclusions
duplicated genes can evolve along separate
pathways, leading to different functions
82Phylogenetics - Issues
Gene trees vs species trees ? Gene duplication
can complicate phylogenetic analysis ? Paralogues
(duplicated genes) do not fit in evolutionary tree
Choice of target sequence type
? Ribosomal RNA (slowest change / mutation rate)
Use for very long-term evolutionary studies,
spanning species boundaries biological kingdoms
? DNA / RNA (fastest change / mutation rate)
(a) Use for short-term studies of closely-related
species
(b) Contains more evolutionary information than
protein
? Protein (medium change / mutation rate)
(a) Use for wide species comparisons
(b) More reliable alignment than DNA
83NO HOMEWORK! Happy??
A problem will be appeared in the Final Exam
Give an example and design a flowchart to show
how to construct a tree
Your answer should include, at least (a) Where
you find the example? ( Google, books, or papers
) (b) Why you choose this example? ( curiosity,
simple, or no reason? ) (c) Where you plan to get
the sequences? ( database in the public domain
) (d) What kind of the methods you plan to use to
construct your tree? (e) Why you plan not use
other methods
Just go to Google and find YOUR OWN Answer !