Title: IE68 Biological databases Phylogenetic analysis
1IE68 - Biological databasesPhylogenetic analysis
2Phylogenetic analysis
- Phylogeny
- a reconstruction of the evolutionary
(genealogical) history of a group of
organisms/genes or proteins from biological data - organisms populations, species, genera,... gt
taxa gt operational taxonomic units (OTUs) - data molecular, morphological,
archaeological,... gt characters - Phylogenetic tree
- the graphical reconstruction of a phylogeny
- tree structure phylogram, cladogram
3Phylogenetic tree
A tree consists of nodes connected by branches
polytomy
A B C D E
gt OTUs for which we have data
outgroup/midpoint
gt Ancestor of all the taxa that comprise the tree
notation ((A,B),(C,D,E))
4Phylogenetics ltgt Phenetics
- Phenetics method of grouping taxa that is based
on overall (dis)similarities of characters gt
with no reference to evolution! - Phylogenetics method of grouping taxa that is
based on shared derived characters
(synapomorphies) or a model of evolution
5Why do we need phylogenies?
- Intrinsic interest in the tree gt tree of life
- origin of organisms
6Why do we need phylogenies?
- Phylogenies can also be used as tools for
investigating other problems - e.g. biogeography
- phylogeny reflects the order of separation of
the areas the different taxa occupy
T
7Why do we need phylogenies?
- Phylogenies can also be used as tools for
investigating other problems - e.g. forensic science
-
8(No Transcript)
9Phylogenetic analysis
- Molecular Phylogenetics
- reconstruction of the evolutionary (geneological)
history of a group of organisms from molecular
data, i.e. DNA or protein sequences - In this lecture, we will focus on phylogenetic
analysis of organisms based on DNA sequence data
10Molecular phylogenetics approach
- Step 1 PCR with primers that target cytoplasmic
DNA or nuclear loci of taxa, followed by DNA
sequence analysis - Step 2 Multiple DNA sequence alignment
- Step 3 Phylogenetic analysis
11PCR and DNA sequencing
- Which loci?
- DNA sequence information, primers, variability,
single or low-copy, orthologous, neutral,
recombination... - Gene trees versus organismal trees
- phylogenies for genes do not always match those
of their corresponding organisms gt analyse more
than one gene
12Confounding influence of gene duplication
2 types of homology orthology (speciation) and
paralogy (gene duplication)
13Lineage sorting and coalescence
species alleles
14Molecular phylogenetics approach
- Step 1 PCR with primers that target cytoplasmic
DNA or nuclear loci of taxa, followed by DNA
sequence analysis - Step 2 Multiple DNA sequence alignment
- Step 3 Phylogenetic analysis
15Multiple DNA sequence alignment
- Problem alternative alignments
- possible to align any two sequences by
postulating some combination of gaps
(insertion/deletions indels) and substitutions - gt which one to choose?
- Basic task of sequence alignment is to find the
alignment with the highest similarity, smallest
distance, or lowest overall cost
16Multiple DNA sequence alignment
- 2 sequences scoring scheme gt optimal alignment
- Scoring scheme
- - scoring matrix distance weights or similarity
scores for each pair of aligned bases e.g.
transition transversion matrix - A T G C
- A 0 5 1 5
- T 5 0 5 1
- G 1 5 0 5
- C 5 1 5 0
- - gap weight, cost or penalty
17Multiple DNA sequence alignment
- Cost of an alignment D s wg
- s no of substitutions, g total length of
gaps - w gap penalty cost of gap relative to
substitution - Gap penalty W makes implicit assumptions about
how the sequences have evolved - if indels are thought to be rare, then W should
be large (and vice versa) - gt have to use knowledge of biology e.g.
translation (3 bp indel, position),
transitionltgttransversion, ...
18Multiple DNA sequence alignment
- Software programs e.g. CLUSTALW (global
alignment) - http//www.ebi.ac.uk/clustalw/index.html
- The optimal alignment is not always the true
alignment gt new developments phylogenetic
analysis without the multiple DNA sequence
alignment step
19Molecular phylogenetics approach
- Step 1 PCR with primers that target cytoplasmic
DNA or nuclear loci of taxa, followed by DNA
sequence analysis - Step 2 Multiple DNA sequence alignment
- Step 3 Phylogenetic analysis
20Inferring phylogenies from DNA sequences
C
Sequence alignment A ..AGCGTCT..B
..AGCGTGT..C ..AGGAGT..
A
B
Phylogenetic methods
unrooted tree
A
B
taxa
characters
C
rooted tree
21Phylogenetic methods
Character-based methods
Non character-based methods
Methods based on an explicit model of evolution
Maximum-likelihood methods
Pairwise distance methods
Methods not based on an explicit model of
evolution
Maximum parsimony methods
22Pairwise distance methods
3 taxa, 3 sequences
- Dissimilarity matrix count the number of
differences between all possible pairs of
sequences - Convert dissimilarity to evolutionary distance by
correcting for multiple events per site according
to a certain model of evolution - Infer tree topology on the basis of the
evolutionary distances by using a clustering
algorithm or optimality criterion
1 2 31 2 0.263 0.20 0.33
1 2 31 2 0.323 0.23 0.44
tree
23Models of sequence evolution
expected ? observed difference gt correction
(linear) (not linear)
Apply a substitution model that tries to estimate
the correct number of substitutions
24Models of sequence evolution
- Distance correction methodsconvert observed
distances into measure that correspond to ACTUAL
distance - Several methods have been proposed, all with
different assumptions about the nature of the
evolutionary process - Essentially they differ by the number of
parameters they include - We can use a general framework to show how these
models are inter-related
25Substitution models general framework
26Substitution models general framework
27e.g. Model of Jukes Cantor (JC)
- One of the first proposed perhaps the simplest
model of evolution - Assumes that all four bases have equal frequency
and that all substitutions are equally likely - Under this model, the distance between any two
sequences is given by d -3/4ln(1-4/3p), where p
is the proportion of nucleotides that are
different in the two sequences
28e.g. Kimura 2 parameter model (K2P)
- incorporates the observation that transitions
accumulate more rapidly than transversion - assumes all four bases have equal frequencies
but that there are 2 rate classes for
substitutions - Under this model, the distance between any two
sequences is given by d 1/2ln1/(1-2P-Q)
1/4ln1/(1-2Q), where P and Q are the
proportional differences between the two
sequences due to transitions and transversions,
respectively
29Substitution models
- Other models adding more parameters
- Felsenstein model (F81)
- variation in base composition gt base frequency
f ?A ?C ?G ?T may vary - Hasewaga Kishino Yano (HKY) model
- unequal base frequency, transition/transversion
- General reversible model (REV) unequal base
frequency, all six pairs of substitutions have
different rates - gt ideally, we want the simplest model we can get
away with that still yields a reasonable
estimate
30Substitution models
- Assumptions of these models
- all nucleotide sites change independently
- base composition equilibrium
- substitution rate is constant over time and in
different lineages - each site in a sequence is equally likely to
undergo substitutiongt gamma distribution has a
parameter that specifies the range of rate
variation among sites model ?
31- Pairwise distance methods
- Dissimilarity matrix count the number of
differences between all possible pairs of
sequences - Convert dissimilarity to evolutionary distance
by correcting for multiple events per site
according to a certain model of evolution - Infer tree topology on the basis of the
evolutionary distances by using a clustering
algorithm
3 taxa, 3 sequences
1 2 31 2 0.263 0.20 0.33
1 2 31 2 0.323 0.23 0.44
tree
32Clustering methods
- Clustering methods follow a set of steps (an
algorithm) and arrive at a tree - UPGMA (Unweighted Pair Group Method using
Arithmetic Averages) results in an rooted and
additive tree with molecular clock - Neighbor-joining results in an unrooted and
additive tree - Other approaches least-squares, Fitch, Kitch,...
33UPGMA clustering
A B C B 2 least differences C 4 4 D 6
6 6
1
A
1
B
Compute new distances between (AB) and other
OTUs d(AB)C (dAC dBC) /2 4 d(AB)D (dAD
dBD) /2 6
34UPGMA clustering
1
A
AB C C 4 D 6 6
1
1
B
2
C
1
A
1
Compute new distances between (ABC) and other
OTUs d(ABC)D (d(AB)D dCD) /2 6
1
B
1
2
C
3
D
35Clustering methods
- UPGMA additive and ultrametric distancesgt
assumes a molecular clock gt very sensitive to
unequal rate of evolution! gt relative-rate test - Use other clustering methods for phylogenye.g.
Neighbor-joining - Goodness of fit statistics to select the
metric tree that best accounts for the observed
distances
36- Pairwise distance methods
- Dissimilarity matrix count the number of
differences between all possible pairs of
sequences - Convert dissimilarity to evolutionary distance
by correcting for multiple events per site
according to a certain model of evolution - Infer tree topology on the basis of the
evolutionary distances by using an optimality
criterion
3 taxa, 3 sequences
1 2 31 2 0.263 0.20 0.33
1 2 31 2 0.323 0.23 0.44
tree
37Minimum evolution
- Distance matrix gt unrooted metric trees
- Each tree has a length L, which is the sum of all
the branch lengths - Optimality criterionthe minimum evolution tree
ME is the tree which minimizes L
38Pairwise distance method
- Advantages
- very fast
- based on a model of evolution
- Disadvantages
- sequence information is reduced to one number
- branch lengths may not be biologically
interpreted - most methods provide only one tree topology
- dependent on the model of evolution used
39Phylogenetic methods
Character-based methods
Non character-based methods
Methods based on an explicit model of evolution
Maximum-likelihood methods
Pairwise distance methods
Methods not based on an explicit model of
evolution
Maximum parsimony methods
40Character-based methods
- Character-based (discrete) methods operate
directly on sequences, rather than on pairwise
distances - Two major discrete methods
- Maximum parsimony (MP) chooses tree(s) that
require fewest evolutionary changes - Maximum Likelihood (ML) chooses tree(s) that is
the one most likely to have produced the observed
data
41Maximum parsimony
- Maximum parsimony infers a phylogenetic tree by
minimizing the total number of evolutionary steps - Principle
- Investigate all possible tree topologies
- Reconstruct ancestral sequences
- Choose topology with smallest number of steps
42Maximum parsimony - principle
A
1
3
2
4
1
2
B
3
4
1
2
C
3
4
possible tree topologies
43Maximum parsimony - principle
44Maximum parsimony - principle
45Maximum parsimony - principle
46Maximum parsimony - generalized
- In previous example, cost of each substitution
was one step gt equal weight - Instead, we can use different costs for different
types of change (e.g. transitions vs
transversions) to better match our assumptions
about evolutionary processes gt weighted
parsimonyaccording to Dollo, Wagner, Fitch, ...
47Maximum parsimony - characters
48Maximum parsimony search methods
- Number of tree topologies Nu
(2n-5)!/2n-3(n-3)!i.e., 3 sequences 1 tree, 4
seq 3 trees, 5 seq 15, 6 105, gt the more
sequences ( taxa), the more trees gt
computationally expensive - Finding optimal trees
- Exhaustive search limited number of taxa
(lt10)find the minimum tree of all possible trees - Branch and bound small number of taxa (lt18)find
the minimum tree without evaluating all trees by
discarding families of trees during tree
construction that cannot be shorter than the
shortest tree found so far - Heuristic search large number of taxa
49Maximum parsimony search methods
- Heuristic searchexplore a subset of all
possible trees, by using stepwise addition of
taxa plus a rearrangement process (branch
swapping), but not guaranteed to find the minimal
tree
Global optimum
Local optimum
50Maximum parsimony - output
- Consensus treeMP can yield multiple equally
most parsimonious (optimal) trees gt
relationships common to all the optimal trees are
summarized with a consensus tree - Strict consensus includes splits found in all
trees - Majority-rule consensus includes splits found in
the majority of the trees (gt 50)
51Maximum parsimony - output
- Consistency index (CI) - Retention index (RI)
- measures of the parsimony fit of a character to a
tree, or of the average fit of all characters to
a tree - more specifically index of how much homoplasy
the constructed tree has - Value from 0 to 1
- higher value gt less homoplasy
52(No Transcript)
53Parsimony branch support and tree stability
- Bootstrap analysis
- is a resampling technique used to measure
sampling error - gives an idea about the reliability of branches
and clusters - original dataset gt resample gt construct trees
gt compare trees to original trees - gt70 quite confident of tree topology
- Decay index (Bremer support)
- gives us a sense of how many steps would be
required before a grouping collapses - higher value gt better branch support
54Maximum parsimony
- Advantages
- based on shared derived characters
- evaluates different tree topologies
- does not reduce the information
- Disadvantages
- computationally intensive for large datasets
- no correction for multiple mutations
- sensitive to unequal rates of evolution (long
branch attraction)
55Phylogenetic methods
Character-based methods
Non character-based methods
Methods based on an explicit model of evolution
Maximum-likelihood methods
Pairwise distance methods
Methods not based on an explicit model of
evolution
Maximum parsimony methods
56Maximum likelihood
- Statistical method
- If given some data D and a hypothesis H, the
likelihood of that data is given byLD Pr (DH) - Which is the probability of D given H?
57Maximum likelihood
- In the context of molecular phylogenetics
- D is the set of sequences being compared
- H is a phylogenetic tree
- We want to find the likelihood of obtaining the
observed data given the tree - The tree that makes the data the most probable
evolutionary outcome is the Maximum Likelihood
estimate of the phylogeny
58Maximum likelihood
- In other wordsWhich tree is most likely to have
yielded these sequences (observed data) under a
given model of evolution (JC, K2P, ...)?
59Maximum likelihood
- Advantages
- Statistically well founded
- Based on a model of evolution
- Evaluates different topologies
- Uses all sequence information
- Often yields estimates that have lower variance
than other methods - Disadvantages
- Very slow (computationally intensive)
- Dependent on the model of evolution used
60Software programs for phylogenetic analysis
- Overview http//evolution.genetics.washington.edu
/phylip/software.html - Most widely used software programs
- PHYLIP free available (downloadable or online
http//bioweb.pasteur.fr/seqanal/phylogeny/phylip-
uk.html) - PAUP user friendly but not free available
61Phylogenetic information on the internet
- http//tolweb.org/tree/phylogeny.html
- http//www.treebase.org/treebase/
- ....
62If you need more information
- Jacqueline Vander Stappen
- K.U.Leuven
- Laboratory of Gene Technology
- Kasteelpark Arenberg 21
- B-3001 Leuven
- Jacqueline.vanderstappen_at_agr.kuleuven.ac.be