Title: Phylogeny
1Phylogeny
2Reconstructing a phylogeny
- The phylogenetic tree (phylogeny) describes the
evolutionary relationships between the studied
data - The data must be comprised of homologous types
- In molecular evolution, the studied data are
homologous DNA/AA sequences - Phylogeny reconstruction explicitly assumes that
the sequences are aligned
INPUT MSA
3Reminder MSA and phylogeny are dependent
MSA
Unaligned sequences
Sequence alignment
Phylogeny reconstruction
Inaccurate guide tree
4Phylogeny representation
Textual representation (Newick format)
Visual representation
((A,C),(B,D))
C
A
D
B
- Each pair of parenthesis () encloses a clade in
the tree - A comma , separates the members of the
corresponding clade - A semicolon is always the last character
5Some terminology
monophyletic group (clade)
root
External branches
internal branches (splits)
Neighbors
Neighbors
internal nodes
External nodes (leaves)
6Swapping neighbors is meaningless
Chimp
Human
Gorilla
(Gorilla,(Human,Chimp))
(Gorilla,(Chimp,Human))
Human
Gorilla
Chimp
((Human,Chimp),Gorilla)
((Chimp,Human),Gorilla)
7Rooted vs. unrooted
?
3
A
1
?
C
B
2
8In newick format
?
3
A
1
?
C
B
2
9How can we root a tree?
10Rooting the tree based on a priori knowledge
using an outgroup
The outgroup should be close enough for detecting
sequence homology, but far enough to be a clear
outgroup
11The gene tree is not always identical to the
species tree
Gene tree
Species tree
?
12Phylogeny reconstruction approaches
Distance based methods Neighbor Joining
A B C D E
A 0 2 3 4 4
B 0 3 4 5
C 0 3 4
D 0 5
E 0
A,B C D E
A,B 0 2.5 4.5 3.5
C 0 3 4
D 0 5
E 0
The Minimum Evolution (ME) criterion in each
iteration we separate the two sequences which
result with the minimal sum of branch lengths
13Phylogeny reconstruction approaches
Topology search methods MP, ML
Maximum Parsimony finds the most parsimonious
topology
Maximum Likelihood finds the most likely topology
P(DataT)
14Phylogeny reconstruction approaches summary
- Distance based methods
- Neighbor Joining (e.g., using ClustalX)
- Fast
- Inaccurate
- Topology search methods
- Maximum parsimony (e.g., using MEGA)
- Crude
- Questionable statistical basis
- Maximum likelihood (e.g., using RAxML, phyML)
- Accurate
- Slow
- Bayesian methods
- Monte Carlo Markov Chains (MCMC) (e.g., using
MrBayes) - Most accurate
- Very slow
15How robust is our tree?
16Bootstrap for estimating robustness
- We need some statistical way to estimate the
confidence in the tree topology - But we dont know anything about the distribution
of tree topologies - The only data source we have is our data (MSA)
- So, we must rely on our own resources pull up
by your own bootstraps
17Bootstrap
1. Create n (100-1000) new MSAs (pseudo-MSAs) by
randomly sampling K positions from our original
MSA with replacement
12345 K 1 ATCTGA 2 ATCTGC 3 ACTTAC
4 ACCTAT
112443 1 AATTTC 2 AATTTC 3 AACTTT 4
AACTTC
9747810 1 TTTTAT 2 CATACA 3
CATACT 4 AGTGGA
51578 12 1 GAGTAT 2 GAGACG 3
AAAACA 4 AAAGGC
18Bootstrap
2. Reconstruct a pseudo-tree from each pseudo-MSA
with the same method used for reconstructing the
original tree
112443 1 AATTTC 2 AATTTC 3 AACTTT 4
AACTTC
9747810 1 TTTTAT 2 CATACA 3
CATACT 4 AGTGGA
51578 12 1 GAGTAT 2 GAGACG 3
AAAACA 4 AAAGGC
Sp1
Sp1
Sp2
Sp2
Sp3
Sp3
Sp4
Sp4
19Bootstrap
3. For each split in our original tree, we count
the number of times it appeared in the
pseudo-trees
Sp1
Sp1
Sp2
Sp2
Sp3
Sp3
Sp4
Sp4
67
Sp1
In 67 of the pseudo-trees, the split between
SP1SP2 and the rest of the tree was found
100
Sp2
Sp3
In general bp support lt 80 is considered low
Sp4
20ClustalX NJ phylogeny reconstruction
21ClustalX NJ phylogeny reconstruction
22http//phylobench.vital-it.ch/raxml-bb/
23(No Transcript)
24Viewing the tree with njPlot
25Note unrooted tree
26Defining an outgroup
27Swapping nodes
28Bootstrap support
29FigTree tree visualization and figure
creationhttp//tree.bio.ed.ac.uk/software/figtree
/
30Reconstructing the tree of life
31Darwins vision of the tree of life from the
Origin of Species
32The three-domain tree of life based on SSU rRNA
MSA
33But branching of several kingdoms remain in
dispute
34Lateral Gene Transfer (LGT) challenges the
conceptual basis of phylogenetic classification
35(No Transcript)
36Methodology
- Started with 36 genes universally present in 191
species (spanning all 3 domains of life), for
which orthologs could be unambiguously identified - Eliminated 5 genes that are LGT suspects (mostly
tRNA synthetases) - Constructed an MSA for each of the 31 orthogroups
- Concatenated all 31 MSAs to a super-MSA of 8090
columns - The phylogeny was reconstructed based on the
super-MSA using the maximum likelihood approach
37http//itol.embl.de
38Tree support
- 81.7 of the splits show bootstrap support of
over 80 - 65 of the split show bootstrap support of 100
- However, several deep splits show low supports
39Still, the debate goes on
40Tree of one percent of life
- Ciccarelli et al. on the one hand favor the claim
that bacteria adhere to a bifurcating tree of
life, given that the small amount of LGT genes
are filtered - On the other hand, their filtering process left
only 31 proteins, which represent 1 of an
average prokaryotic proteome and 0.1 of a large
eukaryotic proteome - If throwing out all non-universally distributed
genes and all LGT suspects leaves a 1 tree, then
we should probably abandon the tree as a working
hypothesis