Title: Phylogenetic trees
1Phylogenetic trees
HIV phylogeny. Simmonds et al (left) Yamamoto et
al (right)
2Star vs hierarchical phylogenies
Hierarchical
Star
3Rooted and unrooted trees
Degree of a node the number of neighbors of
that node. In a binary tree, all nodes have
degree 1 or 3, except the root which has degree
2. (i.e. each node has 0 or 2 children). A binary
tree with N leaf nodes has N-1 internal
nodes(c.f. table-tennis tournament...)
4Rooted directed
Root node
Internal node
Clade, subtree
Leaf node,taxon (pl. taxa)
Synonyms phylogeny, tree, dendrogram, cladogram
5Rooting via outgroups
6Root can be ambiguous
No outgroup... Earliest studies placed LUCA on
eukarya-bacteria branch later studies suggested
bacteria-archaea
7Ultrametric trees molecular clocks
Branch lengths are typically in units of average
number of substitutions per site.Thus, branch
lengths of gt1 have large estimation errors
Ultrametric
Non-ultrametric
(distance)
X
Height of node X
Q. Why are non-ultrametric trees necessary? A.
Mutation rate 1/(generation time) Also
correlated w/other physiological variables (e.g.
metabolic rate) Longitudinal data (e.g. serial
viral sequencing from same host) can also
generate non-ultrametric trees, since leaf nodes
are not contemporaneous
Wen-Hsiung Li, 1985 (2003 Balzan Prize)
8Newick formata.k.a. New Hampshire format
- Rooted tree topologies(A,B,(C,D))
- Branch lengths(A.1,B.2,(C.3,D.4).5)
- Internal node names(A.1,B.2,(C.3,D.4)E.5)F
9Algorithms for phylogenetic reconstruction
- Start with a multiple alignment
- use substitutions to evaluate trees
- indels informative, but harder to model
- Parsimony
- find the tree with the fewest substitutions
- Likelihood
- find the tree with the most likely
substitutions(transition/transversion bias, long
branches, ...) - sum probabilities over unseen ancestral states
- enumerating all possible tree topologies is
sloooooow - Distance matrix
- Start by computing all pairwise distances
- Quick approximation to likelihood methods
10UPGMA algorithm
- Creates ultrametric trees
- Basic idea
- Two closest nodes must be siblings
- Parent is equidistant between siblings
- Distance from parent to any other node is average
of distances of siblings to those nodes
11UPGMA algorithm
- Input a distance matrix, Dij
- Let N be the set of nodes to be joined
- Let the height of node i be Hi
- Initialize Hi0 for all the leaf nodes in N
- While N contains gt1 node
- Find i j, the two closest nodes in N
- (i,j) argmini,j Dij
- Create a new node, k, the parent of (i,j)
- Set Hk .5 (Hi Hj Dij)
- Branch length k?i is (Hk-Hi) and similarly for
k?j - For all nodes n in N (excluding i j)
- Set Dkn .5 (Din Dkn)
- Add k to N remove i j
N2 entries
N-1 steps
N2 steps
O(N3) timeIf we maintain argminj Dij for each j,
then it is O(N2) O(N2) memory
12UPGMA in Perl
- Questions
- How to represent a tree?
- For each node, need children/parents/both, name,
branch length to parent... - How to print a tree in Newick format?
- Recursive (print a particular node)
- Pre-order traversal (parents before children)
- How to represent a distance matrix?
- Can side-step some of these...
13Identify nodes by name, not by number
- Entry Dij of distance matrix is distanceiname
-gtjnamewhere iname is the name of node i
14Accessing the distance matrix
- Set of all nodes, N keys (distance)
- Removing a node from the set delete
distanceiname
15Construct the Newick representation on-the-fly
- Siblings (i, j) (iname , jname)
- Branch lengths Branch k?i has length
ki Branch k?j has length kj - Name of new node (k) (inameki,jnamekj)
- Then, Newick-format tree is just the name of the
root node (plus a semicolon)
16Other phylogeny algorithms
- Neighbor-joining (e.g. neighbor program)
- Parents not equidistant from siblings
- Weighted neighbor-joining (e.g. weighbor
program) - Corrects for long-branch estimation error
- Quartet-puzzling (e.g. tree-puzzle program)
- Looks at sets of 4 nodes, instead of pairs
- MCMC sampling (e.g. MrBayes program)
- Stochastically explores tree space
- Slow, but provides much more information(confiden
ce limits, etc.)
17Long branch attraction
- Arises because sequences on long branches share
chance similarities - Some methods (esp. parsimony) interpret this
incorrectly as relatedness - Solutions
- add more taxa to break up the branches
- use more realistic likelihood models
18Confidence estimates
- Bootstrap
- Sample a random subset of alignment columns (with
replacement) and build a tree from those - Repeat a large number of times
- Support for a branch
- defined as of trees that include that branch
- identify a branch by its partitioning of the taxa
- MCMC is a more statistically rigorous way to get
confidence estimates for trees - because it samples directly from the posterior
distribution of trees
19Evolutionary linguistics
20How to estimate distances?
- T. Jukes and C. Cantor
- Berkeley, 1969