Phylogenetic trees - PowerPoint PPT Presentation

1 / 20
About This Presentation
Title:

Phylogenetic trees

Description:

In a 'binary' tree, all nodes have degree 1 or 3, except the ... Evolutionary linguistics. How to estimate distances? T. Jukes and C. Cantor. Berkeley, 1969 ... – PowerPoint PPT presentation

Number of Views:60
Avg rating:3.0/5.0
Slides: 21
Provided by: ianho9
Category:

less

Transcript and Presenter's Notes

Title: Phylogenetic trees


1
Phylogenetic trees
  • BioE131/231

HIV phylogeny. Simmonds et al (left) Yamamoto et
al (right)
2
Star vs hierarchical phylogenies
Hierarchical
Star
3
Rooted and unrooted trees
Degree of a node the number of neighbors of
that node. In a binary tree, all nodes have
degree 1 or 3, except the root which has degree
2. (i.e. each node has 0 or 2 children). A binary
tree with N leaf nodes has N-1 internal
nodes(c.f. table-tennis tournament...)
4
Rooted directed
Root node
Internal node
Clade, subtree
Leaf node,taxon (pl. taxa)
Synonyms phylogeny, tree, dendrogram, cladogram
5
Rooting via outgroups
6
Root can be ambiguous
No outgroup... Earliest studies placed LUCA on
eukarya-bacteria branch later studies suggested
bacteria-archaea
7
Ultrametric trees molecular clocks
Branch lengths are typically in units of average
number of substitutions per site.Thus, branch
lengths of gt1 have large estimation errors
Ultrametric
Non-ultrametric
(distance)
X
Height of node X
Q. Why are non-ultrametric trees necessary? A.
Mutation rate 1/(generation time) Also
correlated w/other physiological variables (e.g.
metabolic rate) Longitudinal data (e.g. serial
viral sequencing from same host) can also
generate non-ultrametric trees, since leaf nodes
are not contemporaneous
Wen-Hsiung Li, 1985 (2003 Balzan Prize)
8
Newick formata.k.a. New Hampshire format
  • Rooted tree topologies(A,B,(C,D))
  • Branch lengths(A.1,B.2,(C.3,D.4).5)
  • Internal node names(A.1,B.2,(C.3,D.4)E.5)F

9
Algorithms for phylogenetic reconstruction
  • Start with a multiple alignment
  • use substitutions to evaluate trees
  • indels informative, but harder to model
  • Parsimony
  • find the tree with the fewest substitutions
  • Likelihood
  • find the tree with the most likely
    substitutions(transition/transversion bias, long
    branches, ...)
  • sum probabilities over unseen ancestral states
  • enumerating all possible tree topologies is
    sloooooow
  • Distance matrix
  • Start by computing all pairwise distances
  • Quick approximation to likelihood methods

10
UPGMA algorithm
  • Creates ultrametric trees
  • Basic idea
  • Two closest nodes must be siblings
  • Parent is equidistant between siblings
  • Distance from parent to any other node is average
    of distances of siblings to those nodes

11
UPGMA algorithm
  • Input a distance matrix, Dij
  • Let N be the set of nodes to be joined
  • Let the height of node i be Hi
  • Initialize Hi0 for all the leaf nodes in N
  • While N contains gt1 node
  • Find i j, the two closest nodes in N
  • (i,j) argmini,j Dij
  • Create a new node, k, the parent of (i,j)
  • Set Hk .5 (Hi Hj Dij)
  • Branch length k?i is (Hk-Hi) and similarly for
    k?j
  • For all nodes n in N (excluding i j)
  • Set Dkn .5 (Din Dkn)
  • Add k to N remove i j

N2 entries
N-1 steps
N2 steps
O(N3) timeIf we maintain argminj Dij for each j,
then it is O(N2) O(N2) memory
12
UPGMA in Perl
  • Questions
  • How to represent a tree?
  • For each node, need children/parents/both, name,
    branch length to parent...
  • How to print a tree in Newick format?
  • Recursive (print a particular node)
  • Pre-order traversal (parents before children)
  • How to represent a distance matrix?
  • Can side-step some of these...

13
Identify nodes by name, not by number
  • Entry Dij of distance matrix is distanceiname
    -gtjnamewhere iname is the name of node i

14
Accessing the distance matrix
  • Set of all nodes, N keys (distance)
  • Removing a node from the set delete
    distanceiname

15
Construct the Newick representation on-the-fly
  • Siblings (i, j) (iname , jname)
  • Branch lengths Branch k?i has length
    ki Branch k?j has length kj
  • Name of new node (k) (inameki,jnamekj)
  • Then, Newick-format tree is just the name of the
    root node (plus a semicolon)

16
Other phylogeny algorithms
  • Neighbor-joining (e.g. neighbor program)
  • Parents not equidistant from siblings
  • Weighted neighbor-joining (e.g. weighbor
    program)
  • Corrects for long-branch estimation error
  • Quartet-puzzling (e.g. tree-puzzle program)
  • Looks at sets of 4 nodes, instead of pairs
  • MCMC sampling (e.g. MrBayes program)
  • Stochastically explores tree space
  • Slow, but provides much more information(confiden
    ce limits, etc.)

17
Long branch attraction
  • Arises because sequences on long branches share
    chance similarities
  • Some methods (esp. parsimony) interpret this
    incorrectly as relatedness
  • Solutions
  • add more taxa to break up the branches
  • use more realistic likelihood models

18
Confidence estimates
  • Bootstrap
  • Sample a random subset of alignment columns (with
    replacement) and build a tree from those
  • Repeat a large number of times
  • Support for a branch
  • defined as of trees that include that branch
  • identify a branch by its partitioning of the taxa
  • MCMC is a more statistically rigorous way to get
    confidence estimates for trees
  • because it samples directly from the posterior
    distribution of trees

19
Evolutionary linguistics
20
How to estimate distances?
  • T. Jukes and C. Cantor
  • Berkeley, 1969
Write a Comment
User Comments (0)
About PowerShow.com