Understanding sets of trees - PowerPoint PPT Presentation

About This Presentation

Understanding sets of trees


One species tree, true gene trees will agree with the species tree, ... But see also the paper (St. John et al.) evaluating early quartet methods on the CS 394C ... – PowerPoint PPT presentation

Number of Views:60
Avg rating:3.0/5.0
Slides: 30
Provided by: utcs8


Transcript and Presenter's Notes

Title: Understanding sets of trees

Understanding sets of trees
  • CS 394C
  • September 10, 2009

Basic challenge
  • Phylogenetic analyses are sometimes based upon a
    single marker, but often based upon many markers
  • Each marker can be analyzed separately, or the
    entire set can be combined into one
  • Each matrix (each dataset) can result in many
    trees (almost no matter how you analyze the
  • What to do with huge numbers of trees?

What to do?
  • How to estimate evolutionary history from many
  • How to efficiently store large sets of trees
  • How to enable efficient queries of the set of

What to do?
  • How to estimate evolutionary history from many
  • How to efficiently store large sets of trees
  • How to enable efficient queries of the set of

First, a few questions
  • Why are gene trees different from the species
  • Why are estimated gene trees different from the
    true gene tree?
  • Under what conditions is the true evolutionary
    history not a tree? (i.e., what is

  • Evolutionary histories can be reticulate (meaning
  • Horizontal Gene Transfer (HGT)
  • Hybrid speciation
  • Recombination
  • Most phylogeny estimation methods produce trees.
  • Good resource about reticulate phylogenies book
    chapter by Luay Nakhleh (see 394C webpage for the

  • We will assume that all evolutionary histories
    are treelike for the remainder of todays
  • Later in the course well discuss reticulate

Estimated Gene Trees can differ from Species Trees
  • Biological reasons
  • Deep coalescent events (alleles)
  • Gene duplication and loss (gene families)
  • Computational reasons
  • Insufficient time
  • Poor methods (e.g., UPGMA)
  • Poor models (e.g., ML using Jukes-Cantor)
  • Data issues
  • Insufficient data (meaning not enough sites)
  • Poor alignments

Examples of problems
  • When true gene trees can differ from species
  • Given a collection of gene trees, find a species
    tree that minimizes the number of deep
    coalescent events
  • When true gene trees should equal the species
  • Given a collection of gene trees, find a species
    tree that minimizes the total distance to the
    gene trees

When gene trees can differ from species tree
  • Software/Algorithms for deep-coalescent (see
    PhyloNet from Nakhlehs webpage at Rice)
  • GLASS (Roch and Mossel) - distance-based
  • MDC (Than and Nakhleh) - parsimony
  • STEM (Kubatko) - ML
  • BEST (Liu et al.) - Bayesian
  • BUCKy (AnĂ© et al.) - Bayesian
  • Software/Algorithms for duplication-loss
  • NOTUNG (Durand)
  • Duptree (Bansal et al.)
  • Hallet and Lagergren - algorithms/complexity

When gene trees should equal the species tree
  • The problem here is that estimated gene trees can
    differ from the true gene trees.
  • Although the problem is simple, it is still
    interesting -- computationally and
  • Plus, we can still make novel contributions.

The very simplest problem
  • Easiest case
  • One species tree, true gene trees will agree with
    the species tree,
  • Estimated trees are on the full set of taxa
  • Approaches
  • Consensus methods return a tree on the entire
    set S of taxa summarizing the input trees
  • Agreement methods return a tree on a subset of
    the taxa on which the trees agree
  • Clustering, then consensus/agreement

Consensus methods
  • These are the most usual ways of analyzing
    datasets of trees
  • Examples
  • Strict consensus
  • Majority consensus
  • Greedy consensus (aka extended majority)
  • Others less frequently used include Gordons,
    Adams, the Strict Consensus Supertree, Local
    Consensus methods, and more.
  • Survey paper by David Bryant for some of these

Simplest problems, cont.
  • Agreement methods return trees on subsets of S,
    on which the trees are the same (or compatible)
  • MAST maximum agreement subtree (used in
    practice, sometimes)
  • MCST maximum compatible subtree (Ganapathy et
    al., not used in practice)
  • The difference between these is how polytomies
    are handled

Soft vs. hard polytomies
  • Polytomy node of high degree (greater than three
    for an unrooted tree)
  • Polytomies arise in estimations when consensus
    methods are used
  • Polytomies also arise when contracting short
    branches in estimated trees
  • Polytomies can be hard (representing true
    radiations) or soft (representing lack of

Compatible source trees
  • Estimated trees can be compatible when we
    interpret polytomies as soft
  • Compatible means that there is a tree which is
    a common refinement.
  • Example 123456, 123456, 123546.
  • We can compute the compatibility tree (when it
    exists) in O(nk) time, where nS and there are
    k source trees

Computational complexity
  • Most consensus methods (which return a tree on
    the entire set S of taxa) are polynomial time.
  • Most agreement methods (which return a tree on
    the largest subset of the taxa on which the
    source trees agree) are based upon NP-hard
    problems. Some (e.g., MAST) have fixed-parameter
    polynomial time solutions.

Supertree problems
  • Realistic complexity not all the source trees
    are on the same set of taxa.
  • Obvious problems
  • Find the tree on which all the source trees agree
    (if it exists).
  • Find the tree on which a maximum number of the
    source trees agree.
  • Both are NP-hard.

Quartet compatibility
  • Simple case all the source trees are on four
  • We ask does there exist a tree which agrees with
    all the source trees?
  • NP-hard!

Quartet tree amalgamation
  • Given collection of quartet trees, find a tree
    which agrees with a maximum number of these
    quartet trees
  • NP-hard, since compatibility is NP-hard
  • Hard to approximate, but PTAS if you have a tree
    on every quartet of taxa (Jiang et al.)

Quartet amalgamation algorithms
  • Quartet Puzzling (Strimmer and von Haeseler)
  • Q (Berry et al.)
  • Quartet Cleaning (Berry et al.)
  • Weight Optimization (Ranwez and Gascuel)
  • Quartets MaxCut (Snir and Rao)
  • But see also the paper (St. John et al.)
    evaluating early quartet methods on the CS 394C

What about rooted trees?
  • Given set of rooted source trees, we ask
  • Is there a tree on which all the rooted source
    trees are correct?

Rooted tree compatibility
  • Aho, Sagiv, Szymanski, and Ullman polynomial
    time, recursive algorithm
  • If n1, return the singleton tree.
  • If ngt1, then compute an equivalence relation on
    the set of taxa as follows.
  • For each rooted triple ((a,b),c) in the set, put
    a and b in the same equivalence class.
  • Compute transitive closure.
  • If only one equivalence class, reject (set is
    incompatible). Otherwise, recurse on each subset,
    and return tree obtained by making all
    recursively computed trees sibling subtrees.

Subtree compatibility
  • If source trees are rooted, then compatibility
    can be tested in polynomial time. Optimization
    problems are NP-hard, however.
  • If source trees are unrooted, then compatibility
    is NP-hard. And so optimization problems are
    also NP-hard.

Supertree problems, in practice
  • In practice, the most frequently used supertree
    method is MRP, for Matrix Representation with
  • There are, however, many other supertree methods!

Many Supertree Methods
  • MRP
  • weighted MRP
  • Min-Cut
  • Modified Min-Cut
  • Semi-strict Supertree
  • MRF
  • MRD
  • QILI
  • SDM
  • Q-imputation
  • PhySIC
  • Majority-Rule Supertrees
  • Maximum Likelihood Supertrees
  • and many more ...

  • Idea take every sourcetree, and replace it with
    a matrix of 0,1,?.
  • Concatenate the matrices.
  • Apply Maximum Parsimony.
  • If all the source trees are compatible, then an
    exact solution to MRP will return the
    compatibility trees.

Homework, due 9/15
  • Read two papers (linked on the webpage)
  • St. John et al., about quartet-based methods
  • Moret et al., about sequence-length requirements
  • Pick one, write summary, and include questions

  • How do you feel about occasionally having class
    on some Monday or Friday, so we can have guest
Write a Comment
User Comments (0)
About PowerShow.com