Phylogenetic inference - PowerPoint PPT Presentation

1 / 65
About This Presentation
Title:

Phylogenetic inference

Description:

Tree searching (heuristic search) Models (using Modeltest to choose one) ... Use 'hill climbing' methods. Initial tree starts the process, then we seek to ... – PowerPoint PPT presentation

Number of Views:203
Avg rating:3.0/5.0
Slides: 66
Provided by: Guille83
Category:

less

Transcript and Presenter's Notes

Title: Phylogenetic inference


1
Phylogenetic inference
Minicourse Molecular Phylogenetics an update of
new methodological developments during the 48th
Annual Meeting of the Sociedade Brasileira de
Genetica, Aguas de Lindoia, Sao Paulo, Brasil
(Sept 2002)
  • Many methods available, using different
    techniques, many software packages
  • For molecular data, the trend is towards using
    methods based on explicit models based on
    realistic assumptions
  • New improved methods and tests appear in the
    literature constantly

2
Phylogenetic inference
  • This minicurso will review some of the widely
    used (traditional) approaches and introduce two
    recent developments
  • Bayesian inference
  • Genetic algorithms (MetaGA)
  • Review
  • algorithmic vs. optimality criteria approaches
    (parsimony, distance methods and ML)
  • Tree searching (heuristic search)
  • Models (using Modeltest to choose one)

3
Classification of phylogenetic methods
4
Distance and discrete data
5
Algorithms versus optimality criteria
  • Phylogenetic inference is an estimation procedure
    (best estimate)
  • Only have information about the contemporary
    molecules (and organisms)
  • How do we choose a tree from the set of all
    possible trees?
  • Two basic approaches
  • Algorithmic just follow a sequence of steps
  • Optimality criterion how to compare trees

6
Algorithmic methods
  • Combine tree inference and the definition of a
    preferred tree into a single statement
  • Include UPGMA and all forms of pair-group cluster
    analysis, and neighbor joining
  • Computationally fast because they go straight to
    the final solution
  • The task of finding an optimal tree can not be
    separated from that of evaluating a specific tree

7
Optimality criteria
  • Two logical steps
  • Define an optimality criterion (objective
    function for evaluating trees)
  • Find the tree(s) with the best value for the
    objective function (may use algorithms)
  • Evolutionary assumptions made in the first step
    are decoupled from the computations involved in
    the second step
  • Price for logical clarity is that these methods
    can be very slow

8
Use of algorithms
  • Different use in the two approaches
  • In purely algorithmic methods, the algorithm
    defines the tree selection criterion and is
    fundamental
  • In criterion-based methods, algorithms are merely
    tools used in evaluating and searching for
    optimal trees
  • Reliability of the tree?

9
Optimality criteria
  • Parsimony select the tree that minimizes the
    total tree length (number of steps or character
    transformations required to explain a given set
    of data)
  • Some methods are based on models of evolutionary
    change assumptions are made explicit.
  • Is parsimonys model-free nature an advantage
    or a disadvantage?
  • Parsimony does make assumptions (consistency)

10
Optimality methods (cont.)
  • Maximum likelihood evaluates the probability
    that a proposed model of evolution and the
    hypothesized history could give rise to the
    observed data (attempts to estimate the actual
    amount of change)
  • Usually more consistent estimates with lower
    variance than other methods robust to violations
    of assumptions

11
Optimality criteria (cont.)
  • Pairwise distance methods also minimize the
    effect of multiple hits when using appropriate
    model to estimate the true evolutionary distance
    between two sequences (less desirable than full
    ML)
  • Additive and ultrametric distances can be fitted
    to a tree such that all pairwise distances are
    equal to the sum of the branches along the path
    connecting them in the tree

12
  • Observed distances are obtained directly from the
    sequences themselves and patristic distances
    from a tree
  • For additive and ultrametric distances, the
    observed and tree distances match exactly

Additive
Ultrametric
For real data this is rarely the case, indicating
that observed distances cannot be completely
accurately represented by a tree.
13
Classification of phylogenetic methods
14
UPGMA an algorithmic method
  • Cluster analysis Unweighted pair group method
    using arithmetic averages (Sneath and Sokal 1973)
  • Assumes ultrametricity

15
UPGMA example
  • Given a matrix of pairwise distances, find the
    clusters (taxa) i and j such that dij is the Min
    value in the table
  • Define the depth of the branching between i and j
    (lij) to be dij/2
  • If i and j were the last two clusters, the tree
    is complete. Otherwise, create a new cluster
    called u
  • Define a distance from u to each other cluster
    (k, with k ? i or j) to be the average of the
    distances dki and dkj
  • Go back to step 1 with one less cluster clusters
    i and j have been eliminated, and cluster u has
    been added

16
Distance Matrices and phenogram
17
Classification of phylogenetic methods
18
Parsimony methods
  • The most widely-used method, familiar notion in
    science (simplicity)
  • Shared attributes among taxa are inherited from
    common ancestors
  • When character conflicts occur, ad hoc hypotheses
    cannot be avoided if you want to explain all the
    data, and assumptions of homoplasy must be
    invoked

19
Parsimony
  • From all sets of possible trees, find all trees ?
    such that L(?) is minimal
  • B is the number of branches
  • N is the number of characters
  • k and k are the two nodes incident to each
    branch k
  • xkj and xkj represent either elements of the
    input matrix or optimal-character assignments
    made to internal nodes
  • Diff(y,z) is a function specifying the cost of
    transformation from y to z along any branch--for
    unrooted trees diff(y,z)diff(z,y) Diff may be
    defined by cost matrix
  • The coefficient w is the weight assigned to each
    character (a priori or a posteriori)

20
Other parsimony variants
  • Dollo parsimony every derived character must be
    uniquely derived (originate only once in the
    tree)
  • Homoplasy only reversals are allowed (no
    parallelism or convergence)
  • In practice, Dollo parsimony does not require
    inclusion of hypothetical ancestors just
    character polarity (unrooted Dollo)
  • Convenient for restriction-site characters
    (easier to loose that to gain a site)

21
Dollo parsimony and RFLP data
Relaxed Dollo criterion, may be appplied using
generalized parsimony
22
Generalized Parsimony
  • All parsimony variants can be subsumed into a
    generalized method that assigns a cost for each
    possible transformation
  • Costs are represented in a m-by-m cost matrix S,
    where each element Sij represents the increase in
    tree length due to a transformation from state i
    to j
  • The cost of each transformation (weight) can be
    determined a priori (e.g. for RFLPs or for
    transition/transversion changes) or a posteriori
    (using the same data, e.g. successive
    approximations method)

23
Generalized Parsimony Cost matrices
24
Protein parsimony
  • A 20x20 matrix specifies the cost for each
    possible transformation
  • The matrix may be based on the genetic code
    (PROTPARS matrix) and/or the biochemical
    properties of the amino acids themselves (Dayhoff
    matrices)

25
Difference in perspective MP and ML
  • Parsimony seeks solutions that minimize the
    amount of change required to explain the data
    (underestimates superimposed changes)
  • ML attempts to estimate the actual amount of
    change (by specifying the evolutionary model that
    will account for the data with the highest
    likelihood)
  • Methods that incorporate models of evolutionary
    change can make more efficient use of the data

26
Classification of phylogenetic methods
27
Distance methods
  • Experimentally derived distances are assumed to
    be estimates of true distances
  • We want to fit them to a mathematical model
    (additive tree) and find the optimal value for
    the adjustable parameters
  • Branching pattern
  • Branch lenghts
  • Some methods Fitch Margoliash, minimum evolution
    (ME)

28
Distance Methods
  • Alternative approach to ML for minimizing the
    impact of the underestimation problem if
    corrected distances are used
  • Corrected distances are assumed to be estimates
    of the true evolutionary distance (between a pair
    of sequences)
  • Distance methods are less desirable
    approximations to a full ML approach, but much
    faster
  • But some drawbacks of character data-to-distance
    transformations are information loss and
    difficulty for combination of two or more data
    sets

29
The problem...
  • We have uncertain data (distance estimates) that
    we want to fit to a particular mathematical model
    (an additive tree) and find optimal values for
    the adjustable parameters
  • The branching pattern
  • The branch lengths

30
An additive distance measure defines a tree...
For any 2 sequences, the value in the distance
matrix should correspond to the sum of the branch
lengths along the path between the 2 sequences on
the tree.
31
When distances are not ultrametric but only
metric they can still be represented by a tree
An additive tree
Additive trees also represent additive distances
exactly...
32
  • While this tree is additive, it is not
    ultrametric
  • Notice that sequences b and c are the most
    similar (3), but ARE NOT the most closely related
  • Similarity and and evolutionary relationship will
    only coincide exactly if the distances are
    ultrametric

33
Additive-tree methods
  • Due to the finite amount of available data,
    stochastic variation will cause deviations of the
    estimated evolutionary distances from perfect
    tree additivity...
  • even when evolution proceeds according to the
    model used for distance correction (JC, K2P,
    HKY85, etc)
  • Many methods exist that derive a tree (w/ branch
    lengths) from a distance matrix to come closest
    to being additive
  • The discrepancy (distortion) between observed
    and tree distances can be used as an indicator
    (optimality criterion) of how well observed
    distances fit a tree like representation (but
    confusion with algorithm)

34
Fitch-Margoliash and related methods
  • E definition of disagreement between data and
    tree
  • Alpha and weights must be defined
  • If alpha1 then this is an absolute difference
    criterion
  • If alpha2 thenthis is a least squares criterion
  • Weighting schemes (w) more commonly used are 1,
    1/dij, 1/dij2, and 1/variance(dij)

35
Minimum Evolution Method
  • Use unweighted least squares criterion (w1,
    alpha2) to fit branch lengths, but a different
    criterion to evaluate and compare trees

Optimality criterion sum of the absolute values
of the BL that minimize the squared deviations
between observed and path-length distances
2T-3 is the number of independent branches in an
unrooted tree
36
Computation and tree-searchproblems
  • Sometimes negative branch lengths will be defined
    to optimize the fit (E in the equation) some
    solutions
  • Outright rejection of the tree with negative
    branch lenghts (too drastic)
  • Constrain the optimization process to disallow
    negative branch lengths (set them to zero)
  • Least-square and minimum-abs-deviation methods
    assume that each pairwise distance is independent
    (not generally true because of common
    evolutionary history of the molecules)
  • Also remember loss of information when
    summarizing discrete data as a distance matrix

37
Classification of phylogenetic methods
38
Maximum Likelihood methods
  • Evaluates a hypothesis about evolutionary history
    in terms of the probability that a proposed model
    of the evolutionary process and the hypothesized
    history would give rise to the observed data
  • L Pr (DH)
  • A history with a higher probability of giving
    rise to the current state of affairs is preferred
  • Cavalli-Sforza and Edwards (1967) and Felsenstein
    (1981, 1993) and others.

39
ML Objective
  • Data are observed sequences (DNA or prot)
  • Unknowns are branching order (topology) and
    branch lengths of a tree
  • A concrete model of evolution that transforms one
    sequence into another needs to be specified
    (fully defined or with uncertain parameters that
    need to be estimated from the data)
  • L Pr (DH)
  • Trees with higher likelihoods are preferred

40
Calculating L for a tree
  • Aligned sequences for 4 taxa
  • We want to evaluate the tree shown
  • What is the prob that this tree generated the
    data?

41
Calculating L for a tree
  • Root the tree at any internal node (models are
    time-reversible)
  • Assumption of independence allows to calculate L
    for each site separately
  • Then combine the likelihoods into a total value
    at the end

42
Calculating L for a tree
  • To calculate L for some site j, we must consider
    all possible scenarios by which the tip sequences
    could have evolved
  • Specifically, the root (6) may have had A, C, T,
    or G
  • For each of these possibilities, the other
    internal node (5) also might have possessed any
    of the 4 nucleotides

43
Calculating L for a tree
  • Thus, there are 4x416 possibilities to consider

44
Calculating L for a tree
  • Calculate the probability of each and sum them to
    obtain the total probability for site j
  • Assume that the changes along each branch are
    independent (Markov model)
  • Thus, the Pr of any single scenario is equal to
    the product of the Pr of the changes required by
    that scenario

45
Calculating L for a tree
  • Because the Pr of any single observation is an
    extremely small number, we evaluate the log of
    the likelihood instead
  • Probabilities are accumulated as the sum of logs
    of the single-site likelihoods

46
Difference in perspective MP and ML
  • Parsimony seeks solutions that minimize the
    amount of change required to explain the data
    (underestimates superimposed changes)
  • ML attempts to estimate the actual amount of
    change (by specifying the evolutionary model that
    will account for the data with the highest
    likelihood)
  • Methods that incorporate models of evolutionary
    change can make more efficient use of the data

47
Difference in perspective ML vs MP
48
Parsimony and likelihood
ML and MP scores for all 15 unrooted trees for
mtDNA sequence data
49
MP and Inconsistency
50
Long Branch Attraction
  • The Felsenstein Zone
  • What are the assumptions for MP?
  • How can we tell if theres LBA in our data?

51
Searching for optimal trees
  • Methods with explicit optimality criteria
  • Parsimony
  • Maximum likelihood
  • Additive-tree distance
  • Separate the problem of
  • evaluating the tree
  • finding the optimal tree(s)
  • Can we evaluate all possible trees for a
    particular problem?

52
Searching for optimal trees
  • For small to moderate data sets, with as many as
    8-20 taxa, we can use exact methods
  • Exact methods guarantee the discovery of all
    optimal trees
  • Exact methods include
  • Exhaustive search
  • Branch-and-bound search

53
How many trees?
54
And for more than 10 taxa?
55
Exhaustive search enumerate al possible trees
56
Branch-and-bound Does not require exhaustive
search and yet provides an exact solution. 1.
Traverse a search tree in a depth-first
sequence 2. Select upper bound (L) on optimal
value of chosen criterion. 3. Move along path to
tips and evaluate trees. If tree is L then
dispense the rest of that path.
57
Approximate methods
  • For larger data sets computing time becomes
    prohibitive and we only explore some subset of
    all possible trees (hoping that the optimal trees
    will be found in the subset explored)
  • Heuristic approaches sacrifice the guarantee of
    optimality in favor of reduced computer time
  • Use hill climbing methods. Initial tree starts
    the process, then we seek to improve its score
  • When we can find no way to further improve the
    score, we stop.We dont know if we reached a
    local or a global optimum

58
Initial trees
  • May be obtained by stepwise addition, the most
    commonly used method
  • Similar to exhaustive search but evaluate trees
    at every step, each time you add a new taxon and
    only follow the path derived from the optimal
    tree
  • Which taxa do you choose first? Which do you
    connect next?
  • These are greedy algorithms

59
Stepwise addition
60
  • Initial trees also may be obtained by star
    decomposition, another greedy algorithm

61
Branch swapping
  • To improve the initial estimate we can perform
    sets of predefined rearrangements on the tree
  • Any of these rearrangements amounts to a stab in
    the dark
  • Globally optimal trees may be several
    rearrangements away from the starting tree
  • If a better tree is found, a new round of
    rearrangements is then performed in the new tree
  • Several branch-swapping algorithms are available

62
Branch swapping by tree bisection and
reconnection (TBR) 1. Tree is bisected along a
branch, yielding two disjunct subtrees 2. The
subtrees are reconnected by joining a pair of
branches, one from each subtree 3. All possible
bisections and pairwise reconnections are
evaluated
63
Branch swapping by subtree prunning and
regrafting 1. A subtree is pruned from the tree
(e.g. A,B) 2. The subtree is then regrafted to a
different location on the tree 3. All possible
subtree removals and reattachment points are
evaluated
64
Branch swapping by nearest-neighbor interchanges
(NNI) 1. Each interior branch of the tree
defines a local region of four subtrees
2. Interchanging a subtree on one side of the
branch with one from the other constitutes an
NNI 3. Two such rearrangements are possible for
each interior branch (all interior branches are
swapped)
65
Landscapes and the problem of islands of trees
Write a Comment
User Comments (0)
About PowerShow.com