Molecular phylogenetics 3 - PowerPoint PPT Presentation

About This Presentation
Title:

Molecular phylogenetics 3

Description:

Gibbon A C C G C C C C C A 0 0 0 0 0 0 0 0 0 0. 5 7 6 11 5 12 7 7 6 3. Spectral analysis ... In this case, the best non-trivial split is {Orang-utan, Gibbon} ... – PowerPoint PPT presentation

Number of Views:110
Avg rating:3.0/5.0
Slides: 17
Provided by: jimpr8
Category:

less

Transcript and Presenter's Notes

Title: Molecular phylogenetics 3


1
Molecular phylogenetics 3
  • Level 3 Molecular Evolution and Bioinformatics
  • Jim Provan

Page and Holmes Sections 6.5-6
2
Maximum likelihood
  • Principle of likelihood suggests that the
    explanation that makes the observed outcome most
    probable is preferred
  • More formally
  • LD Pr (D H)
  • In a phylogenetic context
  • D is the set of sequences being compared
  • H is a phylogenetic tree
  • The tree that makes the data the most probable
    evolutionary outcome is the maximum likelihood
    estimate of the phylogeny

3
Models, data and hypotheses
  • Maximum likelihood requires three elements
  • A model of sequence evolution
  • A tree
  • A data set
  • ML methods of tree building must solve two
    problems
  • For a given tree topology, what set of branch
    lengths makes the observed data most likely
  • Which tree has the greatest likelihood

4
Models, data and hypotheses
  • Suppose we have two sequences, 1 and 2, separated
    by an average of d substitutions per site
  • d mt
  • Given a model of substitution for each site we
    can compare the probability Pij(d) that two
    sequences separated by d would have nucleotides i
    and j
  • For example, if sequence 1 had nucleotide A then
    PAG(d) is the probability that sequence 2 has a G
    in the corresponding position
  • The log likelihood of obtaining the observed
    sequences is the sum of the log likelihoods of
    each individual site

5
Models, data and hypotheses
  • What model?
  • Transition/transversion ratio
  • Base composition
  • Variation in rate across sites
  • In all but simplest models (e.g. Jukes-Cantor),
    differences in transition / transversion rates
    can be taken into account
  • Keeping other parameters constant, it is possible
    to calculate ML estimates of individual parameters

6
Likelihood ratio tests
  • We can test alternative hypotheses concerning the
    same data using a likelihood ratio test
  • Likelihood ratio statistic (D) is the ratio of
    the alternative hypothesis (H1) to the null
    hypothesis (H0)
  • Because likelihoods are often very small, it is
    more convenient to use log likelihoods
  • D log L1 log L0
  • where
  • L1 is the maximum likelihood of the alternative
    hypothesis H1
  • L0 is the maximum likelihood of the null
    hypothesis H0
  • Can be used to test various hypotheses such as
    whether a particular model of evolution is valid,
    whether a molecular clock adequately describes
    the data or whether one phylogenetic hypothesis
    is better than another

7
Testing models
  • A model can be tested to measure how well it fits
    the observed data by comparing likelihood a tree
    and a model confers on the data (Ltree) with
    theoretical best (Lmax)
  • Likelihood ratio test can be performed to test
    the adequacy of the HKY85 model to describe the
    hominid mtDNA data set

8
Testing rate variation
  • If sequences are evolving at different rates,
    then an ultrametric tree will give a poor
    representation of relationships between taxa
  • 2D log Lno clock log Lclock

9
Comparing phylogenetic hypotheses
  • If two trees are not significantly different then
    the sum of these likelihood differences

will not be significantly different from zero
10
Objections to likelihood
  • Requires an explicit model of evolution
  • This is a strength, since it makes us aware of
    the assumptions being made
  • However, dependence on a model raises question of
    which model to use
  • Computationally expensive
  • Finding the best combination of model and tree is
    technically difficult
  • Computing likelihood is also time consuming and
    it may be that there is more than one maximal
    likelihood value for a given tree
  • Suggested that likelihood is better for testing
    models rather than as an all-purpose phylogenetic
    tool

11
Splits
  • In the above example, the split gorilla,
    orang-utan, gibbon,human, chimp can be
    written as 00011 in binary notation, or 3 in
    decimal notation
  • One advantage is that we can refer to any split
    by a single number

12
Spectral analysis
  • Provides a means of visualising support for each
    split
  • In simple terms, consists of plotting the
    frequencies of each split in the data set
  • Straightforward if there is two states for each
    character

Human G T C A T C A T C C 1 1 0 1 1 0 1 1
0 1 Chimp A T T A C C A T T C 0 1 1 1 0 0
1 1 1 1 Gorilla G T T G T T A T T A 1 1 1
0 1 1 1 1 1 0 Orang-utan A C C A C T C C C
A 0 0 0 1 0 1 0 0 0 0 Gibbon A C C G C
C C C C A 0 0 0 0 0 0 0 0 0 0 5 7
6 11 5 12 7 7 6 3
13
Spectral analysis
14
Spectral analysis
  • Since all splits cannot coexist in the same tree,
    some method is needed to decide which splits to
    use to construct the tree
  • Five trivial splits will be in every tree
  • One possible solution is to choose the two
    mutually compatible, non-trivial splits which
    have the best support
  • In this case, the best non-trivial split is
    Orang-utan, Gibbon
  • The next best supported split is Human, Chimp,
    which is compatible with this split
  • This gives the basic topology Human, Chimp,
    Gorilla, Orang-utan, Gibbon
  • Problems with spectral analysis
  • Computationally expensive (half a million splits
    for 20 sequences)
  • Potential for more than two character states

15
Split decomposition
1 2 3 4 5 6 7 8 9 Human T C C T T A A A
A Chimp T T C T A T A A A Gorilla T T A C A A T
A A Orang-utan C C A C A A A T A Gibbon C C A C
A A A A T
16
Split decomposition
Write a Comment
User Comments (0)
About PowerShow.com