Molecular Phylogenetics - PowerPoint PPT Presentation

1 / 51
About This Presentation
Title:

Molecular Phylogenetics

Description:

Phenetic approach: Trees are based on some measure of distance between the leaves ... either cladistic (e.g. gene order) or phenetic ... – PowerPoint PPT presentation

Number of Views:780
Avg rating:3.0/5.0
Slides: 52
Provided by: ssi9
Category:

less

Transcript and Presenter's Notes

Title: Molecular Phylogenetics


1
Molecular Phylogenetics
2
Trees
  • Diagram consisting of branches and nodes
  • Species tree (how are my species related?)
  • contains only one representative from each
    species
  • when did speciation take place?
  • all nodes indicate speciation events
  • Gene tree (how are my genes related?)
  • normally contains a number of genes from a single
    species
  • nodes relate either to speciation or gene
    duplication events

3
(No Transcript)
4
The purpose of a phylogenetic tree is to
illustrate how a group of objects (usually genes
or organisms) are related to one another
5
Phylogenetic trees are about visualising
evolutionary relationships
Nothing in Biology Makes Sense Except in the
Light of Evolution Theodosius
Dobzhansky (1900-1975)
6
Terms
  • Phylogeny (phylo tribe genesis)
  • Homologue
  • Orthologue
  • Paralogue
  • Tree topology
  • Cladogram
  • Phenogram

7
Cladogram
8
Phenogram
9
Clade A set of species which includes all of
the species derived from a single common ancestor
10
Molecular Evolution - Li
11
Cladistics and Phenetics
  • Cladistic approach Trees are drawn based on the
    conserved characters
  • Phenetic approach Trees are based on some
    measure of distance between the leaves
  • Molecular phylogenies are inferred from molecular
    (usually sequence) data
  • either cladistic (e.g. gene order) or phenetic

12
Classes of algorithm used to infer phylogeny from
sequence
  • Distance methods
  • Parsimony
  • Likelihood
  • Probabilistic methods
  • Phylogentic invariants

13
Distance methods
  • Calculate the distance CORRECTING FOR MULTIPLE
    HITS
  • The Distance Matrix
  • 7
  • Rat 0.0000 0.0646 0.1434 0.1456
    0.3213 0.3213 0.7018
  • Mouse 0.0646 0.0000 0.1716 0.1743
    0.3253 0.3743 0.7673
  • Rabbit 0.1434 0.1716 0.0000 0.0649
    0.3582 0.3385 0.7522
  • Human 0.1456 0.1743 0.0649 0.0000
    0.3299 0.2915 0.7116
  • Oppossum 0.3213 0.3253 0.3582 0.3299
    0.0000 0.3279 0.6653
  • Chicken 0.3213 0.3743 0.3385 0.2915
    0.3279 0.0000 0.5721
  • Frog 0.7018 0.7673 0.7522 0.7116
    0.6653 0.5721 0.0000

14
Distance methods
  • Normally fast and simple
  • e.g. UPGMA, Neighbour Joining, Minimum Evolution,
    Fitch-Margoliash

15
Correction for multiple hits
  • Only differences can be observed directly not
    distances
  • All distance methods rely (crucially) on this
  • A great many models used for nucleotide sequences
    (e.g. JC, K2P, HKY, Rev, Maximum Likelihood)
  • aa sequences are infinitely more complicated!
  • Accuracy falls off drastically for highly
    divergent sequences

16
Minimum Evolution
  • The total length of all branches in the tree
    should be a minimum
  • It has been shown that the minimum evolution tree
    is expected to be the true tree provided branch
    lengths corrected for multiple hits

17
Neighbour Joining
8
7
1
6
2
3
5
4
18
Neighbour joining is an approximation to minimum
evolution
19
Maximum Parsimony
  • Occams Razor
  • Entia non sunt multiplicanda praeter
    necessitatem.
  • William of Occam (1300-1349)

The best tree is the one which requires the least
number of substitutions
20
  • Check each topology
  • Count the minimum number of changes required to
    explain the data
  • Choose the tree with the smallest number of
    changes
  • Usually performs well with closely related
    sequences but often performs badly with very
    distantly related sequences
  • With distantly related sequences homoplasy
    becomes a major problem

21
Maximum Likelihood
  • Require a model of evolution
  • Each substitution has an associated likelihood
    given a branch of a certain length
  • A function is derived to represent the likelihood
    of the data given the tree, branch-lengths and
    additional parameters
  • Function is minimized

22
Models can be made more parameter rich to
increase their realism
  • The most common additional parameters are
  • A correction to allow different substitution
    rates for each type of nucleotide change
  • A correction for the proportion of sites which
    are unable to change
  • A correction for variable site rates at those
    sites which can change
  • The values of the additional parameters will be
    estimated in the process (e.g. PAUP)

23
A gamma distribution can be used to model site
rate heterogeneity
24
Long Branches Attract
  • In a set of sequences evolving at different
    rates the sequences evolving rapidly are drawn
    together

25
Comparison of methods
  • Inconsistency
  • Neighbour Joining (NJ) is very fast but depends
    on accurate estimates of distance. This is more
    difficult with very divergent data
  • Parsimony suffers from Long Branch Attraction.
    This may be a particular problem for very
    divergent data
  • NJ can suffer from Long Branch Attraction
  • Parsimony is also computationally intensive
  • Codon usage bias can be a problem for MP and NJ
  • Maximum Likelihood is the most reliable but
    depends on the choice of model and is very slow
  • Methods may be combined

26
The Molecular Clock
  • For a given protein the rate of sequence
    evolution is approximately constant across
    lineages
  • Zuckerkandl and Pauling (1965)

This would allow speciation and duplication
events to be dated accurately based on molecular
data
Local and approximate molecular clocks more
reasonable
27
Relative Rate Test
  • Test whether sets of sequences are evolving at
    equal rates (local molecular clock hypothesis)

e.g. RRTree, Robinson-Rechavi http//pbil.univ-lyo
n1.fr/software/rrtree.html
28
Rooting the Tree
  • In an unrooted tree the direction of evolution is
    unknown
  • The root is the hypothesized ancestor of the
    sequences in the tree
  • The root can either be placed on a branch or at a
    node
  • You should start by viewing an unrooted tree

29
(No Transcript)
30
(No Transcript)
31
(No Transcript)
32
Automatic rooting
  • Many software packages will root trees
    automatically (e.g. mid-point rooting in NJPlot)
  • This must involve assumptions BEWARE!

33
Rooting Using an Outgroup
  • 1. The outgroup should be a sequence (or set of
    sequences) known to be less closely related to
    the rest of the sequences than they are to each
    other
  • 2. It should ideally be as closely related as
    possible to the rest of the sequences while still
    satisfying condition 1
  • The root must be somewhere between the outgroup
    and the rest (either on the node or in a branch)

34
Sometimes two trees may look very different
but, in fact, differ only in the position of the
root
35
What sequences should I use for organism
phylogenies?
  • Slowly evolving / Fast evolving
  • rRNA
  • mitochondrion
  • other

36
How confident am I that my tree is correct?
  • Bootstrap values
  • Bootstrapping is a statistical technique that
    can use random resampling of data to determine
    sampling error for tree topologies

37
Bootstrapping phylogenies
  • Characters are resampled with replacement to
    create many bootstrap replicate data sets
  • Each bootstrap replicate data set is analysed
    (e.g. with parsimony, distance, ML etc.)
  • Agreement among the resulting trees is summarized
    with a majority-rule consensus tree
  • Frequencies of occurrence of groups, bootstrap
    proportions (BPs), are a measure of support for
    those groups

38
Bootstrapping - an example
Ciliate SSUrDNA - parsimony bootstrap
Ochromonas (1)
Symbiodinium (2)
100
Prorocentrum (3)
Euplotes (8)
84
Tetrahymena (9)
96
Loxodes (4)
100
Tracheloraphis (5)
100
Spirostomum (6)
100
Gruberia (7)
Majority-rule consensus
Wim de Grave et al. Fiocruz bioinformatics
training course
39
Bootstrapping
Majority-rule consensus (with minority components)
Wim de Grave et al. Fiocruz bioinformatics
training course
40
Bootstrap - interpretation
  • Bootstrapping is a very valuable and widely used
    technique (it is demanded by some journals)
  • BPs give an idea of how likely a given branch
    would be to be unaffected if additional data,
    with the same distribution, became available
  • BPs are not the same as confidence intervals.
    There is no simple mapping between bootstrap
    values and confidence intervals. There is no
    agreement about what constitutes a good
    bootstrap value (gt 70, gt 80, gt 85 ????)
  • Some theoretical work indicates that BPs can be a
    conservative estimate of confidence intervals
  • If the estimated tree is inconsistent all the
    bootstraps in the world wont help you..

41
Jack-knifing
  • Jack-knifing is very similar to bootstrapping and
    differs only in the character resampling strategy
  • Jack-knifing is not as widely available or widely
    used as bootstrapping
  • Tends to produce broadly similar results

42
Likelihood-based tests of topologies
  • Kishino-Hasegawa test
  • Trees specified apriori
  • KH can be used to test whether two competing
    hypotheses have significantly different
    likelihood
  • NB should not be used to test trees that have
    been chosen on the basis of the data!
  • Shimodaira-Hasegawa test
  • Can be used to test confidence of ML tree
    compared to related trees (e.g. second most
    likely tree from the data)
  • Andrew Rambaut http//evolve.zoo.ox.ac.uk/software
    /shtests

43
Inferring Sequences at Ancestral Nodes
  • Maximum likelihood estimates of tree topologies
    also provide inferred sequences at ancestral
    nodes
  • Analysis of sequences at ancestral nodes and
    sequence changes at ancestral branches can
    provide information about the timing of the
    acquiring of a novel trait or mutation
  • PAML (Phylogenetic Analysis using Maximum
    Likelihood)
  • Confidence intervals provided
  • Selection can be inferred

44
Coalescent models
  • Consider genetic lineages going back in time
  • Make inferences from patterns of coalescent
    events (e.g. effective population size,
    migrations etc.)
  • Improved efficiency for sequence simulations
  • Great number of software packages including
    LAMARK (Kuhner and Felsenstein)
  • Has generated enormous interest and body of
    literature (for review see Rosenberg and
    Nordborg, Nature 2002

45
  • For an excellent review of recent progress in
    phylogenetics see
  • Molecular phylogenetics state-of-the-art methods
    for looking into the past
  • Whelan et al, Trends in Genetics, 2001

46
Inferring Function from Sequence Homology
High throughput Genome Annotation
47
Basic Methods
Taken from JA Eisen, Genome Research, 1998
  • Highest BLAST hit
  • uncharacterised gene is assigned the function of
    the best BLAST hit
  • An additional cut-off value may be used
  • Still very frequently used sometimes referred
    to as first-pass annotation
  • Top Hits
  • Examine a set of top BLAST hits and consider the
    consensus function
  • COGs (Clusters of Orthologous Genes)
  • Clustering of genes into orthologous groups based
    on similarity scores

48
Phylogenomics
Taken from JA Eisen, Genome Research, 1998
  • Choose genes of interest
  • Identify homologues
  • Align sequences
  • Calculate gene tree
  • Overlay known functions onto tree
  • Infer likely function of genes of interest
  • Note Only proteins with confirmed functions
    should be used (to avoid error propagation)

49
Increased power over similarity methods However
increased power comes with increased costs in
terms of labour
50
Duplication and Speciation
  • Gene duplication may accelerate the rate of
    evolution and the rate of functional divergence
  • Orthologues may provide better information about
    function (for a given level of divergence) than
    paralogues
  • Software is available for automatic inference of
    gene duplication events on a phylogenetic tree
    (e.g. SDI Seán Eddy). This can be used for
    improved automation of phylogenomics (e.g. RIO
    Eddy et al.)
  • ref Zmasek and Eddy BMC Bioinformatics 2002

51
Functional Genomics
  • e.g. Xun Gu Molecular Biology and Evolution 1999
  • Examine functional divergence after duplication
  • Type I (rates change)
  • Type II (replacement at conserved site)
  • Diverge program
Write a Comment
User Comments (0)
About PowerShow.com