IE68 Biological databases Phylogenetic analysis - PowerPoint PPT Presentation

1 / 62
About This Presentation
Title:

IE68 Biological databases Phylogenetic analysis

Description:

a reconstruction of the evolutionary (genealogical) history of a group of ... e.g. forensic science. IE68 - biological databases - phylogeny ... – PowerPoint PPT presentation

Number of Views:188
Avg rating:3.0/5.0
Slides: 63
Provided by: agrKule
Category:

less

Transcript and Presenter's Notes

Title: IE68 Biological databases Phylogenetic analysis


1
IE68 - Biological databasesPhylogenetic analysis
2
Phylogenetic analysis
  • Phylogeny
  • a reconstruction of the evolutionary
    (genealogical) history of a group of
    organisms/genes or proteins from biological data
  • organisms populations, species, genera,... gt
    taxa gt operational taxonomic units (OTUs)
  • data molecular, morphological,
    archaeological,... gt characters
  • Phylogenetic tree
  • the graphical reconstruction of a phylogeny
  • tree structure phylogram, cladogram

3
Phylogenetic tree
A tree consists of nodes connected by branches
polytomy
A B C D E
gt OTUs for which we have data
outgroup/midpoint
gt Ancestor of all the taxa that comprise the tree
notation ((A,B),(C,D,E))
4
Phylogenetics ltgt Phenetics
  • Phenetics method of grouping taxa that is based
    on overall (dis)similarities of characters gt
    with no reference to evolution!
  • Phylogenetics method of grouping taxa that is
    based on shared derived characters
    (synapomorphies) or a model of evolution

5
Why do we need phylogenies?
  • Intrinsic interest in the tree gt tree of life
  • origin of organisms

6
Why do we need phylogenies?
  • Phylogenies can also be used as tools for
    investigating other problems
  • e.g. biogeography
  • phylogeny reflects the order of separation of
    the areas the different taxa occupy

T
7
Why do we need phylogenies?
  • Phylogenies can also be used as tools for
    investigating other problems
  • e.g. forensic science

8
(No Transcript)
9
Phylogenetic analysis
  • Molecular Phylogenetics
  • reconstruction of the evolutionary (geneological)
    history of a group of organisms from molecular
    data, i.e. DNA or protein sequences
  • In this lecture, we will focus on phylogenetic
    analysis of organisms based on DNA sequence data

10
Molecular phylogenetics approach
  • Step 1 PCR with primers that target cytoplasmic
    DNA or nuclear loci of taxa, followed by DNA
    sequence analysis
  • Step 2 Multiple DNA sequence alignment
  • Step 3 Phylogenetic analysis

11
PCR and DNA sequencing
  • Which loci?
  • DNA sequence information, primers, variability,
    single or low-copy, orthologous, neutral,
    recombination...
  • Gene trees versus organismal trees
  • phylogenies for genes do not always match those
    of their corresponding organisms gt analyse more
    than one gene

12
Confounding influence of gene duplication
2 types of homology orthology (speciation) and
paralogy (gene duplication)
13
Lineage sorting and coalescence
species alleles
14
Molecular phylogenetics approach
  • Step 1 PCR with primers that target cytoplasmic
    DNA or nuclear loci of taxa, followed by DNA
    sequence analysis
  • Step 2 Multiple DNA sequence alignment
  • Step 3 Phylogenetic analysis

15
Multiple DNA sequence alignment
  • Problem alternative alignments
  • possible to align any two sequences by
    postulating some combination of gaps
    (insertion/deletions indels) and substitutions
  • gt which one to choose?
  • Basic task of sequence alignment is to find the
    alignment with the highest similarity, smallest
    distance, or lowest overall cost

16
Multiple DNA sequence alignment
  • 2 sequences scoring scheme gt optimal alignment
  • Scoring scheme
  • - scoring matrix distance weights or similarity
    scores for each pair of aligned bases e.g.
    transition transversion matrix
  • A T G C
  • A 0 5 1 5
  • T 5 0 5 1
  • G 1 5 0 5
  • C 5 1 5 0
  • - gap weight, cost or penalty

17
Multiple DNA sequence alignment
  • Cost of an alignment D s wg
  • s no of substitutions, g total length of
    gaps
  • w gap penalty cost of gap relative to
    substitution
  • Gap penalty W makes implicit assumptions about
    how the sequences have evolved
  • if indels are thought to be rare, then W should
    be large (and vice versa)
  • gt have to use knowledge of biology e.g.
    translation (3 bp indel, position),
    transitionltgttransversion, ...

18
Multiple DNA sequence alignment
  • Software programs e.g. CLUSTALW (global
    alignment)
  • http//www.ebi.ac.uk/clustalw/index.html
  • The optimal alignment is not always the true
    alignment gt new developments phylogenetic
    analysis without the multiple DNA sequence
    alignment step

19
Molecular phylogenetics approach
  • Step 1 PCR with primers that target cytoplasmic
    DNA or nuclear loci of taxa, followed by DNA
    sequence analysis
  • Step 2 Multiple DNA sequence alignment
  • Step 3 Phylogenetic analysis

20
Inferring phylogenies from DNA sequences
C
Sequence alignment A ..AGCGTCT..B
..AGCGTGT..C ..AGGAGT..
A
B
Phylogenetic methods
unrooted tree
A
B
taxa
characters
C
rooted tree
21
Phylogenetic methods
Character-based methods
Non character-based methods
Methods based on an explicit model of evolution
Maximum-likelihood methods
Pairwise distance methods
Methods not based on an explicit model of
evolution
Maximum parsimony methods
22
Pairwise distance methods
3 taxa, 3 sequences
  • Dissimilarity matrix count the number of
    differences between all possible pairs of
    sequences
  • Convert dissimilarity to evolutionary distance by
    correcting for multiple events per site according
    to a certain model of evolution
  • Infer tree topology on the basis of the
    evolutionary distances by using a clustering
    algorithm or optimality criterion

1 2 31 2 0.263 0.20 0.33
1 2 31 2 0.323 0.23 0.44
tree
23
Models of sequence evolution
expected ? observed difference gt correction
(linear) (not linear)
Apply a substitution model that tries to estimate
the correct number of substitutions
24
Models of sequence evolution
  • Distance correction methodsconvert observed
    distances into measure that correspond to ACTUAL
    distance
  • Several methods have been proposed, all with
    different assumptions about the nature of the
    evolutionary process
  • Essentially they differ by the number of
    parameters they include
  • We can use a general framework to show how these
    models are inter-related

25
Substitution models general framework
26
Substitution models general framework
27
e.g. Model of Jukes Cantor (JC)
  • One of the first proposed perhaps the simplest
    model of evolution
  • Assumes that all four bases have equal frequency
    and that all substitutions are equally likely
  • Under this model, the distance between any two
    sequences is given by d -3/4ln(1-4/3p), where p
    is the proportion of nucleotides that are
    different in the two sequences

28
e.g. Kimura 2 parameter model (K2P)
  • incorporates the observation that transitions
    accumulate more rapidly than transversion
  • assumes all four bases have equal frequencies
    but that there are 2 rate classes for
    substitutions
  • Under this model, the distance between any two
    sequences is given by d 1/2ln1/(1-2P-Q)
    1/4ln1/(1-2Q), where P and Q are the
    proportional differences between the two
    sequences due to transitions and transversions,
    respectively

29
Substitution models
  • Other models adding more parameters
  • Felsenstein model (F81)
  • variation in base composition gt base frequency
    f ?A ?C ?G ?T may vary
  • Hasewaga Kishino Yano (HKY) model
  • unequal base frequency, transition/transversion
  • General reversible model (REV) unequal base
    frequency, all six pairs of substitutions have
    different rates
  • gt ideally, we want the simplest model we can get
    away with that still yields a reasonable
    estimate

30
Substitution models
  • Assumptions of these models
  • all nucleotide sites change independently
  • base composition equilibrium
  • substitution rate is constant over time and in
    different lineages
  • each site in a sequence is equally likely to
    undergo substitutiongt gamma distribution has a
    parameter that specifies the range of rate
    variation among sites model ?

31
  • Pairwise distance methods
  • Dissimilarity matrix count the number of
    differences between all possible pairs of
    sequences
  • Convert dissimilarity to evolutionary distance
    by correcting for multiple events per site
    according to a certain model of evolution
  • Infer tree topology on the basis of the
    evolutionary distances by using a clustering
    algorithm

3 taxa, 3 sequences
1 2 31 2 0.263 0.20 0.33
1 2 31 2 0.323 0.23 0.44
tree
32
Clustering methods
  • Clustering methods follow a set of steps (an
    algorithm) and arrive at a tree
  • UPGMA (Unweighted Pair Group Method using
    Arithmetic Averages) results in an rooted and
    additive tree with molecular clock
  • Neighbor-joining results in an unrooted and
    additive tree
  • Other approaches least-squares, Fitch, Kitch,...

33
UPGMA clustering
A B C B 2 least differences C 4 4 D 6
6 6
1
A
1
B
Compute new distances between (AB) and other
OTUs d(AB)C (dAC dBC) /2 4 d(AB)D (dAD
dBD) /2 6
34
UPGMA clustering
1
A
AB C C 4 D 6 6
1
1
B
2
C
1
A
1
Compute new distances between (ABC) and other
OTUs d(ABC)D (d(AB)D dCD) /2 6
1
B
1
2
C
3
D
35
Clustering methods
  • UPGMA additive and ultrametric distancesgt
    assumes a molecular clock gt very sensitive to
    unequal rate of evolution! gt relative-rate test
  • Use other clustering methods for phylogenye.g.
    Neighbor-joining
  • Goodness of fit statistics to select the
    metric tree that best accounts for the observed
    distances

36
  • Pairwise distance methods
  • Dissimilarity matrix count the number of
    differences between all possible pairs of
    sequences
  • Convert dissimilarity to evolutionary distance
    by correcting for multiple events per site
    according to a certain model of evolution
  • Infer tree topology on the basis of the
    evolutionary distances by using an optimality
    criterion

3 taxa, 3 sequences
1 2 31 2 0.263 0.20 0.33
1 2 31 2 0.323 0.23 0.44
tree
37
Minimum evolution
  • Distance matrix gt unrooted metric trees
  • Each tree has a length L, which is the sum of all
    the branch lengths
  • Optimality criterionthe minimum evolution tree
    ME is the tree which minimizes L

38
Pairwise distance method
  • Advantages
  • very fast
  • based on a model of evolution
  • Disadvantages
  • sequence information is reduced to one number
  • branch lengths may not be biologically
    interpreted
  • most methods provide only one tree topology
  • dependent on the model of evolution used

39
Phylogenetic methods
Character-based methods
Non character-based methods
Methods based on an explicit model of evolution
Maximum-likelihood methods
Pairwise distance methods
Methods not based on an explicit model of
evolution
Maximum parsimony methods
40
Character-based methods
  • Character-based (discrete) methods operate
    directly on sequences, rather than on pairwise
    distances
  • Two major discrete methods
  • Maximum parsimony (MP) chooses tree(s) that
    require fewest evolutionary changes
  • Maximum Likelihood (ML) chooses tree(s) that is
    the one most likely to have produced the observed
    data

41
Maximum parsimony
  • Maximum parsimony infers a phylogenetic tree by
    minimizing the total number of evolutionary steps
  • Principle
  • Investigate all possible tree topologies
  • Reconstruct ancestral sequences
  • Choose topology with smallest number of steps

42
Maximum parsimony - principle
A
1
3
2
4
1
2
B
3
4
1
2
C
3
4
possible tree topologies
43
Maximum parsimony - principle
44
Maximum parsimony - principle
45
Maximum parsimony - principle
46
Maximum parsimony - generalized
  • In previous example, cost of each substitution
    was one step gt equal weight
  • Instead, we can use different costs for different
    types of change (e.g. transitions vs
    transversions) to better match our assumptions
    about evolutionary processes gt weighted
    parsimonyaccording to Dollo, Wagner, Fitch, ...

47
Maximum parsimony - characters
48
Maximum parsimony search methods
  • Number of tree topologies Nu
    (2n-5)!/2n-3(n-3)!i.e., 3 sequences 1 tree, 4
    seq 3 trees, 5 seq 15, 6 105, gt the more
    sequences ( taxa), the more trees gt
    computationally expensive
  • Finding optimal trees
  • Exhaustive search limited number of taxa
    (lt10)find the minimum tree of all possible trees
  • Branch and bound small number of taxa (lt18)find
    the minimum tree without evaluating all trees by
    discarding families of trees during tree
    construction that cannot be shorter than the
    shortest tree found so far
  • Heuristic search large number of taxa

49
Maximum parsimony search methods
- Heuristic searchexplore a subset of all
possible trees, by using stepwise addition of
taxa plus a rearrangement process (branch
swapping), but not guaranteed to find the minimal
tree
Global optimum
Local optimum
50
Maximum parsimony - output
  • Consensus treeMP can yield multiple equally
    most parsimonious (optimal) trees gt
    relationships common to all the optimal trees are
    summarized with a consensus tree
  • Strict consensus includes splits found in all
    trees
  • Majority-rule consensus includes splits found in
    the majority of the trees (gt 50)

51
Maximum parsimony - output
  • Consistency index (CI) - Retention index (RI)
  • measures of the parsimony fit of a character to a
    tree, or of the average fit of all characters to
    a tree
  • more specifically index of how much homoplasy
    the constructed tree has
  • Value from 0 to 1
  • higher value gt less homoplasy

52
(No Transcript)
53
Parsimony branch support and tree stability
  • Bootstrap analysis
  • is a resampling technique used to measure
    sampling error
  • gives an idea about the reliability of branches
    and clusters
  • original dataset gt resample gt construct trees
    gt compare trees to original trees
  • gt70 quite confident of tree topology
  • Decay index (Bremer support)
  • gives us a sense of how many steps would be
    required before a grouping collapses
  • higher value gt better branch support

54
Maximum parsimony
  • Advantages
  • based on shared derived characters
  • evaluates different tree topologies
  • does not reduce the information
  • Disadvantages
  • computationally intensive for large datasets
  • no correction for multiple mutations
  • sensitive to unequal rates of evolution (long
    branch attraction)

55
Phylogenetic methods
Character-based methods
Non character-based methods
Methods based on an explicit model of evolution
Maximum-likelihood methods
Pairwise distance methods
Methods not based on an explicit model of
evolution
Maximum parsimony methods
56
Maximum likelihood
  • Statistical method
  • If given some data D and a hypothesis H, the
    likelihood of that data is given byLD Pr (DH)
  • Which is the probability of D given H?

57
Maximum likelihood
  • In the context of molecular phylogenetics
  • D is the set of sequences being compared
  • H is a phylogenetic tree
  • We want to find the likelihood of obtaining the
    observed data given the tree
  • The tree that makes the data the most probable
    evolutionary outcome is the Maximum Likelihood
    estimate of the phylogeny

58
Maximum likelihood
  • In other wordsWhich tree is most likely to have
    yielded these sequences (observed data) under a
    given model of evolution (JC, K2P, ...)?

59
Maximum likelihood
  • Advantages
  • Statistically well founded
  • Based on a model of evolution
  • Evaluates different topologies
  • Uses all sequence information
  • Often yields estimates that have lower variance
    than other methods
  • Disadvantages
  • Very slow (computationally intensive)
  • Dependent on the model of evolution used

60
Software programs for phylogenetic analysis
  • Overview http//evolution.genetics.washington.edu
    /phylip/software.html
  • Most widely used software programs
  • PHYLIP free available (downloadable or online
    http//bioweb.pasteur.fr/seqanal/phylogeny/phylip-
    uk.html)
  • PAUP user friendly but not free available

61
Phylogenetic information on the internet
  • http//tolweb.org/tree/phylogeny.html
  • http//www.treebase.org/treebase/
  • ....

62
If you need more information
  • Jacqueline Vander Stappen
  • K.U.Leuven
  • Laboratory of Gene Technology
  • Kasteelpark Arenberg 21
  • B-3001 Leuven
  • Jacqueline.vanderstappen_at_agr.kuleuven.ac.be
Write a Comment
User Comments (0)
About PowerShow.com