Understanding Molecular Evolution - PowerPoint PPT Presentation

1 / 60
About This Presentation
Title:

Understanding Molecular Evolution

Description:

Linnaeus (18th Century): familiar hierarchical classification scheme of life: ... chimp mtDNA (Ingman et al, 2000), and primate retroviruses (Salemi et al., 2000) ... – PowerPoint PPT presentation

Number of Views:158
Avg rating:3.0/5.0
Slides: 61
Provided by: ktl
Category:

less

Transcript and Presenter's Notes

Title: Understanding Molecular Evolution


1
Understanding Molecular Evolution
Marco Salemi, Ph.D.
Dept. Pathology U.F. Gainesville, FL (U.S.A.)
2
Before Darwin
  • The first genealogy These are the generations
    of Shem Shem was an hundred years old, and begat
    Arphaxad two years after the flood And Shem
    lived after he begat Arphaxad five hundred years,
    and begat sons and daughters. And Arphaxad lived
    five and thirty years, and begat Salah
    Genesis 11
  • Aristotle animal-plants classification
  • Linnaeus (18th Century) familiar hierarchical
    classification scheme of life kingdom, family,
    class, order, genus and species

3
The evolutionary thinking
  • Russel Wallace writes to Charles Darwin (June
    17th 1858)
  • Ernst Haeckel (mid-19th Century) the tree of
    life
  • The neo-synthesis (Fisher, Heldane, and Wright,
    1930-1950)

4
The molecular revolution
  • Nuttal, 1904 Serological cross-reactions to
    study phylogenetic relationships among various
    group of animals.
  • Watson and Crick beautiful helix!
  • Zuckerland and Pauling, 1965 molecular clocks.
  • Fitch Margoliash, 1967 Construction of
    phylogenetic trees.A method based on mutation
    distances as estimated from cytochrome c
    sequences is of general applicability (Science,
    155279-284).
  • Kimura, 1968 Evolutionary rate at the molecular
    level (Nature, 217624-626).

The birth of molecular evolution
5
FUNDAMENTALS OF MOLECULAR EVOLUTION
6
Genetic information
  • The phenotype of an organism is always the result
    of its genetic information and the interaction
    with the environment.
  • Genetic information is stored in double stranded
    DNA for most organisms or RNA for some viruses.
  • This genetic information can be coding (for
    proteins) or non-coding (e.g., rRNA, regulatory
    sequences).
  • The genetic code is universal for all organisms,
    with only a few exceptions such as the
    mitochondrial code.

7
Genetic code
  • The genetic code is a triplet of 4 possible bases
    (AGCT), thus there are 64 codons 61 sense codes
    encoding 20 amino acids, and 3 non-sense codes
    encoding stop codons.
  • The genetic code is degenerated in such a way
    that mutations at the 3rd cdp only in 30, at 2nd
    cdp always, and at 1st cdp in 96 of cases result
    in an amino acid change, causing a non-synonymous
    mutations. The other changes are silent or
    synonymous.

8
(No Transcript)
9
Transitions and transversions
?
A
C
?
?
?
?
T
G
?
  • Transitions (?) are purine (A, G) or pyrimidine
    (C, T) mutations Pu-Pu, Py-Py
  • Transversions (?) are purine to pyrimidine
    mutations or the reverse (Pu-Py, or Py-Pu).

10
Point mutations and the genetic code
  • 4 possible transitions A?G, C?T
  • 8 possible transversions A?C, A?T,
    G?C, G?T
  • Thus if mutations were random, transversions are
    2 times more likely than transitions.
  • Due to steric hindrance, the opposite is true,
    transitions occur in general more often than
    transversions (2-15 times more, depending on the
    gene region and the species).

11
(No Transcript)
12
Transversions result in more disruptive amino
acid changes.
13
Other mutations
  • Insertions and deletions (indels). Usually by 3
    nucleotides in coding regions.
  • Recombination. Often in viruses.
  • Gene (or chromosome) duplication
  • Lateral gene transfer

14
Genetic variation in populations
  • Polymorphism 2 (or more) mutations co-exist in a
    population of organisms.
  • The variants at a particular position (or locus)
    are called alleles.
  • Diploid organisms can be homozygous (2 identical
    alleles) or heterozygous (2 different alleles) at
    a particular locus.
  • For viruses, the term quasispecies is often used.
  • The variation in a population can be described in
    allele frequencies or gene frequencies

15
Evolution and fixation of mutations
  • Evolution is the consecutive fixation of
    mutations
  • The fixation rate of such a polymorphism is in
    fact the evolutionary rate. This is dependant on
  • Mutation rate
  • Generation time
  • Evolutionary forces, such as fitness, selective
    pressure, population size

16
Population genetics
  • Selective Pressure
  • Effective Population size (Ne)
  • Random Genetic drift

17

Darwinian theory (neo-synthesis)
  • A new mutation can quickly become FIXED in the
    population as a result of
  • Natural Selection
  • The differential reproduction of genetically
    distinct individuals or genotypes within a
    population
  • Fitness
  • Is a measure of the individual's ability to
    survive and reproduce

18
Population dynamics
1
fixed mutation
polymorphism maintained
ALLELE FREQUENCY
lost mutation
0
TIME
19
Selective pressure
  • Positive selective pressure mutant is more fit
  • Negative selective pressure mutant is less fit
  • Balancing selection heterozygote is more fit
  • Most synonymous mutations can be considered
    neutral
  • Non-synonymous mutations are always subject to
    selective pressure

20
Effective population size
1st generation N10, Ne5
2nd generation N10, Ne4
3rd generation N10, Ne3
4th generation N5, Ne2
Bottleneck event
5th generation N3, Ne2
Mutation event
6th generation N9, Ne4
7th generation N11
21
Deterministic or stochastic model of evolution
  • Deterministic fixation of mutations is entirely
    dependent on selective pressure. Alleles do not
    get lost or fixed by chance (by accident).
  • Stochastic fixation is dependent on chance
    events. Chance effect is much larger than
    selective pressure, random genetic drift plays a
    big role.
  • Whether or not selective pressure plays a role
    can be tested by comparing synonymous with
    non-synonymous rates of substitution.

22
Neo-Darwinism vs Neutral evolution
  • Neo-Darwinism
  • Random mutations are source of variation.
  • Selective pressure is cause of adaptive
    evolution survival of the fittest.
  • Neutral evolution (Kimura)
  • A majority of non-synonymous mutations are
    deleterious, there is a strong negative selective
    pressure.
  • Most mutations that become fixed are neutral,
    rarely positive selective pressure is strong
    enough to fix adaptive mutations. Random genetic
    drift is strong.

23
Adaptation by evolution in a fitness landscape
24
Environmental factors
  • Once an organism has reached the adaptive peak,
    evolution continues mainly according to the
    neutral model.
  • Changes in the environment can reshape the
    adaptive landscape, and stimulate again adaptive
    evolution.
  • Sudden changes in the environment often cause
    intense adaptive evolution and the origin of new
    species. Example mammalian diversification
    after the disappearance of dinosaurs.

25
Adaptive evolution
  • Neutral evolution is the major factor
    contributing to variation at the molecular level.
  • The phenotypic effects of adaptive evolution can
    be more easily appreciated when long time periods
    are considered.
  • Usually, only a minority of sites in a
    codingregion are subject to positive selection

26

Mutation and evolutionary rate (1)
  • New mutation in a diploid population of N
    individuals
  • Fixation time t (Kimura and Otha 1969)
  • t 2/s ln (2N) (s selective advantage)
  • t 4N for neutral mutations
  • Evolutionary Rate (or substitution rate), r
  • number of mutants reaching fixation per unit time
  • Mutation Rate, m
  • rate of mutation at the DNA level (biochemical
    concept)

27

Mutation and evolutionary rate (2)
  • q 1/2N frequency of a new mutation
  • m rate of neutral mutations
  • 2Nm number of mutants arising per generation
  • probability of fixation p
  • p 1-exp(-4Nsq)/1-exp(-4Ns)
  • when s?0, exp(-4Nsq) ? 1- 4Nsq, and exp(-4Ns) ?
    1- 4Ns
  • p q 1/2N
  • Evolutionary rate number of mutants arising per
    generation x probability of fixation
  • r 2Nm 1/2N m (Kimura, 1968)

28

The classic molecular clock
  • The molecular clock hypothesis postulates that
    for any given macromolecule (a protein or a DNA
    sequence) the rate of evolution is approximately
    constant over time in all evolutionary lineages
    (Zuckerkandl and and Pauling 1962 and 1965)

29
A global molecular clock?
The hypothesis known as global clock was based on
the observation that a linear relation seems to
exist between the number of amino acid
substitutions between homologous proteins of
different species, and the species divergence
times estimated from archaeological data.
30

Why is the molecular clock attractive ?
  • If macromolecules evolve at constant rates, they
    can be used to date species-divergence times and
    other types of evolutionary events, similar to
    the dating of geological time using radioactive
    elements
  • Phylogenetic reconstruction is much simpler under
    constant rates that under nonconstant rates
  • The degree of rate variation among lineages may
    provide much insight into the mechanisms of
    molecular evolution (e.g. Kimura 1983 Gillespie
    1991 Salemi et al., 1999).

31
Estimating ancestral divergence times under the
clock hypothesis
knowing a divergence time T r dac/2T all the
other divergence dates in the internal nodes of
the tree can be estimated using the rate r
32
Global vs Local clocks
  • The molecular clock hypothesis is in perfect
    agreement with the neutral theory of evolution
    (Kimura 1968 Kimura 1983).
  • The existence of a clock seems to be a major
    support of the neutral theory against natural
    selection
  • The global clock hypothesis is today known to be
    untrue.
  • Factors like different metabolic rates, different
    lifespan, and different efficiency in the DNA
    repair mechanisms between distantly related
    species are responsible for different
    evolutionary rates of homologous genes.

33
Local clocks
  • In general the evolutionary rate r can be
    expressed as
  • r f m
  • f fraction on neutral mutation
  • m mutation rate
  • If f is constant
  • and
  • m is constant
  • The rate of evolution is constant (molecular
    clock)
  • Local molecular clocks have been demonstrated
    for a number of closely related species (Li,
    1997), human and chimp mtDNA (Ingman et al,
    2000), and primate retroviruses (Salemi et al.,
    2000).

34
Evolutionary rates of organisms
nucleotide substitutions per site per year
10 - 9
10 - 8
10 - 7
10 - 6
10 - 5
10 - 4
10 - 3
10 - 2
10 - 1
cellular genes
RNA viruses
DNA viruses
Human mtDNA
35
Homology - Similarity
  • Homology
  • The tree of life implies one origin for all life
    on earth, thus all sequences are in fact
    descendant from the same sequence through
    mutations.
  • The term homology is used when two sequences
    share a common ancestor recent enough such that
    this is still detectable in their sequence. An
    unambiguous sequence alignment can be obtained.
  • Similarity
  • Any 2 sequences can be compared and the
    similarity computed ( nt or aa identity).
  • Allowing gaps, 2 non-homologous nt sequences can
    have a similarity of up to 50 for aa sequence
    this can be up to 20.

36
The data used for phylogenetic analysis
  • Morphological characters
  • Fossils (not for viruses)
  • Genetic data
  • AA or NT sequences
  • RFLP
  • Allele frequencies
  • ...
  • A combination of these data

37
Homoplasy
  • When the linear relationship between time and
    evolution is disturbed
  • convergent evolution (for example, HIV immune
    escape in the V3 loop)
  • sequence reversal
  • multiple hits
  • parallel substitution

38
Homology and Homoplasy
  • Homology
  • two identical nt in different sequences are
    homologous if and only if both sequences acquired
    that nt directly from a common ancestor
  • Homoplasy
  • two identical nt in different sequences are
    homoplasious when they evolved independently from
    different ancestors

39
Alignments - Positional homology
40
Pairwise sequence alignments
  • Set up a matrix with the characters of the two
    sequences
  • Score identities in the matrix with 1,
    differences with 0
  • Gap penalties (GP) open gap penalty (e.g. -2)
    and extending gap penalty (e.g.-1). GP-2L(-1)
    with Llength of gap.
  • End gaps are scored (global alignment) or not
    (local alignment)
  • An alignment is a path through the matrix. Its
    overall score is the sum of the scores on its
    path, plus gap penalties
  • The alignment with the best score is chosen as
    the optimal alignment

41
Scoring a path in the matrix
  • Take diagonal steps only
  • Dont use characters twice
  • Skipping characters results in insertions or
    deletions

G A A C T T A A 0 1 1 0 0 0 1 C 0 0 0 1 0 0 0
C 0 0 0 1 0 0 0 T 0 0 0 0 1 1 0 T 0 0 0 0 1 1 0
T 0 0 0 0 1 1 0
-1
GAACTTA-ACCTTT
Score 4-13
42
Scoring another path
-2
G A A C T T A A 0 1 1 0 0 0 1 C 0 0 0 1 0 0 0
C 0 0 0 1 0 0 0 T 0 0 0 0 1 1 0 T 0 0 0 0 1 1 0
T 0 0 0 0 1 1 0
-1
GAAC-TTA--ACCTTT
Score 4-2-11
43
Multiple sequence alignments
  • Score all possible pairwise alignments Dij
  • Find the best multiple alignment that gives the
    highest weighted sum of pairs (WSP)
  • WSP??WijDij
  • Wij is a weight that can be given e.g. to give
    less weight to overrepresented closely related
    sequences
  • ClustalW
  • sequences are aligned in pairs to create a
    distance matrix based on their alignment score
  • scores are down-weighted according to how closely
    related they are
  • this distance matrix is used to make a guide tree
    (see phylogenetic analysis methods)
  • the guide tree is used to cluster the sequences
    during the stepwise alignment

44
Phylogenetic analysis reconstructing the
history of a gene
  • Only alignments of homologous positions can be
    used for phylogenetic analysis
  • point mutations
  • indels add gaps to achieve positional homology
  • Recombinations disturb a phylogenetic analysis,
    since the two parts of the recombined gene have a
    different phylogenetic history.
  • In case of gene duplication (paralogous genes)
    followed by speciation (orthologous genes)
    reconstruct the species tree from orthologous
    genes

45
Rooted phylogenetic tree
Branches can rotate freely. Branching order is
called topology
External node or Operational Taxonomic Units
OTU (or Taxon)
A
G
node
H
B
J
C
K
Internal node or Hypothetical Taxonomic Units
HTU (or Ancestor)
D
I
root
E
branch
F
TIME
46
Unrooted phylogenetic tree
  • Root node K disappeared
  • To root an unrooted tree
  • root by outgroup, e.g. use F as outgroup
  • midpoint rooting

Monophyletic taxa
47
Coalescence time on a rooted tree
Most recent common ancestor of all taxa (MRCA)
TIME
Coalescence time of all taxa
48
Trees combinatorial explosions
  • Number of unrooted trees for n taxa
  • NU(2n-5)!/2n-3(n-3)!
  • Number of rooted trees for n taxa
  • NR(2n-3)!/2n-2(n-2)!

49
Phylogenetic tree using aligned sequences gene
tree
  • When from orthologous genes the phylogenetic
    tree can be used to reconstruct the species tree
    (cladogram speciation over time).
  • However, considering polymorphisms, the nodes do
    not necessarily coincide exactly with speciation.
  • Similar for viruses if quasispecies variation
    exists, the coalescence time in a transmission
    case does not necessarily coincide with the
    transmission time.
  • Molecular clock calculations to estimate the
    coalescence time can only be used if the clock
    holds

50
Gene divergence may predate species divergence
Past Present
Gene splitting
Gene splitting
Population splitting
A B
C D
E F
51
Gene trees vs Species trees
  • Gene and species trees may be different

52
Substitution models
  • During evolution, multiple hits can have happened
    at a single position the evolutionary distance
    is almost always larger than the dissimilarity (
    nt or aa divergence)

Expected difference based on number of mutations
that happened
Correction
Sequence difference
Observed difference
Time/Evolutionary distance
53
Phylogenetic signal / phylogenetic noise
  • Before making any phylogenetic inference it is
    necessary to investigate the presence of
    phylogenetic signal in the data set of aligned
    sequences
  • WHY ?
  • Sequences could be not enough informative about
    the phylogeny of the taxa under study
    (phylogenetic signal too low)
  • Sequences could be too divergent (positional
    homology is lost)
  • Sequences could be saturated
  • The data could not evolve according to a strictly
    bifurcating tree

54
DNA dot matrix
  • A fast way to evaluate whether or not two
    sequences can be unambiguously aligned is to
    perform a DNA dot matrix
  • Homologous regions usually show a clear diagonal
    in the dot matrix

55
DNA dot matrix
A T T C T G A
T C T G A A A
A T T C G G A
A T T C G G A
  • Sequences that can be unambiguously aligned show
    a clear diagonal in the dot matrix
  • Sequences that cannot be unambiguously aligned
    show no diagonal in the dot matrix

56
HIV/SIV DNA env dot matrix (1)

57
HIV/SIV DNA env dot matrix (2)
58
Substitution Saturation
  • If the sequences in the data set diverged long
    ago, the phylogenetic signal could be lost
  • It can be proved that two random nucleotide
    sequences will share on average 25 similarity
    just by accident if gaps are not allowed, and up
    to 50 allowing gaps sequences with similar base
    composition will artificially cluster together,
    irrespective of the true phylogeny
  • Third codon positions usually saturate much
    faster than first or second
  • Transitions usually saturate much faster than
    transversions
  • A fast way to check for substitution saturation
    in a set of aligned sequence is to plot the
    genetic distance for each pair wise comparison
    versus the number of transitions and
    transversions between each pair
  • usually Ti occur more often than Tv. However,
    when the saturation point is reached, Tv tend to
    outnumber Ti

59
d
Transitions A?G, C?T
- - - - - - -
Transversions A?C, T C?A, G
Divergence time (Mya)
5 10 15
60
Implicit and explicit models of evolution
  • Distance matrix methods and maximum likelihood
    methods can correct for homoplasy by defining
    evolutionary parameters such as a nucleotide
    substitution matrix (additional parameters can
    also be e.g. rate heterogeneity). These methods
    therefore use an explicit model of evolution
    whose parameters can be estimated from the data.
  • Parsimony methods can not correct for homoplasy,
    they use an implicit model of evolution, they
    assume that the most parsimonious tree (that is
    the tree with the smallest number of mutations)
    is evolutionary also the most likely one.
Write a Comment
User Comments (0)
About PowerShow.com