Title: Understanding Molecular Evolution
1Understanding Molecular Evolution
Marco Salemi, Ph.D.
Dept. Pathology U.F. Gainesville, FL (U.S.A.)
2Before Darwin
- The first genealogy These are the generations
of Shem Shem was an hundred years old, and begat
Arphaxad two years after the flood And Shem
lived after he begat Arphaxad five hundred years,
and begat sons and daughters. And Arphaxad lived
five and thirty years, and begat Salah
Genesis 11 - Aristotle animal-plants classification
- Linnaeus (18th Century) familiar hierarchical
classification scheme of life kingdom, family,
class, order, genus and species
3The evolutionary thinking
- Russel Wallace writes to Charles Darwin (June
17th 1858)
- Ernst Haeckel (mid-19th Century) the tree of
life
- The neo-synthesis (Fisher, Heldane, and Wright,
1930-1950)
4The molecular revolution
- Nuttal, 1904 Serological cross-reactions to
study phylogenetic relationships among various
group of animals. - Watson and Crick beautiful helix!
- Zuckerland and Pauling, 1965 molecular clocks.
- Fitch Margoliash, 1967 Construction of
phylogenetic trees.A method based on mutation
distances as estimated from cytochrome c
sequences is of general applicability (Science,
155279-284). - Kimura, 1968 Evolutionary rate at the molecular
level (Nature, 217624-626).
The birth of molecular evolution
5FUNDAMENTALS OF MOLECULAR EVOLUTION
6Genetic information
- The phenotype of an organism is always the result
of its genetic information and the interaction
with the environment. - Genetic information is stored in double stranded
DNA for most organisms or RNA for some viruses. - This genetic information can be coding (for
proteins) or non-coding (e.g., rRNA, regulatory
sequences). - The genetic code is universal for all organisms,
with only a few exceptions such as the
mitochondrial code.
7Genetic code
- The genetic code is a triplet of 4 possible bases
(AGCT), thus there are 64 codons 61 sense codes
encoding 20 amino acids, and 3 non-sense codes
encoding stop codons. - The genetic code is degenerated in such a way
that mutations at the 3rd cdp only in 30, at 2nd
cdp always, and at 1st cdp in 96 of cases result
in an amino acid change, causing a non-synonymous
mutations. The other changes are silent or
synonymous.
8(No Transcript)
9Transitions and transversions
?
A
C
?
?
?
?
T
G
?
- Transitions (?) are purine (A, G) or pyrimidine
(C, T) mutations Pu-Pu, Py-Py - Transversions (?) are purine to pyrimidine
mutations or the reverse (Pu-Py, or Py-Pu).
10Point mutations and the genetic code
- 4 possible transitions A?G, C?T
- 8 possible transversions A?C, A?T,
G?C, G?T - Thus if mutations were random, transversions are
2 times more likely than transitions. - Due to steric hindrance, the opposite is true,
transitions occur in general more often than
transversions (2-15 times more, depending on the
gene region and the species).
11(No Transcript)
12Transversions result in more disruptive amino
acid changes.
13Other mutations
- Insertions and deletions (indels). Usually by 3
nucleotides in coding regions. - Recombination. Often in viruses.
- Gene (or chromosome) duplication
- Lateral gene transfer
14Genetic variation in populations
- Polymorphism 2 (or more) mutations co-exist in a
population of organisms. - The variants at a particular position (or locus)
are called alleles. - Diploid organisms can be homozygous (2 identical
alleles) or heterozygous (2 different alleles) at
a particular locus. - For viruses, the term quasispecies is often used.
- The variation in a population can be described in
allele frequencies or gene frequencies
15Evolution and fixation of mutations
- Evolution is the consecutive fixation of
mutations - The fixation rate of such a polymorphism is in
fact the evolutionary rate. This is dependant on - Mutation rate
- Generation time
- Evolutionary forces, such as fitness, selective
pressure, population size
16Population genetics
- Selective Pressure
- Effective Population size (Ne)
- Random Genetic drift
17Darwinian theory (neo-synthesis)
- A new mutation can quickly become FIXED in the
population as a result of - Natural Selection
- The differential reproduction of genetically
distinct individuals or genotypes within a
population - Fitness
- Is a measure of the individual's ability to
survive and reproduce
18Population dynamics
1
fixed mutation
polymorphism maintained
ALLELE FREQUENCY
lost mutation
0
TIME
19Selective pressure
- Positive selective pressure mutant is more fit
- Negative selective pressure mutant is less fit
- Balancing selection heterozygote is more fit
- Most synonymous mutations can be considered
neutral - Non-synonymous mutations are always subject to
selective pressure
20Effective population size
1st generation N10, Ne5
2nd generation N10, Ne4
3rd generation N10, Ne3
4th generation N5, Ne2
Bottleneck event
5th generation N3, Ne2
Mutation event
6th generation N9, Ne4
7th generation N11
21Deterministic or stochastic model of evolution
- Deterministic fixation of mutations is entirely
dependent on selective pressure. Alleles do not
get lost or fixed by chance (by accident). - Stochastic fixation is dependent on chance
events. Chance effect is much larger than
selective pressure, random genetic drift plays a
big role. - Whether or not selective pressure plays a role
can be tested by comparing synonymous with
non-synonymous rates of substitution.
22Neo-Darwinism vs Neutral evolution
- Neo-Darwinism
- Random mutations are source of variation.
- Selective pressure is cause of adaptive
evolution survival of the fittest. - Neutral evolution (Kimura)
- A majority of non-synonymous mutations are
deleterious, there is a strong negative selective
pressure. - Most mutations that become fixed are neutral,
rarely positive selective pressure is strong
enough to fix adaptive mutations. Random genetic
drift is strong.
23Adaptation by evolution in a fitness landscape
24Environmental factors
- Once an organism has reached the adaptive peak,
evolution continues mainly according to the
neutral model. - Changes in the environment can reshape the
adaptive landscape, and stimulate again adaptive
evolution. - Sudden changes in the environment often cause
intense adaptive evolution and the origin of new
species. Example mammalian diversification
after the disappearance of dinosaurs.
25Adaptive evolution
- Neutral evolution is the major factor
contributing to variation at the molecular level. - The phenotypic effects of adaptive evolution can
be more easily appreciated when long time periods
are considered. - Usually, only a minority of sites in a
codingregion are subject to positive selection
26Mutation and evolutionary rate (1)
- New mutation in a diploid population of N
individuals - Fixation time t (Kimura and Otha 1969)
- t 2/s ln (2N) (s selective advantage)
- t 4N for neutral mutations
- Evolutionary Rate (or substitution rate), r
- number of mutants reaching fixation per unit time
- Mutation Rate, m
- rate of mutation at the DNA level (biochemical
concept)
27Mutation and evolutionary rate (2)
- q 1/2N frequency of a new mutation
- m rate of neutral mutations
- 2Nm number of mutants arising per generation
- probability of fixation p
- p 1-exp(-4Nsq)/1-exp(-4Ns)
- when s?0, exp(-4Nsq) ? 1- 4Nsq, and exp(-4Ns) ?
1- 4Ns - p q 1/2N
- Evolutionary rate number of mutants arising per
generation x probability of fixation - r 2Nm 1/2N m (Kimura, 1968)
28The classic molecular clock
- The molecular clock hypothesis postulates that
for any given macromolecule (a protein or a DNA
sequence) the rate of evolution is approximately
constant over time in all evolutionary lineages
(Zuckerkandl and and Pauling 1962 and 1965)
29A global molecular clock?
The hypothesis known as global clock was based on
the observation that a linear relation seems to
exist between the number of amino acid
substitutions between homologous proteins of
different species, and the species divergence
times estimated from archaeological data.
30Why is the molecular clock attractive ?
- If macromolecules evolve at constant rates, they
can be used to date species-divergence times and
other types of evolutionary events, similar to
the dating of geological time using radioactive
elements - Phylogenetic reconstruction is much simpler under
constant rates that under nonconstant rates - The degree of rate variation among lineages may
provide much insight into the mechanisms of
molecular evolution (e.g. Kimura 1983 Gillespie
1991 Salemi et al., 1999).
31Estimating ancestral divergence times under the
clock hypothesis
knowing a divergence time T r dac/2T all the
other divergence dates in the internal nodes of
the tree can be estimated using the rate r
32Global vs Local clocks
- The molecular clock hypothesis is in perfect
agreement with the neutral theory of evolution
(Kimura 1968 Kimura 1983). - The existence of a clock seems to be a major
support of the neutral theory against natural
selection - The global clock hypothesis is today known to be
untrue. - Factors like different metabolic rates, different
lifespan, and different efficiency in the DNA
repair mechanisms between distantly related
species are responsible for different
evolutionary rates of homologous genes.
33Local clocks
- In general the evolutionary rate r can be
expressed as - r f m
- f fraction on neutral mutation
- m mutation rate
- If f is constant
- and
- m is constant
- The rate of evolution is constant (molecular
clock) - Local molecular clocks have been demonstrated
for a number of closely related species (Li,
1997), human and chimp mtDNA (Ingman et al,
2000), and primate retroviruses (Salemi et al.,
2000).
34Evolutionary rates of organisms
nucleotide substitutions per site per year
10 - 9
10 - 8
10 - 7
10 - 6
10 - 5
10 - 4
10 - 3
10 - 2
10 - 1
cellular genes
RNA viruses
DNA viruses
Human mtDNA
35Homology - Similarity
- Homology
- The tree of life implies one origin for all life
on earth, thus all sequences are in fact
descendant from the same sequence through
mutations. - The term homology is used when two sequences
share a common ancestor recent enough such that
this is still detectable in their sequence. An
unambiguous sequence alignment can be obtained. - Similarity
- Any 2 sequences can be compared and the
similarity computed ( nt or aa identity). - Allowing gaps, 2 non-homologous nt sequences can
have a similarity of up to 50 for aa sequence
this can be up to 20.
36The data used for phylogenetic analysis
- Morphological characters
- Fossils (not for viruses)
- Genetic data
- AA or NT sequences
- RFLP
- Allele frequencies
- ...
- A combination of these data
37Homoplasy
- When the linear relationship between time and
evolution is disturbed - convergent evolution (for example, HIV immune
escape in the V3 loop) - sequence reversal
- multiple hits
- parallel substitution
38Homology and Homoplasy
- Homology
- two identical nt in different sequences are
homologous if and only if both sequences acquired
that nt directly from a common ancestor
- Homoplasy
- two identical nt in different sequences are
homoplasious when they evolved independently from
different ancestors
39Alignments - Positional homology
40Pairwise sequence alignments
- Set up a matrix with the characters of the two
sequences - Score identities in the matrix with 1,
differences with 0 - Gap penalties (GP) open gap penalty (e.g. -2)
and extending gap penalty (e.g.-1). GP-2L(-1)
with Llength of gap. - End gaps are scored (global alignment) or not
(local alignment) - An alignment is a path through the matrix. Its
overall score is the sum of the scores on its
path, plus gap penalties - The alignment with the best score is chosen as
the optimal alignment
41Scoring a path in the matrix
- Take diagonal steps only
- Dont use characters twice
- Skipping characters results in insertions or
deletions
G A A C T T A A 0 1 1 0 0 0 1 C 0 0 0 1 0 0 0
C 0 0 0 1 0 0 0 T 0 0 0 0 1 1 0 T 0 0 0 0 1 1 0
T 0 0 0 0 1 1 0
-1
GAACTTA-ACCTTT
Score 4-13
42Scoring another path
-2
G A A C T T A A 0 1 1 0 0 0 1 C 0 0 0 1 0 0 0
C 0 0 0 1 0 0 0 T 0 0 0 0 1 1 0 T 0 0 0 0 1 1 0
T 0 0 0 0 1 1 0
-1
GAAC-TTA--ACCTTT
Score 4-2-11
43Multiple sequence alignments
- Score all possible pairwise alignments Dij
- Find the best multiple alignment that gives the
highest weighted sum of pairs (WSP) - WSP??WijDij
- Wij is a weight that can be given e.g. to give
less weight to overrepresented closely related
sequences - ClustalW
- sequences are aligned in pairs to create a
distance matrix based on their alignment score - scores are down-weighted according to how closely
related they are - this distance matrix is used to make a guide tree
(see phylogenetic analysis methods) - the guide tree is used to cluster the sequences
during the stepwise alignment
44Phylogenetic analysis reconstructing the
history of a gene
- Only alignments of homologous positions can be
used for phylogenetic analysis - point mutations
- indels add gaps to achieve positional homology
- Recombinations disturb a phylogenetic analysis,
since the two parts of the recombined gene have a
different phylogenetic history. - In case of gene duplication (paralogous genes)
followed by speciation (orthologous genes)
reconstruct the species tree from orthologous
genes
45Rooted phylogenetic tree
Branches can rotate freely. Branching order is
called topology
External node or Operational Taxonomic Units
OTU (or Taxon)
A
G
node
H
B
J
C
K
Internal node or Hypothetical Taxonomic Units
HTU (or Ancestor)
D
I
root
E
branch
F
TIME
46Unrooted phylogenetic tree
- Root node K disappeared
- To root an unrooted tree
- root by outgroup, e.g. use F as outgroup
- midpoint rooting
Monophyletic taxa
47Coalescence time on a rooted tree
Most recent common ancestor of all taxa (MRCA)
TIME
Coalescence time of all taxa
48Trees combinatorial explosions
- Number of unrooted trees for n taxa
- NU(2n-5)!/2n-3(n-3)!
- Number of rooted trees for n taxa
- NR(2n-3)!/2n-2(n-2)!
49Phylogenetic tree using aligned sequences gene
tree
- When from orthologous genes the phylogenetic
tree can be used to reconstruct the species tree
(cladogram speciation over time). - However, considering polymorphisms, the nodes do
not necessarily coincide exactly with speciation.
- Similar for viruses if quasispecies variation
exists, the coalescence time in a transmission
case does not necessarily coincide with the
transmission time. - Molecular clock calculations to estimate the
coalescence time can only be used if the clock
holds
50Gene divergence may predate species divergence
Past Present
Gene splitting
Gene splitting
Population splitting
A B
C D
E F
51Gene trees vs Species trees
- Gene and species trees may be different
52Substitution models
- During evolution, multiple hits can have happened
at a single position the evolutionary distance
is almost always larger than the dissimilarity (
nt or aa divergence)
Expected difference based on number of mutations
that happened
Correction
Sequence difference
Observed difference
Time/Evolutionary distance
53Phylogenetic signal / phylogenetic noise
- Before making any phylogenetic inference it is
necessary to investigate the presence of
phylogenetic signal in the data set of aligned
sequences - WHY ?
- Sequences could be not enough informative about
the phylogeny of the taxa under study
(phylogenetic signal too low) - Sequences could be too divergent (positional
homology is lost) - Sequences could be saturated
- The data could not evolve according to a strictly
bifurcating tree
54DNA dot matrix
- A fast way to evaluate whether or not two
sequences can be unambiguously aligned is to
perform a DNA dot matrix - Homologous regions usually show a clear diagonal
in the dot matrix
55DNA dot matrix
A T T C T G A
T C T G A A A
A T T C G G A
A T T C G G A
- Sequences that can be unambiguously aligned show
a clear diagonal in the dot matrix
- Sequences that cannot be unambiguously aligned
show no diagonal in the dot matrix
56HIV/SIV DNA env dot matrix (1)
57HIV/SIV DNA env dot matrix (2)
58Substitution Saturation
- If the sequences in the data set diverged long
ago, the phylogenetic signal could be lost - It can be proved that two random nucleotide
sequences will share on average 25 similarity
just by accident if gaps are not allowed, and up
to 50 allowing gaps sequences with similar base
composition will artificially cluster together,
irrespective of the true phylogeny - Third codon positions usually saturate much
faster than first or second - Transitions usually saturate much faster than
transversions - A fast way to check for substitution saturation
in a set of aligned sequence is to plot the
genetic distance for each pair wise comparison
versus the number of transitions and
transversions between each pair - usually Ti occur more often than Tv. However,
when the saturation point is reached, Tv tend to
outnumber Ti
59d
Transitions A?G, C?T
- - - - - - -
Transversions A?C, T C?A, G
Divergence time (Mya)
5 10 15
60Implicit and explicit models of evolution
- Distance matrix methods and maximum likelihood
methods can correct for homoplasy by defining
evolutionary parameters such as a nucleotide
substitution matrix (additional parameters can
also be e.g. rate heterogeneity). These methods
therefore use an explicit model of evolution
whose parameters can be estimated from the data. - Parsimony methods can not correct for homoplasy,
they use an implicit model of evolution, they
assume that the most parsimonious tree (that is
the tree with the smallest number of mutations)
is evolutionary also the most likely one.