Title: Bioinformatica e Analisi Funzionale del Genoma
1Bioinformatica e Analisi Funzionale del Genoma
Molecular Phylogenetics and Evolutionary Analysis
2Molecular Phylogenetics and Evolutionary Analysis
Tutorial Outline
- Introduction
- Molecular mechanisms underlying genetic evolution
- Homology, Orthology and Paralogy
- DNA vs protein sequences in evolutionary analyses
- Multiple Alignment
- Genetic Distances
- Site-Specific rates
- Molecular Phylogeny
- Software packages for evolutionary analysis
3Molecular Phylogenetics and Evolutionary Analysis
Nothing in Biology makes sense except in the
light of Evolution Theodosius Dobzhansky
(1900-1975)
Rosetta Stone
4Mechanism of Evolution
Mutational changes (errors) in genetic
transmission are at the basis of evolutionary
processes that starting from an ancestral life
form have produced the amazing diversification in
today life forms. Basic types of DNA changes
include
- point mutations (nucleotide substitutions)
- insertions
- deletions
- inversion and other kind of rearrangements
5Mechanism of Evolution
Mutagenesis is the process that give rise to
mutations, can be
- Spontaneous, e.g. through errors in the normal
process of DNA replication, mostly due to the
peculiar properties of purine and pyrimidine
bases - Induced, e.g. through radiation or chemical
damage to DNA
6Mechanism of Evolution
Spontaneous mutagenesis and errors in DNA
replication
- In man DNA replication must accurately replicate
6 x 109 base pairs every time a cell divides
- On average a new mutation occur per 1010 bp
incorporated
7Mechanism of Evolution
DNA polymerases ensures accuracy by two main
methods
(a) Base selection
Only AT and GC base pairs fit properly in the
active site of the polymerase
(b) Proofreading and Repair
If a wrong base is inserted, then it is removed
and replaced with the correct one.
8Mechanism of Evolution
Point Mutations
Transition
Transition
Transversions
Pyrimidine
Purine
9Mechanism of Evolution
Spontaneous mutations are the result of basic
properties of purine and pyrimidine bases that
may assume two structures for the keto-enol or
amino-imino tautomery.
H
H
N
N
N
N
O
O
H
G (keto)
G (enol)
N
N
N
N
H
N
H
H
N
H
H
10Mechanism of Evolution
Keto-enol and amino-imino tautomerism
Rare isomeric forms (tautomers) of the 4 bases
exist in equilibrium with the major forms (11000)
The rare forms reverse the base pairing rules
11Mechanism of Evolution
Base pairing of the rare forms produce transitions
A(amino) C (imino) A(imino) C
(amino) G (keto) ? T (enol) G (enol) ?
T (keto)
These minor base pairs may escape proofreading
and generate transition mutations in further
rounds of DNA replication
12Mechanism of Evolution
Base pairing of the rare forms produce also
transversions
Tautomer concentration is 10-4 (enol) and 10-5
(imino). If simultaneously the rare tautomer
pairs with a nucleotide in syn (0.05 -0.1)
conformation in the template then transversion
substitutions (purine ? pyrimidine) may occur.
13Mechanism of Evolution
Transitions vs Transversion
As expected transitions are more frequent than
transversions. Observed frequencies of point
mutations (1 per 10-9 - 10-10 bases incorporated)
are much lower than expected (about 10-6) because
of repair systems. Point mutations are also
produced by other types on non canonical
pairings, depurination processes, oxydative
deamination, etc.)
14Mechanism of Evolution
Slippage generates small insertions and deletions
1st replication round
2nd replication round
15Mechanism of Evolution
Unequal crossing-over may generate larger
duplications and insertions
Insertion
Deletion
16Mechanism of Evolution
Mutation and Fixation
To be a mutation genetically relevant it should
be heritable, i.e. to occur in germline cells,
and spread in a remarkable fraction of the
population, i.e. fixation.
17Mechanism of Evolution
Sequence and length changes in evolution
18Homology, Orthology and Paralogy
- Similarity
- - resemblance between two biosequences
- - local or global
- - can be measured
- Homology
- - common evolutionary descent
- - established through an evolutionary analysis
- - yes or not
19Orthology vs Paralogy
both imply homology
Sequences originated from a common ancestor
following a speciation event
Sequences originated from a common ancestor
following a gene duplication event
Sequences originated from a lateral gene transfer
event
20Orthology vs Paralogy
ancestral gene
gene duplication
gene A
gene B
speciation
orthologous
paralogous
orthologous
21DNA vs Proteins
Sequence conservation is in the order DNA lt
protein sequence lt protein secondary structure
lt protein 3D structure
Evolutionary Information
22DNA vs Proteins
Ser Gly Arg His Lys
UCU GGU CGU CAU AAA UCC GGC CGC CAC
AAG UCG GGG CGG UCA GGA CGA AGU AGC
Many different coding sequence codify for the
same protein.
23DNA vs Proteins
Protein 2 changes
DNA 52 changes
24Protein sequence vs structure
Spinach (1A70) and Azotobacter (7FD1) ferredoxins
25Steps in phylogenetic analysis
Multiple Alignment
Compute genetic distance or other analyses
Get a tree or other evolutionary inference
26Multiple Alignment
The Multiple Alignment is critical a bad
alignment produces wrong results. Carefully
evaluate the alignment (usually
computer-generated) before the analysis to
eventually introduce manual adjustments from
structural or functional information and remove
low-quality regions.
27Genetic Distance
2/7 occurred changes observed in the alignment
28Genetic Distance
The observed proportion of changes is a poor
estimator of the actual number of evolutionary
changes at increasing divergence
expected difference
saturation
observed difference
29Stochastic Models
Molecular Evolution is modeled as a
time-dependent probabilistic process. Several
models have been proposed differing in accepted
assumptions
- All nucleotide sites change independently.
- The substitution rate is constant over time and
in different lineages. - The base composition is at equilibrium.
- Nucleotide substitution rate is the same for all
kind of changes, for all sites.
30Base Composition
Stochastic models assume that base composition is
at equilibrium, i.e. that base composition is
roughly the same over the collection of sequences
being studied. This condition is also known as
Stationarity. Violation of stationarity may lead
to incorrect inferences.
31Base Composition
The stationarity check is mandatory before using
stochastic models.
Puzzle check
32Substitution Model
A
Transversion
Transition
T
C
G
33Jukes-Cantor Model
Equilibrium base composition 1/4, 1/4, 1/4, 1/4
A
One-parameter model
T
C
G
34Jukes-Cantor Model
qt fraction of identical sites at time t l
substitution rate per unit time
A
C
B
T t
B C (qt) (1- l)2 qt
B ? C (1-qt) 2 (l/3)(1- l)(1- qt)
C
B
T t1
qt1(1- 2l) qt 2/3 l(1- qt)
qt1-qt dq/dt 2/3 l - 8/3 l q
2lT d -3/4 ln (1-4/3p) p1-q
35Kimura Model
Equilibrium base composition 1/4, 1/4, 1/4, 1/4
A
Two-parameter model
T
C
G
36General Time Reversible Model
Equilibrium base composition qA, qC, qG, qT)
A
9-parameter model
T
C
G
37Outline of substitution models
38More Realistic Models
Models can be made more parameter rich to
increase their realism
- The most common additional parameters are
- A correction for the proportion of sites which
are unable to change - A correction for variable site rates at those
sites which can change
39Site Rate Heterogeneity
Different sites in DNA sequences may have quite
different probabilities of change. For this
reason it is advisable to analyze separately
sites presumably subjected to different
evolutionary dynamics (e.g. first and second vs
third codon positions). Furthermore rate
variation among sites can be modeled by a Gamma
distribution whose shape parameter alpha
specifies the range of rate variation among sites.
40Gamma Distribution
Higher alpha Lower heterogeneity
41More vs Lesser parameter Models
Models can be made more parameter rich to
increase their realism but the more parameters
you estimate from the data the more time needed
for an analysis and the more sampling error
accumulatesOne might have a realistic model but
large sampling errorsRealism comes at a cost in
time and precision!Fewer parameters may give an
inaccurate estimate, but more parameters decrease
the precision of the estimate In general use the
simplest model which fits the data.
42Parameter Estimation
Use PAUP tree scores to use ML to estimate
parameters
43Distance measure for protein sequences
When comparing protein sequences a parameter-rich
model is inapplicable as proteins use a 20-letter
alphabet. A suitable approximation is given by
the Kimura formula
With p the observed proportion of different amino
acids.
44Genetic distances
The measure unit of a genetic distance is
substitutions/site
45Estimating site-specific rate variability
- Rationale
- Substitutions between closely related sequences
are likely to have occurred at fast evolving
sites - Between closely related sequences substitutions
involving biochemically distinct amino acids will
tend to occur at least constrained (faster
evolving) sites.
46Estimating site-specific rate variability
The variability of the i-th site in a multiple
alignment of N sequences, L sites long is given
by
- Where dij is
- Nucleotide sequences 0 or 1 depending on the
observation or not of a nt substitution in the
j-th pairwise comparison. - Protein sequences a measure of amino acid pair
distance, ranging from 0 (identity) to 1 (least
common substitution). - .. and Kj is the overall genetic distance for the
j-th comparison. - A relative variability index is given by gi ni
/ nmax.
47Human mtDNA D-loop site variability
1
146
Hvr1 Region SiteVar software
152
195
0,8
0,6
Relative variability
0,4
0,2
0
141
1
11
21
31
41
51
61
71
81
91
331
351
371
121
361
101
111
131
151
161
171
181
191
201
211
221
231
241
251
261
271
281
291
301
311
321
341
CSB1
CSB2
CSB3
O
H
48Molecular Phylogeny
Evolutionary relationships between organisms, or
more generally between homologous genes, can be
suitably represented by phylogenetic trees. A
phylogenetic tree is a graph composed of nodes
and branches in which only one branch connects
two adjacent nodes. The nodes represent taxonomic
units, and the branches define the relationships
among the units in terms of descent and ancestry.
49Molecular PhylogenyTree topology
50Molecular PhylogenyRoot of the Tree
Rooted tree
Unrooted tree
51Molecular PhylogenyTypes of Trees
0.098
Cladogram
Phylogram
0.001
0.046
0.091
0.010
0.014
0.019
0.053
0.048
0.033
0.022
All branch lengths equal
Branch lengths proportional to distance
52Molecular PhylogenyTree Styles
CHIMP
ORANG
GORILLA
CHIMP
HUMAN
MACAQUE
HUMAN
GORILLA
ORANG
MACAQUE
OWL MONKEY
Newick format
OWL MONKEY
53Molecular PhylogenyTree building methods
54Molecular PhylogenyUPGMA
- Unweighted Pair Group Method with Arithmetic Mean
- uses a distance matrix
- a sequential algorithm identifies local
topological relationships - tree built in a stepwise manner
- arithmetic mean defines the distances between
taxa (initial or composite)
1
2
3
4
5
55Molecular PhylogenyMolecular Clock
UPGMA assumes homogeneous substitution rate along
all lineages, i.e. the Molecular Clock.
V K/2T T K/2V
56Molecular PhylogenyMolecular Clock - Divergence
Times
calibration
57Molecular PhylogenyMolecular Clock - Divergence
Times
58Molecular PhylogenyNeighbor Joining
Determines the N-3 internal branches that give
the smallest tree length (sum of all branches)
59Molecular PhylogenyDrawbacks of clustering
methods
Loss of Information
- Uninterpretable branch length
- dtree lt dobs biologically impossible
- occasionally even dtree lt 0
The method does not optimize an objective
function Clustering methods merely produce a
tree, but do not allow us to evaluate i) the
quality of the tree ii) competing hypotheses.
60Maximum Parsimony
- Uses the principle of minimum evolution to
identify the tree that requires the minimum of
evolutionary changes to explain divergence
between sequences - Often, no unique tree can be inferred
- Exhaustive search is unfeasible for large datasets
61Maximum Parsimony
4 changes
5 changes
6 changes
62Maximum Likelihood
Given a dataset D (i.e. a multiple alignment)
- Maximize the likelihood function L(S, w, R) ln
P (D S, w, R) with - Tree topology S
- Branch lengths w (w1, w2, w3, w4, w5)
- Rate matrix (and other evolutionary parameters,
e.g. Gamma, Inv)
1
3
1
2
1
3
State 3
State 2
State 1
2
4
3
4
4
2
63Maximum Likelihood
- Very reliable method
- Provides tree topology, branch length, parameter
estimates, may account for site variability and
fraction of invariant sites - May compare alternative hypotheses
- but
- it is computationally very intensive and
unapplicable to large dataset, but some
approximation methods are available such as
Quartet Puzzling and Bayesian Inference.
64Maximum LikelihoodTesting alternative hypotheses
H0 no Clock
H1 Clock
L0
L1
Likelihood Ratio Test (LRT) 2(L1 - L0 ) ? c2
(k-2)
65Assessing Tree ReliabilityBootstrap
Resampling with repetition
Consensus
Jacknife Resampling without repetition
66Assessing Tree ReliabilityConsensus Tree
67Molecular PhylogenyPrograms and Packages
http//evolution.genetics.washington.edu/phylip/so
ftware.html
68Molecular PhylogenyMajor Softwares