Title: Molecular Evolution
1Molecular Evolution
- The neutral theory of molecular evolution
- The molecular clock hypothesis
- Positive selection
- Phylogenetic trees
- Evolutionary distance measures
- Distance-based tree construction methods (UPGMA
and neighbour joining)
EPFL Bioinformatics I 16 Jan 2006
2Neutral theory of molecular evolution
- Historical background (Haldanes paradox)
- Early population geneticists believed that most
polymorphisms are maintained by balancing
selection. - Balancing selection implies a genetic load
because homozygotes are less fit than
heterozygotes. - When protein electrophoresis became available, it
was found that a very large number of genes were
actually polymorphic. This appeared to imply an
unacceptably high genetic load for the human and
other populations.
- Kimuras neutral theory of molecular evolution
provides an explanation for Haldanes paradox - Claim The large majority of observed molecular
polymorphisms reflect neutral changes. Likewise,
most substitutions observed between homologous
genes are selectively neutral. - Implications Gene (protein) families evolve
through neutral mutations and purifying
selection. Most genes (proteins) have not been
improved during the period of metazoan evolution.
EPFL Bioinformatics I 16 Jan 2006
3The molecular clock hypothesis
In 1965, Zuckerkandl and Pauling proposed that
for any given lineage, the rate of molecular
evolution (amino acid substitutions per year) is
constant over time. In other words there exists
a universal molecular clock. Implications
Mutation and substitution rates are the same in
all lineages. This supports the neutral theory of
molecular evolution (no dependence on population
size and generation time) but is difficult to
reconcile with claims that mutation rates differ
along chromosomes. Exceptions and limitations
The primate lineage appears to have a somewhat
lower evolutionary rate than other lineages. The
theory is not readily testable for lower
eukaryotes and bacteria, for which a fossil
record is lacking. Note further the rate
appears to be proportional to time, and not to
the number of generations or cell divisions. The
independence of generation time speaks against
positive selection as a driving force of
evolution. The independence of cell cycles
suggests that most mutations do not happen during
replication.
EPFL Bioinformatics I 16 Jan 2006
4Proteins evolve at different rates
- Making the following assumptions
- All amino acid replacements are selectively
neutral (neutral theory) - There is a constant molecular clock
- A given protein (e.g. an enzyme) has the same
function and thus evolves under the same
purifying selection conditions in all species - it follows that
- a given protein evolves at a constant rate in all
lineages - However
- Different proteins may evolve at different rates
due to varying levels of functional constraints - At the nucleotide sequence, the different rates
are primarily reflected by non-silent base
substitutions (assuming that silent substitutions
are selectively neutral). - These predictions are matched by many protein
families
EPFL Bioinformatics I 16 Jan 2006
5Positive Selection
- Positive selection may occur
- When the function of a protein is improved, e.g.
the efficiency or substrate specificity of an
enzyme - When a protein is undergoing adaptation to
changes in the environment, e.g. viral surface
proteins try to escape the immune system. This
case is documented by many examples and may
generally be more frequent than functional
improvement. - Potential evidence for positive selection
- Ratio of silent versus non-silent amino acid
substitutions - Accelerated or population-size dependent rate of
amino acid substitutions in a particular lineage - More silent replacements among within species
polymorphisms than among between species
replacements (evidence for positive selection in
the past).
EPFL Bioinformatics I 16 Jan 2006
6Phylogenetic trees
Rooted tree
Rooted tree satisfying molecular clock
hypothesis all leaves at same distance from the
root.
root
6
root
7
8
time
7
6
8
3
5
1
2
4
2
1
3
4
5
Unrooted tree
Note 1-5 are called leaves, or leave nodes. 6-8
are inferred nodes corresponding to ancestral
species or molecules. Branches are also called
edges. The edge lengths reflect evolutionary
distances.
3
4
8
6
2
7
5
1
Bioinformatics I 16 Jan 2006
7Phylogenetic trees
A phylogenetic tree is a graph reflecting the
approximate distances between a set of objects in
a hierarchical fashion. A tree is also called a
dendrogram. There are different types of trees
Unrooted versus rooted trees A rooted tree has
an additional node representing the origin, in
molecular phylogeny the last common ancestor of
the sequences analyzed. In general, the root
cannot be directly inferred from the data. It may
be inferred from the paleontological record, from
a trusted outlier, or on the basis of the
molecular clock hypothesis. Scaled and
unscaled trees In an unscaled tree, the length
of the branches are not important. Only the
topology counts. In phylogeny, trees are usually
scaled. Binary trees each node branches into
two daughter nodes. Other trees are usually not
considered in phylogegy as they can easily be
approximated by binary trees with very short
edges between nodes. Note A rooted (unrooted)
tree connecting n objects (leaves) has 2n1
(2n2) nodes altogether and 2n2 (2n3) edges
Bioinformatics I 16 Jan 2006
8Phylogenetic tree reconstruction, overview
- Computational challenge There is an enormous
number of different topologies even for a
relatively small number of sequences - 3 sequences 1
- 4 sequences 3
- 5 sequences 15
- 10 sequences 2,027,025
- 20 sequences 221,643,095,476,699,771,875
- Consequence Most tree construction algorithm are
heuristic methods not guaranteed to find the
optimal topology. - Input data for two major classes of algorithms
- Input data distance matrix, examples UPGMA,
neighbor-joining - 2. Input data multiple alignment parsimony,
maximum likelihood - Distance matrix methods use distances computed
from pairwise or multiple alignments as input.
Bioinformatics I 16 Jan 2006
9Distance measures for phylogenetic tree
construction
Distance measures respect the following
constraints d 0 if the sequences are
identical, d gt 0 if the sequences are
different Distances between molecular sequences
are computed from pair-wise alignment scores.
For closely related DNA sequences, one could
simply use f , the fraction of non-identical
residues (readily computed from the identity
value returned by an alignment program). For
more distantly related sequences, the
Jukes-Cantor distance, dij ¾log(14f/3) is
preferred. This measures is supposed to be
proportional to evolutionary time. It takes into
account that the percent identity values
saturates at 25 over time. For protein
sequences aligned with the aid of a substitution
matrix, an approximate distance is often computed
as follows
Sobs observed pairwise alignment
score Smax maximum score (average of sequences of
sequences against themselves Srand expected
score for random sequences of same length and
composition
Bioinformatics I 16 Jan 2006
10Distance matrixbased methods Example UPGMA
- Unrooted pair-group method with arithmetic means
(UPGMA) - Initialization
- assign each sequence to its own cluster
- define one leave node for each single-sequence
cluster - put all leave nodes at height zero
- Iteration.
- determine the two clusters for which the distance
is minimal and combine them in a new cluster. - compute the distance between the new cluster and
all other clusters by averaging over all
pair-wise distances between cluster elements - define a new node for the new cluster and place
it at height corresponding to the average
distance between the cluster elements - Termination
- When only two clusters remain, place root at the
average distance between the elements of the two
clusters - Limitation of UPGMA The algorithms implicitly
assumes a constant evolutionary rate in all
branches. It is therefore unfit to test the
molecular clock hypothesis. - An alternative method called neighbour-joining
provides more realistic branch lengths.
Bioinformatics I 16 Jan 2006
11Distance matrixbased methods Neighbor joining
- Underlying assumptions
- Additivity The distance between two leaves is
the sum of the lengths of the edges on the path
connecting them (may at best be approximately
true) - Motivation
- Additivity is a less stringent assumption than
the molecular clock assumption - .If different branches of a tree evolve at
different rates, the closest pair of leaves may
not be neighboring leaves, see example below. - Output an un-rooted tree
- Principle
- A modified distance measure Dij is used to detect
neighbors, which is obtained by subtracting the
average distances to all other leaves from the
distance dij
0.4
3
Example The closest pair of leaves 1,2 are not
neigbors d120.3, d130,5, however D12 1.1,
D13 1.2
1
0.1
0.1
4
2
0.1
0.4
Bioinformatics I 16 Jan 2006
12Distance matrixbased methods Neighbor joining
- Initialization
- Define T to be the set of leave nodes,
- Initialize the current set of nodes as LT.
- Iteration
- Pick a pair i, j from L for which Dij is minimal
- Define a new node k and set dkm ½(dimdjmdij),
for all m in L, also compute Dkm for all m. - Add k to T with edges of length
- dij ½(dijrirj)
- djk dijdik
- Remove i and j from L and add k.
- Termination
- When L consists of two leaves i and j add
remaining edge between i and j, with length dij - Formulae
- where L is the size of the current set of leave
nodes
Bioinformatics I 16 Jan 2006