Title: Evolution and Molecular Phylogeny I
1Evolution and Molecular Phylogeny I
MA5232 Lecture 2
- LX Zhang
- Department of Mathematics
- National University of Singapore
- matzlx_at_nus.edu.sg
-
2Outline
- Evolution, genetics and genomics background
- Phylogeny model of evolution
- Mathematical properties of phylogenies.
- How to build molecular phylogeny?
- -- Maximum parsimony
- -- Maximum likelihood
- Tree distances
- Gene trees and species trees
31. Little Background
- Evolution is a basic subject in the modern
biology, unifying together different
- fields such as genetics, microbiology.
- In 1859, Darwin published On the origin of
Species
- -- Natural Selection
- -- Tree of life
4The Tree of Life
- All living species are the modified
descendants
- of the earlier species.
- At the base is the common ancestor in the
distant
- past, and out of it grows a trunk, which split
again
- and again to create a large bifurcating tree.
- Each branches represents a single species
- branching points are where one species
- becomes two. Most branches eventually come to
- a dead end as species go extinct. But some
- reach right to the top these are living
species.
- Ever since Darwin, the tree concept has been
- the main principle for understanding of all
living things.
The tree of life in Darwins notebook.
Then, how a new species is formed?
-- speciation events.
5Nature Selection
- Evolution is driven by a process of natural
- selection.
- -- All individuals struggle to survive, but
some with small beneficial traits will have
greater chance of survival (ecological selection)
or reproducing (sexual selection) - -- Over ages, process of slow
- evolutionary change causes one
- species to evolve into another.
- -- But a species splits into two in
- a speciation event.
-
6Molecular Evolution
- Darwinian evolution and genetics produce the
modern evolutionary biology.
- The mechanisms of inheritance began to be
revealed in the start of 20th century.
- The structure of DNA was uncovered in 1953.
- As a blueprint of life, DNA must be the stuff
in which the history of life
- was written.
- -- From a common ancestral sequence, two
DNA sequences are diverged.
- And each of the two sequences start to
accumulate nucleotide substitution.
- -- The more closely related two species
are, the more similar their
- DNA ought to be.
- The molecular phylogenetic analysis is to prove
a powerful tool. For instance,
- it has been used to
- -- Study of human evolution.
- -- Classification of giant panda
- -- Tracing HIV infection
7A Primer to Human Genome
- Cells are the fundamental working units of every
living systems.
- DNA is made of 4 nucleotide bases. The DNA
sequence is the particular side-by-side
arrangement of bases along the DNA strand. This
order spells out the exact instructions required
to create a particular organism. - The genome is an organisms complete set of DNA.
- Except for mature red blood cells, all human
cells contains a complete genome arranged in 24
distinct
- chromosomes.
8- Each chromosome contains many genes,
- the basic physical and functional units of
heredity.
- Genes are specific sequences of bases that
encode instructions on how to make proteins.
- Proteins perform most life functions and even
make up the majority of cellular structures.
- Proteins are large, complex molecules made
up of smaller subunits called amino acids.
- A protein folds up into specific
three-dimensional structure that define their
particular functions in the cell.
9How does the human genome stack up?
10What does the draft human genome sequence tell
us?
By the Numbers The human genome contains 3 bill
ion chemical nucleotide bases (A, C, T, and G).Â
The average gene consists of 3000 bases, but
sizes vary greatly, with the largest known human
gene being dystrophin at 2.4 million bases.
 The total number of genes is estimated at
around 30,000--much lower than previous estimates
of 80,000 to 140,000. Â Â The functions are unk
nown for over 50 of discovered genes.
U.S. Department of Energy Genome Programs,
Genomics and Its Impact on Science and Society,
2003
11 Almost all (99.9) nucleotide bases are exactly
the same in all people. Scientists have
identified about 3 million locations where
single-base DNA differences (SNPs) occur in
humans. This information promises to revolutio
nize the processes of finding chromosomal
locations for disease-associated sequences and
tracing human history.
12What does the draft human genome sequence tell us
How It's Arranged The human genome's gene-d
ense "urban centers" are in nucleotides G and C
. Â In contrast, the gene-poor "deserts" are
rich in the DNA nucleotides A and T. Â
13What does the draft human genome sequence tell us
- Genes appear to be concentrated in random areas
along the genome, with vast expanses of noncoding
DNA between.Â
- Chromosome 1 has the most genes (2968), and the
Y chromosome has the fewest (231).
14What does the draft human genome sequence tell
us?
The Wheat from the Chaff Less than 2 of the
genome codes for proteins. Â Repeated sequence
s that do not code for proteins ("junk DNA") make
up at least 50 of the human genome. Repetitive
sequences shed light on chromosome structure and
dynamics. Over time, these repeats reshape the
genome by rearranging it, creating entirely new
genes, and modifying and reshuffling existing
genes. Â
152. (Molecular) Phylogenetic Tree Basic model of
evolution
- It summarizes the evolutionary relationships
- (or differences) among a set of sequences
- A tree structure is composed of nodes and edges
- (or branches). Nodes represent the taxonomic
units (genes, sequences).
16Some concepts
S1
S2
S3 S4
S5
Topology of a phylogenetic tree is the
branching pattern of the tree, which is a tree
in graph theory -- The degree of a node is th
e number of branches incident to it. It c
an be one, two, or three. -- Degree-1 nodes ar
e leaves, usually having a label.
-- Non-leaf nodes are internal nodes
-- There is at most one node of degree 2,
called the root if any.
Leaves represent the sequences under comparison,
called operational taxonomic units (OUTs) or t
ips
Internal nodes represent inferred ancestral
sequences,
called hypothetical sequences.
17Unrooted and rooted phylogenetic trees
Unrooted phylogenetic tree, also called
phenogram, is a tree in which --- all nodes
represents related descendants,
--- but there is not enough information on
their common ancestor.
Rooted phylogenetic tree, also called Cladogram,
is a tree in which --- there is a root, repres
enting the common ancestor of the objects
represented by leaves. --- The path from the
root to a leaf represents an evolutionary path
18Branch length and molecular clock
A branch from a node to its child often is
assigned a length or weight, representing the nu
mber of mutations occurring in the
corresponding course of evolution.
--- Mutational events under consideration
varies from study to study, -- The leng
th of a branch may be converted into the
evolutionary time with a molecular clock,
i.e, mutational rate..
Molecular clock hypothesis -- All mutations
occurs at the same rates in all the tree
branches. -- The rate of the mutations is t
he same for all positions along
a sequence.
193. Basic Mathematical Properties
Theorem 1 (a) Each unrooted phylogenetic tree
of n leaves has 2n 2 nodes and 2n-3 edges fo
r n 3. (b) Each rooted phylog
enetic tree of n leaves has 2n-1 nodes and 2n-
2 edges for n 3. Proof. (a) It is proved b
y induction on n. (b) It is derived from the fo
llowing facts Appending a leaf to
the root of a rooted tree gives a
a unrooted tree with n1 leaves.
20- Each branch in a unrooted phylogenetic tree
defines a partition of
- the set of leaf labels
Two unrooted trees are identical if
their edges induce same label partitions.
- Each node in a rooted phylogenetic tree
defines a subset of
- leaf labels, composed of leaves below it. So,
rooted tree is also used
- in taxonomy.
Two rooted trees are identical if
their internal nodes induce same label
subsets.
21Unrooted or rooted phylogenetic trees have
labeled leaves
and unlabeled internal nodes
1
3
2
and 13 mores for n5
22Rooted case
and 13 mores for n4
1
2
3
4
23Theorem 2 (1) There are unrooted phylogen
etic trees with unit-length branches,
unlabeled internal nodes and n labeled leaves.
(2) There are rooted phylogenetic tr
ees with unit-length branches,
unlabeled internal nodes and n labeled leaves.
There are (2n-3) times more rooted trees
than unrooted trees over n leaves
24 Number of all
possible trees
n Unrooted
Rooted
3 1
3
4 3
15
5 15
105
6 105
945
7 945
10,395
8 10,395
135,135
9 135,135
2,027,025
10 2,027,025
34,459,425
11 34,459,425
654,729,075
Tree model is rich.
25edges branches
264. Procedure of Building (or Reconstructing)
Molecular Phylogenetic Trees
- Methods
- Character based methods
- -- Maximum parsimony
- -- Maximum likelihood
- Distance based methods
- -- Neighbor joining
- -- UPGMA method
27How to choose a phylogenetic method?
Collect a set of DNA, or protein sequences
Obtain a multiple sequence alignment
No
Yes
Strong Similarity?
Maximum Parsimony
Maximum likelihood or distance methods
Validating result (Optional)
284.1 Maximum Parsimony Method
- It assumes that substitution at a position in
- sequence is independent from those
occurring
- at neighbor positions.
- It outputs a tree that requires the minimum
- number of changes to explain the
differences
- observed in the alignment data.
- -- Ockhams razor principle
- The explanation of any phenomenon
should make as few
- assumptions as possible,
eliminating those that make no
- difference in the observable
predictions of the explanatory
- hypothesis.
29Consider the following DNA data and a proposed
tree
By examining all sites individually, we show that
8
substitutions are needed to explain the data.
Or
30At site 2
At site 3
At site 4
At site 5
At site 6
31In summary,
GCGGTA
GTCGTA
GTCACT
ACGACA
ACGGTA
GTCGTA
ACGGTA
GCGGTA
ACGGTA
32Changing the tree will alter the number of
substitutions required to explain the data. Henc
e, the following problem arises.
Small Parsimony Problem Input A rooted phylo
genetic tree with leaves labeled
with letters. Question Label the inter
nal nodes to minimize the number
substitutions in all branches.
Assume that u is connected to v by a branch in a
rooted tree. The node u is called a child of v i
f v is closer to the root than u.
Obviously, each internal node has exactly two
children.
33Fitch Algorithm
- Step 1 Compute a subset Sx of letters for each
node x in the tree.
- For each leaf x having label lx, Sx
lx
- For each internal node u having
children v and w, we compute
- after Sv and Sw are computed.
-
GU A
GU A
G, AnG
G nG, A
34Fitch Algorithm
- Step 2 Select a letter lx from Sx to label a
node x. This is done
- from the root to leaves.
- (1) Select an arbitrary letter from Sr to
be lr for the root r.
- (2) Assume u is a child of v and lv is
determined. If lv belongs to Su , then lu lv .
Otherwise, select an arbitrary letter from Su as
lu -
-
G
G
G
G
35Another Example
columns
columns
columns
k is the number of letters appearing in the leaf
labels.
36Maximum Parsimony Problem Input A multi
ple sequence alignment Solution A most parsimo
nious tree that explains the seq
uence data with the minimum numb
er of substitutions.
S1 GCTTTATTCTT S2 GCTTCATTGAG S3 GAT
TCAGTGTG
S4 GCTGTAATGTG
S1 S3 S4 S2
37(No Transcript)
384.2 Maximum Likelihood Method
- Given a set of k sequences of n characters
each (Data)
- Assume an evolution model M (e.g.
Jukes-Cantor)
- -- A prior distribution of nucleotides at
each site at root
- -- The sites evolve independently and
identically (i.i.d.)
- -- The probability p(x?y t) that a
letter x is replaced
- by another y on a branch of length t
for any x and y.
- Find a tree H of k leaves labeled with the
that maximizes
- the conditional probability
- LPr Data H, M,
- called the likelihood, under the model M.
- Since sites evolve i.i.d., using
- L?1in PrData(i) H, M,
39Likelihood calculation
Assume that at the i-th site, the sequences have
letters T, A, A, respectively and the followin
g tree H
-- The root X may take one of the 4
possibilities A, G, C, T -- The inte
rnal node Y has also 4 possible
state A,G, C, T. -- For each pair of spe
cific states, say XA, YG, the e
volution has the following probability
pXA p(A?A t) p(A?G s) p(G?T v) p(G?A
u). -- Considering all the possibilities, the p
robability is were ? is the alphabet containi
ng A, G, C, T.
40For example, for three sequences
S1 TGGT S2 AGGT
S3 AGCG We need to consider the f
ollowing three different tree topologies
For each tree, say the left one, the likelihood
PrData H, M P1 P2 P3P4
41For each tree H, we need to choose 4 branch
lengths to maximize the likelihood Pr Data
H, M , containing 16 x 16 x 16 x 16 terms,
each of which is a product of 5
probabilities. So, the maximum likelihood is
extremely time-consuming.
- Preferred by some systematists, but even harder
than MP in practice. In practice, most systematic
biologists use ML on small datasets,
- Theoretically, it is NP-hard.
- The main challenge here is to make it possible
to obtain good solutions to ML in reasonable
time periods on large datasets.
425. Consistence among trees and distance between
trees
- It is often to have two or more trees for the
same group, often from different types of data or
from different methods.
- -- How to put all the trees together to
get
- one overall estimate of trees?
- -- How to measure the extend of
difference between trees?
43Robinson-Foulds distance
Partition metric or Robinson-Foulds distance for
unrooted trees -- Each branch in a unrooted
tree gives a partition of the set of leaf
labels. So, a unrooted tree is uniquely
defined by the partitions induced by its
branches. -- The Robinson-Foulds distance b
etween two trees is equal to the number
of partitions that found in a tree
but not in the other.
A
G
D
C
F
B
F
E
44Nearest Neighbor Interchange Distance
There are 3 unrooted trees formed by grouping
differently 4 leaves.
In general, each internal branch that connects
two internal nodes is adjacent to 4 other branch
es (red in the example). These 4 branches are ca
lled the nearest neighbors of each other.
45Nearest Neighbor Interchange(NNI)
Consider the branch leading to C. It is
originally grouped with the nearest
neighboring branch leading to (A, B). By
interchange it with one of the other two nearest
neighbors, we obtains different tress.
Interchange with the one leading to (E, F)
Interchange with the one leading to (G, D)
E
C
E
G
F
D
F
A
A
G
C
B
D
B
46A tree can be transformed into any other in a
series
of NNI operations.
The space of phylogenetic trees with 5 leav
es connected by
NNI transformation
The nni distance d(S, T) between two unrooted
trees over the same set of leaves is the minimum
number of NNI operations needed from
transformation from one to another.
Theorem For any two unrooted trees S and T of n
leaves,
their NNI distance is bounded as
47Computing consensus trees
Strict Consensus -- Each node presents a su
bset of leaf labels. Given a set of tr
ees, Strict consensus defines a
tree that contains exactly all the
subsets that are on all given rooted tr
ees.
48A B C D E F G
B C D E F G A
A B C D E F G
These trees differ only in the placement of A and
are hence quite similar. Since the only common s
ubset is the whole set of labels, their strict
consensus is completely unresolved.
This shows that limitation of this consensus me
thod. Remark (1) A consensus tree is not a
phylogenetic tree (2) Consensus tree can als
o be defined for a set of unrooted trees.
49Majority-Rule Consensus -- Each node presen
ts a subset of leaf labels. Given a se
t of trees, under the majority-rule,
the consensus tree contains all the
subsets that occur in at least half of t
he given trees.
Theorem The majority-rule consensus tree
always exists.
The left tree
506. Species Trees and Gene Trees
- A species tree describes the evolutionary
relationship
- of various species that are believed to have
- a common ancestor.
- Internal nodes represent speciation events
- that had occurred in the evolutionary
- course.
51- A gene tree depicts how a single gene has evolved
- in a group of related species
- It provides evidence for speciation events
that
- are responsible for species evolution. Hence it
- is often used to estimate the species trees.
- But more and more analyses indicate that
- gene trees are often inconsistent.
- --Gene tree may have different
- branch length
- -- It may even have different topology.
- There are two reasons for this
- consistency.
52Reason 1 Genetic drift varies
- Species evolves through
- speciation events.
- At the same time, genes get
- passed from generation to
- generation.
- Each generation has a set of
- individuals and each individuals
- may or may not have children in
- next generation.
(http//scintilla.nature.com/node/625714)
53(http//scintilla.nature.com/node/625714)
- When a speciation occurred,
- two species formed and population
- was divided into two parts.
- Breeding cross parts (species)
- is no longer possible
- Therefore, after speciation, no
- two individuals in different species
- can have the same parent in the
- previous generation.
- But this does not mean that
- the common ancestor of
- individuals in two separate species
- lived at the time when the speciation
- occurred.
54(http//scintilla.nature.com/node/625714)
55(http//scintilla.nature.com/node/625714)
56Current research problem How to reconstruct
the species trees from a set of gene trees?
-- Different genes tell different stories.
-- Also different reconstruction tools often
produce different trees.
Wong et al, 2008 on yeast species