Evolution and Molecular Phylogeny I

About This Presentation

Title:

Evolution and Molecular Phylogeny I

Description:

In 1859, Darwin published 'On the origin of Species' -- Natural Selection -- Tree of life ... Ever since Darwin, the tree concept has been ... – PowerPoint PPT presentation

Number of Views:304

Avg rating:3.0/5.0

Slides: 57

Provided by: mat91

Category:

more less

Transcript and Presenter's Notes

Title: Evolution and Molecular Phylogeny I

1
Evolution and Molecular Phylogeny I
MA5232 Lecture 2

LX Zhang
Department of Mathematics
National University of Singapore
matzlx_at_nus.edu.sg

2
Outline

Evolution, genetics and genomics background
Phylogeny model of evolution
Mathematical properties of phylogenies.
How to build molecular phylogeny?
-- Maximum parsimony
-- Maximum likelihood
Tree distances
Gene trees and species trees

3
1. Little Background

Evolution is a basic subject in the modern
biology, unifying together different
fields such as genetics, microbiology.
In 1859, Darwin published On the origin of
Species
-- Natural Selection
-- Tree of life

4
The Tree of Life

All living species are the modified
descendants
of the earlier species.
At the base is the common ancestor in the
distant
past, and out of it grows a trunk, which split
again
and again to create a large bifurcating tree.
Each branches represents a single species
branching points are where one species
becomes two. Most branches eventually come to
a dead end as species go extinct. But some
reach right to the top these are living
species.
Ever since Darwin, the tree concept has been
the main principle for understanding of all
living things.

The tree of life in Darwins notebook.
Then, how a new species is formed?
-- speciation events.
5
Nature Selection

Evolution is driven by a process of natural
selection.
-- All individuals struggle to survive, but
some with small beneficial traits will have
greater chance of survival (ecological selection)
or reproducing (sexual selection)
-- Over ages, process of slow
evolutionary change causes one
species to evolve into another.
-- But a species splits into two in
a speciation event.

6
Molecular Evolution

Darwinian evolution and genetics produce the
modern evolutionary biology.
The mechanisms of inheritance began to be
revealed in the start of 20th century.
The structure of DNA was uncovered in 1953.
As a blueprint of life, DNA must be the stuff
in which the history of life
was written.
-- From a common ancestral sequence, two
DNA sequences are diverged.
And each of the two sequences start to
accumulate nucleotide substitution.
-- The more closely related two species
are, the more similar their
DNA ought to be.
The molecular phylogenetic analysis is to prove
a powerful tool. For instance,
it has been used to
-- Study of human evolution.
-- Classification of giant panda
-- Tracing HIV infection

7
A Primer to Human Genome

Cells are the fundamental working units of every
living systems.
DNA is made of 4 nucleotide bases. The DNA
sequence is the particular side-by-side
arrangement of bases along the DNA strand. This
order spells out the exact instructions required
to create a particular organism.
The genome is an organisms complete set of DNA.
Except for mature red blood cells, all human
cells contains a complete genome arranged in 24
distinct
chromosomes.

Each chromosome contains many genes,
the basic physical and functional units of
heredity.
Genes are specific sequences of bases that
encode instructions on how to make proteins.
Proteins perform most life functions and even
make up the majority of cellular structures.
Proteins are large, complex molecules made
up of smaller subunits called amino acids.
A protein folds up into specific
three-dimensional structure that define their
particular functions in the cell.

9
How does the human genome stack up?
10
What does the draft human genome sequence tell
us?

By the Numbers The human genome contains 3 bill
ion chemical nucleotide bases (A, C, T, and G).
The average gene consists of 3000 bases, but
sizes vary greatly, with the largest known human
gene being dystrophin at 2.4 million bases.
The total number of genes is estimated at
around 30,000--much lower than previous estimates
of 80,000 to 140,000. The functions are unk
nown for over 50 of discovered genes.
U.S. Department of Energy Genome Programs,
Genomics and Its Impact on Science and Society,
2003
11
Almost all (99.9) nucleotide bases are exactly
the same in all people. Scientists have
identified about 3 million locations where
single-base DNA differences (SNPs) occur in
humans. This information promises to revolutio
nize the processes of finding chromosomal
locations for disease-associated sequences and
tracing human history.
12
What does the draft human genome sequence tell us
How It's Arranged The human genome's gene-d
ense "urban centers" are in nucleotides G and C
. In contrast, the gene-poor "deserts" are
rich in the DNA nucleotides A and T.
13
What does the draft human genome sequence tell us

Genes appear to be concentrated in random areas
along the genome, with vast expanses of noncoding
DNA between.
Chromosome 1 has the most genes (2968), and the
Y chromosome has the fewest (231).

14
What does the draft human genome sequence tell
us?
The Wheat from the Chaff Less than 2 of the
genome codes for proteins. Repeated sequence
s that do not code for proteins ("junk DNA") make
up at least 50 of the human genome. Repetitive
sequences shed light on chromosome structure and
dynamics. Over time, these repeats reshape the
genome by rearranging it, creating entirely new
genes, and modifying and reshuffling existing
genes.
15
2. (Molecular) Phylogenetic Tree Basic model of
evolution

It summarizes the evolutionary relationships
(or differences) among a set of sequences
A tree structure is composed of nodes and edges
(or branches). Nodes represent the taxonomic
units (genes, sequences).

16
Some concepts
S1
S2
S3 S4
S5
Topology of a phylogenetic tree is the
branching pattern of the tree, which is a tree
in graph theory -- The degree of a node is th
e number of branches incident to it. It c
an be one, two, or three. -- Degree-1 nodes ar
e leaves, usually having a label.
-- Non-leaf nodes are internal nodes
-- There is at most one node of degree 2,
called the root if any.
Leaves represent the sequences under comparison,
called operational taxonomic units (OUTs) or t
ips
Internal nodes represent inferred ancestral
sequences,
called hypothetical sequences.
17
Unrooted and rooted phylogenetic trees
Unrooted phylogenetic tree, also called
phenogram, is a tree in which --- all nodes
represents related descendants,
--- but there is not enough information on
their common ancestor.
Rooted phylogenetic tree, also called Cladogram,
is a tree in which --- there is a root, repres
enting the common ancestor of the objects
represented by leaves. --- The path from the
root to a leaf represents an evolutionary path
18
Branch length and molecular clock
A branch from a node to its child often is
assigned a length or weight, representing the nu
mber of mutations occurring in the
corresponding course of evolution.
--- Mutational events under consideration
varies from study to study, -- The leng
th of a branch may be converted into the
evolutionary time with a molecular clock,
i.e, mutational rate..
Molecular clock hypothesis -- All mutations
occurs at the same rates in all the tree
branches. -- The rate of the mutations is t
he same for all positions along
a sequence.
19
3. Basic Mathematical Properties
Theorem 1 (a) Each unrooted phylogenetic tree
of n leaves has 2n 2 nodes and 2n-3 edges fo
r n 3. (b) Each rooted phylog
enetic tree of n leaves has 2n-1 nodes and 2n-
2 edges for n 3. Proof. (a) It is proved b
y induction on n. (b) It is derived from the fo
llowing facts Appending a leaf to
the root of a rooted tree gives a
a unrooted tree with n1 leaves.
20

Each branch in a unrooted phylogenetic tree
defines a partition of
the set of leaf labels

Two unrooted trees are identical if
their edges induce same label partitions.

Each node in a rooted phylogenetic tree
defines a subset of
leaf labels, composed of leaves below it. So,
rooted tree is also used
in taxonomy.

Two rooted trees are identical if
their internal nodes induce same label
subsets.
21
Unrooted or rooted phylogenetic trees have
labeled leaves
and unlabeled internal nodes
1
3
2
and 13 mores for n5
22
Rooted case
and 13 mores for n4
1
2
3
4
23
Theorem 2 (1) There are unrooted phylogen
etic trees with unit-length branches,
unlabeled internal nodes and n labeled leaves.
(2) There are rooted phylogenetic tr
ees with unit-length branches,
unlabeled internal nodes and n labeled leaves.

There are (2n-3) times more rooted trees
than unrooted trees over n leaves
24
Number of all
possible trees
n Unrooted
Rooted
3 1
3
4 3
15
5 15
105
6 105
945
7 945
10,395
8 10,395
135,135
9 135,135
2,027,025
10 2,027,025
34,459,425
11 34,459,425
654,729,075
Tree model is rich.
25
edges branches
26
4. Procedure of Building (or Reconstructing)
Molecular Phylogenetic Trees

Methods
Character based methods
-- Maximum parsimony
-- Maximum likelihood
Distance based methods
-- Neighbor joining
-- UPGMA method

27
How to choose a phylogenetic method?
Collect a set of DNA, or protein sequences
Obtain a multiple sequence alignment
No
Yes
Strong Similarity?
Maximum Parsimony
Maximum likelihood or distance methods
Validating result (Optional)
28
4.1 Maximum Parsimony Method

It assumes that substitution at a position in
sequence is independent from those
occurring
at neighbor positions.
It outputs a tree that requires the minimum
number of changes to explain the
differences
observed in the alignment data.
-- Ockhams razor principle
The explanation of any phenomenon
should make as few
assumptions as possible,
eliminating those that make no
difference in the observable
predictions of the explanatory
hypothesis.

29
Consider the following DNA data and a proposed
tree
By examining all sites individually, we show that
8
substitutions are needed to explain the data.
Or
30
At site 2
At site 3
At site 4
At site 5
At site 6
31
In summary,
GCGGTA
GTCGTA
GTCACT
ACGACA
ACGGTA
GTCGTA
ACGGTA
GCGGTA
ACGGTA
32
Changing the tree will alter the number of
substitutions required to explain the data. Henc
e, the following problem arises.
Small Parsimony Problem Input A rooted phylo
genetic tree with leaves labeled
with letters. Question Label the inter
nal nodes to minimize the number
substitutions in all branches.
Assume that u is connected to v by a branch in a
rooted tree. The node u is called a child of v i
f v is closer to the root than u.
Obviously, each internal node has exactly two
children.
33
Fitch Algorithm

Step 1 Compute a subset Sx of letters for each
node x in the tree.
For each leaf x having label lx, Sx
lx
For each internal node u having
children v and w, we compute
after Sv and Sw are computed.

GU A
GU A
G, AnG
G nG, A
34
Fitch Algorithm

Step 2 Select a letter lx from Sx to label a
node x. This is done
from the root to leaves.
(1) Select an arbitrary letter from Sr to
be lr for the root r.
(2) Assume u is a child of v and lv is
determined. If lv belongs to Su , then lu lv .
Otherwise, select an arbitrary letter from Su as
lu

G
G
G
G
35
Another Example
columns
columns
columns
k is the number of letters appearing in the leaf
labels.
36
Maximum Parsimony Problem Input A multi
ple sequence alignment Solution A most parsimo
nious tree that explains the seq
uence data with the minimum numb
er of substitutions.
S1 GCTTTATTCTT S2 GCTTCATTGAG S3 GAT
TCAGTGTG
S4 GCTGTAATGTG
S1 S3 S4 S2
37
(No Transcript)
38
4.2 Maximum Likelihood Method

Given a set of k sequences of n characters
each (Data)
Assume an evolution model M (e.g.
Jukes-Cantor)
-- A prior distribution of nucleotides at
each site at root
-- The sites evolve independently and
identically (i.i.d.)
-- The probability p(x?y t) that a
letter x is replaced
by another y on a branch of length t
for any x and y.
Find a tree H of k leaves labeled with the
that maximizes
the conditional probability
LPr Data H, M,
called the likelihood, under the model M.
Since sites evolve i.i.d., using
L?1in PrData(i) H, M,

39
Likelihood calculation
Assume that at the i-th site, the sequences have
letters T, A, A, respectively and the followin
g tree H
-- The root X may take one of the 4
possibilities A, G, C, T -- The inte
rnal node Y has also 4 possible
state A,G, C, T. -- For each pair of spe
cific states, say XA, YG, the e
volution has the following probability
pXA p(A?A t) p(A?G s) p(G?T v) p(G?A
u). -- Considering all the possibilities, the p
robability is were ? is the alphabet containi
ng A, G, C, T.
40
For example, for three sequences
S1 TGGT S2 AGGT
S3 AGCG We need to consider the f
ollowing three different tree topologies
For each tree, say the left one, the likelihood
PrData H, M P1 P2 P3P4
41
For each tree H, we need to choose 4 branch
lengths to maximize the likelihood Pr Data
H, M , containing 16 x 16 x 16 x 16 terms,
each of which is a product of 5
probabilities. So, the maximum likelihood is
extremely time-consuming.

Preferred by some systematists, but even harder
than MP in practice. In practice, most systematic
biologists use ML on small datasets,
Theoretically, it is NP-hard.
The main challenge here is to make it possible
to obtain good solutions to ML in reasonable
time periods on large datasets.

42
5. Consistence among trees and distance between
trees

It is often to have two or more trees for the
same group, often from different types of data or
from different methods.
-- How to put all the trees together to
get
one overall estimate of trees?
-- How to measure the extend of
difference between trees?

43
Robinson-Foulds distance
Partition metric or Robinson-Foulds distance for
unrooted trees -- Each branch in a unrooted
tree gives a partition of the set of leaf
labels. So, a unrooted tree is uniquely
defined by the partitions induced by its
branches. -- The Robinson-Foulds distance b
etween two trees is equal to the number
of partitions that found in a tree
but not in the other.
A
G
D
C
F
B
F
E
44
Nearest Neighbor Interchange Distance
There are 3 unrooted trees formed by grouping
differently 4 leaves.
In general, each internal branch that connects
two internal nodes is adjacent to 4 other branch
es (red in the example). These 4 branches are ca
lled the nearest neighbors of each other.
45
Nearest Neighbor Interchange(NNI)
Consider the branch leading to C. It is
originally grouped with the nearest
neighboring branch leading to (A, B). By
interchange it with one of the other two nearest
neighbors, we obtains different tress.
Interchange with the one leading to (E, F)
Interchange with the one leading to (G, D)
E
C
E
G
F
D
F
A
A
G
C
B
D
B
46
A tree can be transformed into any other in a
series
of NNI operations.
The space of phylogenetic trees with 5 leav
es connected by
NNI transformation
The nni distance d(S, T) between two unrooted
trees over the same set of leaves is the minimum
number of NNI operations needed from
transformation from one to another.
Theorem For any two unrooted trees S and T of n
leaves,
their NNI distance is bounded as
47
Computing consensus trees
Strict Consensus -- Each node presents a su
bset of leaf labels. Given a set of tr
ees, Strict consensus defines a
tree that contains exactly all the
subsets that are on all given rooted tr
ees.
48
A B C D E F G
B C D E F G A
A B C D E F G
These trees differ only in the placement of A and
are hence quite similar. Since the only common s
ubset is the whole set of labels, their strict
consensus is completely unresolved.
This shows that limitation of this consensus me
thod. Remark (1) A consensus tree is not a
phylogenetic tree (2) Consensus tree can als
o be defined for a set of unrooted trees.
49
Majority-Rule Consensus -- Each node presen
ts a subset of leaf labels. Given a se
t of trees, under the majority-rule,
the consensus tree contains all the
subsets that occur in at least half of t
he given trees.
Theorem The majority-rule consensus tree
always exists.
The left tree
50
6. Species Trees and Gene Trees

A species tree describes the evolutionary
relationship
of various species that are believed to have
a common ancestor.
Internal nodes represent speciation events
that had occurred in the evolutionary
course.

A gene tree depicts how a single gene has evolved
in a group of related species
It provides evidence for speciation events
that
are responsible for species evolution. Hence it
is often used to estimate the species trees.
But more and more analyses indicate that
gene trees are often inconsistent.
--Gene tree may have different
branch length
-- It may even have different topology.
There are two reasons for this
consistency.

52
Reason 1 Genetic drift varies

Species evolves through
speciation events.
At the same time, genes get
passed from generation to
generation.
Each generation has a set of
individuals and each individuals
may or may not have children in
next generation.

(http//scintilla.nature.com/node/625714)
53
(http//scintilla.nature.com/node/625714)

When a speciation occurred,
two species formed and population
was divided into two parts.
Breeding cross parts (species)
is no longer possible

Therefore, after speciation, no
two individuals in different species
can have the same parent in the
previous generation.
But this does not mean that
the common ancestor of
individuals in two separate species
lived at the time when the speciation
occurred.

54
(http//scintilla.nature.com/node/625714)
55
(http//scintilla.nature.com/node/625714)
56
Current research problem How to reconstruct
the species trees from a set of gene trees?
-- Different genes tell different stories.
-- Also different reconstruction tools often
produce different trees.
Wong et al, 2008 on yeast species

Write a Comment

User Comments (0)