Evolution and Molecular Phylogeny I - PowerPoint PPT Presentation

1 / 56
About This Presentation
Title:

Evolution and Molecular Phylogeny I

Description:

In 1859, Darwin published 'On the origin of Species' -- Natural Selection -- Tree of life ... Ever since Darwin, the tree concept has been ... – PowerPoint PPT presentation

Number of Views:304
Avg rating:3.0/5.0
Slides: 57
Provided by: mat91
Category:

less

Transcript and Presenter's Notes

Title: Evolution and Molecular Phylogeny I


1
Evolution and Molecular Phylogeny I
MA5232 Lecture 2
  • LX Zhang
  • Department of Mathematics
  • National University of Singapore
  • matzlx_at_nus.edu.sg

2
Outline
  • Evolution, genetics and genomics background
  • Phylogeny model of evolution
  • Mathematical properties of phylogenies.
  • How to build molecular phylogeny?
  • -- Maximum parsimony
  • -- Maximum likelihood
  • Tree distances
  • Gene trees and species trees

3
1. Little Background
  • Evolution is a basic subject in the modern
    biology, unifying together different
  • fields such as genetics, microbiology.
  • In 1859, Darwin published On the origin of
    Species
  • -- Natural Selection
  • -- Tree of life

4
The Tree of Life
  • All living species are the modified
    descendants
  • of the earlier species.
  • At the base is the common ancestor in the
    distant
  • past, and out of it grows a trunk, which split
    again
  • and again to create a large bifurcating tree.
  • Each branches represents a single species
  • branching points are where one species
  • becomes two. Most branches eventually come to
  • a dead end as species go extinct. But some
  • reach right to the top these are living
    species.
  • Ever since Darwin, the tree concept has been
  • the main principle for understanding of all
    living things.

The tree of life in Darwins notebook.
Then, how a new species is formed?
-- speciation events.
5
Nature Selection
  • Evolution is driven by a process of natural
  • selection.
  • -- All individuals struggle to survive, but
    some with small beneficial traits will have
    greater chance of survival (ecological selection)
    or reproducing (sexual selection)
  • -- Over ages, process of slow
  • evolutionary change causes one
  • species to evolve into another.
  • -- But a species splits into two in
  • a speciation event.

6
Molecular Evolution
  • Darwinian evolution and genetics produce the
    modern evolutionary biology.
  • The mechanisms of inheritance began to be
    revealed in the start of 20th century.
  • The structure of DNA was uncovered in 1953.
  • As a blueprint of life, DNA must be the stuff
    in which the history of life
  • was written.
  • -- From a common ancestral sequence, two
    DNA sequences are diverged.
  • And each of the two sequences start to
    accumulate nucleotide substitution.
  • -- The more closely related two species
    are, the more similar their
  • DNA ought to be.
  • The molecular phylogenetic analysis is to prove
    a powerful tool. For instance,
  • it has been used to
  • -- Study of human evolution.
  • -- Classification of giant panda
  • -- Tracing HIV infection

7
A Primer to Human Genome
  • Cells are the fundamental working units of every
    living systems.
  • DNA is made of 4 nucleotide bases. The DNA
    sequence is the particular side-by-side
    arrangement of bases along the DNA strand. This
    order spells out the exact instructions required
    to create a particular organism.
  • The genome is an organisms complete set of DNA.

  • Except for mature red blood cells, all human
    cells contains a complete genome arranged in 24
    distinct
  • chromosomes.

8
  • Each chromosome contains many genes,
  • the basic physical and functional units of
    heredity.
  • Genes are specific sequences of bases that
    encode instructions on how to make proteins.
  • Proteins perform most life functions and even
    make up the majority of cellular structures.
  • Proteins are large, complex molecules made
    up of smaller subunits called amino acids.
  • A protein folds up into specific
    three-dimensional structure that define their
    particular functions in the cell.

9
How does the human genome stack up?
10
What does the draft human genome sequence tell
us?

By the Numbers The human genome contains 3 bill
ion chemical nucleotide bases (A, C, T, and G). 
The average gene consists of 3000 bases, but
sizes vary greatly, with the largest known human
gene being dystrophin at 2.4 million bases.
  The total number of genes is estimated at
around 30,000--much lower than previous estimates
of 80,000 to 140,000.     The functions are unk
nown for over 50 of discovered genes.
U.S. Department of Energy Genome Programs,
Genomics and Its Impact on Science and Society,
2003
11
Almost all (99.9) nucleotide bases are exactly
the same in all people. Scientists have
identified about 3 million locations where
single-base DNA differences (SNPs) occur in
humans. This information promises to revolutio
nize the processes of finding chromosomal
locations for disease-associated sequences and
tracing human history.
12
What does the draft human genome sequence tell us
How It's Arranged The human genome's gene-d
ense "urban centers" are in nucleotides G and C
.   In contrast, the gene-poor "deserts" are
rich in the DNA nucleotides A and T.  
13
What does the draft human genome sequence tell us
  • Genes appear to be concentrated in random areas
    along the genome, with vast expanses of noncoding
    DNA between. 
  • Chromosome 1 has the most genes (2968), and the
    Y chromosome has the fewest (231).

14
What does the draft human genome sequence tell
us?
The Wheat from the Chaff Less than 2 of the
genome codes for proteins.   Repeated sequence
s that do not code for proteins ("junk DNA") make
up at least 50 of the human genome. Repetitive
sequences shed light on chromosome structure and
dynamics. Over time, these repeats reshape the
genome by rearranging it, creating entirely new
genes, and modifying and reshuffling existing
genes.  
15
2. (Molecular) Phylogenetic Tree Basic model of
evolution
  • It summarizes the evolutionary relationships
  • (or differences) among a set of sequences
  • A tree structure is composed of nodes and edges

  • (or branches). Nodes represent the taxonomic
    units (genes, sequences).

16
Some concepts
S1
S2
S3 S4
S5
Topology of a phylogenetic tree is the
branching pattern of the tree, which is a tree
in graph theory -- The degree of a node is th
e number of branches incident to it. It c
an be one, two, or three. -- Degree-1 nodes ar
e leaves, usually having a label.
-- Non-leaf nodes are internal nodes
-- There is at most one node of degree 2,
called the root if any.
Leaves represent the sequences under comparison,
called operational taxonomic units (OUTs) or t
ips
Internal nodes represent inferred ancestral
sequences,
called hypothetical sequences.
17
Unrooted and rooted phylogenetic trees
Unrooted phylogenetic tree, also called
phenogram, is a tree in which --- all nodes
represents related descendants,
--- but there is not enough information on
their common ancestor.
Rooted phylogenetic tree, also called Cladogram,
is a tree in which --- there is a root, repres
enting the common ancestor of the objects
represented by leaves. --- The path from the
root to a leaf represents an evolutionary path
18
Branch length and molecular clock
A branch from a node to its child often is
assigned a length or weight, representing the nu
mber of mutations occurring in the
corresponding course of evolution.
--- Mutational events under consideration
varies from study to study, -- The leng
th of a branch may be converted into the
evolutionary time with a molecular clock,
i.e, mutational rate..
Molecular clock hypothesis -- All mutations
occurs at the same rates in all the tree
branches. -- The rate of the mutations is t
he same for all positions along
a sequence.
19
3. Basic Mathematical Properties
Theorem 1 (a) Each unrooted phylogenetic tree
of n leaves has 2n 2 nodes and 2n-3 edges fo
r n 3. (b) Each rooted phylog
enetic tree of n leaves has 2n-1 nodes and 2n-
2 edges for n 3. Proof. (a) It is proved b
y induction on n. (b) It is derived from the fo
llowing facts Appending a leaf to
the root of a rooted tree gives a
a unrooted tree with n1 leaves.
20
  • Each branch in a unrooted phylogenetic tree
    defines a partition of
  • the set of leaf labels

Two unrooted trees are identical if
their edges induce same label partitions.
  • Each node in a rooted phylogenetic tree
    defines a subset of
  • leaf labels, composed of leaves below it. So,
    rooted tree is also used
  • in taxonomy.

Two rooted trees are identical if
their internal nodes induce same label
subsets.
21
Unrooted or rooted phylogenetic trees have
labeled leaves
and unlabeled internal nodes
1
3
2
and 13 mores for n5
22
Rooted case
and 13 mores for n4
1
2
3
4
23
Theorem 2 (1) There are unrooted phylogen
etic trees with unit-length branches,
unlabeled internal nodes and n labeled leaves.
(2) There are rooted phylogenetic tr
ees with unit-length branches,
unlabeled internal nodes and n labeled leaves.

There are (2n-3) times more rooted trees
than unrooted trees over n leaves
24
Number of all
possible trees
n Unrooted
Rooted
3 1
3
4 3
15
5 15
105
6 105
945
7 945
10,395
8 10,395
135,135
9 135,135
2,027,025
10 2,027,025
34,459,425
11 34,459,425
654,729,075
Tree model is rich.
25
edges branches
26
4. Procedure of Building (or Reconstructing)
Molecular Phylogenetic Trees
  • Methods
  • Character based methods
  • -- Maximum parsimony
  • -- Maximum likelihood
  • Distance based methods
  • -- Neighbor joining
  • -- UPGMA method

27
How to choose a phylogenetic method?
Collect a set of DNA, or protein sequences
Obtain a multiple sequence alignment
No
Yes
Strong Similarity?
Maximum Parsimony
Maximum likelihood or distance methods
Validating result (Optional)
28
4.1 Maximum Parsimony Method
  • It assumes that substitution at a position in

  • sequence is independent from those
    occurring
  • at neighbor positions.
  • It outputs a tree that requires the minimum
  • number of changes to explain the
    differences
  • observed in the alignment data.
  • -- Ockhams razor principle
  • The explanation of any phenomenon
    should make as few
  • assumptions as possible,
    eliminating those that make no
  • difference in the observable
    predictions of the explanatory
  • hypothesis.

29
Consider the following DNA data and a proposed
tree
By examining all sites individually, we show that
8
substitutions are needed to explain the data.
Or
30
At site 2
At site 3
At site 4
At site 5
At site 6
31
In summary,
GCGGTA
GTCGTA
GTCACT
ACGACA
ACGGTA
GTCGTA
ACGGTA
GCGGTA
ACGGTA
32
Changing the tree will alter the number of
substitutions required to explain the data. Henc
e, the following problem arises.
Small Parsimony Problem Input A rooted phylo
genetic tree with leaves labeled
with letters. Question Label the inter
nal nodes to minimize the number
substitutions in all branches.
Assume that u is connected to v by a branch in a
rooted tree. The node u is called a child of v i
f v is closer to the root than u.
Obviously, each internal node has exactly two
children.
33
Fitch Algorithm
  • Step 1 Compute a subset Sx of letters for each
    node x in the tree.
  • For each leaf x having label lx, Sx
    lx
  • For each internal node u having
    children v and w, we compute
  • after Sv and Sw are computed.

GU A
GU A
G, AnG
G nG, A
34
Fitch Algorithm
  • Step 2 Select a letter lx from Sx to label a
    node x. This is done
  • from the root to leaves.
  • (1) Select an arbitrary letter from Sr to
    be lr for the root r.
  • (2) Assume u is a child of v and lv is
    determined. If lv belongs to Su , then lu lv .
    Otherwise, select an arbitrary letter from Su as
    lu

G
G
G
G
35
Another Example
columns
columns
columns
k is the number of letters appearing in the leaf
labels.
36
Maximum Parsimony Problem Input A multi
ple sequence alignment Solution A most parsimo
nious tree that explains the seq
uence data with the minimum numb
er of substitutions.
S1 GCTTTATTCTT S2 GCTTCATTGAG S3 GAT
TCAGTGTG
S4 GCTGTAATGTG
S1 S3 S4 S2
37
(No Transcript)
38
4.2 Maximum Likelihood Method
  • Given a set of k sequences of n characters
    each (Data)
  • Assume an evolution model M (e.g.
    Jukes-Cantor)
  • -- A prior distribution of nucleotides at
    each site at root
  • -- The sites evolve independently and
    identically (i.i.d.)
  • -- The probability p(x?y t) that a
    letter x is replaced
  • by another y on a branch of length t
    for any x and y.
  • Find a tree H of k leaves labeled with the
    that maximizes
  • the conditional probability
  • LPr Data H, M,
  • called the likelihood, under the model M.
  • Since sites evolve i.i.d., using
  • L?1in PrData(i) H, M,

39
Likelihood calculation
Assume that at the i-th site, the sequences have
letters T, A, A, respectively and the followin
g tree H
-- The root X may take one of the 4
possibilities A, G, C, T -- The inte
rnal node Y has also 4 possible
state A,G, C, T. -- For each pair of spe
cific states, say XA, YG, the e
volution has the following probability
pXA p(A?A t) p(A?G s) p(G?T v) p(G?A
u). -- Considering all the possibilities, the p
robability is were ? is the alphabet containi
ng A, G, C, T.
40
For example, for three sequences
S1 TGGT S2 AGGT
S3 AGCG We need to consider the f
ollowing three different tree topologies
For each tree, say the left one, the likelihood
PrData H, M P1 P2 P3P4
41
For each tree H, we need to choose 4 branch
lengths to maximize the likelihood Pr Data
H, M , containing 16 x 16 x 16 x 16 terms,
each of which is a product of 5
probabilities. So, the maximum likelihood is
extremely time-consuming.
  • Preferred by some systematists, but even harder
    than MP in practice. In practice, most systematic
    biologists use ML on small datasets,
  • Theoretically, it is NP-hard.
  • The main challenge here is to make it possible
    to obtain good solutions to ML in reasonable
    time periods on large datasets.

42
5. Consistence among trees and distance between
trees
  • It is often to have two or more trees for the
    same group, often from different types of data or
    from different methods.
  • -- How to put all the trees together to
    get
  • one overall estimate of trees?
  • -- How to measure the extend of
    difference between trees?

43
Robinson-Foulds distance
Partition metric or Robinson-Foulds distance for
unrooted trees -- Each branch in a unrooted
tree gives a partition of the set of leaf
labels. So, a unrooted tree is uniquely
defined by the partitions induced by its
branches. -- The Robinson-Foulds distance b
etween two trees is equal to the number
of partitions that found in a tree
but not in the other.
A
G
D
C
F
B
F
E
44
Nearest Neighbor Interchange Distance
There are 3 unrooted trees formed by grouping
differently 4 leaves.
In general, each internal branch that connects
two internal nodes is adjacent to 4 other branch
es (red in the example). These 4 branches are ca
lled the nearest neighbors of each other.
45
Nearest Neighbor Interchange(NNI)
Consider the branch leading to C. It is
originally grouped with the nearest
neighboring branch leading to (A, B). By
interchange it with one of the other two nearest
neighbors, we obtains different tress.
Interchange with the one leading to (E, F)
Interchange with the one leading to (G, D)
E
C
E
G
F
D
F
A
A
G
C
B
D
B
46
A tree can be transformed into any other in a
series
of NNI operations.
The space of phylogenetic trees with 5 leav
es connected by
NNI transformation
The nni distance d(S, T) between two unrooted
trees over the same set of leaves is the minimum
number of NNI operations needed from
transformation from one to another.
Theorem For any two unrooted trees S and T of n
leaves,
their NNI distance is bounded as
47
Computing consensus trees
Strict Consensus -- Each node presents a su
bset of leaf labels. Given a set of tr
ees, Strict consensus defines a
tree that contains exactly all the
subsets that are on all given rooted tr
ees.
48
A B C D E F G
B C D E F G A
A B C D E F G
These trees differ only in the placement of A and
are hence quite similar. Since the only common s
ubset is the whole set of labels, their strict
consensus is completely unresolved.
This shows that limitation of this consensus me
thod. Remark (1) A consensus tree is not a
phylogenetic tree (2) Consensus tree can als
o be defined for a set of unrooted trees.
49
Majority-Rule Consensus -- Each node presen
ts a subset of leaf labels. Given a se
t of trees, under the majority-rule,
the consensus tree contains all the
subsets that occur in at least half of t
he given trees.
Theorem The majority-rule consensus tree
always exists.
The left tree
50
6. Species Trees and Gene Trees
  • A species tree describes the evolutionary
    relationship
  • of various species that are believed to have
  • a common ancestor.
  • Internal nodes represent speciation events
  • that had occurred in the evolutionary
  • course.

51
  • A gene tree depicts how a single gene has evolved

  • in a group of related species
  • It provides evidence for speciation events
    that
  • are responsible for species evolution. Hence it

  • is often used to estimate the species trees.
  • But more and more analyses indicate that
  • gene trees are often inconsistent.
  • --Gene tree may have different
  • branch length
  • -- It may even have different topology.
  • There are two reasons for this
  • consistency.

52
Reason 1 Genetic drift varies
  • Species evolves through
  • speciation events.
  • At the same time, genes get
  • passed from generation to
  • generation.
  • Each generation has a set of
  • individuals and each individuals
  • may or may not have children in
  • next generation.

(http//scintilla.nature.com/node/625714)
53
(http//scintilla.nature.com/node/625714)
  • When a speciation occurred,
  • two species formed and population
  • was divided into two parts.
  • Breeding cross parts (species)
  • is no longer possible
  • Therefore, after speciation, no
  • two individuals in different species
  • can have the same parent in the
  • previous generation.
  • But this does not mean that
  • the common ancestor of
  • individuals in two separate species
  • lived at the time when the speciation
  • occurred.

54
(http//scintilla.nature.com/node/625714)
55
(http//scintilla.nature.com/node/625714)
56
Current research problem How to reconstruct
the species trees from a set of gene trees?
-- Different genes tell different stories.
-- Also different reconstruction tools often
produce different trees.
Wong et al, 2008 on yeast species
Write a Comment
User Comments (0)
About PowerShow.com