Title: Bioinformatics: Applications
1Bioinformatics Applications
- ZOO 4903
- Fall 2006, MW 1030-1145
- Sutton Hall, Room 312
- Jonathan Wren
- Introduction to phylogenetics
2Lecture overview
- What weve talked about so far
- Abundance growth of DNA sequences
- How to quantify locate similar sequences
- How to align multiple sequences and identify
conserved regions - Overview
- Both species and genes change over time, and it
is useful to understand when divergences took
place - Defining evolutionary distances
- Visualizing evolutionary relationships
3(No Transcript)
4Phylogeny Reconstructing Evolutionary History
- Goal Infer past history that produced a set of
modern characters. - Needed
- A set of characters (e.g. DNA, protein)
- Evolutionary model
- Distance metric
- Probabilistic model of evolution
- Output an evolutionary tree
5Finding a Phylogeny
- No guarantees of correctness
- Based on evidence, but there is more than one way
to arrive at the same answer - All we can observe is distance/ differences. From
this we infer relatedness. - Find the most likely history
- Parsimony find the evolutionary tree that
explains the observations with the fewest
possible changes
ACGT ACGT AGGT ATGT
AGGT
6A brief history of phylogeny
- The ancient Greeks divided life into classes or
forms based upon their morphology - Darwins Origin of Species (1859) propelled
efforts to find the links between species - Hershey Chase show that DNA is the molecule of
inheritance (1952) - Emile Zuckerkandl and Linus Pauling used amino
acid sequences to produce a primate phylogeny
(1962) - NSF begins funding efforts to construct the Tree
of Life (2005)
7The Tree of Lifehttp//tolweb.org
8(No Transcript)
9Phylogenetic Analysis Overview
Evolutionary tree
evolutionary model
Chimp
sequence data
Human
Gorilla
phylogenetic inference
Orangutan
Macaque
Squirrel
Summary Statistics
10Language Phylogeny
Latin arbor domus casa
arbre maison
arbore casa
arbol casa
albero casa
tre hus
strom domovni
treow hus
baum haus
tree house
11Language Phylogeny
French
Romanian
Spanish
Italian
Norwegian
Czech
Anglo-Saxon
German
English
12Molecular phylogeny What makes it possible
- Point mutations (substitution) one base is
replaced with another
UV Ray
With only point mutations, easy to tell how close
two genomes are - just count the different bases
CAT
CGT
13And as in sequence alignment, we can quantify
differences
- Insertions one or more bases are inserted
- Deletions one or more bases are removed
GCATG
GCACATG
GCATCATG
GCATG
Caused by copying errors (enzymes slipping, etc.)
14Phylogeny Standard Assumptions
- Sequences diverge by speciation represented
usually as bifurcation events - Sequences are essentially independent once they
diverge from their common ancestor. - The probability of observing one nucleotide at
the same site in the future depends only on the
current nucleotide at the site. (Markov Chain
assumption). - Different sites (characters) within a sequence
evolve independently.
15Question
- Q If we can accurately identify the evolutionary
history of a species, can we extrapolate and
predict future directions in its evolution?
16Question
- Q If we can accurately identify the evolutionary
history of a species, can we extrapolate and
predict future directions in its evolution? - A No. As far as we know, evolution does not have
a direction.
17Phylogenies represented with trees
- A tree is a directed acyclic graph consisting of
nodes (sequences) and edges (relationships). - There exists a single unique path between any
pair of nodes.
Ancestralsequences
Not a tree, due to cycle
Modern sequences
18A tree consists of nodes connected by branches
One unique internal node is the root of the tree
the ancestor of all the sequences.
Internal nodes represent hypothetical ancestors
Terminal nodes represent sequences or organisms
for which we have data. Each is typically called
a Operational Taxonomical Unit or OTU.
19Human mtDNA Phylogeny
- Vertical layout is relatively meaningless (e.g.
swapping any 2-way branch has no effect on tree
meaning) - Evolutionary distance (horizontal scale, in this
diagram) is the most important information in the
phylogeny, and is reflected in the tree structure
(grouping). - Phylogeny is not classification, but distance,
i.e. there is no answer to how many clusters?
20Phylogenetics appliedFlorida dentist case
1990 case Did a patients HIV infection result
from an invasive dental procedure performed by an
HIV dentist? HIV evolves so fast that
transmission patterns can be reconstructed from
viral sequence (molecular forensics).
Compared viral sequence from the dentist, three
of his HIV patients, and two HIV local controls.
21Phylogenetic trees can be between or within
species
Relationships within species HIV subtypes
22Florida dentist case results
- 2 of 3 HIV patients closer to dentist than to
local controls - Patient F likely HIV from another source
23Phylogenetic tree types
- Forked vs. hierarchical
- Bifurcating vs. multifurcating
- Rooted vs. unrooted
- Ultrametric vs. additive
24Forked vs. Hierarchical
1
2
3
4
6
1
2
5
3
4
25Bifurcating vs. multifurcating
Multifurcating
Bifurcating
Polytomy
- Polytomies Soft vs. Hard
- Soft designate a lack of information about the
order of divergence. - Hard the hypothesis that multiple divergences
occurred simultaneously
26Rooted vs. Unrooted
27Rooted vs. Unrooted Trees
Chimp
Human
Gorilla
Macaque
Chimp
Macaque
Human
Rooted Trees
Gorilla
past
These two rooted trees are different But both
have the same unrooted tree
Unrooted Tree
The direction of evolution is specified in a
rooted tree but not in an unrooted tree
28Rooted Trees Are Directed Graphs
In a rooted tree each edge is directed, pointing
from root to leaves.
Chimp
Human
Gorilla
Macaque
Rooted Tree
In an unrooted tree, the edges are undirected.
Unrooted Tree
29Choosing a Root Location?
Chimp
Human
Gorilla
Macaque
Chimp
Macaque
Human
Rooted Trees
Gorilla
past
A rooted tree is just an unrooted tree with a
root location chosen on one branch.
Unrooted Tree
In the absence of an outgroup, the root location
is often unclear and therefore arbitrary.
30 of OTUs for trees
Number of OTUs
Log trees
OTUs
OTU Operational Taxonomical Unit
31Tree Properties
In simple scenarios, evolutionary trees are
ultrametric and phylograms are additive.
32Tree Types
33Character Differences ? Branch Lengths and Order
Characters
Chimp ..AGCTAAAGGGTCAGGGGAAGGGCA.. Gorilla
..AGCATAGGGGTCAGGGGAAAGGCT.. Human
..AGCAAAAGGGTCAGGGGAAGGGGA.. Macaque
..AGCTCATCGGTAAGGAGAAAGGAT.. Orangutan
..AGCCCATCGGTCAGGAGAAAGGAT.. Squirrel
..AGCGGACCGGTAAGGAGAAAGGAC..
- Phylogenetic Tree
- Length of a branch is proportional to number of
differences - Tree shows order of divergences
Branch length represents the extent of change
- expected number of substitutions per
nucleotide site. The longer the branch the more
change
34Character Differences ? Sequence Distances
Characters
Chimp ..AGCTAAAGGGTCAGGGGAAGGGCA.. Gorilla
..AGCATAGGGGTCAGGGGAAAGGCT.. Human
..AGCAAAAGGGTCAGGGGAAGGGGA.. Macaque
..AGCTCATCGGTAAGGAGAAAGGAT.. Orangutan
..AGCCCATCGGTCAGGAGAAAGGAT.. Squirrel
..AGCGGACCGGTAAGGAGAAAGGAC..
Distances
35Sequence Distances ? Branch Lengths and Order
1
2
Chimp ..AGCTAAAGGGTCAGGGGAAGGGCA.. Human
..AGCAAAAGGGTCAGGGGAAGGGGA.. Gorilla
..AGCATAGGGGTCAGGGGAAAGGCT.. Squirrel
..AGCGGACCGGTAAGGAGAAAGGAC.. Macaque
..AGCTCATCGGTAAGGAGAAAGGAT.. Orangutan
..AGCCCATCGGTCAGGAGAAAGGAT..
1
3
5
3
5
2
36Inferring Ancestral Sequences in the Tree
Chimp ..AGCTAAAGGGTCAGGGGAAGGGCA.. Human
..AGCAAAAGGGTCAGGGGAAGGGGA.. Ch/Hu
..AGCAAAAGGGTCAGGGGAAGGGCA.. Gorilla
..AGCATAGGGGTCAGGGGAAAGGCT.. Squirrel
..AGCGGACCGGTAAGGAGAAAGGAC.. Macaque
..AGCTCATCGGTAAGGAGAAAGGAT.. Orangutan
..AGCCCATCGGTCAGGAGAAAGGAT..
2
5
3
5
2
- When daughter sequences agree, likely conserved
in ancestor - Resolve disagreements by checking outgroup if an
ingroup-outgroup match, likely conserved in
ancestor.
37Inferring Ancestral Sequences in the Tree
A
(daughter sequences)
A
A
- When daughter sequences agree, likely conserved
in ancestor - Resolve disagreements by checking outgroup if an
ingroup-outgroup match, likely conserved in
ancestor.
38Inferring Ancestral Sequences in the Tree
T
(daughter sequences)
?
A
- When daughter sequences agree, likely conserved
in ancestor - Resolve disagreements by checking outgroup if an
ingroup-outgroup match, likely conserved in
ancestor.
39Inferring Ancestral Sequences in the Tree
T
(daughter sequences)
T
A
T
T
(outgroup sequence)
- When daughter sequences agree, likely conserved
in ancestor - Resolve disagreements by checking outgroup if an
ingroup-outgroup match, likely conserved in
ancestor.
40Inferring Ancestral Sequences in the Tree
T
(daughter sequences)
?
A
?
C
(outgroup sequence)
Of course, the outgroup may fail to resolve the
ambiguity
41Infer Ancestral Sequences at all junctions
2
Chimp ..AGCTAAAGGGTCAGGGGAAGGGCA.. Human
..AGCAAAAGGGTCAGGGGAAGGGGA.. Ch/Hu
..AGCAAAAGGGTCAGGGGAAGGGCA.. Gorilla
..AGCATAGGGGTCAGGGGAAAGGCT.. Gor/Ch/Hu
..AGCA?A?GGGTCAGGGGAAAGGCT.. ROOT
..AGC??A??GGT?AGG?GAAAGG?T.. Sq/Mac/Or
..AGC??A?CGGTAAGGAGAAAGGAT.. Squirrel
..AGCGGACCGGTAAGGAGAAAGGAC.. Mac/Or
..AGC?CATCGGTAAGGAGAAAGGAT.. Macaque
..AGCTCATCGGTAAGGAGAAAGGAT.. Orangutan
..AGCCCATCGGTCAGGAGAAAGGAT..
5
3
5
2
42Given a multiple alignment, how do we construct
the tree?
A - GCTTGTCCGTTACGAT B ACTTGTCTGTTACGAT C
ACTTGTCCGAAACGAT D - ACTTGACCGTTTCCTT E
AGATGACCGTTTCGAT F - ACTACACCCTTATGAG
?
43Phylogenetic Methods
Many different procedures exist. Three of the
most popular
Neighbor-joining Minimizes distance between
nearest neighbors Maximum parsimony Minimizes
total evolutionary change Maximum likelihood
Maximizes likelihood of observed data
44Distance Methods
Logic Evolutionary distance is a tree metric and
hence defines a tree
- General Method
- Evolutionary distances are computed for all pairs
of taxa. - A phylogenetic tree is constructed by considering
the relationships among these distance data
(fitting a tree to the matrix).
- Methods well talk about
- UPGMA
- Neighbor Joining
45Unweighted Pair Group Method with Arithmetic Mean
(UPGMA) Method of Clustering
First, construct a distance matrix
A - GCTTGTCCGTTACGAT B ACTTGTCTGTTACGAT C
ACTTGTCCGAAACGAT D - ACTTGACCGTTTCCTT E
AGATGACCGTTTCGAT F - ACTACACCCTTATGAG
This can also be thought of as Uniting Pairs of a
Greedy Multiple Alignment
46UPGMA
First round
dist(A,B),C (distAC distBC) / 2 4
dist(A,B),D (distAD distBD) / 2 6
dist(A,B),E (distAE distBE) / 2
6 dist(A,B),F (distAF distBF) / 2 8
Choose the most similar pair, cluster them
together and calculate the new distance matrix.
47UPGMA
Second round
Third round
48UPGMA
Fourth round
Fifth round
Note the this method identifies the root of the
tree.
49UPGMA in practice example
- The mitochondrial genome has 16,500 base-pairs.
- These sequences are estimated to diverge at a
rate of 1.7x10-8 substitutions per site per year
(molecular evolution clock) - Thus, roughly 1 mutation every 3,565 years
5086 mitochondrial DNAs from various human
populations compared via UPGMA
sub-Sahara mtDNA
51Phylogeny based upon the molecular clock
Evidence for a human mitochondrial origin in
Africa African sequence diversity is twice as
large as that of non-African
Ingman, M., Kaessmann, H., Pääbo, S.
Gyllensten, U. (2000) Nature 408 708-713.
52UPGMA assumes a molecular clock
- The UPGMA clustering method is very sensitive to
unequal evolutionary rates (assumes that the
evolutionary rate is the same for all branches). - Clustering works only if the data are ultrametric
- Ultrametric distances are defined by the
satisfaction of the 'three-point condition'.
The three-point condition
C
A
B
For any three taxa, the two greatest distances
are equal.
53UPGMA fails when rates of evolution are not
constant
A tree in which the evolutionary rates are not
equal
(Neighbor joining will get the right tree in this
case.)
54Molecular Clocksfor various genes
55Molecular Clocksfor various genes
Why such different profiles? Variation in
mutation rate?
Variation in selection. Genes coding for some
molecules under very strong stabilizing selection.
56Neighbor Joining An algorithm for finding the
shortest tree
Start with a star (no hierarchical structure)
c
a
d
b
The length of the tree
Pair-wise distances
Number of OTUs
57Neighbor Joining
The following can be used to calculate the length
of this tree
(Saitou and Nei, 1987)
58Neighbor Joining
At each step, each pair of possible neighbors are
considered and the one producing the shortest
tree is chosen (minimal evolution criteria).
59Neighbor Joining
As in UPGMA, a new internal branch is added at
each step.
60The Principle of Parsimony
Simpler hypotheses are preferable to more
complicated ones and that as hoc hypotheses
should be avoided whenever possible (Occams
Razor). Thus, find the tree that requires the
smallest number of evolutionary changes.
0123456789012345 W - ACTTGACCCTTACGAT X
AGCTGGCCCTGATTAC Y AGTTGACCATTACGAT Z -
AGCTGGTCCTGATGAC
W
X
Y
Z
61Maximum Parsimony
Start by classifying the sites
123456789012345678901 Mouse
CTTCGTTGGATCAGTTTGATA Rat
CCTCGTTGGATCATTTTGATA Dog
CTGCTTTGGATCAGTTTGAAC Human
CCGCCTTGGATCAGTTTGAAC ----------------------------
-------- Invariant Variant
-------------------------
----------- Informative
Non-inform.
62What characters are informative for an Unrooted
Tree?
A
G
G
G
A
G
G
A
22 is Informative
31 is not informative
63 123456789012345678901 Mouse
CTTCGTTGGATCAGTTTGATA Rat
CCTCGTTGGATCATTTTGATA Dog
CTGCTTTGGATCAGTTTGAAC Human
CCGCCTTGGATCAGTTTGAAC
G
T
T
G
G
G
T
G
G
G
G
G
Site 5
G
C
G
C
T
C
T
T
T
T
T
C
C
C
C
C
C
T
Site 2
C
C
C
C
T
C
T
G
G
T
G
G
G
G
G
T
T
T
Site 3
T
G
T
G
G
G
64Maximum Parsimony
123456789012345678901 Mouse
CTTCGTTGGATCAGTTTGATA Rat
CCTCGTTGGATCATTTTGATA Dog
CTGCTTTGGATCAGTTTGAAC Human
CCGCCTTGGATCAGTTTGAAC Informative
3
1
0
65Maximum likelihood methods.
- Similar to maximum parsimony
- For each column of the alignment all possible
trees are calculated - Trees with the least number of substitutions are
more likely
C C A G
(1) A G G C U C C A A (2) A G G U U C G A A (3)
A G C C C A G A A (4) A U U U C G G A A
G
P(change)?
C
A
66Maximum likelihood advantages
- Advantages of maximum likelihood over maximum
parsimony - Takes into account different rates of
substitution between different amino acids and/or
different sites - Statistically well-founded
- Applicable to more diverse sequences
67Comparison of Methods
68Methods for phylogenetic trees construction.
Set of related sequences
Multiple sequence alignments
Strong sequence similarity?
Maximum parsimony methods
Yes
No
Recognizable sequence similarity?
Yes
Distance methods
No
Analyze reliability of prediction
Maximum likelihood methods
69Summary
- Multiple sequence alignment gives us the
opportunity to calculate evolutionary distances
between sequences - Phylogenetic trees have several different
formats, some of which are stylistic and others
which convey information - The optimal phylogenetic tree is hard to find,
but there are several good ways of approximating
it
70For next time
- Read Mount Chapter 11, pages 512-543
71Phylogeny can also help isolate genes responsible
for adaptation
Non-thermophiles
A
B
C
Thermophile