Bioinformatics: Applications - PowerPoint PPT Presentation

1 / 69
About This Presentation
Title:

Bioinformatics: Applications

Description:

Gorilla. Macaque. Rooted Tree. Unrooted Tree. In an unrooted tree, the edges are ' ... Gorilla ..AGCATAGGGGTCAGGGGAAAGGCT.. Human ..AGCAAAAGGGTCAGGGGAAGGGGA. ... – PowerPoint PPT presentation

Number of Views:60
Avg rating:3.0/5.0
Slides: 70
Provided by: jonath76
Category:

less

Transcript and Presenter's Notes

Title: Bioinformatics: Applications


1
Bioinformatics Applications
  • ZOO 4903
  • Fall 2006, MW 1030-1145
  • Sutton Hall, Room 312
  • Jonathan Wren
  • Introduction to phylogenetics

2
Lecture overview
  • What weve talked about so far
  • Abundance growth of DNA sequences
  • How to quantify locate similar sequences
  • How to align multiple sequences and identify
    conserved regions
  • Overview
  • Both species and genes change over time, and it
    is useful to understand when divergences took
    place
  • Defining evolutionary distances
  • Visualizing evolutionary relationships

3
(No Transcript)
4
Phylogeny Reconstructing Evolutionary History
  • Goal Infer past history that produced a set of
    modern characters.
  • Needed
  • A set of characters (e.g. DNA, protein)
  • Evolutionary model
  • Distance metric
  • Probabilistic model of evolution
  • Output an evolutionary tree

5
Finding a Phylogeny
  • No guarantees of correctness
  • Based on evidence, but there is more than one way
    to arrive at the same answer
  • All we can observe is distance/ differences. From
    this we infer relatedness.
  • Find the most likely history
  • Parsimony find the evolutionary tree that
    explains the observations with the fewest
    possible changes

ACGT ACGT AGGT ATGT
AGGT
6
A brief history of phylogeny
  • The ancient Greeks divided life into classes or
    forms based upon their morphology
  • Darwins Origin of Species (1859) propelled
    efforts to find the links between species
  • Hershey Chase show that DNA is the molecule of
    inheritance (1952)
  • Emile Zuckerkandl and Linus Pauling used amino
    acid sequences to produce a primate phylogeny
    (1962)
  • NSF begins funding efforts to construct the Tree
    of Life (2005)

7
The Tree of Lifehttp//tolweb.org
8
(No Transcript)
9
Phylogenetic Analysis Overview

Evolutionary tree
evolutionary model
Chimp
sequence data
Human
Gorilla
phylogenetic inference
Orangutan
Macaque
Squirrel
Summary Statistics
10
Language Phylogeny
Latin arbor domus casa
arbre maison
arbore casa
arbol casa
albero casa
tre hus
strom domovni
treow hus
baum haus
tree house
11
Language Phylogeny
French
Romanian
Spanish
Italian
Norwegian
Czech
Anglo-Saxon
German
English
12
Molecular phylogeny What makes it possible
  • Point mutations (substitution) one base is
    replaced with another

UV Ray
With only point mutations, easy to tell how close
two genomes are - just count the different bases
CAT
CGT
13
And as in sequence alignment, we can quantify
differences
  • Insertions one or more bases are inserted
  • Deletions one or more bases are removed

GCATG
GCACATG
GCATCATG
GCATG
Caused by copying errors (enzymes slipping, etc.)
14
Phylogeny Standard Assumptions
  • Sequences diverge by speciation represented
    usually as bifurcation events
  • Sequences are essentially independent once they
    diverge from their common ancestor.
  • The probability of observing one nucleotide at
    the same site in the future depends only on the
    current nucleotide at the site. (Markov Chain
    assumption).
  • Different sites (characters) within a sequence
    evolve independently.

15
Question
  • Q If we can accurately identify the evolutionary
    history of a species, can we extrapolate and
    predict future directions in its evolution?

16
Question
  • Q If we can accurately identify the evolutionary
    history of a species, can we extrapolate and
    predict future directions in its evolution?
  • A No. As far as we know, evolution does not have
    a direction.

17
Phylogenies represented with trees
  • A tree is a directed acyclic graph consisting of
    nodes (sequences) and edges (relationships).
  • There exists a single unique path between any
    pair of nodes.

Ancestralsequences
Not a tree, due to cycle
Modern sequences
18
A tree consists of nodes connected by branches
One unique internal node is the root of the tree
the ancestor of all the sequences.
Internal nodes represent hypothetical ancestors
Terminal nodes represent sequences or organisms
for which we have data. Each is typically called
a Operational Taxonomical Unit or OTU.
19
Human mtDNA Phylogeny
  • Vertical layout is relatively meaningless (e.g.
    swapping any 2-way branch has no effect on tree
    meaning)
  • Evolutionary distance (horizontal scale, in this
    diagram) is the most important information in the
    phylogeny, and is reflected in the tree structure
    (grouping).
  • Phylogeny is not classification, but distance,
    i.e. there is no answer to how many clusters?

20
Phylogenetics appliedFlorida dentist case
1990 case Did a patients HIV infection result
from an invasive dental procedure performed by an
HIV dentist? HIV evolves so fast that
transmission patterns can be reconstructed from
viral sequence (molecular forensics).
Compared viral sequence from the dentist, three
of his HIV patients, and two HIV local controls.
21
Phylogenetic trees can be between or within
species
Relationships within species HIV subtypes
22
Florida dentist case results
  • 2 of 3 HIV patients closer to dentist than to
    local controls
  • Patient F likely HIV from another source

23
Phylogenetic tree types
  • Forked vs. hierarchical
  • Bifurcating vs. multifurcating
  • Rooted vs. unrooted
  • Ultrametric vs. additive

24
Forked vs. Hierarchical
1
2
3
4
6
1
2
5
3
4
25
Bifurcating vs. multifurcating
Multifurcating
Bifurcating
Polytomy
  • Polytomies Soft vs. Hard
  • Soft designate a lack of information about the
    order of divergence.
  • Hard the hypothesis that multiple divergences
    occurred simultaneously

26
Rooted vs. Unrooted
27
Rooted vs. Unrooted Trees
Chimp
Human
Gorilla
Macaque
Chimp
Macaque
Human
Rooted Trees
Gorilla
past
These two rooted trees are different But both
have the same unrooted tree
Unrooted Tree
The direction of evolution is specified in a
rooted tree but not in an unrooted tree
28
Rooted Trees Are Directed Graphs
In a rooted tree each edge is directed, pointing
from root to leaves.
Chimp
Human
Gorilla
Macaque
Rooted Tree
In an unrooted tree, the edges are undirected.
Unrooted Tree
29
Choosing a Root Location?
Chimp
Human
Gorilla
Macaque
Chimp
Macaque
Human
Rooted Trees
Gorilla
past
A rooted tree is just an unrooted tree with a
root location chosen on one branch.
Unrooted Tree
In the absence of an outgroup, the root location
is often unclear and therefore arbitrary.
30
of OTUs for trees
Number of OTUs
Log trees
OTUs
OTU Operational Taxonomical Unit
 
31
Tree Properties
In simple scenarios, evolutionary trees are
ultrametric and phylograms are additive.
32
Tree Types
33
Character Differences ? Branch Lengths and Order
Characters
Chimp ..AGCTAAAGGGTCAGGGGAAGGGCA.. Gorilla
..AGCATAGGGGTCAGGGGAAAGGCT.. Human
..AGCAAAAGGGTCAGGGGAAGGGGA.. Macaque
..AGCTCATCGGTAAGGAGAAAGGAT.. Orangutan
..AGCCCATCGGTCAGGAGAAAGGAT.. Squirrel
..AGCGGACCGGTAAGGAGAAAGGAC..
  • Phylogenetic Tree
  • Length of a branch is proportional to number of
    differences
  • Tree shows order of divergences

Branch length represents the extent of change
- expected number of substitutions per
nucleotide site. The longer the branch the more
change
34
Character Differences ? Sequence Distances
Characters
Chimp ..AGCTAAAGGGTCAGGGGAAGGGCA.. Gorilla
..AGCATAGGGGTCAGGGGAAAGGCT.. Human
..AGCAAAAGGGTCAGGGGAAGGGGA.. Macaque
..AGCTCATCGGTAAGGAGAAAGGAT.. Orangutan
..AGCCCATCGGTCAGGAGAAAGGAT.. Squirrel
..AGCGGACCGGTAAGGAGAAAGGAC..
Distances
35
Sequence Distances ? Branch Lengths and Order
1
2
Chimp ..AGCTAAAGGGTCAGGGGAAGGGCA.. Human
..AGCAAAAGGGTCAGGGGAAGGGGA.. Gorilla
..AGCATAGGGGTCAGGGGAAAGGCT.. Squirrel
..AGCGGACCGGTAAGGAGAAAGGAC.. Macaque
..AGCTCATCGGTAAGGAGAAAGGAT.. Orangutan
..AGCCCATCGGTCAGGAGAAAGGAT..
1
3
5
3
5
2
36
Inferring Ancestral Sequences in the Tree
Chimp ..AGCTAAAGGGTCAGGGGAAGGGCA.. Human
..AGCAAAAGGGTCAGGGGAAGGGGA.. Ch/Hu
..AGCAAAAGGGTCAGGGGAAGGGCA.. Gorilla
..AGCATAGGGGTCAGGGGAAAGGCT.. Squirrel
..AGCGGACCGGTAAGGAGAAAGGAC.. Macaque
..AGCTCATCGGTAAGGAGAAAGGAT.. Orangutan
..AGCCCATCGGTCAGGAGAAAGGAT..
2
5
3
5
2
  • When daughter sequences agree, likely conserved
    in ancestor
  • Resolve disagreements by checking outgroup if an
    ingroup-outgroup match, likely conserved in
    ancestor.

37
Inferring Ancestral Sequences in the Tree
A
(daughter sequences)
A
A
  • When daughter sequences agree, likely conserved
    in ancestor
  • Resolve disagreements by checking outgroup if an
    ingroup-outgroup match, likely conserved in
    ancestor.

38
Inferring Ancestral Sequences in the Tree
T
(daughter sequences)
?
A
  • When daughter sequences agree, likely conserved
    in ancestor
  • Resolve disagreements by checking outgroup if an
    ingroup-outgroup match, likely conserved in
    ancestor.

39
Inferring Ancestral Sequences in the Tree
T
(daughter sequences)
T
A
T
T
(outgroup sequence)
  • When daughter sequences agree, likely conserved
    in ancestor
  • Resolve disagreements by checking outgroup if an
    ingroup-outgroup match, likely conserved in
    ancestor.

40
Inferring Ancestral Sequences in the Tree
T
(daughter sequences)
?
A
?
C
(outgroup sequence)
Of course, the outgroup may fail to resolve the
ambiguity
41
Infer Ancestral Sequences at all junctions
2
Chimp ..AGCTAAAGGGTCAGGGGAAGGGCA.. Human
..AGCAAAAGGGTCAGGGGAAGGGGA.. Ch/Hu
..AGCAAAAGGGTCAGGGGAAGGGCA.. Gorilla
..AGCATAGGGGTCAGGGGAAAGGCT.. Gor/Ch/Hu
..AGCA?A?GGGTCAGGGGAAAGGCT.. ROOT
..AGC??A??GGT?AGG?GAAAGG?T.. Sq/Mac/Or
..AGC??A?CGGTAAGGAGAAAGGAT.. Squirrel
..AGCGGACCGGTAAGGAGAAAGGAC.. Mac/Or
..AGC?CATCGGTAAGGAGAAAGGAT.. Macaque
..AGCTCATCGGTAAGGAGAAAGGAT.. Orangutan
..AGCCCATCGGTCAGGAGAAAGGAT..
5
3
5
2
42
Given a multiple alignment, how do we construct
the tree?
A - GCTTGTCCGTTACGAT B ACTTGTCTGTTACGAT C
ACTTGTCCGAAACGAT D - ACTTGACCGTTTCCTT E
AGATGACCGTTTCGAT F - ACTACACCCTTATGAG
?
43
Phylogenetic Methods
Many different procedures exist. Three of the
most popular
Neighbor-joining Minimizes distance between
nearest neighbors Maximum parsimony Minimizes
total evolutionary change Maximum likelihood
Maximizes likelihood of observed data
44
Distance Methods
Logic Evolutionary distance is a tree metric and
hence defines a tree
  • General Method
  • Evolutionary distances are computed for all pairs
    of taxa.
  • A phylogenetic tree is constructed by considering
    the relationships among these distance data
    (fitting a tree to the matrix).
  • Methods well talk about
  • UPGMA
  • Neighbor Joining

45
Unweighted Pair Group Method with Arithmetic Mean
(UPGMA) Method of Clustering
First, construct a distance matrix
A - GCTTGTCCGTTACGAT B ACTTGTCTGTTACGAT C
ACTTGTCCGAAACGAT D - ACTTGACCGTTTCCTT E
AGATGACCGTTTCGAT F - ACTACACCCTTATGAG
This can also be thought of as Uniting Pairs of a
Greedy Multiple Alignment
46
UPGMA
First round
dist(A,B),C (distAC distBC) / 2 4
dist(A,B),D (distAD distBD) / 2 6
dist(A,B),E (distAE distBE) / 2
6 dist(A,B),F (distAF distBF) / 2 8
Choose the most similar pair, cluster them
together and calculate the new distance matrix.
47
UPGMA
Second round
Third round
48
UPGMA
Fourth round
Fifth round
Note the this method identifies the root of the
tree.
49
UPGMA in practice example
  • The mitochondrial genome has 16,500 base-pairs.
  • These sequences are estimated to diverge at a
    rate of 1.7x10-8 substitutions per site per year
    (molecular evolution clock)
  • Thus, roughly 1 mutation every 3,565 years

50
86 mitochondrial DNAs from various human
populations compared via UPGMA
sub-Sahara mtDNA
51
Phylogeny based upon the molecular clock
Evidence for a human mitochondrial origin in
Africa African sequence diversity is twice as
large as that of non-African
Ingman, M., Kaessmann, H., Pääbo, S.
Gyllensten, U. (2000) Nature 408 708-713.
52
UPGMA assumes a molecular clock
  • The UPGMA clustering method is very sensitive to
    unequal evolutionary rates (assumes that the
    evolutionary rate is the same for all branches).
  • Clustering works only if the data are ultrametric
  • Ultrametric distances are defined by the
    satisfaction of the 'three-point condition'.

The three-point condition
C
A
B
For any three taxa, the two greatest distances
are equal.
53
UPGMA fails when rates of evolution are not
constant
A tree in which the evolutionary rates are not
equal
(Neighbor joining will get the right tree in this
case.)
54
Molecular Clocksfor various genes
55
Molecular Clocksfor various genes
Why such different profiles? Variation in
mutation rate?
Variation in selection. Genes coding for some
molecules under very strong stabilizing selection.
56
Neighbor Joining An algorithm for finding the
shortest tree
Start with a star (no hierarchical structure)
c
a
d
b
The length of the tree
Pair-wise distances
Number of OTUs
57
Neighbor Joining
The following can be used to calculate the length
of this tree
(Saitou and Nei, 1987)
58
Neighbor Joining
At each step, each pair of possible neighbors are
considered and the one producing the shortest
tree is chosen (minimal evolution criteria).
59
Neighbor Joining
As in UPGMA, a new internal branch is added at
each step.
60
The Principle of Parsimony
Simpler hypotheses are preferable to more
complicated ones and that as hoc hypotheses
should be avoided whenever possible (Occams
Razor). Thus, find the tree that requires the
smallest number of evolutionary changes.
0123456789012345 W - ACTTGACCCTTACGAT X
AGCTGGCCCTGATTAC Y AGTTGACCATTACGAT Z -
AGCTGGTCCTGATGAC
W
X
Y
Z
61
Maximum Parsimony
Start by classifying the sites
123456789012345678901 Mouse
CTTCGTTGGATCAGTTTGATA Rat
CCTCGTTGGATCATTTTGATA Dog
CTGCTTTGGATCAGTTTGAAC Human
CCGCCTTGGATCAGTTTGAAC ----------------------------
-------- Invariant Variant
-------------------------
----------- Informative
Non-inform.
62
What characters are informative for an Unrooted
Tree?
A
G
G
G
A
G
G
A
22 is Informative
31 is not informative
63
123456789012345678901 Mouse
CTTCGTTGGATCAGTTTGATA Rat
CCTCGTTGGATCATTTTGATA Dog
CTGCTTTGGATCAGTTTGAAC Human
CCGCCTTGGATCAGTTTGAAC
G
T
T
G
G
G
T
G
G
G
G
G
Site 5
G
C
G
C
T
C
T
T
T
T
T
C
C
C
C
C
C
T
Site 2
C
C
C
C
T
C
T
G
G
T
G
G
G
G
G
T
T
T
Site 3
T
G
T
G
G
G
64
Maximum Parsimony
123456789012345678901 Mouse
CTTCGTTGGATCAGTTTGATA Rat
CCTCGTTGGATCATTTTGATA Dog
CTGCTTTGGATCAGTTTGAAC Human
CCGCCTTGGATCAGTTTGAAC Informative

3
1
0
65
Maximum likelihood methods.
  • Similar to maximum parsimony
  • For each column of the alignment all possible
    trees are calculated
  • Trees with the least number of substitutions are
    more likely

C C A G
(1) A G G C U C C A A (2) A G G U U C G A A (3)
A G C C C A G A A (4) A U U U C G G A A
G
P(change)?
C
A
66
Maximum likelihood advantages
  • Advantages of maximum likelihood over maximum
    parsimony
  • Takes into account different rates of
    substitution between different amino acids and/or
    different sites
  • Statistically well-founded
  • Applicable to more diverse sequences

67
Comparison of Methods
68
Methods for phylogenetic trees construction.
Set of related sequences
Multiple sequence alignments
Strong sequence similarity?
Maximum parsimony methods
Yes
No
Recognizable sequence similarity?
Yes
Distance methods
No
Analyze reliability of prediction
Maximum likelihood methods
69
Summary
  • Multiple sequence alignment gives us the
    opportunity to calculate evolutionary distances
    between sequences
  • Phylogenetic trees have several different
    formats, some of which are stylistic and others
    which convey information
  • The optimal phylogenetic tree is hard to find,
    but there are several good ways of approximating
    it

70
For next time
  • Read Mount Chapter 11, pages 512-543

71
Phylogeny can also help isolate genes responsible
for adaptation
Non-thermophiles
A
B
C
Thermophile
Write a Comment
User Comments (0)
About PowerShow.com