Title: Bioinformatics 40400
1Bioinformatics40400
- Gianluca Pollastri
- office CS A1.07
- email gianluca.pollastri_at_ucd.ie
2Credits
- Richard Lathrop and Pierre Baldis Bioinformatics
courses at University of California _at_ Irvine.
3Phylogenetics credits
- Aoife McLysaght, Trinity College Dublin I
borrowed some of her slides for classes in
Molecular Evolution..
4Course overview
- Context DNA, RNA, proteins
- Resources GenBank, PDB, etc.
- Algorithms for sequence comparison.
- Phylogenetic trees.
- Protein structure prediction.
5Lecture notes
- http//gruyere.ucd.ie/2007_courses/40400/
- confidential..
6Recommended/useful readings
- No book is actually required
- Introduction to Computational Molecular Biology
- Setubal, Meidanis
- Introduction to Bioinformatics
- Lesk
- Bioinformatics the Machine Learning approach
- Baldi, Brunak
- Biological sequence analysis (but this is a tough
one) - Eddy, Durbin, Krogh, Mitchison
7Course marking
- 4 small things to do at home, each worth 10
- 60 in the final exam
8Reconstruct phylogeny from molecular data
ACTGTTACCGA
?
ACTGTTACCGA
ACTGTTACCGA
ACTGTTACCGA
ACTGTTACCGA
9Note
- There are two pieces of information in a
phylogenetic tree - the topology order of divergence events
- branch lengths extent of sequence divergence
10Note 2
- If we build a phylogeny based on one kind of
sequence (for instance, a group of sequences from
different organisms, one for each organism, that
show similarity to each other), what we are
building is a phylogeny of the sequences and not
necessarily of the organisms. - For example horizontal transfer a gene might
have been transferred from an organism to another
at some stage.
11Note 2 bis
- Orthologues sequence divergence occurred after a
speciation event - Paralogues sequence divergence after genome
duplication may coexist in the same genome - Alignments cant tell between them, but phylogeny
might.
12About complexity of phylogeny
13Methods of Tree reconstruction
- Distance based UPGMA, Neighbour Joining
- Maximum Parsimony
- Maximum Likelihood (and full Bayesian)
14Genetic distance
- Distance from one sequence to another
- Hamming Distance
- Count number of differences
- Attention there might be multiple hits number
of events is greater than number of differences
(some events cancel each other). - We would like to estimate number of events
- Remember PAM matrices PAM250 equivalent to 20
similarity, etc.
15Distance methods UPGMA
- Unweighted Pair-Group Method with Arithmetic
means - Assumes constant molecular clock
- Simplest method, often dangerous
16Distance Matrix
17UPGMA
.15/2
A
- dAB is the smallest distance
- Group A and B
- Branch length dAB/2 (here we say evolution rate
is constant..) - Recalculate distances from AB to other taxa as
average - d(AB)C (dAC dBC)/2
.15/2
B
18UPGMA
- new distance matrix
- Find smallest distance and continue as before
- Repeat until all taxa are on tree
19dAB/2
A
dAB/2
B
d(AB)C/2
C
d(ABC)D/2
D
20(No Transcript)
21UPGMA example
- Started from horse myoglobin.
- Looked for homologues with BLAST.
- Collected a number of myoglobins
22- gtuniprotP02192MYG_BOVIN Myoglobin.
- MGLSDGEWQLVLNAWGKVEADVAGHGQEVLIRLFTGHPETLEKFDKFKHL
KTEAEMKASE - DLKKHGNTVLTALGGILKKKGHHEAEVKHLAESHANKHKIPVKYLEFISD
AIIHVLHAKH - PSDFGADAQAAMSKALELFRNDMAAQYKVLGFHG
- gtuniprotP02197MYG_CHICK Myoglobin.
- MGLSDQEWQQVLTIWGKVEADIAGHGHEVLMRLFHDHPETLDRFDKFKGL
KTPDQMKGSE - DLKKHGATVLTQLGKILKQKGNHESELKPLAQTHATKHKIPVKYLEFISE
VIIKVIAEKH - AADFGADSQAAMKKALELFRNDMASKYKEFGFQG
- gtuniprotP68082MYG_HORSE Myoglobin.
- MGLSDGEWQQVLNVWGKVEADIAGHGQEVLIRLFTGHPETLEKFDKFKHL
KTEAEMKASE - DLKKHGTVVLTALGGILKKKGHHEAELKPLAQSHATKHKIPIKYLEFISD
AIIHVLHSKH - PGDFGADAQGAMTKALELFRNDIAAKYKELGFQG
- gtuniprotP02144MYG_HUMAN Myoglobin.
- MGLSDGEWQLVLNVWGKVEADIPGHGQEVLIRLFKGHPETLEKFDKFKHL
KSEDEMKASE - DLKKHGATVLTALGGILKKKGHHEAEIKPLAQSHATKHKIPVKYLEFISE
CIIQVLQSKH - PGDFGADAQGAMNKALELFRKDMASNYKELGFQG
- gtuniprotP04247MYG_MOUSE Myoglobin.
- MGLSDGEWQLVLNVWGKVEADLAGHGQEVLIGLFKTHPETLDKFDKFKNL
KSEEDMKGSE - DLKKHGCTVLTALGTILKKKGQHAAEIQPLAQSHATKHKIPVKYLEFISE
IIIEVLKKRH
23- 1 uniprotP02192MYG_BOVIN 154 2
uniprotP02197MYG_CHICK 154 72 - 1 uniprotP02192MYG_BOVIN 154 3
uniprotP68082MYG_HORSE 154 88 - 1 uniprotP02192MYG_BOVIN 154 4
uniprotP02144MYG_HUMAN 154 84 - 1 uniprotP02192MYG_BOVIN 154 5
uniprotP04247MYG_MOUSE 154 78 - 1 uniprotP02192MYG_BOVIN 154 6
uniprotP02189MYG_PIG 154 88 - 1 uniprotP02192MYG_BOVIN 154 7
uniprotP02170MYG_RABIT 154 88 - 1 uniprotP02192MYG_BOVIN 154 8
uniprotP02190MYG_SHEEP 154 98 - 1 uniprotP02192MYG_BOVIN 154 9
uniprotP68279MYG_TURTR 154 85 - 2 uniprotP02197MYG_CHICK 154 3
uniprotP68082MYG_HORSE 154 75 - 2 uniprotP02197MYG_CHICK 154 4
uniprotP02144MYG_HUMAN 154 76 - 2 uniprotP02197MYG_CHICK 154 5
uniprotP04247MYG_MOUSE 154 74 - 2 uniprotP02197MYG_CHICK 154 6
uniprotP02189MYG_PIG 154 76 - 2 uniprotP02197MYG_CHICK 154 7
uniprotP02170MYG_RABIT 154 76 - 2 uniprotP02197MYG_CHICK 154 8
uniprotP02190MYG_SHEEP 154 72 - 2 uniprotP02197MYG_CHICK 154 9
uniprotP68279MYG_TURTR 154 72 - 3 uniprotP68082MYG_HORSE 154 4
uniprotP02144MYG_HUMAN 154 88 - 3 uniprotP68082MYG_HORSE 154 5
uniprotP04247MYG_MOUSE 154 82 - 3 uniprotP68082MYG_HORSE 154 6
uniprotP02189MYG_PIG 154 90 - 3 uniprotP68082MYG_HORSE 154 7
uniprotP02170MYG_RABIT 154 89
24(No Transcript)
25(No Transcript)
260.01
Sheep Cow Chick Horse Human Mouse Pig Rabbi
t Dolphin
0.01
27(No Transcript)
28(No Transcript)
29(No Transcript)
30(No Transcript)
31(No Transcript)
32(No Transcript)
33(No Transcript)
34(No Transcript)
35(No Transcript)
360.01
Sheep Cow Human Pig Chick Horse Mouse Rabbi
t Dolphin
0.01
0.035
0.035
37(No Transcript)
38(No Transcript)
39(No Transcript)
40(No Transcript)
41(No Transcript)
420.01
Sheep Cow Human Pig Rabbit Chick Horse Mous
e Dolphin
0.01
0.035
0.0125
0.035
0.0475
43(No Transcript)
44(No Transcript)
45(No Transcript)
460.01
Sheep Cow Human Pig Rabbit Horse Chick Mous
e Dolphin
0.01
0.035
0.0125
0.035
0.0075
0.0475
0.055
47(No Transcript)
48(No Transcript)
490.01
Sheep Cow Human Pig Rabbit Horse Dolphin Ch
ick Mouse
0.01
0.035
0.0125
0.035
0.0075
0.0475
0.0033
0.055
0.0588
50(No Transcript)
51(No Transcript)
520.01
Sheep Cow Human Pig Rabbit Horse Dolphin Ch
ick Mouse
0.0579
0.01
0.035
0.0125
0.035
0.0075
0.0475
0.0033
0.055
0.0091
0.0588
53(No Transcript)
54(No Transcript)
550.01
Sheep Cow Human Pig Rabbit Horse Dolphin Mo
use Chick
0.0579
0.01
0.035
0.0125
0.0276
0.035
0.0075
0.0475
0.0033
0.055
0.0091
0.0588
0.0955
56(No Transcript)
570.01
Sheep Cow Human Pig Rabbit Horse Dolphin Mo
use Chick
0.0579
0.01
0.035
0.0125
0.0276
0.035
0.0075
0.0475
0.0033
0.0374
0.055
0.0091
0.0588
0.0955
0.1329
58Neighbour Joining (NJ) Saitou and Nei 87
- Another distance based method.
- As for UPGMA, we first compute the distance
matrix, by aligning all pairs of sequences,
pairwise. - Based on the minimum evolution criterion
minimise the sum of the branch lengths
59Neighbours
- Neighbours are OTU, leaves of the tree connected
by a node
1
3
2
4
If we join leaves, we create new neighbours
60Start from a star tree
- All nodes are neighbours at the beginning
1
3
x
2
4
N
just one OTU x at the beginning
61Join nodes
- We want to join the two nodes that give the
minimal sum of branches.
3
1
x
y
4
2
by joining nodes we create a new OTU y
N
62Join nodes
- If I have N nodes, there are N(N-1)/2 ways of
choosing two of them. - Lets call Lab the distance between OTU a and b,
and Dij the distance between nodes i and j. - Then, the total distance for the star tree is
63Lxy
- Once weve joined nodes 1 and 2, the distance Lxy
will be
minus all the distances weve counted in the
second OTU x
minus all the times weve counted D12
All distances from 1 and 2 through xy
64substitute
- The two subtractive terms are really the sums of
the Lix in two star trees, so we can compute them
from the
65finally
- The sum of the branches we obtain by joining 1
and 2 is then - (all expressed in Dij, which we can obtain from
the matrix)
66The algorithm
- do
- Scan all pairs of nodes to find the one with the
lowest Sij. - Join i and j in a single node, reestimate all
branch lengths. - until just one node