Title: Week 36: Trees, Alignment
1Week 36 Trees, Alignment Database Search
Trees Alignment Database Search
2Trees
Tree Concepts Reconstruction Methods
Parsimony Likelihood Distance The
Molecular Clock
3Topologies 4 sequences with root ignored s1
s3 s1 s2 s1
s2 \ / \ / \
/ \ / \ /
\ / --------
-------- -------- / \
/ \ / \ /
\ / \ / \
s2 s4 s3 s4 s4
s3 For unrooted trees with k leaves and
internal nodes with 3 incoming edges, the
following holds Edges 2k-3 Internal
nodes k-2 T(k) (2k-5)T(k-1) T(3)1 k
3 4 5 6 7
20 -----------------------------------------------
T(k) 1 3 15 105 945
1023 ---------------------------------------------
-- Edges 3 5 7 9 11
37 -----------------------------------------------
Internal Nodes 1 2 3 4 5 18
4Central Principles of Phylogeny Reconstruction
TTCAGT TCCAGT GCCAAT GCCAAT
1
0
Parsimony Distance Likelihood
2
Total Weight 4
0
1
0.6
1 1 2 3 2 1
0.7
1.5
0.4
0.3
L3.110-7 Parameter estimates
5The Small Parsimony Problem (Fitch-Hartigan-Sankof
f) ? / \ / \ ? ? / \ / \ C
G T C LN
/ /\ \ /
/ \ \ /
/ \ \ / LA LC LG LT
/ \ RA RG .. \ Recursion LN
minLNL d(N,NL) minLNR d(N,NR) Initialis
ation Lleaf(N) 0, if N at leaf - else infinity
6Fitch-Hartigan-Sankoff Algorithm
(A,C,G,T) (9,7,7,7)
/ \ / \ Costs
Transition 2, / \ (A ,C,G,
T) \ Transversion 5, indel 10.
(10,2,10,2) \ / \ \
/ \ \ / \ \
/ \ \ / \
\ (A,C,G,T) (A,C,G,T) (A,C,G,T) 0
0 0
7Distance Concepts on Trees I A Metric, d( , )
i d(a,b)0 ltgt ab
ii d(a,b)d(b,a)
iii d(a,b) lt d(a,c) d(c,b)
a
b
c
8Distance Concepts on Trees II
Tree Metric (distance function originates from
tree) d(x,y) d(z,æ) d(x,z) d(y,æ) gt d(x,æ)
d(y,z), where z,y,z,æ is a permutation of
a,b,c,d. (gt implies that no branch has length 0)
Reconstruction Principle d(s1,i) (d(s1,s2)
d(s1,s3) - d(s2,s3))/2
9Distance Concepts on Trees III
Ultra Metric (distance function originates from
tree) d(x,y) d(x,z) gt d(x,y), where z,y,z
is a permutation of a,b,c. (gt implies that no
branch has length 0)
Reconstruction Principle d(s1,i)
d(s1,s2)/2
10Evolutionary Substitution Process
t1
A
t2
C
C
Pi,j(t) probability of going from i to j in
time t.
11Probability of a pattern - summing over internal
states
A
C
G
?
?
?
?
A
A
T
A C G T
A C G T
A C G T
12Felsenstein's recursion Conditional probability,
CL(v,N) is the probability for observing the
nucleotides at the leaves at a subtree hanging
from v, if nucleotide N is found at
v. P(v1,v2,N1,N2) probability for N1 at v1, N2
at v2. I) Initial contition CL(leaf,N) 1N
is at leaf II) Recursion
P(v,vl,N,Nl)CL(vl,Nl) CL(v,N)
P(v,vr,N,Nr)CL(vr,Nr) where r refers
to right son, l to left son. III) L(p)
pNBL(Rod,N,p). IV) Total likelihoodProduct
over all positionsL Li(p).
13Output from Likelihood Method
With Clock
Without Clock s5 s4 23 5.2
\ /
/\ 40.9
20.4 / \
\ / / \
! / \
1.6 5.6 23 sd4.6 124.4 / \
s1---6-------22---------------11---3
/\ \ !
! 44.9 /\ \ /\ 7 3.4
4 sd.1.4 / \ \ / \
! s1 s2 s3 s4 s5
s2 Likelihood 7.910-14 ?? ?
0.31.1,0.18.1 6.210-12 ?? ? 0.34.1
0.16.1 ln(7.910-14) ln(6.210-12) is ?2
distributed with (n-2) degrees of freedom.
14The Felsenstein Zone Felsenstein-Cavendar (1979)
True Tree
Reconstructed Tree
s1
s2
s3
s4
Patterns(16 only 8 shown) 0 1 0 0 0
0 0 0 0 0 1 0 0 1 0 1 0 0 0 1
0 1 1 0 0 0 0 0 1 0 1 1
15The Molecular Clock First noted by Zuckerkandl
Pauling (1964) as an empirical fact. How can
one detect it? Known Ancestor Time
Unknown AncestorTime
/\ a at time T.
/ \ / \
? \ / \ /\
\ / \ / \ \ /
\ / \ \ s1
s2 s1 s2 s3
16History of Phylogenetic Methods 1958 Sokal and
Michener publishes UGPMA method for making
distrance trees with a clock. 1964 Parsimony
principle defined, but not advocated by Edwards
and Cavalli-Sforza. 1962-65 Zuckerkandl and
Pauling introduces the notion of a Molecular
Clock. 1967 First large molecular phylogenies
by Fitch and Margoliash. 1969 Heuristic method
used by Dayhoff to make trees and reconstruct
ancetral sequences. 1970 Neyman analyzes three
sequence stochastic model with Jukes-Cantor
substitution. 1971-73 Fitch, Hartigan Sankoff
independently comes up with same algorithm
reconstructing parsimony ancetral sequences.
1973 Sankoff treats alignment and phylogenies
as on general problem phylogenetic alignment.
171979 Cavender and Felsenstein independently comes
up with same evolutionary model where parsimony
is inconsistent. Later called the Felsenstein
Zone. 1981 Felsenstein Maximum Likelihood
Model Program DNAML (i programpakken
PHYLIP). 1981 Parsimony tree problem is shown to
be NP-Complete. 1985 Felsenstein introduces
bootstrapping as confidence interval on
phylogenies. 1986 Bandelt and Dress introduces
split decompostion as a generalization of trees.
1985- Many authors (Sawyer, Hein, Stephens,
M.Smith) tries to address the problem of
recombinations in phylogenies. 1997-9 Thorne et
al., Sanderson Huelsenbeck introduces the
Almost Clock. 2000 Rambaut (and others) makes
methods that can find trees with
non-contemporaneous leaves.
18Alignment
Pairwise Alignment Again Triple Quadruple -
Many Similarity-Distance Conversion Local
Alignment Statistical alignment Conclusion
19Parsimony Alignment of two strings. Sequences
s1CTAGG s2TTGT. Basic operations
transitions 2 (C-T A-G), transversions 5,
indels (g) 10. CTA,TT
GG CTAG,TTG CTA,TTG G-
CTAG,TT -G Initial
condition D0,00. (Di,j D(s11i,
s21j)) Di,jmin Di-1,j-1 d(s1i,s2j),
Di,j-1 g, Di-1,j g DCTA,TT
w(GG) 12 0 12 D4,3DCTAG,TTGminDCTA,TTG
w(G-) 4 10 14 DCTAG,TT
w(-G) 22 10 32
20 40 32 22 14 9 17 T
/ 30 22 12 4 12 22 G
/ 20 12 2 - 12 22 32 T
/ 10 2 10 20 30 40 T / 0 10
20 30 40 50 C T A G G
CTAGG Alignment
i v Cost 17 TT-GT
21Alignment of three sequences. s1ATCG s2ATGCC
s3CTCC Alignment AT-CG ATGCC
CT-CC Consensus sequence
ATCC Configurations in an alignment column -
- n n n - n - - n - n -
n n - n - - - n n n
- Recursion Di,j,k minDi-i',j-j',k-k'
d(i,i',j,j',k,k') Initial condition D0,0,0
0. Running time l1l2l3(23-1) Memory
requirement l1l2l3 New phenomena ancestral
sequence.
22Parsimony Alignment of four sequences s1ATCG
s2ATGCC s3CTCC s4ACGCG Alignment AT-CG
ATGCC CT-CC
ACGCG Configurations in alignment columns -
- - n - - - n n n - n n n n - -
- n - n n - n - - n - n n n - -
n - - n - n - n - n n - n n - n
- - - - n n - - n n n n - n
- Recursion Di minDi-? d(i,?) ?
0,14\04 Initial condition D0
0. Computation time l1l2l3l415 Memory
l1l2l3l4
23Alignment of many sequences. s1ATCG, s2ATGCC,
......., snACGCG Alignment AT-CG
s1 s3 s4 ATGCC
\ ! / .....
---------- ..... /
\ ACGCG s2
s5 Configurations in an alignment column
2n-1 Recursion DiminDi-? d(i,?) ?
0,1n\0n Initial condition D0,0,..0
0. Computation time ln(2n-1)n (lsequence
length, nnumber of sequences) Memory
requirement ln
24Close-to-Optimal Alignments (Waterman Byers,
1983) A. Alignments within ? of optimal. Ex.
? 2 40 32 22 14 9 17 T
/ 30 22 12 4 12 22 G
/ 20 12 2 - 12 22 32 T
/ 10 2 10 20 30 40 T /
0 10 20 30 40 50 C T A
G G CTAGG
Alignment i iv Cost 19
TTGT- Caveat
There are enormous numbers of suboptimal
alignments. B. Sets of postions that are on
some suboptimal alignment.
25Longer Indels TCATGGTACCGTTAGCGT GCA-----------GC
AT gk cost of indel of length k. Initial
condition D0,00 Di,j min Di-1,j-1
d(s1i,s2j), Di,j-1 g1,Di,j-2
g2,Di,j-3 g3,, Di-1,j g1,Di-2,j
g2,Di-3,j g3,, Cubic running
time. Quadratic memory.
26If gk a bk, then quadratic running
time. Gotoh (1982) Di,j is split into 3 types
1. D0i,j as Di,j, except s1i must mactch
s2j. 2. D1i,j as Di,j, except s1i is
matched with "-". 3. D2i,j as Di,j, except
s2i is matched with "-". Then D0i,j
min(D0i-1,j-1, D1i-1,j-1, D2i-1,j-1)
d(s1i,s2j) D1i,j min(D1i,j-1
b, D0i,j-1 a b) D2i,j
min(D2i-1,j b, D0i-1,j a b) Comment 1.
Evolutionary Consistency Condition gi gj gt
gij
27Gotoh Alignment,1981 Let all substitutions
cost 2 og let gk 3 k, then align ACGT with AT.
The alignment must be ACGT
with a cost 5.
A--T
-
n 5 6 2 6 5 5 4
10 11 12 T
T 4 0 4 5 6 4 8
9 10 11 A
A 0 4 5 6 7 - -
- - - A C G T
A C G T n
n n
- - 6 2 6
5 - 8 9 6 7 T
T - 0 6 7 8
- 8 4 5 6 A
A 0 - - - - -
4 5 6 7 A C G T
A C G T
28Distance-Similarity. (Smith-Waterman-Fitch,1982)
Di,jminDi-1,j-1 d(s1i,s2j), Di,j-1 g,
Di-1,j g Si,jmaxDi-1,j-1 s(s1i,s2j),
Si,j-1 -w, Si-1,j-w Distance Transitions2
Transversions 5 Indels10 M largest distance
between two nucleotides (5). Similarity
s(n1,n2) M - d(n1,n2)
wk k/(2M) gk w
1/(2M) g Similarity
Parameters Transversions0 Transitions3
Identity5 Indels 10 1/10
29 40/-40.4 32/-27.3 22/-12.2 14/0.9
9/11.0 17/2.9 T 30/-30.3 22/-17.2
12/-2.1 4/11.0 12/2.9 22/-7.2 G
20/-20.2 12/-7.1 2/8.0 12/-2.1
22/-12.2 32/-22.3 T 10/-10.1 2/3.0
10/-7.1 20/-17.2 30/-27.3 40/-37.4 T
0/0 10/-10.1 20/-20.2 30/-30.3
40/-40.4 50/-50.5 C T
A G G
Comments 1. The Switch from Dist to Sim is
highly analogous to Maximizing -f(x) instead of
Minimizing f(x). 2. Dist will based on a
metric i. d(x,x) 0, ii. d(x,y) gt0, iii.
d(x,y) d(y,x) iv. d(x,z) d(z,y) gt
d(x,y). There are no analogous restrictions
on Sim, giving it a larger parameter space.
30Local alignment Global Alignment
Si,jmaxDi-1,j-1 s(s1i,s2j), Si,j-1 -w,
Si-1,j-w Local
Si,jmaxDi-1,j-1 s(s1i,s2j), Si,j-1 -w,
Si-1,j-w,0 0 1 0 .6 1 2 .6
1.6 1.6 3 2.6 Score Parameters C
0 0 1 0 1 .3 .6 0.6 2
3 1.6 Match 1 A 0 0 0
1.3 0 1 1 2 3.3 2 1.6
Mismatch -1/3 G
/ 0 0 .3 .3 1.3 1
2.3 2.3 2 .6 1.6 Gap 1 k/3 C
/ 0 0
.6 1.6 .3 1.3 2.6 2.3 1 .6 1.6
GCC-UCG U /
GCCAUUG 0 0 2
.6 .3 1.6 2.6 1.3 1 .6 1 A
! 0 1 .6 0 1
3 1.6 1.3 1 1.3 1.6 C
/ 0 1 0 0 2 1.3 .3
1 .3 2 .6 C /
0 0 0 1 .3 0 0 .6 1
0 0 G / 0 0 0 .6 1
0 0 0 1 1 2 U 0 0
1 .6 0 0 0 0 0 0 0
A 0 0 1 0 0 0 0 0
0 0 0 A 0 0 0 0 0 0
0 0 0 0 0 C A G C
C U C G C U U
31Progressive Alignment (Feng-Doolittle 1987
J.Mol.Evol.) Can align alignments and given a
tree make a multiple alignment.
alkmny-trwq acdeqrt akkmdyftrwq
acdehrt kkkmemftrwq P(n,q) P(n,h) P(d,q)
P(d,h) P(e,q) P(e,h)/6
Sodh
atkavcvlkgdgpqvqgsinfeqkesdgpvkvwgsikglte-glhgfhvh
qfg---ndtagct sagphfnp lsrk Sodb
atkavcvlkgdgpqvqgtinfeak-gdtvkvwgsikglteglhgfhvhq
fg----ndtagct sagphfnp lsrk Sodl
atkavcvlkgdgpqvqgsinfeqkesdgpvkvwgsikglte-glhgfhvh
qfg---ndtagct sagphfnp lsrk Sddm
atkavcvlkgdgpqvq-infeak-gdtvkvwgsikglteglhgfhvhq
fg----ndtagct sagphfnp lsrk Sdmz
atkavcvlkgdgpqvqinfeqkesdgpvkvwgsikglteglhgfhvhq
fg----ndtagct sagphfnp Lsrk Sods
vatkavcvlkgdgpqvqinfeak-gdtvkvwgsikgltepnglhgfhvh
qfg----ndtagct sagphfnp lsrk S dpb
datkavcvlkgdgpqvqinfeqkesdgpv---wgsikgltglhgfhvhq
fgscasndtagctvlggssagphfnpehtnk
sddm
Sodb
Sodl
Sodh
Sdmz
sods
Sdpb
32(No Transcript)
33l m into Alignment Blocks A. Amino Acids
Ignored - - - - - - -
- - - - -
k
k k e-mt1-lb(t)(lb(t))k-1
1-e-mt-mb(t)1-lb(t)(lb(t))k-1
1-lb(t)(lb(t))k pk(t)
pk(t)
pk(t)
p0(t) mb(t) where
b(t)1-e(l-m)t/m-l B. AA Considered T - -
- R Q S W
Pt(T--gtR)pQ..pWp4(t)
4 T - - - -
- R Q S W pR pQ..pWp4(t)
4
34Basic Pairwise Recursion (O(length3))
i
j
Survives
Dies
i-1
i
i-1
i
j-1
j
j
i
i-1
i
i-1
j-2
j
j
j-1
35a-globin (141) and b-globin (146) 430.108
-log(a-globin) 327.320 -log(a-globin
b-globin) 730.428 -log(l(sumalign)) lt
0.0371805 /- 0.0135899 mt 0.0374396
/- 0.0136846 st 0.91701 /-
0.119556 E(Length) E(Insertions,Deletions)
E(Substitutions) 143.499 5.37255
131.59 Maximum contributing
alignment V-LSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFP
TTKTYFPHF-DLS--H---GSAQVKGHGKKVADAL
VHLTPEEKSAVTALWGKV--NVDEVGGEALGRLLVVYPWTQRFFESFGD
LSTPDAVMGNPKVKAHGKKVLGAF TNAVAHVDDMPNALSALSDLHAHK
LRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKY
R SDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFG
KEFTPPVQAAYQKVVAGVANALAHKYH Ratio
l(maxalign)/l(sumalign) 0.00565064
361966 Levenstein formulates distance measure
between sequences and instroduces dynamica
programming algorithm finding the
distance. 1970 Needleman and Wunch compares
proteins maximising a similarity score. 1972
Sankoff Sellers reinvents the basic algorithm.
Sankoff can also align subject to the constraint
that there must be exactly k indels. 1973
Sankoff makes multiple alignment and phylogeny -
both exact heuristic. 1975 Hirschberg gives
linear memory algorithm. 1976 Waterman gives
cubic algorithm allowing for indels of arbitrary
length. 1976 Waterman introduces alignment
without reference to phylogeny. 1981 Waterman,
Smith and Fitch shows duality of simiarity and
distance. 1982 Gotoh gives quadratic algorithm
if gap penalty functionen is gk a bk (for
indel of length k). Uses 3 matrices in stead of
1. 1983 Waterman and Byers introduces
close-to-optimal alignments. 1984-5 Ukkonen,
Myers, Fickett accelerates algorithmen
considerably. 1984 Hogeveg and Hespers
introduces heuristic multiple phylogenetic
alignment. 1984 Fredman introduces triple
alignment generalisation of Needleman-Wunch. 1985
Lipman Wilbur uses hashing.
37 1989 Myers introduces alignment with concave
gap penalty function. 1991 Thorne, Kishino
Felsenstein makes good model for statistical
alignment, partially introduced in 1986 by
Thomson Bishop. 1991 States Botstein
compares a DNA string with a protein in search of
frameshift mutations. 1993-4 Gusfield,
Lander, Waterman and others introduces parametric
alignment. 1987 Feng-Doolittle
introducesphylogenetisk alignment "Once a gap
always a gap". 1989 Kececioglou makes strong
acceleration of Sankoff's exact algorithm.
1994 Krogh et al Baldi et al. introduces
Hidden Markov Models for multiple alignment.
1995 Mitcheson Durbin introduces Tree-HMMs
allowing sequences generated by an HMM to have a
"tree correlation" structure, but not based on an
explicit evolutionary process. 1999 -
resurgence of interest in statistical alignment
38Database Search
What is the probability model for database
search? What is PAM matrix Statistical
Alignment Homology Testing.
39Illustration of database search. Query sequence
qYQPVNPAL Database s1CVDAEGKYL and
s2TTEQRPKNPATYCG i. No mismatches - no gaps
longest common segment q YQPVNPAL s2
TTEQRPKNPATYCG length 3
ii. Mismatches allowed - no gaps q
YQPVNPAL s2 TTEQRPKNPATYCG
Similarity function s( , ). Total
score, S s(P,P) s(V,K) .. s(A,A)
40iii.Both mismatches gapsLocal alignment
(LA). q YQ-PVNPAL s2 TTEQRPKNPATYCG
Here I1 QPVNPA and J1
QRPKNPA. S s(Q,Q) - g s(P,P) s(V,K)
.. s(A,A) i. If g mismatch cost infinity,
LA reduces to longest common segment. ii. If g
infinity, LA reduces to best segment.
41Distributions of Scores. Model q and database
is a series of iid (independent, identically
distributed) random variables, Xi's, where
P(Xij) pj (j any of the 20 amino
acids). Scoring scheme i. E(s(Xi,Xj))
pipjs(i,j) lt 0 ii. max s(i,j) gt 0 and g lt
0. iii. s(i,j) s(j,i)
42- i. Longest Common Segments. The mean grows
proportionally to log(nm), where n and m are the
length of the 2 sequences. - ii. Best Segment with score S will follow an
extreme value distribution. - P(Sgtx) exp(exp(-lx/u)),
- u is a positioning parameter, l a parameter that
determines how fast the distribution tails off. - P(Sgtx) ) exp(Kmnexp(- l x))
- is the x that solves pipjexp(s(i,j)x)
1 - K is also a known function of the pi's and the
sij's. - iii. Local Alignment
- Distribution unknowm.
- Looks like Extreme Value Distribution.
43From http//www.vuse.vanderbilt.edu/mahas1/ce207/
type1extremevalue.html
44The PAM matrix - Point Accepted Mutations
Wi,j -ln(piP2.5i,j/(pipj))
s1 ATWYFCAK-AC Random s1 ATWYFC-AKAC
s2 ETWYKCALLAD s2
LTAYKADCWLE
s2 is a random permutation of s2
Z score(s1,s2)-Escore(s1,s2)/
s.d.score(s1,s2)
45From W.Pearson
46From W.Pearson
47From W.Pearson
48The tactics of BLAST I The most widely used
program for database searches in biological
sequence databases is BLAST (Altschul et al.,
1990) and variants of BLAST. i. It defines a
neighbourhood of segments to the segments
composing the query sequence.
I
qYQPVNPAL
YQPVN, QPVNP, PVNPA, VNPAL
II
YQPVN
AQPVN, BQPVN,.., YAPVN,..
ii. It finds segments in the database that
matches these neighborhood segments very quickly.
49The tactics of BLAST II iii. It heuristically
finds large segments giving a good score.
iv. If the score of this good segment is
statistically significant, then this is extended
5'-ward and 3'-ward by a local alignment
algorithm, giving a proposed local alignment.
50Homology Test Wi,j -ln(piP2.5i,j/(pipj)) D(s
1,s2) is evaluated in D(s1,s2) Real s1
ATWYFCAK-AC Random s1 ATWYFC-AKAC
s2 ETWYKCALLAD s2
LTAYKADCWLE
This test 1. Test the competing
hypothesis that 2 sequences are 2.5 events apart
versus infinitely far apart. 2. It only handles
substitutions correctly. The rationale for
indel costs are more arbitrary. 3. It samples in
(pipj) by permuting the order of amino acids in
the second. I.e. uses drawing without
replacement a hypergeometric distribution.
51(No Transcript)
52Questions Summary Summary i. Database search
is a local alignment problem ii. Scores are
evaluated in an extreme value distribution. iii.
Databases can have problems with internal
similarities. Comments Questions 1. Fuzzy
Problem - in principle are all sequences
homologous. 2. Combined model of shift in
functionality with sequence evolution would be
optimal. 3. If the query sequence is a set of
homologous sequences, it is possible to weight
important positions.