Title: Pr
1Approaches to Data Analysis
Data GTCAT,GTTGGT,GTCA,CTCA
Parsimony, similarity, optimisation.
GT-CAT GTTGGT GT-CA- CT-CA-
statistics
statistics
Ideal Practice 1 phase analysis.
Actual Practice 2 phase analysis.
2Origins of Statistical Alignment
Bishop Thompson 1986 Thorne Kishino
Felsenstein 1991
Challenges to Statistical Alignment
Understanding the Basic Model Speed of the Basic
Algorithm Analyzing Many Sequences - Multiple
Statistical Alignment Realistic Models
The Biological Problems Phylogeny Molecular
Evolution Alignment Homology Testing More
3Thorne-Kishino-Felsenstein (1991) Process
A C G
T 0
- - -
T t
l lt m P(s) (1-l/m)(l/m)l pA A .. pT T
l length(s) Time reversible
4The invasion of the immortal link (From
Hein,Wiuf,Knudsen,Moeller Wiebling 2000)
5Time reversibility
Pi,j(t) probability that i has evolved into j
after time t. p(i) probability of i after
infinitely long time - equilibrium
distribution p(i) Pi,j(t) p(j) Pj,i(t)
a
t1
t2
s2
s1
s1
s2
t1 t2
6Two kinds of alignment
Optimisation (here Parsimony) Shortest Path
C T G A G G G T - - G C
CTGAGG
GTGC
Statistical Probability and Sum over all Paths
C T G A G G G T - - G C
CTGAGG
GTGC
7l m into Alignment Blocks
A. Amino Acids Ignored - - -
- - - - - - - -
-
k k
k e-mt1-lb(t)(lb(t))k-1
1-e-mt-mb(t)1-lb(t)(lb(t))k-1
1-lb(t)(lb(t))k pk(t)
pk(t)
pk(t)
p0(t) mb(t)
b(t)1-e(l-m)t/m-l
B. Amino Acids Considered T - - -
R Q S W Pt(T--gtR)pQ..pWp4(t)
4 T - - - -
- R Q S W pR pQ..pWp4(t)
8Illustration of single equation.
- - ... - ...
pk1
m
- - ... - - ...
pk
mk
lk
l(k-1)
m(k1)
- - - ... - - ...
- ... - - ...
pk1
pk-1
DpkDtl(k-1) pk-1m(k1)pk1
-(lm)kpkmpk1
9Diff. Equations for p-functions
- - ... - ...
Dpk Dtl(k-1) pk-1 mkpk1 -
(lm)kpk - - - ... -
- ... DpkDtl(k-1)
pk-1m(k1)pk1-(lm)kpkmpk1
- - - ... - ...
DpkDtlkpk-1m(k-1)pk1-((k1)lmk)
pk Initial Conditions pk(0) pk(0) pk
(0) 0 kgt1 p0(0)
p0(0) 1. p0 (0) 0
10Basic Pairwise Recursion (O(length3))
i
j
Survives
Dies
i-1
i
i-1
i
j-1
j
j
i-1
i
j-2
j
1 j (j) cases
0 j (j1) cases
11Basic Pairwise Recursion (O(length3))
j
(i,j)
(i-1,j)
j-1
(i-1,j-1)
Initial condition ps21j
..
(i-1,j-k)
..
..
i
i-1
12Fundamental Pairwise Recursion.
P(s1i-gts2j) p0P(s1i-1-gts2j) Initial
Condition P(s10 -gts2j)
pjps21j Probability of observation
P(s1,s2) P(s1) P(s1 -gts2) Simplification
Ri,j(p1f(s1i,s2j)p1ps2jj)P(s1i-1-gts2j-1)
l?b p?s2jRi,j-1
P(s1i-gts2j) Ri,j p0 P(s1i-gts2j-1) P(s1i-gts
2j) p0P(s1i-1-gts2j)
??????????????????? lbP(s1i-gts2j-1)
(p1f(s1i,s2jp1p?s2jj- lb p?s2jj
))P(s1i-1-gts2j-1)
13Geometric Like Offspring Number
- - - - - - -
- k
k e-mt1-lb(t)(lb(t))k-1
1-e-mt-mb(t)1-lb(t)(lb(t))k-1
pk(t)
pk(t)
p0(t) mb(t)
Alternative traversal
Die forward in time Give birth backwards Trace
leftmost unfinished branch. After one survivor,
branch lengths With birth possibility always t.
14Quadratic Recursion
(i,j)
(i-1,j)
(i-1,j-1)
(i,j-1)
Two state recursion
Ri,j(p1f(s1i,s2j)p1ps2jj)P(s1i-1-gts2j-1)
l?b p?s2jRi,j-1 P(s1i-gts2j) Ri,j p0
P(s1i-gts2j-1)
One state recursion
P(s1i-gts2j) p0P(s1i-1-gts2j)
??????????????????? lbP(s1i-gts2j-1)
(p1f(s1i,s2jp1p?s2jj- lb p?s2jj
))P(s1i-1-gts2j-1)
1. Summation, Maximization and Sampling of
Alignments. 2. For more sequences Ancestral
Sequences Alignments.
15Likelihood Surface (From Hein,Wiuf,Knudsen,Moeller
Wiebling 2000)
16a-globin (141) and b-globin (146) (From
Hein,Wiuf,Knudsen,Moeller Wiebling
2000) 430.108 -log(a-globin) 327.320
-log(a-globin --gt b-globin) 730.428
-log(a-globin, b-globin) -log(l(sumalign))
lt 0.0371805 /- 0.0135899 mt
0.0374396 /- 0.0136846 st 0.91701 /-
0.119556 E(Length) E(Insertions,Deletions)
E(Substitutions) 143.499 5.37255
131.59 Maximum contributing
alignment V-LSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFP
TTKTYFPHF-DLS--H---GSAQVKGHGKKVADALT VHLTPEEKSAVTA
LWGKV--NVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVK
AHGKKVLGAFS NAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHC
LLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR DGLAHLDNLKGT
FATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVV
AGVANALAHKYH Ratio l(maxalign)/l(sumalign)
0.00565064
17Likelihood Surface (From Hein,Wiuf,Knudsen,Moeller
Wiebling 2000)
18Homology Test Wi,j -ln(piP2.5i,j/(pipj)) D(
s1,s2) is evaluated in D(s1,s2) Real s1
ATWYFCAK-AC Random s1 ATWYFC-AKAC
s2 ETWYKCALLAD s2
LTAYKADCWLE
This test 1. Test the competing
hypothesis that 2 sequences are 2.5 events apart
versus infinitely far apart. 2. It only handles
substitutions correctly. The rationale for
indel costs are more arbitrary. 3. It samples in
(pipj) by permuting the order of amino acids in
the second. I.e. uses drawing without
replacement a hypergeometric distribution.
19a-, myoglobin homology test (From
Hein,Wiuf,Knudsen,Moeller Wiebling 2000)
20Algorithm for alignment on star tree
(O(length6))(Steel Hein, 2001)
ACGC
TT GT
s2
s1
a
(l/m)
s3
ACG GT
21Binary Tree Problem
TGA
ACCT
s1
s3
a1
a2
s2
s4
GTT
ACG
22Binary Tree Problem
- The problem would be simpler if
- The ancestral sequences their alignment was
known. - ii. The alignment of ancestral alignment
columns to leaf sequences was known.
A markov chain generating ancestral alignments
can solve the problem!!
23Markov Chains Generating the p-functions
Ancestral Sequence Generator
p function generator
- - - -
p/p function generator
- - - -
-
E
lb 1- lb 1-mb mb
-
- - - - -
lb 1- lb
-
24Generating Ancestral Alignments.
-
E
- E
lb l/m (1- lb)e-m
l/m (1- lb)(1- e-m) (1-
l/m) (1- lb) - lb
l/m (1- lb)e-m l/m (1- lb)(1-
e-m) (1- l/m) (1- lb) _
lb l/m (1- lb)e-m
l/m (1- lb)(1- e-m) (1-
l/m) (1- lb) -
lb
a1 -
E a2
E lb
l/m (1- lb)e-m
(1- l/m) (1- lb)
25The Basic Recursion
Remove 1st step - recursion
S
E
Remove last step - recursion
264-Sequence Recursion II First Step Removal
Pa(Sk) Epifixes (Sk1l) starting in given MC
starts in a.
Pa(Sk)
Where P(kS i,H??)
F(kSi,H)
27Example 4 globins
logLikelikelihood -1593.223
28Example 4 globins
29O(lk)algorithm for k sequences
Two Approaches Use geometric tails of
p-functions suitable rearrangements. Make
ancestral Markov Chain for the leaves as well
30Contrasting Probability Distance Recursions
Probability
O(l2k) O(lk) possible
Distance (Sankoff, 1973) - O(lk)
A
C
-
A
15 cases
31k ancestral sequence Markov Chain
State Space E E
All connected . , . ,
. . -tuples
E
a4 - a4 -
/ / a1 ---a2----a3
a1 ---a2----a3 \ -
\ - a5 - a5
32k ancestral sequences 2 Problems
1. Ambigous Indel/Alignment relationship.
a - / \ /
\ s1 - - s2 s1 - -
- - a - - - - s2 - -
- -
2. Grand children before younger siblings. a
- - - - - - - - a1
- - - - a2 -
- - -
33Transition Probabilities between two k-ancestral
states
0 - 1 - - 2 - 3 4 - 5 6
- 7 -
1
4
0 -
5
2
3
6
7
34Gibbs Samplers for Statistical Alignment
Holmes Bruno (2001) Sampling Ancestors to
pairs.
Jensen Hein (subm.) Sampling nodes adjacent to
triples Slower basic operation, faster mixing
35Work in Progress Plans
State Reduction (Lunter, Song, Hein Miklos)
Longer Insertion-Deletions (Miklos, Lunter,
Holmes)
A T C C G
A T C C G
Heterogeneity along Sequence (Skou,
Hein,..) HMM/SCFG like?
T T
Acceleration Implementation (Lunter
Song) MCMC Methods (Ledet Jensen,
Holmes,...........)
36Statistical Alignment Summary
Motivation for statistical alignment i.
Data is sequences - not alignment! ii. The
focus on alignments is exagerated!!
Progress Major Accelerations for
pairwise/multiple statistical alignment Longer
Insertion-Deletions models
Challenges ahead Position Heterogeneity hmm
scfg analogues. Algorithms for large data sets
(gt5 sequences) MCMC. Local alignment version
Software ???
37Acknowledgements (www.stats.ox.ac.uk/hein)
Pairwise (with Knudsen, Wiuf, Møller, Wibling)
Simpler recursion. Computational
acceleration. Multiple Star Tree (with
M.Steel) Binary Tree (with C.Storm, Jens
Ledet, Lunter, Miklos,Song,Holmes,..) Gibbs
Multiple Alignment (with Jens Ledet)
Articles Manuscripts 1. Hein,J.J., C.Wiuf,
B.Knudsen, Møller, M., and G.Wibling (2000)
Statistical Alignment Computational Properties,
Homology Testing and Goodness-of-Fit. (J.
Molecular Biology 302.265-279) 2. J.J.Hein
(2001) A generalisation of the
Thorne-Kishino-Felsenstein model of Statistical
Alignment to k sequences related by a binary
tree. (Pac.Symp.Biocompu. 2001 p179-190 (eds RB
Altman et al.) 3. Steel, M. J.J.Hein (2001) A
generalisation of the Thorne-Kishino-Felsenstein
model of Statistical Alignment to k sequences
related by a star tree. ( Letters in Applied
Mathematics) 4. JJ Hein, J.L.Jensen, C.Pedersen
(2002) Algorithms for Multiple Statistical
Alignment. (submitted to PNAS) 5. J.L.Jensen
JJ Hein (2002) A Gibbs Sampler for Multiple
Statistical Alignment. (submitted Statistical
Journal) 6. Lunter, Song, Miklos Hein (2002)
(In Press J.Com.Biol.) 7. Lunter, Song, Hein
(2003) (in prep.) 8. Miklos, Lunter Holmes
(2002) (in press MBE) 9. Miklos, I Toroczkai
Z. (2001) An improved model for statistical
alignment, in WABI2001, Lecture Notes in Computer
Science, (O. Gascuel BME Moret, eds) 21491-10.
Springer, Berlin 10 Miklos, I (2002) An improved
algorithm for statistical alignment of sequences
related by a star tree. Bul. Math. Biol.
64771-779. 11 Miklos, I (2002) Algorithm for
statistical alignment of sequences derived from a
Poisson sequence length distribution Disc. Appl.
Math. accepted. 12 Holmes, I W.Bruno (2001)
Evolutionary HMMs A Bayesian Approach to
Multiple Alignment Bioinformatics 17.9.803-20.