Title: Chapter 4: Local motif finding and Global Multiple Alignment
1 Chapter 4 Local (motif finding) and Global
Multiple Alignment
2- When a problem is too hard, make it easy by
finding the right problem.
31st Topic (Global) Multiple Alignment
- To do phylogenetic analysis
- Same protein from different species
- Optimal multiple alignment probably implies
history - Discover irregularities, such as Systic Fibrosis
(CFTR) gene just one gap (for one codon
deletion) - To find conserved regions
- Local multiple alignment reveals conserved
regions - Conserved regions usually are key functional
regions - These regions are prime targets for drug
developments
4Definitions
- Given sequences s1, sn, multiple alignment M
puts them in n rows, one sequence per row, with
spaces inserted to get supersequences S1, , Sn,
SiL. - caaccca
- ca cccc
- ca cccg
- ca ccct
- SP-Alignment minimize sum of dH(Si,Sj) over all
pairs i, j. - minM Si?j dH(Si,Sj)
- Star-Alignment (also called Consensus Alignment)
find center sequence S, minimize sum of dH(S,Si)
over all i. - minS Si1,..,n dH(S,Si)
- Three types of algorithms precise, heuristic,
approximate - Approximation ratio for minimization problems r
cappr / copt. If r1e for any e, then it is an
approximation scheme. If this is in polynomial
time for each fixed e, it is PTAS (poly. time
appr. sch.) .
5Dynamic Programming for Multiple Alignment
- Given Aa1an, Bb1bn, Cc1cn
- S(i,j,k)
- max S(i-1,j-1,k-1)d(ai,bj,ck)
- S(i-1,j-1,k) d(ai,bj,-)
- S(i-1,j,k-1) d(ai,-,ck)
- S(i,j-1,k-1) d(-,bj,ck)
- S(i-1,j,k) d(ai,-,-)
- S(i,j-1,k) d(-,bj,-)
- S(i,j,k-1) d(-,-,ck)
6CLUSTAL-W
- Standard popular software
- It does multiple alignment as follows
- Align 2
- Repeat keep on adding a new sequence to the
alignment until no more, using tree-like
heuristics. - Problem It is simply a heuristics.
- Alternative dynamic programming nk for k
sequences. This is too slow. - We need to understand the problem and solve it
right.
7Making the Problem Simpler!
- Multiple alignment is hard
- For k sequences, nk time, by dynamic programming
- NP hard in general, not clear how to approximate
- Popular practice -- alignment within a band the
p-th letter in one sequence is not more than c
places away from the p-th letter in another
sequence in the final alignment the alignment
is along a diagonal bandwidth 2c. - Used in final stage of FASTA program.
8Literature
- NP hardness under various models Wang-Jiang
(JCB), Li-Ma-Wang (STOC99), W. Just - Approximation results Gusfield (2- 1/L), Bafna,
Lawler, Pevzner (CPM94, 2-k/L), star alignment. - Sankoff, Kruskal discussed within a band
- Pearson showed alignment within a band gives very
good results for a lot protein superfamilies. - Altschul and Lipman, Chao-Pearson-Miller,
Fickett, Ukkonen, Spouge (survey) all have
studied alignment within a band. - Li, Ma, Wang (STOC, 2000), approximation scheme
within a band (and show it is still NP-hard).
9Alignment within a c-band
- SP-Alignment
- NP hard (even 0-gap)
- PTAS
- PTAS for constant number of insertion/deletion
gaps per sequence on average (for coding regions,
this assumption makes a lot of sense)
- Star-Alignment
- NP-hard (even 0-gap)
- PTAS
- PTAS for constant number of insertion/deletion
gaps per sequence on average
10Star Alignment, constant gaps
- We first consider the simpler Consensus Pattern
Problem Given a set S s1, s2, sn of
sequences each of length m, and an integer L,
find a median string s of length L and a
substring ti (consensus patterns) of length L
from each si, minimizing Si1..n dH(s, ti), where
dH(x,y) is Hamming distance between x,y. - The use of Hamming distance is for convenience
only - Using the consensusPattern algorithm, we will
design PTAS for star alignment with constant
number of gaps. - Then using the same idea we extend the alg. to SP
alignment.
11Idea Sampling r substrings
Consensus t1, tn
If we get r of those tis
j
j
We also expect this column has k percent as
If this column has k percent as
12- Chernoffs Bound Consider n independent random
trails with success probability p. Let m be
number of successes, then - (1) P(m-np en) ee2n/3p
- Lemma. Let h(a) be the number of occurrences of
letter a in a1, , an. Let a be a majority.
Randomly pick r of them, let ar be the majority
among these r letters. Then Eh(a)-h(ar)O(log
r /vr)(n-h(a)) - Proof. Consider picking a from a1 ... an in r
independent trails with success probability
pah(a)/n. Let ma be the number of successes of
getting a. Define S to be the set of as whose
frequencies are close to the majority a, i.e. - Sa rpa -rpa lt 2vr log r 1 .
- (2) h(a)gth(a)- (2nlogr)/vr ,
for all a in S. - Thus E(h(ar)) Sall aPr(ara)h(a)
- Sa in S Pr(ara)h(a) Sa not
in SPr(ara)h(a) - first term is large second
term is tiny/unlikely - Sa in S Pr(ara)h(a) //
deleted second term - Pr(marpa -vr log
r)Pr(marpavr log r for all a not in S) - (h(a) (2n log r) / vr)
- (1 e-O(log r))(1 e-O(log
r))(h(a) 2nlog r / vr ) by Chernoff bound 1 - (1 O(1/r))(h(a) (2nlog r)
/vr) - Thus E(h(a) h(ar)) h(a) E(h(ar))
O(log r / vr)(n h(a), assuming h(a)lt3n/4.
13Algorithm consensusPattern
- Algorithm consensusPattern
- Input n sequences Ss1, ,sn, integers L and
r. - Output the median string and consensus patterns
- for every r length-L substrings u1,u2, ur
from S do - (a) Set u to be the column-wise majority
string of uis. - (b) for i1,2, , n do
- Set ti to be the substring of length L
from si that is - closest to u
- (c) Let c(u)Si1..n dH(u,ti)
- Output the u and the corresponding t1,t2, tn
minimizing c(u)
14Theorem. consensusPattern is achieves 1O(1/vr)
approximation ratio in nO(r) time.
- Proof. Suppose s and t1, , tn are an optimal
solution. Obviously s is the column-wise
majority of t1 tn. Let coptSi1..ndH(s,ti).
Let R contain any r indices in 1,,n and sR be
the column-wise majority of tjs for jeR and cR
be the corresponding distance. Let hj(a) be
number of tis such that tija. Then we have - Si1..ndH(ti,s)Sj1..L(n-hj(sj))
() - Thus
- cR copt Si1..ndH(ti,sR)
Si1...ndH(ti,s) Sj1Lhj(sj) hj(sRj) - The average value of cR-copt over all choices R
is - n-r Sall R cR copt n-r Sall R
Sj1Lhj(sj)-hj(sRj) - Sj1Ln-r Sall
Rhj(sj)-hj(sRj) - O(1/vr)
Sj1L(n-hj(sj)) by Lemma - O(1/vr) copt
by () - Hence, there must be R of r indices such that
- cR copt O(1/vr) copt
15Star-alignment
- Notation in an alignment, a block of inserted
--- is called a gap. If a multiple alignment
has c gaps on average for each sequence, we call
it average c-gap alignment. - We first design a PTAS for the average c-gap SP
alignment. - Then using the PTAS for the average c-gap SP
alignment, we design a PTAS for SP-alignment
within a band.
16Star Alignment with c gaps
- StarAlign
- Input Ss1, ,sn, sim
- Output a median string s.
- for every set of r si in S do
- for every alignment of these r sequences, with
no more than cr gaps in each sequence, each gap
size lt crm, do - (a) Find column-wise majority to form
median s. - (b) Compute dSi1..ndE(s,si).
- Output s which minimizes d.
17Average c-gap SP Alignment
- Key Idea choose r representative sequences, we
find their correct alignment in the optimal
alignment, by exhaustive search. Then we use this
alignment as reference. - Then we align every other sequence against this
alignment. - Then choose the best.
- All we have to show is that there are r sequences
whose letter frequencies in each column of their
alignment approximates the complete alignment ---
like the Star Alignment
18Sampling r sequences
Complete Alignment
Alignment with r sequences
j
j
We also expect this column has k percent as
If this column has k percent as
19AverageSPAlign
- Input Ss1,,sn, sim, integers c,l,r
- Output a multiple alignment M
- for Lm to nm
- for any r sequences in S
- for any possible alignment M (of r seq.)
of length L - and with no more than cl gaps in each
sequence - align all other sequences to M //one
alignment -
- Output the best alignment.
20SP Alignment within c-Band
- Basic Idea
- Dynamically cut seq-uences into segments
- Each segment satisfies the average c-gap
condition. Hence use previous algorithm - Then assemble the segments together
- Divide-Conquer.
Cutting these sequences into 6 segments, each
segment has c-gaps per sequence on average in
optimal alignment.
21The final algorithm diagonalAlign
- while (not finished)
- find a maximum prefix for each sequence (same
length) such that AverageSPAlign returns low
cost. Keep the multiple alignment for this
segment -
- Concatenate the multiple alignments for all
segments together to as final alignment.
222nd Topic Motif Finding (Local Multiple
Alignment)
- Finding motifs/conserved regions in proteins is
important in drug design and proteomics. - We have already solved one version consensus
pattern problem. - We will present 3 different methods.
- (1) Heuristics Gibbs Sampling method.
- (2) Approximation algorithm
- (3) LP relaxation.
23Many versions
- Input Ss1, ,sn
- Consensus Pattern. Find s, substring ti of si,
minimizing Si1..ndH(s,ti). - General Consensus Pattern in above, max
Sj1..Lcj, where cjSj(a)logj(a), where j(a)
is frequency of a in j-th column. - Closest string Find s, minimizing d s.t.
dH(s,si)d, for all i. - Closest substring find s, ti of si, minimizing d
s.t. dH(s,ti)d. - All have been solved with PTAS.
24Given k protein sequences, find a conserved
region
L
K sequences
Red regions are conserved regions, or,
motifs. The dont have to be exactly same, they
match with higher scores than other regions.
251. Gibbs Sampling Algorithm
- Many programs exist. Some use HMM. Most popular
perhaps is the Gibbs sampling method by Lawrence
et al - Input, S1, , Sm
- Randomly choose a word wi from each Si
- Randomly choose r
- Create a frequency matrix (qij) from the m-1
words not from Sr. - For each position in Sr, compute the probability
a word fits the frequency matrix (qij), and use
that word with the highest probability, repeat. - Works well, no guarantee.
262. PTAS
- Input S1, , Sm, integer L.
- Output t1, ,tm, tiL (motifs)
- For every r length L substring, compute the
consensus (under different metric) M of them. - In all strings, find substring of length L
closest to M. - Choose the best.
- Theorem This algorithm outputs a solution no
worse than 11/sqrt(r) optimal. Time complexity
is (nm)O(r). - Has guarantee, slow.
273. Linear ProgrammingRelaxation(Reference book
Randomized algorithms, Motwani, Raghavan)
- We now introduce a new technique.
- For simplicity, lets consider a simplified
problem. Given a set Ss1, sn sequences, each
of length n, find string t that is far away from
all si in S find t and largest d such that
dH(t,si) gt d. - Formulating an LP problem. If binary strings xx1
xn and sis1, sn, then dH(x,si)x1 x2
xn, where xj xj if sj0, xj 1-xj if sj1,
now we have an integer program -
- max d
- Sj1..n aijxjgtCid for each si, aij0,1,-1
Ci is from Sxi - xi 0, 1
28LP Relaxation
- Integer Program (as we have just described) is
NP-hard. So it is hard to solve it directly. - However, Linear Program is polynomial time
solvable. So we relax our requirement that
solutions x1, xn have to be 0,1. Instead we may
use 0?xi?1. - Then randomized rounding --- convert our fraction
solution to integer solution by set xi1 with
probability xi, xi0 with probability 1-xi. - What does this give us? Well, for each Sj1..n
aijxj Ci, after randomized rounding, we lose
about logn, if d is O(n), we are within d-logd
with high probability. When d is small, we use
other simpler techniques. Thus we have a very
good approximation. - Similar idea, but much more involved, applies to
the closest (sub)string problems.
29Project Topics
- Combining the PTAS approach with the background
model studied by SinhaTompa - Improve running time for approximation
algorithms. Implement a real system. - Simplify SP alignment proof using Chernoff bound
- Compare with Pevzner-Sze Combinatorial
Approaches to Finding Subtle Signals in DNA
Sequences. ISMB 2000 269-278, and the projection
approach of Buhler-Tompa.
http//www.cs.washington.edu/homes/tompa/papers/tf
bs.doc
http//www.cs.washington.edu/homes/tompa/papers/pr
oj.ps
30Open Questions
- The PTAS guarantee to converge, but slow. The
heuristics are fast, but with no convergence
guarantee. - Open Problem Design more practical efficient
algorithms for these local/global alignment
problems, possibly under more reasonable
restrictions. It is important such a program
works with huge input sizes.