Chapter 4: Local motif finding and Global Multiple Alignment - PowerPoint PPT Presentation

1 / 30

About This Presentation

Title:

Chapter 4: Local motif finding and Global Multiple Alignment

Description:

... (motif finding) and Global Multiple Alignment ... multiple alignment probably implies ... Pearson showed alignment within a band gives very good results for ... – PowerPoint PPT presentation

Number of Views:58

Avg rating:3.0/5.0

Slides: 31

Provided by: min87

Category:

more less

Transcript and Presenter's Notes

Title: Chapter 4: Local motif finding and Global Multiple Alignment

1
Chapter 4 Local (motif finding) and Global
Multiple Alignment
2

When a problem is too hard, make it easy by
finding the right problem.

3
1st Topic (Global) Multiple Alignment

To do phylogenetic analysis
Same protein from different species
Optimal multiple alignment probably implies
history
Discover irregularities, such as Systic Fibrosis
(CFTR) gene just one gap (for one codon
deletion)
To find conserved regions
Local multiple alignment reveals conserved
regions
Conserved regions usually are key functional
regions
These regions are prime targets for drug
developments

4
Definitions

Given sequences s1, sn, multiple alignment M
puts them in n rows, one sequence per row, with
spaces inserted to get supersequences S1, , Sn,
SiL.
caaccca
ca cccc
ca cccg
ca ccct
SP-Alignment minimize sum of dH(Si,Sj) over all
pairs i, j.
minM Si?j dH(Si,Sj)
Star-Alignment (also called Consensus Alignment)
find center sequence S, minimize sum of dH(S,Si)
over all i.
minS Si1,..,n dH(S,Si)
Three types of algorithms precise, heuristic,
approximate
Approximation ratio for minimization problems r
cappr / copt. If r1e for any e, then it is an
approximation scheme. If this is in polynomial
time for each fixed e, it is PTAS (poly. time
appr. sch.) .

5
Dynamic Programming for Multiple Alignment

Given Aa1an, Bb1bn, Cc1cn
S(i,j,k)
max S(i-1,j-1,k-1)d(ai,bj,ck)
S(i-1,j-1,k) d(ai,bj,-)
S(i-1,j,k-1) d(ai,-,ck)
S(i,j-1,k-1) d(-,bj,ck)
S(i-1,j,k) d(ai,-,-)
S(i,j-1,k) d(-,bj,-)
S(i,j,k-1) d(-,-,ck)

6
CLUSTAL-W

Standard popular software
It does multiple alignment as follows
Align 2
Repeat keep on adding a new sequence to the
alignment until no more, using tree-like
heuristics.
Problem It is simply a heuristics.
Alternative dynamic programming nk for k
sequences. This is too slow.
We need to understand the problem and solve it
right.

7
Making the Problem Simpler!

Multiple alignment is hard
For k sequences, nk time, by dynamic programming
NP hard in general, not clear how to approximate
Popular practice -- alignment within a band the
p-th letter in one sequence is not more than c
places away from the p-th letter in another
sequence in the final alignment the alignment
is along a diagonal bandwidth 2c.
Used in final stage of FASTA program.

8
Literature

NP hardness under various models Wang-Jiang
(JCB), Li-Ma-Wang (STOC99), W. Just
Approximation results Gusfield (2- 1/L), Bafna,
Lawler, Pevzner (CPM94, 2-k/L), star alignment.
Sankoff, Kruskal discussed within a band
Pearson showed alignment within a band gives very
good results for a lot protein superfamilies.
Altschul and Lipman, Chao-Pearson-Miller,
Fickett, Ukkonen, Spouge (survey) all have
studied alignment within a band.
Li, Ma, Wang (STOC, 2000), approximation scheme
within a band (and show it is still NP-hard).

9
Alignment within a c-band

SP-Alignment
NP hard (even 0-gap)
PTAS
PTAS for constant number of insertion/deletion
gaps per sequence on average (for coding regions,
this assumption makes a lot of sense)

Star-Alignment
NP-hard (even 0-gap)
PTAS
PTAS for constant number of insertion/deletion
gaps per sequence on average

10
Star Alignment, constant gaps

We first consider the simpler Consensus Pattern
Problem Given a set S s1, s2, sn of
sequences each of length m, and an integer L,
find a median string s of length L and a
substring ti (consensus patterns) of length L
from each si, minimizing Si1..n dH(s, ti), where
dH(x,y) is Hamming distance between x,y.
The use of Hamming distance is for convenience
only
Using the consensusPattern algorithm, we will
design PTAS for star alignment with constant
number of gaps.
Then using the same idea we extend the alg. to SP
alignment.

11
Idea Sampling r substrings
Consensus t1, tn
If we get r of those tis
j
j
We also expect this column has k percent as
If this column has k percent as
12

Chernoffs Bound Consider n independent random
trails with success probability p. Let m be
number of successes, then
(1) P(m-np en) ee2n/3p
Lemma. Let h(a) be the number of occurrences of
letter a in a1, , an. Let a be a majority.
Randomly pick r of them, let ar be the majority
among these r letters. Then Eh(a)-h(ar)O(log
r /vr)(n-h(a))
Proof. Consider picking a from a1 ... an in r
independent trails with success probability
pah(a)/n. Let ma be the number of successes of
getting a. Define S to be the set of as whose
frequencies are close to the majority a, i.e.
Sa rpa -rpa lt 2vr log r 1 .
(2) h(a)gth(a)- (2nlogr)/vr ,
for all a in S.
Thus E(h(ar)) Sall aPr(ara)h(a)
Sa in S Pr(ara)h(a) Sa not
in SPr(ara)h(a)
first term is large second
term is tiny/unlikely
Sa in S Pr(ara)h(a) //
deleted second term
Pr(marpa -vr log
r)Pr(marpavr log r for all a not in S)
(h(a) (2n log r) / vr)
(1 e-O(log r))(1 e-O(log
r))(h(a) 2nlog r / vr ) by Chernoff bound 1
(1 O(1/r))(h(a) (2nlog r)
/vr)
Thus E(h(a) h(ar)) h(a) E(h(ar))
O(log r / vr)(n h(a), assuming h(a)lt3n/4.

13
Algorithm consensusPattern

Algorithm consensusPattern
Input n sequences Ss1, ,sn, integers L and
r.
Output the median string and consensus patterns
for every r length-L substrings u1,u2, ur
from S do
(a) Set u to be the column-wise majority
string of uis.
(b) for i1,2, , n do
Set ti to be the substring of length L
from si that is
closest to u
(c) Let c(u)Si1..n dH(u,ti)
Output the u and the corresponding t1,t2, tn
minimizing c(u)

14
Theorem. consensusPattern is achieves 1O(1/vr)
approximation ratio in nO(r) time.

Proof. Suppose s and t1, , tn are an optimal
solution. Obviously s is the column-wise
majority of t1 tn. Let coptSi1..ndH(s,ti).
Let R contain any r indices in 1,,n and sR be
the column-wise majority of tjs for jeR and cR
be the corresponding distance. Let hj(a) be
number of tis such that tija. Then we have
Si1..ndH(ti,s)Sj1..L(n-hj(sj))
()
Thus
cR copt Si1..ndH(ti,sR)
Si1...ndH(ti,s) Sj1Lhj(sj) hj(sRj)
The average value of cR-copt over all choices R
is
n-r Sall R cR copt n-r Sall R
Sj1Lhj(sj)-hj(sRj)
Sj1Ln-r Sall
Rhj(sj)-hj(sRj)
O(1/vr)
Sj1L(n-hj(sj)) by Lemma
O(1/vr) copt
by ()
Hence, there must be R of r indices such that
cR copt O(1/vr) copt

15
Star-alignment

Notation in an alignment, a block of inserted
--- is called a gap. If a multiple alignment
has c gaps on average for each sequence, we call
it average c-gap alignment.
We first design a PTAS for the average c-gap SP
alignment.
Then using the PTAS for the average c-gap SP
alignment, we design a PTAS for SP-alignment
within a band.

16
Star Alignment with c gaps

StarAlign
Input Ss1, ,sn, sim
Output a median string s.
for every set of r si in S do
for every alignment of these r sequences, with
no more than cr gaps in each sequence, each gap
size lt crm, do
(a) Find column-wise majority to form
median s.
(b) Compute dSi1..ndE(s,si).
Output s which minimizes d.

17
Average c-gap SP Alignment

Key Idea choose r representative sequences, we
find their correct alignment in the optimal
alignment, by exhaustive search. Then we use this
alignment as reference.
Then we align every other sequence against this
alignment.
Then choose the best.
All we have to show is that there are r sequences
whose letter frequencies in each column of their
alignment approximates the complete alignment ---
like the Star Alignment

18
Sampling r sequences
Complete Alignment
Alignment with r sequences
j
j
We also expect this column has k percent as
If this column has k percent as
19
AverageSPAlign

Input Ss1,,sn, sim, integers c,l,r
Output a multiple alignment M
for Lm to nm
for any r sequences in S
for any possible alignment M (of r seq.)
of length L
and with no more than cl gaps in each
sequence
align all other sequences to M //one
alignment
Output the best alignment.

20
SP Alignment within c-Band

Basic Idea
Dynamically cut seq-uences into segments
Each segment satisfies the average c-gap
condition. Hence use previous algorithm
Then assemble the segments together
Divide-Conquer.

Cutting these sequences into 6 segments, each
segment has c-gaps per sequence on average in
optimal alignment.
21
The final algorithm diagonalAlign

while (not finished)
find a maximum prefix for each sequence (same
length) such that AverageSPAlign returns low
cost. Keep the multiple alignment for this
segment
Concatenate the multiple alignments for all
segments together to as final alignment.

22
2nd Topic Motif Finding (Local Multiple
Alignment)

Finding motifs/conserved regions in proteins is
important in drug design and proteomics.
We have already solved one version consensus
pattern problem.
We will present 3 different methods.
(1) Heuristics Gibbs Sampling method.
(2) Approximation algorithm
(3) LP relaxation.

23
Many versions

Input Ss1, ,sn
Consensus Pattern. Find s, substring ti of si,
minimizing Si1..ndH(s,ti).
General Consensus Pattern in above, max
Sj1..Lcj, where cjSj(a)logj(a), where j(a)
is frequency of a in j-th column.
Closest string Find s, minimizing d s.t.
dH(s,si)d, for all i.
Closest substring find s, ti of si, minimizing d
s.t. dH(s,ti)d.
All have been solved with PTAS.

24
Given k protein sequences, find a conserved
region
L
K sequences
Red regions are conserved regions, or,
motifs. The dont have to be exactly same, they
match with higher scores than other regions.
25
1. Gibbs Sampling Algorithm

Many programs exist. Some use HMM. Most popular
perhaps is the Gibbs sampling method by Lawrence
et al
Input, S1, , Sm
Randomly choose a word wi from each Si
Randomly choose r
Create a frequency matrix (qij) from the m-1
words not from Sr.
For each position in Sr, compute the probability
a word fits the frequency matrix (qij), and use
that word with the highest probability, repeat.
Works well, no guarantee.

26
2. PTAS

Input S1, , Sm, integer L.
Output t1, ,tm, tiL (motifs)
For every r length L substring, compute the
consensus (under different metric) M of them.
In all strings, find substring of length L
closest to M.
Choose the best.
Theorem This algorithm outputs a solution no
worse than 11/sqrt(r) optimal. Time complexity
is (nm)O(r).
Has guarantee, slow.

27
3. Linear ProgrammingRelaxation(Reference book
Randomized algorithms, Motwani, Raghavan)

We now introduce a new technique.
For simplicity, lets consider a simplified
problem. Given a set Ss1, sn sequences, each
of length n, find string t that is far away from
all si in S find t and largest d such that
dH(t,si) gt d.
Formulating an LP problem. If binary strings xx1
xn and sis1, sn, then dH(x,si)x1 x2
xn, where xj xj if sj0, xj 1-xj if sj1,
now we have an integer program
max d
Sj1..n aijxjgtCid for each si, aij0,1,-1
Ci is from Sxi
xi 0, 1

28
LP Relaxation

Integer Program (as we have just described) is
NP-hard. So it is hard to solve it directly.
However, Linear Program is polynomial time
solvable. So we relax our requirement that
solutions x1, xn have to be 0,1. Instead we may
use 0?xi?1.
Then randomized rounding --- convert our fraction
solution to integer solution by set xi1 with
probability xi, xi0 with probability 1-xi.
What does this give us? Well, for each Sj1..n
aijxj Ci, after randomized rounding, we lose
about logn, if d is O(n), we are within d-logd
with high probability. When d is small, we use
other simpler techniques. Thus we have a very
good approximation.
Similar idea, but much more involved, applies to
the closest (sub)string problems.

29
Project Topics

Combining the PTAS approach with the background
model studied by SinhaTompa
Improve running time for approximation
algorithms. Implement a real system.
Simplify SP alignment proof using Chernoff bound
Compare with Pevzner-Sze Combinatorial
Approaches to Finding Subtle Signals in DNA
Sequences. ISMB 2000 269-278, and the projection
approach of Buhler-Tompa.

http//www.cs.washington.edu/homes/tompa/papers/tf
bs.doc
http//www.cs.washington.edu/homes/tompa/papers/pr
oj.ps
30
Open Questions

The PTAS guarantee to converge, but slow. The
heuristics are fast, but with no convergence
guarantee.
Open Problem Design more practical efficient
algorithms for these local/global alignment
problems, possibly under more reasonable
restrictions. It is important such a program
works with huge input sizes.

Write a Comment

User Comments (0)