Title: Introduction to Real-Time Systems
1Randomized Algorithm for Motif Detection
Lusheng Wang Dept. of Computer
Science City University of Hong Kong Web
http//www.cs.cityu.edu.hk/lwang e-mail
lwang_at_cs.cityu.edu.hk
2 Outline
- Review our theoretical results
- A software for motif detection
- Other problems we have been working
- --Parametric tools RNA secondary structure
comparison - --Computing translocation distance between
genomes
3 Distinguishing Substring Selection Problem
Instance Given a set B of n1 (bad) strings of
length at least L, a set G of
n2 (good) strings, two thresholds db
and dg ( db lt dg). Objective Find a
string s such that for each string b ? B there
exists a substring t of b with
length L such that d(s, t) ?
db and for any string g ? G,
d(s, g) ? dg.
Bad sequences
s motif
Good sequences
4 Applications
1. Finding Targets for Potential Drugs (T.
Jiang, C. Trendall, S, Wang, T. Wareham, X.
Zhang, 98) (K. Lanctot, M. Li, B. Ma, S.
Wang, and L. Zhang 1999) -- bad strings in B
are from Bacteria. --good strings are from
Humans -- find a substring s of length L that
is conserved in all bad strings, but not
conserved in good strings. -- use s to
screen chemicals -- those selected chemicals
can then be tested as potential
broad-range antibiotics.
5 Applications
2. Creating Diagnostic Probes for Bacterial
Infection (T. Brown, G.A. Leonard, E.D.
Booth, G. Kneale, 1990) -- a group of
closely related pathogenic bacteria -- find
a substring that occurs in each of the bacterial
sequences (with as few substitutions as
possible) and does not occurs in the
host (Human)
6 Applications
3. Creating Universal PCR Primers 4. Creating
Unbiased Consensus Sequences 5. Anti-sense Drug
Design 6. In coding theory (computer science)
7 Distinguishing String Selection Problem
Instance Given a set B of n1 (bad) strings of
length L, a set G of n2
(good) strings of length L, two thresholds db and
dg ( db lt dg). Objective
Find a string s such that for each string b ? B,
d(s, b) ? db
and for any string g ? G, d(s,
g) ? dg.
Bad strings
Center string
Good Strings
8 Closest String Problem
Instance Given a set B of n strings of
length L, and an integer
d Objective Find a string s of length L such
that for each string si ?
B, d(s, si) ? d Theorem
Closest string problem is NP-hard. (Lanctot, Li,
Ma, Wang, Zhang, 1999)
9 Closest Substring Problem
Instance Given a set B of n strings of
length at least L, and an
integer d Objective Find a string s of length L
such that for each string
si ? B, there is a substring ti of si (with
length L) such that d(s, si)
? d Theorem Closest substring
problem is NP-hard. (K.
Lanctot, M. Li, B. Ma, S. Wang, and L. Zhang
1999) A PTAS was given in M. Li, B. Ma, and
L. Wang 2000
10 Consensus Pattern
Instance Given a set B of n strings of
different lengths, and an
integer L, Objective Find a string s of
length L and for each string si ? B, find a
substring ti of si such that
n ? d(s, ti)
i1 is minimized.
Theorem The consensus pattern problem
is NP-hard. There is a PTAs for consensus
pattern. (M. Li, B. Ma, and L. Wang, 1999)
11 Farthest String Problem (the other half)
Instance Given a set G of n strings of
length L, and an integer
d Objective Find a string s of length L such
that for each string si ? G,
d(s, si) ? d. Theorem Farthest String Problem is
NP-hard. (K. Lanctot, M. Li, B.
Ma, S. Wang, and L. Zhang, 1999) A PTAS -- A
linear programming random rounding
d is very large, i.e, O(L).
(K. Lanctot, M. Li, B. Ma,
S. Wang, and L. Zhang 1999)
12 Computer Science Background Knowledge
NP-complete Problems unlikely to have a
polynomial time algorithm to optimally solve the
problem A- an approximation algorithm for a
minimization problem. performance ratio a
number r such that for any instance I
A(I)/OPT(I) ? r A(I)
- the cost of the solution for instance I
produced by A OPT(I) - the cost of an optimal
solution for instance I. approximation scheme
an algorithm A with input I
and an error bound ?
that has
the performance ratio 1 ?. a family of
algorithms A, one for each ?.
13 Computer Science Background Knowledge
Polynomial time approximation scheme (PTAS)
an approximation scheme A? that runs in time
polynomial in the size of the instance I, for
any fixed ?. MAX SNP-hard unlikely to have
a PTAS.
14 Summary of our theoretical results
- Closest string problem A PTAS (M. Li, B. Ma, and
L. Wang, STOC99) - Closest substring Problem A PTAS (Li, Ma, Wang,
2002, JACM) - Consensus pattern problem NP-hard and A PTAS (
Li, . Ma, and Wang, JCSS, 2002) - Distinguishing string Selection Problem PTAS
- Distinguishing Substring Selection Problem A
PTAS (Deng, Li, Li, MA and Wang, SIAM J. on
Computing, 2003)
15 Our Results
An approximation scheme that takes an ?gt0 as a
parameter and finds a string s such that for
each string b ? B there exists a substring t of
b with length L such that
d(s, t) ? (1?) db, and for any string g ? G,
d(s, g) ?(1- ?) dg.
16 Computation of the Closest String Problem
Theorem Closest string problem is NP-hard.
(K. Lanctot, M. Li, B. Ma, S.
Wang, and L. Zhang 1999) A 4/3(1?)
approximation
(K. Lanctot, M. Li, B. Ma, S. Wang, and L.
Zhang 1999) (L.
Gasieniec, J. Jansson, and A. Lingas 1999) A
PTAS was presented for the problem (M. Li, B.
Ma, and L. Wang, 1999)
17 Linear Programming Approach
The optimization problem
min d d(si, x)
? d, i 1, 2, , n.
18Linear Programming Approach
- The optimization problem
- min d
- ?a ? ?xj,a 1, j 1, 2 , , L
- ?1 ? j? L ?a ? ? ?(sij, a) xj,a ? d, i
1, 2, , n. - ? xj,a 0 -1 variable denoting whether xj
a - ? ?(sij, a) denote whether sij a
- ? x'j,a a fractional solution to the above LP
19 Linear Programming Approach (continued)
Randomized Rounding with probability x'j,a ,
set xj,a 1. Theorem For any ? gt 0, if L ? 4(ln
n)/?2, and d ?(L), then
d(si, x ) ? d(1?). ? When Llt 4(ln n)/?2
enumerate all possible strings of length L.
20 when d is small (d lt O(L))
Theorem Let 1? i1, i2, , ir ? n and
Qi1,i2,,ir be the set of positions
where si1, si2, , sir agree. There exist
indices 1? i1, i2, , ir ? n
such that for any sl j ?
Qi1,i2,,ir si1j ? slj and si1j ?sj ?
1/(2r-1) d, when max d(si,
sj) gt 11/(2r-1). Note that si1 and s can be
different in many bits in Qi1,i2,,ir . However,
the effect on each sl ? B is not bad.
21 The Algorithm
Step 1 choose the good indices 1? i1, i2, ,
ir ? n and compute Qi1,i2,,ir.
(Try all possible r-tuples) Step 2. Set P 1,
2, , L - Qi1,i2,,ir . Step 3. Use linear
programming approach to solve the problem on P.
(modify the inequalities a little
bit.) Step 4. Use si1 as the solution at
positions in Qi1,i2,,ir . Step 5. Combine the
two parts to get a string of length L.
22Old approximation algorithm for Consensus
Pattern
- Randomly select t sequences from the n given
sequences. - Try (m-L1)t different ways to get the
occurrences of the motif in the t selected
sequences. - (3) Use the t substrings to form an approximate
motif. - (4) Use the approximate motif to search all n
sequences and reconstruct the new motif based on
n substrings.
select t sequences
23New randomized algorithm for Consensus
Pattern (Wang and Dong, 2004)
(1) Randomly select k positions from the L
positions (2) Try 4 k possible consensus strings
of length k. (3) Use the partial consensus
strings to search all the given strings and find
a substring that is closest to the partial string
for each of the n given strings (4) Reconstruct
the center string (by choose the majority letter
for each column) (5) Repeat (1)-(4) several times
and choose the best output.
24Theorem Assume d(ti, s)O(L). With high
probability, the algorithm output a string s
with cost n
n ? d(ti,
s) lt (1?) ? d(ti,s), i1
i1 where s is
the optimal string, ti and ti are substrings in
si that may or may not be identical. The
probability is related to nm (total length of
strings) and k is related to ?, n and m.
25- It is expected that the algorithm might be
practical since it is a random algorithm. - Theoretical results shows that k is required to
be large in order to get good results. - Comparing with the best known software (Random
projection algorithm EM), the random algorithm
is much slower. - We try to combine our random algorithm with EM
- Exciting new findings
26- A challenging instance (L, d), e.g., (15, 4)
- 20 sequences over 4 letters (randomly
generated) - 600 letters in each sequence
- A Pattern was constructed as follows
- Given a center string s of length 15, we
randomly change 4 - positions and get 20 strings of length 4
(with 4 different letters each). - Put each of the 20 strings back to the 20
sequences of length 600 (the location is randomly
chosen) - Challenge find the inserted 20 strings
27The EM Algorithm
- Input
- a 4?L weight matrix W (the initial guess of the
motif) - n sequences S1,S2,...,Sn.
- Output
- new W that is a local maximal solution
- An array P indicating the probability that the
motif starts at every position of every
sequence.
A 0.25 0.0 C 0.25 1.0 G 0.25 0.0 T
0.25 0.0
28Step 1
- For each position x in each sequence Si,
calculate the probability that we observe this
sequence data if the motif starts from position
x - P(i,x)?j1 to L W(bj,j)
- bjSi(xj) the j-th base of the L-mer starting
at x in Si. - To avoid zero weights, a fixed small number is
added to W(i,j).
29Step 2
- For each L-mer in each sequence, normalize the
probability that this L-mer is an occurrence of
the motif as - m-L1
- P'(i,x)P(i,x) / ??x1 P(i,x)
- and replace P with P'.
- Its purpose is to guarantee that
- ?x1 to n-l1P'(i,x)1.
30Step 3
- Re-estimate the motif weight matrix W.
- W ?i1 n ?x1 m-L1 Wi,x
- Where
- Wi,x(b,j)P(i,x) if bSi(xj), else 0.
31Step 4
- Because the sum of each column of W should be
1.0, a normalization is applied to W. - W'(i,j) W(i,j)/??ba,c,g,tW(b,j).
- Replace W with W'.
32Step 5
- Steps 1 to 4 is called a cycle. If W changes very
little from last cycle, then EM converges and the
algorithm ends. Else, goto step 1 and start next
cycle. - How to determine the amount of change?
- Consider the two consecutive matrices Wq-1 and
Wq. - maxWq(i,j)-Wq-1(i,j)lt??
33An Example
- A 0.25 0.0 0.0
- C 0.25 1.0 0.0 s1ACT P10.25 ?1.0
?1.0 - G 0.25 0.0 0.0 s2CGG P20.25 ?? ??,
? is small, but gt0. - T 0.25 0.0 1.0
-
- A p1 0 0 A
0 0 0 - W1 C 0 p1 0 W2 C p2
0 0 - G 0 0 0 G
0 p2 p2 - T 0 0 p1 T
0 0 0
34EM with threshold
- In order to accelerate the EM algorithm, a
threshold is added to step 3. An L-mer takes part
in the re-estimation of the motif matrix only
when its likelihood P(i,x) is not less than the
average likelihood of all the L-mers in this
sequence. So step 3 is modified to - W??i1 to t?Wi,xP(i,x)??1/(n-l1)
35Benefits of the threshold
- First, each EM cycle takes less time than before,
because not every Wi,x needs to be summed. - Second, EM converges faster and more efficient,
because the positions we neglect are more likely
to be noise than to be meaningful motif
occurrences. The average number of cycles changes
from 6.0 to less than 4. And the probability that
EM finds the correct motif becomes larger.
36The time for each cycle
l (the motif length) 11 13
15 17 19 EM
22.4 25.5
28.6 31.5 35.2 EM with
threshold 16.1 18.2 20.0
22.0 23.8 Table 1
average cycle time (milliseconds)
20 sequences and 600 letters each
37The Whole Algorithm
- for trial1 to m do
- 2 choose k positions from L positions randomly
- 3 generate 4k starting points
- 4 for i1 to 4k do
- 5 refine the i-th starting point
- 6 if an (L,d) motif is found (error ltd) then
goto 7 - 7 report the best motif ever found.
38 m
1 3 5
Accuracy EM
45 76 84 ()
EM with threshold 58
91 97 Average time
EM 26
53 67 (seconds) EM
with threshold 9.5 16
17 Table 2
(11,2)-motif 20
sequences and 600 letters each
Accuracy and time comparison
39 m
2 6
10 Accuracy EM
30 62 74 ()
EM with threshold 52
83 90 Average
time EM 66
150 197 (seconds) EM
with threshold 21 39
46 Table 2
(13, 3)-motif 20
sequences and 600 letters each
Accuracy and time comparison
40 m
5 10
20 Accuracy EM
31 46 68 ()
EM with threshold 49
75 88 Average
time EM 179
309 479 (seconds) EM
with threshold 59 87
117 Table 2
(15, 4)-motif 20
sequences and 600 letters each
Accuracy and time comparison
41 m
1 2
3 Accuracy without shift
58 79 91 ()
with shift 93
99 100 Average time
without shift 9.4 14
16 (seconds) with shift
12 13 13
Table 3 (11,
2)-motif 20 sequences
and 600 letters each
Accuracy and time comparison Among 4k, keep the
top 10 patterns, and try to shift 1 or 2
positions to the left and right.
42 m
1 2
4 Accuracy without shift
32 52 72 ()
with shift 70
91 100 Average time
without shift 12 21
32 (seconds) with shift
14 19 21
Table 3 (13,
3)-motif 20 sequences
and 600 letters each
Accuracy and time comparison Among 4k, keep the
top 10 patterns, and try to shift 1 or 2
positions to the left and right.
43 m
2 4
8 Accuracy without shift
27 43 68 ()
with shift 70
94 100 Average time
without shift 29 50
78 (seconds) with shift
27 34 36
Table 3 (15,
4)-motif 20 sequences
and 600 letters each
Accuracy and time comparison Among 4k, keep the
top 10 patterns, and try to shift 1 or 2
positions to the left and right.
44Our Program V.S. Random Projection
PROJECTION Our program motif m accuracy time k
m accuracy time (9,1) 1 100 4.3 4 1 100 6.9 (
8,1) 3 100 4.3 4 5 100 9.6 (11,2) 6 100 5.6
4 3 100 13 (13,3) 12 100 12 4 4 100 21 (15,4
) 30 100 26 4 8 100 36 (10,2) 60 100 25 4 30
100 47 (12,3) 300 99 203 4 70 99 207
On these 7 cases, PROJECTION (Buhler and Tompa,
2002) runs faster than our program.They are
short and easy motifs.
45Our Program V.S. Random Projection
PROJECTION Our program motif m accuracy time k
m accuracy time (17,5) 160 99 83 4 16 100 52
(19,6) 200 99 194 4 30 100 92 (21,7) 400 96
332 4 60 100 102 (14,4) 800 89 936 4 160 90
813 (16,5) 1200 76 2334 4 300 82 2063 (18,6) 2
400 81 4929 5 150 89 4661
On these 6 cases, our program is both faster and
more accurate. They are long and subtle motifs.
46Parametric Alignment of RNA secondary
structures(L. Wang, J. Zhao, H. Zhao)
- RNA secondary structures can be decomposed into
five types of components - Stem, hairpin, bulge, interior loop and
multi-branch loop. - An ordered tree can be used to describe RNA
secondary structures - The order among siblings in the tree is important.
47(No Transcript)
48Comparison of ordered trees
- Alignment of trees
- Operations insertion, deletion, match and
mismatch. - Parameters cost for (1) insertion/deletion, (2)
gap penalty, and (3) mismatch - (cost of match is fixed as 1.)
- Setting of parameters is important
- http//www.cs.cityu.edu.hk/lwang/software/ParaTre
e/index.html
49Direct comparison of RNA secondary structures
- Measures have been proposed to directly compare
two RNA secondary structures. (Zhang et al, 2000) - O(P1xP2xR1xR2).
- We are working on a software (parametric
version).
50The algorithm for aligning two RNA structures
- 2. Secondary Structure a sequence with
non-crossing base pairs - A C G U G G A C C G C
- 3. Tertiary Structure a sequence with crossing
base pairs. -
- A G U G G U C U C U A G C U
- one base in at most one pair
51The algorithm for aligning two RNA Structures
- Edit Operations
- AAAGAAUAAU_UUACGGGACCCUAUAAA
- CGAGA_CAACAUUACGGG______AUAAA
pair mismatch(PMis)
pair match(PM)
pair deletion(PIndel)
gap
base match(SM)
base mismatch(Mis)
base insertion(Indel)
52The algorithm for aligning two RNA Structures
- Definition Alignment between R1 and R2
- R1 insert _,R2 insert _.
- AAAG A_AAG
- AUAG AU_AG
- single base aligned to single base or _.
- AAAGUC AAAGUC
- AAUUC AA_UUC
53The algorithm for aligning two RNA Structures
- base pair aligned to base pair or _.
- AAUGUUUC AAUGUUUC
- AGUUU AGU_UUU_
-
-
54Computation of signed translocation distance
- Chromosome a set of genes in a linear order.
- Gemone a set of chormosomes.
- Example
- A (1,2,3,4,5,6,7,-8,9,10,11,-12),
- (13,14,15,16,17)
55Translocation operation
Prefix-prefix translocation
X
X1
Y2
X1 X2
Y1
X2
Y
X1
-Y1
Y1 Y2
-X2
Y2
Prefix-suffix translocation
56Translocation distance
- Translocation distance the minimum number of
translocations required to transform one genome
into another. Computing the distance given two
signed genomes, compute the translocation
distance. - Previous results (for signed case)
- O(n3) algorithm. (Hannenhalli, 1996)
- O(n2log n) algorithm. (Zhu, 2001)
57Our results (with Zhu and Liu)
- O(n2) algorithm for signed case
- NP-hard for unsigned case. (settled an main open
problem) - (in preparation with Zhu, et
al.) - A software is produced. (with Feng, 2004)
- http//www.cs.cityu.edu.hk/lwang/software/Translo
cation/index.html
58Open Problem(8 years old)
- Transposition
- Chromosome XX1X2...XiXj.XkXk1Xn.
- After operation
- Chromosome X1X2Xi-1Xj1XkXiXjXk1Xn.
- Is there any polynomial time algorithm?