Introduction to Real-Time Systems - PowerPoint PPT Presentation

About This Presentation

Title:

Introduction to Real-Time Systems

Description:

Lusheng Wang Dept. of Computer Science City University of Hong Kong Web: http://www.cs.cityu.edu.hk/~lwang e-mail: lwang_at_cs.cityu.edu.hk – PowerPoint PPT presentation

Number of Views:35

Avg rating:3.0/5.0

Slides: 59

Provided by: ComputerSc272

Category:

more less

Transcript and Presenter's Notes

Title: Introduction to Real-Time Systems

1
Randomized Algorithm for Motif Detection
Lusheng Wang Dept. of Computer
Science City University of Hong Kong Web
http//www.cs.cityu.edu.hk/lwang e-mail
lwang_at_cs.cityu.edu.hk
2

Outline

Review our theoretical results
A software for motif detection
Other problems we have been working
--Parametric tools RNA secondary structure
comparison
--Computing translocation distance between
genomes

3

Distinguishing Substring Selection Problem
Instance Given a set B of n1 (bad) strings of
length at least L, a set G of
n2 (good) strings, two thresholds db
and dg ( db lt dg). Objective Find a
string s such that for each string b ? B there
exists a substring t of b with
length L such that d(s, t) ?
db and for any string g ? G,
d(s, g) ? dg.
Bad sequences
s motif
Good sequences
4

Applications
1. Finding Targets for Potential Drugs (T.
Jiang, C. Trendall, S, Wang, T. Wareham, X.
Zhang, 98) (K. Lanctot, M. Li, B. Ma, S.
Wang, and L. Zhang 1999) -- bad strings in B
are from Bacteria. --good strings are from
Humans -- find a substring s of length L that
is conserved in all bad strings, but not
conserved in good strings. -- use s to
screen chemicals -- those selected chemicals
can then be tested as potential
broad-range antibiotics.
5

Applications
2. Creating Diagnostic Probes for Bacterial
Infection (T. Brown, G.A. Leonard, E.D.
Booth, G. Kneale, 1990) -- a group of
closely related pathogenic bacteria -- find
a substring that occurs in each of the bacterial
sequences (with as few substitutions as
possible) and does not occurs in the
host (Human)
6

Applications
3. Creating Universal PCR Primers 4. Creating
Unbiased Consensus Sequences 5. Anti-sense Drug
Design 6. In coding theory (computer science)
7
Distinguishing String Selection Problem
Instance Given a set B of n1 (bad) strings of
length L, a set G of n2
(good) strings of length L, two thresholds db and
dg ( db lt dg). Objective
Find a string s such that for each string b ? B,
d(s, b) ? db
and for any string g ? G, d(s,
g) ? dg.
Bad strings
Center string
Good Strings
8
Closest String Problem
Instance Given a set B of n strings of
length L, and an integer
d Objective Find a string s of length L such
that for each string si ?
B, d(s, si) ? d Theorem
Closest string problem is NP-hard. (Lanctot, Li,
Ma, Wang, Zhang, 1999)
9
Closest Substring Problem
Instance Given a set B of n strings of
length at least L, and an
integer d Objective Find a string s of length L
such that for each string
si ? B, there is a substring ti of si (with
length L) such that d(s, si)
? d Theorem Closest substring
problem is NP-hard. (K.
Lanctot, M. Li, B. Ma, S. Wang, and L. Zhang
1999) A PTAS was given in M. Li, B. Ma, and
L. Wang 2000
10
Consensus Pattern
Instance Given a set B of n strings of
different lengths, and an
integer L, Objective Find a string s of
length L and for each string si ? B, find a
substring ti of si such that
n ? d(s, ti)
i1 is minimized.
Theorem The consensus pattern problem
is NP-hard. There is a PTAs for consensus
pattern. (M. Li, B. Ma, and L. Wang, 1999)
11
Farthest String Problem (the other half)
Instance Given a set G of n strings of
length L, and an integer
d Objective Find a string s of length L such
that for each string si ? G,
d(s, si) ? d. Theorem Farthest String Problem is
NP-hard. (K. Lanctot, M. Li, B.
Ma, S. Wang, and L. Zhang, 1999) A PTAS -- A
linear programming random rounding
d is very large, i.e, O(L).
(K. Lanctot, M. Li, B. Ma,
S. Wang, and L. Zhang 1999)
12
Computer Science Background Knowledge
NP-complete Problems unlikely to have a
polynomial time algorithm to optimally solve the
problem A- an approximation algorithm for a
minimization problem. performance ratio a
number r such that for any instance I
A(I)/OPT(I) ? r A(I)
- the cost of the solution for instance I
produced by A OPT(I) - the cost of an optimal
solution for instance I. approximation scheme
an algorithm A with input I
and an error bound ?
that has
the performance ratio 1 ?. a family of
algorithms A, one for each ?.

13
Computer Science Background Knowledge
Polynomial time approximation scheme (PTAS)
an approximation scheme A? that runs in time
polynomial in the size of the instance I, for
any fixed ?. MAX SNP-hard unlikely to have
a PTAS.

14
Summary of our theoretical results

Closest string problem A PTAS (M. Li, B. Ma, and
L. Wang, STOC99)
Closest substring Problem A PTAS (Li, Ma, Wang,
2002, JACM)
Consensus pattern problem NP-hard and A PTAS (
Li, . Ma, and Wang, JCSS, 2002)
Distinguishing string Selection Problem PTAS
Distinguishing Substring Selection Problem A
PTAS (Deng, Li, Li, MA and Wang, SIAM J. on
Computing, 2003)

15

Our Results
An approximation scheme that takes an ?gt0 as a
parameter and finds a string s such that for
each string b ? B there exists a substring t of
b with length L such that
d(s, t) ? (1?) db, and for any string g ? G,
d(s, g) ?(1- ?) dg.
16
Computation of the Closest String Problem
Theorem Closest string problem is NP-hard.
(K. Lanctot, M. Li, B. Ma, S.
Wang, and L. Zhang 1999) A 4/3(1?)
approximation
(K. Lanctot, M. Li, B. Ma, S. Wang, and L.
Zhang 1999) (L.
Gasieniec, J. Jansson, and A. Lingas 1999) A
PTAS was presented for the problem (M. Li, B.
Ma, and L. Wang, 1999)
17

Linear Programming Approach
The optimization problem
min d d(si, x)
? d, i 1, 2, , n.
18
Linear Programming Approach

The optimization problem
min d
?a ? ?xj,a 1, j 1, 2 , , L
?1 ? j? L ?a ? ? ?(sij, a) xj,a ? d, i
1, 2, , n.
? xj,a 0 -1 variable denoting whether xj
a
? ?(sij, a) denote whether sij a
? x'j,a a fractional solution to the above LP

19
Linear Programming Approach (continued)
Randomized Rounding with probability x'j,a ,
set xj,a 1. Theorem For any ? gt 0, if L ? 4(ln
n)/?2, and d ?(L), then
d(si, x ) ? d(1?). ? When Llt 4(ln n)/?2
enumerate all possible strings of length L.
20
when d is small (d lt O(L))
Theorem Let 1? i1, i2, , ir ? n and
Qi1,i2,,ir be the set of positions
where si1, si2, , sir agree. There exist
indices 1? i1, i2, , ir ? n
such that for any sl j ?
Qi1,i2,,ir si1j ? slj and si1j ?sj ?
1/(2r-1) d, when max d(si,
sj) gt 11/(2r-1). Note that si1 and s can be
different in many bits in Qi1,i2,,ir . However,
the effect on each sl ? B is not bad.
21
The Algorithm
Step 1 choose the good indices 1? i1, i2, ,
ir ? n and compute Qi1,i2,,ir.
(Try all possible r-tuples) Step 2. Set P 1,
2, , L - Qi1,i2,,ir . Step 3. Use linear
programming approach to solve the problem on P.
(modify the inequalities a little
bit.) Step 4. Use si1 as the solution at
positions in Qi1,i2,,ir . Step 5. Combine the
two parts to get a string of length L.
22
Old approximation algorithm for Consensus
Pattern

Randomly select t sequences from the n given
sequences.
Try (m-L1)t different ways to get the
occurrences of the motif in the t selected
sequences.
(3) Use the t substrings to form an approximate
motif.
(4) Use the approximate motif to search all n
sequences and reconstruct the new motif based on
n substrings.

select t sequences
23
New randomized algorithm for Consensus
Pattern (Wang and Dong, 2004)
(1) Randomly select k positions from the L
positions (2) Try 4 k possible consensus strings
of length k. (3) Use the partial consensus
strings to search all the given strings and find
a substring that is closest to the partial string
for each of the n given strings (4) Reconstruct
the center string (by choose the majority letter
for each column) (5) Repeat (1)-(4) several times
and choose the best output.
24
Theorem Assume d(ti, s)O(L). With high
probability, the algorithm output a string s
with cost n
n ? d(ti,
s) lt (1?) ? d(ti,s), i1
i1 where s is
the optimal string, ti and ti are substrings in
si that may or may not be identical. The
probability is related to nm (total length of
strings) and k is related to ?, n and m.
25

It is expected that the algorithm might be
practical since it is a random algorithm.
Theoretical results shows that k is required to
be large in order to get good results.
Comparing with the best known software (Random
projection algorithm EM), the random algorithm
is much slower.
We try to combine our random algorithm with EM
Exciting new findings

A challenging instance (L, d), e.g., (15, 4)
20 sequences over 4 letters (randomly
generated)
600 letters in each sequence
A Pattern was constructed as follows
Given a center string s of length 15, we
randomly change 4
positions and get 20 strings of length 4
(with 4 different letters each).
Put each of the 20 strings back to the 20
sequences of length 600 (the location is randomly
chosen)
Challenge find the inserted 20 strings

27
The EM Algorithm

Input
a 4?L weight matrix W (the initial guess of the
motif)
n sequences S1,S2,...,Sn.
Output
new W that is a local maximal solution
An array P indicating the probability that the
motif starts at every position of every
sequence.

A 0.25 0.0 C 0.25 1.0 G 0.25 0.0 T
0.25 0.0
28
Step 1

For each position x in each sequence Si,
calculate the probability that we observe this
sequence data if the motif starts from position
x
P(i,x)?j1 to L W(bj,j)
bjSi(xj) the j-th base of the L-mer starting
at x in Si.
To avoid zero weights, a fixed small number is
added to W(i,j).

29
Step 2

For each L-mer in each sequence, normalize the
probability that this L-mer is an occurrence of
the motif as
m-L1
P'(i,x)P(i,x) / ??x1 P(i,x)
and replace P with P'.
Its purpose is to guarantee that
?x1 to n-l1P'(i,x)1.

30
Step 3

Re-estimate the motif weight matrix W.
W ?i1 n ?x1 m-L1 Wi,x
Where
Wi,x(b,j)P(i,x) if bSi(xj), else 0.

31
Step 4

Because the sum of each column of W should be
1.0, a normalization is applied to W.
W'(i,j) W(i,j)/??ba,c,g,tW(b,j).
Replace W with W'.

32
Step 5

Steps 1 to 4 is called a cycle. If W changes very
little from last cycle, then EM converges and the
algorithm ends. Else, goto step 1 and start next
cycle.
How to determine the amount of change?
Consider the two consecutive matrices Wq-1 and
Wq.
maxWq(i,j)-Wq-1(i,j)lt??

33
An Example

A 0.25 0.0 0.0
C 0.25 1.0 0.0 s1ACT P10.25 ?1.0
?1.0
G 0.25 0.0 0.0 s2CGG P20.25 ?? ??,
? is small, but gt0.
T 0.25 0.0 1.0
A p1 0 0 A
0 0 0
W1 C 0 p1 0 W2 C p2
0 0
G 0 0 0 G
0 p2 p2
T 0 0 p1 T
0 0 0

34
EM with threshold

In order to accelerate the EM algorithm, a
threshold is added to step 3. An L-mer takes part
in the re-estimation of the motif matrix only
when its likelihood P(i,x) is not less than the
average likelihood of all the L-mers in this
sequence. So step 3 is modified to
W??i1 to t?Wi,xP(i,x)??1/(n-l1)

35
Benefits of the threshold

First, each EM cycle takes less time than before,
because not every Wi,x needs to be summed.
Second, EM converges faster and more efficient,
because the positions we neglect are more likely
to be noise than to be meaningful motif
occurrences. The average number of cycles changes
from 6.0 to less than 4. And the probability that
EM finds the correct motif becomes larger.

36
The time for each cycle
l (the motif length) 11 13
15 17 19 EM
22.4 25.5
28.6 31.5 35.2 EM with
threshold 16.1 18.2 20.0
22.0 23.8 Table 1
average cycle time (milliseconds)
20 sequences and 600 letters each
37
The Whole Algorithm

for trial1 to m do
2 choose k positions from L positions randomly
3 generate 4k starting points
4 for i1 to 4k do
5 refine the i-th starting point
6 if an (L,d) motif is found (error ltd) then
goto 7
7 report the best motif ever found.

38
m
1 3 5
Accuracy EM
45 76 84 ()
EM with threshold 58
91 97 Average time
EM 26
53 67 (seconds) EM
with threshold 9.5 16
17 Table 2
(11,2)-motif 20
sequences and 600 letters each
Accuracy and time comparison
39
m
2 6
10 Accuracy EM
30 62 74 ()
EM with threshold 52
83 90 Average
time EM 66
150 197 (seconds) EM
with threshold 21 39
46 Table 2
(13, 3)-motif 20
sequences and 600 letters each
Accuracy and time comparison
40
m
5 10
20 Accuracy EM
31 46 68 ()
EM with threshold 49
75 88 Average
time EM 179
309 479 (seconds) EM
with threshold 59 87
117 Table 2
(15, 4)-motif 20
sequences and 600 letters each
Accuracy and time comparison
41
m
1 2
3 Accuracy without shift
58 79 91 ()
with shift 93
99 100 Average time
without shift 9.4 14
16 (seconds) with shift
12 13 13
Table 3 (11,
2)-motif 20 sequences
and 600 letters each
Accuracy and time comparison Among 4k, keep the
top 10 patterns, and try to shift 1 or 2
positions to the left and right.
42
m
1 2
4 Accuracy without shift
32 52 72 ()
with shift 70
91 100 Average time
without shift 12 21
32 (seconds) with shift
14 19 21
Table 3 (13,
3)-motif 20 sequences
and 600 letters each
Accuracy and time comparison Among 4k, keep the
top 10 patterns, and try to shift 1 or 2
positions to the left and right.
43
m
2 4
8 Accuracy without shift
27 43 68 ()
with shift 70
94 100 Average time
without shift 29 50
78 (seconds) with shift
27 34 36
Table 3 (15,
4)-motif 20 sequences
and 600 letters each
Accuracy and time comparison Among 4k, keep the
top 10 patterns, and try to shift 1 or 2
positions to the left and right.
44
Our Program V.S. Random Projection
PROJECTION Our program motif m accuracy time k
m accuracy time (9,1) 1 100 4.3 4 1 100 6.9 (
8,1) 3 100 4.3 4 5 100 9.6 (11,2) 6 100 5.6
4 3 100 13 (13,3) 12 100 12 4 4 100 21 (15,4
) 30 100 26 4 8 100 36 (10,2) 60 100 25 4 30
100 47 (12,3) 300 99 203 4 70 99 207
On these 7 cases, PROJECTION (Buhler and Tompa,
2002) runs faster than our program.They are
short and easy motifs.
45
Our Program V.S. Random Projection
PROJECTION Our program motif m accuracy time k
m accuracy time (17,5) 160 99 83 4 16 100 52
(19,6) 200 99 194 4 30 100 92 (21,7) 400 96
332 4 60 100 102 (14,4) 800 89 936 4 160 90
813 (16,5) 1200 76 2334 4 300 82 2063 (18,6) 2
400 81 4929 5 150 89 4661
On these 6 cases, our program is both faster and
more accurate. They are long and subtle motifs.
46
Parametric Alignment of RNA secondary
structures(L. Wang, J. Zhao, H. Zhao)

RNA secondary structures can be decomposed into
five types of components
Stem, hairpin, bulge, interior loop and
multi-branch loop.
An ordered tree can be used to describe RNA
secondary structures
The order among siblings in the tree is important.

47
(No Transcript)
48
Comparison of ordered trees

Alignment of trees
Operations insertion, deletion, match and
mismatch.
Parameters cost for (1) insertion/deletion, (2)
gap penalty, and (3) mismatch
(cost of match is fixed as 1.)
Setting of parameters is important
http//www.cs.cityu.edu.hk/lwang/software/ParaTre
e/index.html

49
Direct comparison of RNA secondary structures

Measures have been proposed to directly compare
two RNA secondary structures. (Zhang et al, 2000)
O(P1xP2xR1xR2).
We are working on a software (parametric
version).

50
The algorithm for aligning two RNA structures

2. Secondary Structure a sequence with
non-crossing base pairs
A C G U G G A C C G C
3. Tertiary Structure a sequence with crossing
base pairs.
A G U G G U C U C U A G C U
one base in at most one pair

51
The algorithm for aligning two RNA Structures

Edit Operations
AAAGAAUAAU_UUACGGGACCCUAUAAA
CGAGA_CAACAUUACGGG______AUAAA

pair mismatch(PMis)
pair match(PM)
pair deletion(PIndel)
gap
base match(SM)
base mismatch(Mis)
base insertion(Indel)
52
The algorithm for aligning two RNA Structures

Definition Alignment between R1 and R2
R1 insert _,R2 insert _.
AAAG A_AAG
AUAG AU_AG
single base aligned to single base or _.
AAAGUC AAAGUC
AAUUC AA_UUC

53
The algorithm for aligning two RNA Structures

base pair aligned to base pair or _.
AAUGUUUC AAUGUUUC
AGUUU AGU_UUU_

54
Computation of signed translocation distance

Chromosome a set of genes in a linear order.
Gemone a set of chormosomes.
Example
A (1,2,3,4,5,6,7,-8,9,10,11,-12),
(13,14,15,16,17)

55
Translocation operation
Prefix-prefix translocation
X
X1
Y2
X1 X2
Y1
X2
Y
X1
-Y1
Y1 Y2
-X2
Y2
Prefix-suffix translocation
56
Translocation distance

Translocation distance the minimum number of
translocations required to transform one genome
into another. Computing the distance given two
signed genomes, compute the translocation
distance.
Previous results (for signed case)
O(n3) algorithm. (Hannenhalli, 1996)
O(n2log n) algorithm. (Zhu, 2001)

57
Our results (with Zhu and Liu)