Title: Asynchronous Pattern Matching - Metrics
1Asynchronous Pattern Matching -Metrics
2Motivation
3Motivation
In the old days Pattern and text are given in
correct sequential order. It is possible that the
content is erroneous. New paradigm Content is
exact, but the order of the pattern symbols may
be scrambled. Why? Transmitted asynchronously?
The nature of the
application?
4Example Swaps
Tehse knids of typing mistakes are very common So
when searching for pattern These we are seeking
the symbols of the pattern but with an order
changed by swaps. Surprisingly, pattern matching
with swaps is easier than pattern matching with
mismatches (ACHLP01)
5Example Reversals
AAAGGCCCTTTGAGCCC AAAGAGTTTCCCGGCCC Given a DNA
substring, a piece of it can detach and reverse.
This process still computationally
tough. Question What is the minimum number of
reversals necessary to sort a permutation of
1,,n
6Global Rearrangements?
Berman Hannenhalli (1996) called this
Global Rearrangement as opposed to
Local Rearrangement (edit distance).
Showed it is NP-hard. Our Thesis This is a
special case of errors in the address rather than
content.
7Example Transpositions
AAAGGCCCTTTGAGCCC AATTTGAGGCCCAGCCC Given a DNA
substring, a piece of it can be transposed to
another area. Question What is the minimum
number of transpositions necessary to sort a
permutation of 1,,n ?
8Complexity?
Bafna Pevzner (1998), Christie (1998),
Hartman (2001) 1.5 Polynomial Approximation. Not
known whether efficiently computable. This is
another special case of errors in the address
rather than content.
9Example Block Interchanges
AAAGGCCCTTTGAGCCC AAGTTTAGGCCCAGCCC Given a DNA
substring, two non-empty subsequences can be
interchanged. Question What is the minimum
number of block interchanges necessary to sort a
permutation of 1,,n ? Christie (1996) O(n
)
2
10Summary
Biology sorting permutations Reversals (Berman
Hannenhalli, 1996) Transpositions (Bafna
Pevzner, 1998)
Pattern Matching Swaps (Amir, Lewenstein
Porat, 2002)
NP-hard ?
O(n log m)
Block interchanges O(n2) (Christie, 1996)
Note A swap is a block interchange
simplification
1. Block size
2. Only once
3. Adjacent
11 Edit operations map
Reversal, Transposition, Block interchange 1.
arbitrary block size 2. not once 3. non
adjacent 4. permutation 5. optimization Intercha
nge 1. block of size 1 2. not once 3.
non adjacent 4. permutation 5.
optimization Generalized-swap 1. block of size
1 2. once 3. non adjacent 4. repetitions
5. optimization/decision Swap 1. block of
size 1 2. once 3. adjacent 4.
repetitions 5. optimization/decision
12Definitions
interchange
interchange matches
S1bbaca S2bbaac
S1bbaca S2bcaba
generalized-swap matches
13Generalized Swap Matching
INPUT text T0..n, pattern P0..m OUTPUT all
i s.t. P generalized-swap matches Ti..im
Reminder Convolution The convolution of the
strings t1..n and p1..m is the string tp
such that
Fact The convolution of n-length text and
m-length pattern can be done in O(n log m) time
using FFT.
14In Pattern Matching
Convolutions
b0 b1 b2
b0 b1 b2
b0 b1 b2
O(n log m) using FFT
15Problem O(n log m) only in algebraically closed
fields, e.g. C.
Solution Reduce problem to (Boolean/integer/real)
multiplication. S
This reduction costs!
Example Hamming distance.
A B A B C A B B B A
Counting mismatches is equivalent to Counting
matches
16Example
1 0 1
1 0 1
1 0 1
Count all hits of 1 in pattern and 1 in text.
17For
Define
1 if ab
0 o/w
Example
18For
Do
Result The number of times a in pattern matches
a in text the number of times b in pattern
matches b in text the number of times c in
pattern matches c in text.
19Generalized Swap Matching a Randomized Algorithm
Idea assign natural numbers to alphabet symbols,
and construct T replacing the number a by the
pair a2,-a P replacing the number b by the pair
b, b2. Convolution of T and P gives at every
location 2i ?j0..mh(T2ij,Pj) where
h(a,b)ab(a-b). ? 3-degree multivariate
polynomial.
20Generalized Swap Matching a Randomized Algorithm
Since h(a,a)0 h(a,b)h(b,a)ab(b-a)ba(a-b
)0, a generalized-swap match ? 0 polynomial.
Example Text ABCBAABBC Pattern CCAABABBB
1 -1, 4 -2, 9 -3,4 -2,1 -1,1 -1,4 -2,4 -2,9 -3
3 9, 3 9, 1 1,1 1,2 4, 1 1,2 4, 2 4,2 4
3 -9,12 -18,9 -3,4 -2,2 -4,1 -1,8 -8,8 -8,18 -12
21Generalized Swap Matching a Randomized Algorithm
Problem It is possible that coincidentally the
result will be 0 even if no swap match. Example
for text ace and pattern bdf we get a
multivariate degree 3 polynomial We have to
make sure that the probability for such a
possibility is quite small.
22Generalized Swap Matching a Randomized Algorithm
What can we say about the 0s of the polynomial?
By Schwartz-Zippel Lemma prob. of
0?degree/domain. Conclude
Theorem There exist an O(n log m) algorithm that
reports all generalized-swap matches and reports
false matches with prob.?1/n.
23Generalized Swap MatchingDe-randomization?
Can we detect 0s thus de-randomize the algorithm?
Suggestion Take h1,hk having no common root.
It wont work, k would have to be too large !
24Generalized Swap Matching De-randomization?
Theorem ?(m/log m) polynomial functions are
required to guarantee a 0 convolution value is a
0 polynomial.
Proof By a linear reduction from word
equality. Given m-bit words w1 w2 at
processors P1 P2 Construct Tw1,1,2,,m
P1,2,,m,w2. Now, T generalized-swap matches P
iff w1w2.
P1 computes w1 (1,2,,m)
log m bit result
P2 computes (1,2,,m) w2
Communication Complexity word equality
requires exchanging ?(m) bits, We get k?log m
?(m), so k must be ?(m/log m).
25Interchange Distance Problem
INPUT text T0..n, pattern P0..m OUTPUT The
minimum number of interchanges s.t. Ti..im
interchange matches P.
Reminder permutation cycle The cycles (143)
3-cycle, (2) 1-cycle represent 3241. Fact The
representation of a permutation as a product of
disjoint permutation cycles is unique.
26Interchange Distance Problem
Lemma Sorting a k-length permutation cycle
requires exactly k-1 interchanges. Proof By
induction on k.
Cases (1), (2 1), (3 1 2)
Theorem The interchange distance of an m-length
permutation ? is m-c(?), where c(?) is the number
of permutation cycles in ?.
Result An O(nm) algorithm to solve the
interchange distance problem.
A connection between sorting by interchanges and
generalized-swap matching?
27Interchange Generation Distance Problem
INPUT text T0..n, pattern P0..m OUTPUT The
minimum number of interchange- generations s.t.
Ti..im interchange matches P.
Definition Let SS1,S2,,SkF, Sl1 derived from
Sl via interchange Il. An interchange-generation
is a subsequence of I1,,Ik-1 s.t. the
interchanges have no index in common.
Note Interchanges in a generation may occur in
parallel.
28Interchange Generation Distance Problem
Lemma Let ? be a cycle of length kgt2. It is
possible to sort ? in 2 generations and k-1
interchanges. Example (1,2,3,4,5,6,7,8,0) g
eneration 1 (1,8),(2,7),(3,6),(4,5) (8,7,
6,5,4,3,2,1,0) generation
2 (0,8),(1,7),(2,6),(3,5) (0,1,2,3,4,5,6,
7,8)
29Sorting a General Cycle in Two Rounds
Algorithm Exchange contents of locations 2 and
n 3
and n-1
4 and n-2
. . . Then bring every one
to place simultaneously.
30EXAMPLE
Index 6 3 5 1
4 2 Sorted 0 1
2 3 4 5
Permutation 2 5 4 0
1 3 3 4
5 0 1 2
0 4 5 3 1
2 0 1 5
3 4 2 0
1 2 3 4 5
31Why Does it Work?
We want to send 0 to its place So in the last
location should be the number where 0 is
now. This number is in location 2. Same
reasoning true for 1, 2,
32Interchange Generation Distance Problem
- Theorem Let maxl(?) be the length of the longest
permutation cycle in an m-length permutation ?.
The interchange generation distance of ? is
exactly - 0, if maxl(?)1.
- 1, if maxl(?)2.
- 2, if maxl(?)gt2.
Note There is a generalized-swap match iff
sorting by interchanges is done in 1 generation.
33L1-Distance Problem
INPUT text T0..n, pattern P0..m OUTPUT The
L1-distance between Ti..im and P for all
i?0,n-m. Where, L1-distance(Ti..im,P)?j-?T
i(j)
Example Text ABCBAABBC Pattern CCAABABBB
How do we pair the symbols?
Do we need to try all pairings of same letters?
34L1-Distance Problem
Definition Let T10..n, be a string over ? and
T20..n a permutation of T1. A pairing between
T1 and T2 is a bijection
M0,,n?0,,n Where T1iT2M(i) , for all
i0,,n The L1-distance between T1 and T2 under
M is The L1-distance between T1 and T2 is
35L1-Distance Problem
EXAMPLE T1 A B B A B C A C B
M1 T2 C A C B B B B A A
dM1L1(T1,T2)
36L1-Distance Problem
EXAMPLE T1 A B B A B C A C B
M1 T2 C A C B B B B A A
dM1L1(T1,T2)8
37L1-Distance Problem
EXAMPLE T1 A B B A B C A C B
M1 T2 C A C B B B B A A
dM1L1(T1,T2)8513
38L1-Distance Problem
EXAMPLE T1 A B B A B C A C B
M1 T2 C A C B B B B A A
dM1L1(T1,T2)85215
39L1-Distance Problem
EXAMPLE T1 A B B A B C A C B
M1 T2 C A C B B B B A A
dM1L1(T1,T2)852419
40L1-Distance Problem
EXAMPLE T1 A B B A B C A C B
M1 T2 C A C B B B B A A
dM1L1(T1,T2)8524120
41L1-Distance Problem
EXAMPLE T1 A B B A B C A C B
M1 T2 C A C B B B B A A
dM1L1(T1,T2)85241323
42L1-Distance Problem
EXAMPLE T1 A B B A B C A C B
M1 T2 C A C B B B B A A
dM1L1(T1,T2)852413528
43L1-Distance Problem
EXAMPLE T1 A B B A B C A C B
M1 T2 C A C B B B B A A
dM1L1(T1,T2)8524135735
44L1-Distance Problem
EXAMPLE T1 A B B A B C A C B
M1 T2 C A C B B B B A A
dM1L1(T1,T2)85241357540
45L1-Distance Problem
EXAMPLE T1 A B B A B C A C B
M2 T2 C A C B B B B A A
dM2L1(T1,T2)
46L1-Distance Problem
EXAMPLE T1 A B B A B C A C B
M2 T2 C A C B B B B A A
dM2L1(T1,T2)1
47L1-Distance Problem
EXAMPLE T1 A B B A B C A C B
M2 T2 C A C B B B B A A
dM2L1(T1,T2)123
48L1-Distance Problem
EXAMPLE T1 A B B A B C A C B
M2 T2 C A C B B B B A A
dM2L1(T1,T2)1225
49L1-Distance Problem
EXAMPLE T1 A B B A B C A C B
M2 T2 C A C B B B B A A
dM2L1(T1,T2)12249
50L1-Distance Problem
EXAMPLE T1 A B B A B C A C B
M2 T2 C A C B B B B A A
dM2L1(T1,T2)1224110
51L1-Distance Problem
EXAMPLE T1 A B B A B C A C B
M2 T2 C A C B B B B A A
dM2L1(T1,T2)12241515
52L1-Distance Problem
EXAMPLE T1 A B B A B C A C B
M2 T2 C A C B B B B A A
dM2L1(T1,T2)122415217
53L1-Distance Problem
EXAMPLE T1 A B B A B C A C B
M2 T2 C A C B B B B A A
dM2L1(T1,T2)1224152522
54L1-Distance Problem
EXAMPLE T1 A B B A B C A C B
M2 T2 C A C B B B B A A
dM2L1(T1,T2)12241525224 Note that
dM2L1(T1,T2)40 How do we choose the pairing
function?
55L1-Distance Problem
Fortunately, we know how an optimal pairing
looks... lemma Let T1,T2??m be two strings s.t.
T2 is a permutation of T1. Let M be the pairing
function that, for all a and k, moves the k-th a
in T1 to the location of the k-th a in T2.
Then, dL1(T1,T2)dML1(T1,T2)
Example An optimal pairing Text
ABCBAABBC Pattern CCAABABBB
Result An O(nm) algorithm to solve the
L1-distance problem.
56L1-Distance Problem
Proof of pairing lemma Note that in M, for a
fixed alphabet symbol a, there are no
crossovers. We will show that crossovers do not
help. Cases M M 1)
x y
z Cost in Mxyyz Cost in Mxyzy
57L1-Distance Problem
Proof of pairing lemma (continued) 2)
M M
x y z Cost in
Mxz Cost in Mxyyz
58L1-Distance Problem
Proof of pairing lemma (continued) 3)
M M
x y z Cost in
Mxz Cost in Mxyzy
59L1-Distance Problem
For pattern with distinct entries we can do
better... Idea Use computations from position i
to position i1.
Example Text 2312331 Pattern 123
Dist(1)1-32-13-24
Direction relative to pattern Left Match
Right 2,3 1
60L1-Distance Problem
Now we move to the next location Text
2312331 Pattern 123
Dist(2)?
Direction relative to pattern Left (1 to Dist)
Match Right (-1 to Dist) 2,3 1,2
61L1-Distance Problem
Next location distance computation Text
2312331 Pattern 123
Dist(2)Dist(1)Left-Right-left mostright
most 41-1-13-24 (1-22-33-1)
Result An O(n) algorithm to solve the
L1-distance problem for pattern with distinct
entries.
62L2-Distance Problem
Definition Let T10..n, be a string over ? and
T20..n a permutation of T1. A pairing between
T1 and T2 is a bijection
M0,,n?0,,n Where T1iT2M(i) , for all
i0,,n The L2-distance between T1 and T2 under
M is The L2-distance between T1 and T2 is
63L2-Distance Problem
INPUT text T0..n, pattern P0..m OUTPUT The
L2-distance between Ti..im and P for all
i?0,n-m. Where, L2-distance(T(i),P)?j-MT(i)(j
)2
Do we need to try all pairings of same letters?
No, the pairing lemma works here too.
Result An O(nm) algorithm to solve the
L2-distance problem.
We can do better
64L2-Distance Problem
Idea Consider T and P of same length. Let
lista(P) and lista(T) for each letter a, be the
list of locations in which a occurs in P and
T. The L2-distance between T and P
is dL2(T,P)?a?P?j?0,list(a)(lista(P)j-li
sta(T(i))j)2 Schematically Fix letter a
T P list(T) i1, i2, , ik
list(P) j1, j2, , jk
65L2-Distance Problem
easy to compute
easy to compute
convolution
66L2-Distance Problemlarge text
Idea Consider lista(P) and lista(T(i)) for each
letter a, the list of locations in which a occurs
in P and T(i). The L2-distance between T(i) and P
is dL2(T(i),P)?a?P?j?0,list(a)(lista(P)j
-lista(T(i))j)2 Now, we want to use lista(T)
instead of lista(T(i)). Now the text location
indices are not fixed
67L2-Distance Problemlarge text
Schematically Fix letter a T
x P list(T) i1, i2, ,
ik,...,ir list(P) j1, j2, , jk For
location x, the text indices need to be i2-x,
i3-x, Convolution formula
68L2-Distance Problemlarge text
convolution
easy to compute
easy to compute
69L2-Distance Problemlarge alphabet
Problem What happens when unbounded number of
alphabet symbols? How many convolutions? Let
?a1,,ak and let ni be the number of times ai
occurs in T, i1,,k. Clearly,
n1n2nkn. Time For each symbol ai Total
Time
Result An O(n log m) algorithm to solve the
L2-distance problem.
70Open Problems
1. Interchange distance faster than O(nm)? 2.
Asynchronous communication different errors in
address bits. 3. Different error measures than
interchange/block interchange/transposition/revers
als for errors arising from address bit errors.
Note The techniques employed in asynchronous
pattern matching have so far proven new and
different from traditional pattern matching.
71The End