Title: 2 Dimensional Parameterized Matching
1 2 Dimensional Parameterized Matching
- Carmit Hazay
- Moshe Lewenstein
- Dekel Tsur
2Outline
- Definitions
- HistoryMotivation
- Text Preprocessing
- Algorithm Outline
- Pattern Preprocessing
3Parameterized Matching
- Input two strings s and t, st, over
alphabets ?s and ?t. - s parameterize matches t if bijection
?s ?t , such that (s) t.
Example
a
a
b
b
b
s
(a)x
x
x
y
y
y
t
(b)y
41D Parameterized Matching
- Input Two strings T, P Tn, Pm.
- Output All text locations i,
- such that (P)Ti Tim-1.
52D Parameterized Matching
- Input Text T and pattern P
Tnn, Pmm. - Output All text locations (i,j),
- such that (P)Ti,j Tim-1,jm-1.
- Example-
T
a b c a a b b b b
(x)a (y)b (z)c
P
x y z x x y y y y
6Parameterized Matching History
- Introduced by Brenda Baker Baker93.
- Two Dimensions AACLP03This work.
- Used in scaled matching ABL99.
- Periodicity of parameterized matching
ApostolicoGiancarlo. - Approximate parameterized matching AEL,
HLS04. - Others AFM94, Bak95, Bak97.
7Mismatch pairs
- Pair of locations such that the characters
disagree parameterized. - Example,
a a b a a a
x x y x z y
81D Encoding
- Encode every text location by its predecessor
location.
First a to its left
a b a d d a b d b c b d a a b d a a a a b b b
T
1 3 6 13 14 15
16 17 18
Encoded T
0 1 3 6 13 14 15 16 17
91D Encoding
- Two p-matching strings have the same encoded
texts.
S
a b b c b a a c b b c b a
0 0 2 0 3 1 6 4 5 9 8 10 7
Encoded S
x y y z y x x z y y z y x
T
0 0 2 0 3 1 6 4 5 9 8 10 7
Encoded T
101D Encoding
- Two strings p-match iff encoded strings match.
- Reduction to exact matching problem.
S
a b b c b b a c b b c b a
0 0 2 0 3 5 6 4 5 9 8 10 7
Encoded S
x y y z y x x z y y z y x
T
0 0 2 0 3 1 6 4 5 9 8 10 7
Encoded T
112D Mismatch Pairs
- Same as 1D mismatch pairs, but with 2D strings.
- Example
-
a b a b a b b a b
x y x y y y y y y
122D Encoding
- First idea,
- Encode the linearization of text and pattern.
- Overflow problem!!
-
b
a
Different character than b
b
a
b
Different character than a
132D Encoding
- Second idea, use strips.
- Strip Substring of T of size nm.
- i-th strip of T, is nm substring
T1n,iim-1.
i
Encode the linearization of strips. Why overflow
problem solved? Every predecessor within mm
window.
14Text Preprocessing
- For first strip, compute predecessors on its
linearization. - How to compute predecessors for rest of strips?
- First solution Do same as above.
- Time O(n2m).
-
- Can we do better?
15Update strips
- Yes, exploit information from previous strips.
- When moving from strip i to i1, update only O(n)
pointers of first and last column.
i
i1
Time O(n2)
16Check Predecessors
- Are we done?
- No!!
- Need to check every predecessor against every
text location contains it. - Worst case O(n2m2).
- How to improve?
17Algorithm Outline
- Use Duel and Sweep paradigm
- Find candidates - Dueling
- Divide candidates by strips
- Update predecessors of every new strip
- Check new predecessors - Sweep
- Assume pattern witness table given.
18Witness
- Witness Mismatch pair between P and its
alignment to location (a,b).
a
b
19Set Candidates
- Using duel-
- Two text locations with witness one can be
eliminated. - Apply algorithm of ABF94 and return list of
candidates. - Time O(n2).
20Sweep Technique
- Observation,
- All candidates agree with each other.
- Hence,
- Mismatch pair eliminates all candidates
containing it. - Therefore,
- For every predecessor, enough to find one
candidate that contains it.
21Sweep Technique
- How to find?
- Create new nn array A such that,
- Ai,j largest row among candidates that
starts at column j and overlap with row i.
x
22Sweep Technique
- For every predecessor (i,j), (x,y), use range
minima query to find highest candidate contain it.
In case of a mismatch pair, eliminate all
candidates containing it. How?
23Sweep Technique
- Use mismatch vector.
- Every mismatch pair translates into range.
- For new predecessors, add mistake predecessors,
and delete old ones. -
All candidates within this range are
eliminated. O(n2) time.
m
24Sweep Technique
- Observation-
- T p-matches P
- Every text location and its predecessor are not
mismatch pair -
- of distinct characters in P and T equal
- Left to do?
- Count distinct characters for every candidate.
- Use algorithm of Amir Cole Dar Church, time O(n2).
25Overview
Checking all predecessors takes linear
time. Total time O(n2).
26Pattern Preprocessing
- Find witness table for P in time O(m2.5
polylogm). - For every pattern location (i,j), create list of
size O( ) pointers. - Pointer i is predecessor of lines above
(i,j). - Reduce to exact matching with dont cares.
27Pattern Preprocessing
- End cases, multiple cases.
Less than
B1
A1
A2
B2
B3
A3
A4
B4
28Open Questions
- Can the algorithm time complexity be reduced into
O(n2m2)?