Title: Approximate Matching of RunLength Compressed Strings
1Approximate Matching of Run-Length Compressed
Strings
- Veli Mäkinen
- Gonzalo Navarro
- Esko Ukkonen
Department of Computer Science, University of
Helsinki, Finland Department of Computer
Science, University of Chile, Santiago, Chile
2The Problem
- Run-length coding of a string is
RL(aaaabbbbbbccaaabb)a4b6c2a3b2. - Problem Given two run-length encoded strings,
calculate their edit distance. - The trivial solution is to decompress and then
calculate, but can it be done faster...
3Variations of the scheme
- Different variations of edit distance.
- Levenshtein distance Minimum number of character
insertions, deletions, and substitutions to
convert a string into another string. - Longest common subsequence (LCS) Distance DID,
minimum number of insertions and deletions, is a
dual problem for LCS 2LCS(A,B)mn-DID(A,B).
4Variations of the scheme (2)
- Search problem Search all approximate
occurrences of a short pattern P inside a large
text T, where both the pattern and the text are
run-length encoded. - With approximate occurrence we mean that the
(edit) distance between the pattern and a
substring of the text is at most some treshold
value k.
5Previous Results
- Bunke Csirik, 95 O(mnmn) for calculating
the LCS between two strings of lengths m and n,
run-length encoded to lengths m and n. - Apostolico Landau Skiena, 97 O(mn
log(mn)) for the LCS. - Mitchell, 97 O((mnp) log(mnp)) for the
LCS, where p is the amount of matches between
compressed characters.
6Our Results
- The first algorithm for the Levenshtein distance
in this context O(mnmn) by generalizing the
result of Bunke Csirik for the LCS. - Search algorithm for Levenshtein and LCS
distances O(mmn).
Independently Arbell Landau Mitchell found
similar algorithm.
7Our Results (2)
- O(min(d2min(m,n), mnmax(m,n))) for the LCS,
where dmn-2LCS(A,B). - Conjecture O(mn) average case for the LCS.
- Experimental results to support the conjecture.
8Bunke Csirik algorithm for the LCS
a a a a a b b b b c c c c a a
Equal letter box values from the diagonal.
0 1 2 3 4 5 6 7 8 9 10 11
1 2 6 10
2 1 5 9
3 0 4 8
4 1 5 7
5 4 3 2 3 4 5 6 5 4 5 6
6 3 5 7
7 4 4 8
8 5 3 7
9 8 7 6 5 4 3 2 3 4 5 6
13 12 11 10 9 8 7 6 7 8 9 10
15 14 13 12 11 10 9 8 7 6 7 8
10 7 3 7
11 8 4 8
12 9 5 9
14 11 7 9
a a a b b b b a a a a
Different letter box minimum of two values
(value from the left borderdistance, value from
the upper border distance).
2LCS(A,B)mn-DID(A,B) gt
LCS(aaaaabbbbccccaa,aaabbbbaaaa) (1511-8)/29
9O(mnmn) for the Levenshtein distance
a a a a a b b b b c c c c a a
0 1 2 3 4 5 6 7 8 9 10 11
1 2 6 10
2 1 5 9
3 0 4 8
4 1 4 7
5 4 3 2 2 2 3 4 4 4 5 6
6 3 3 6
7 4 2 6
8 5 2 6
9 8 7 6 5 4 3 2 3 4 5 6
13 12 11 10 9 8 7 6 6 6 6 6
15 14 13 12 11 10 9 8 7 6 6 6
10 7 3 6
11 8 4 6
12 9 5 6
14 11 7 6
Equal letter boxes are calculated as in the LCS.
a a a b b b b a a a a
Different letter boxes are more difficult...
10O(mnmn) for the Levenshtein distance (2)
mintop
3 2 1 0 1 2 4 5
X 5
4
t
minleft
Xmin(mintopt,minleftl)
l
11O(mnmn) for the Levenshtein distance (3)
min1 0 0 1 1 2 1 3 0
mintop1
3 2 1 0 1 2 4
X 5
5 4
t1
minleft3
Xmin(mintopt,minleftl) 2
l5
12O(mnmn) for the Levenshtein distance (4)
min0 0 1 1 1 2 1 3 0
mintop0
3 2 1 0 1 2 4
2 5
X 5 4
t2
minleft3
Xmin(mintopt,minleftl) 2
l5
13O(mnmn) for the Levenshtein distance (5)
min0 0 1 1 2 2 1 3 0
mintop0
3 2 1 0 1 2 4
2 5
2 5 X 4
t3
minleft3
Xmin(mintopt,minleftl) 3
l5
14O(mnmn) for the Levenshtein distance (6)
min0 0 1 1 2 2 2 3 0
mintop0
3 2 1 0 1 2 4
2 5
2 5 3 4
X
t4
minleft3
Xmin(mintopt,minleftl) 4
l5
15O(mnmn) for the Levenshtein distance (7)
min0 0 1 1 2 2 1 3 1
mintop0
3 2 1 0 1 2 4
2 5
2 5 3 4
X 4
t4
minleft3
l4
Xmin(mintopt,minleftl) 4
16O(mnmn) for the Levenshtein distance (8)
min0 0 1 1 1 2 1 3 1
mintop0
3 2 1 0 1 2 4
2 5
2 5 3 4
X 4 4
t4
minleft4
l3
Xmin(mintopt,minleftl) 4
17O(mnmn) for the Levenshtein distance (9)
min1 0 0 1 1 2 1 3 1
mintop1
3 2 1 0 1 2 4
2 5
2 5 3 4
X 4 4 4
t4
minleft4
l2
Xmin(mintopt,minleftl) 5
18O(mnmn) for the Levenshtein distance (10)
min2 0 0 1 0 2 1 3 1
mintop2
3 2 1 0 1 2 4
2 5
2 5 3 4
X 5 4 4 4
t4
l1
minleft4
Xmin(mintopt,minleftl) 5
19O(mnmn) for the Levenshtein distance (11)
Different letter boxes can be calculated as fast
as equal letter boxes.
3 2 1 0 1 2 4
2 5
2 5 3 4
5 5 4 4 4
m rows with n cells n columns with m
cells gt O(mnmn).
20Approximate searching
- Can be done by assigning first row to zero, but
time complexity is O(mnmn). - If all runs in the text are shorter than 2m, then
nlt2mn, and the time complexity is O(mmn). - If a run is longer than 2m-1, only the first 2m
columns need to be calculated the rest equals to
the last column.gt O(mmn) search algorithm.
21Greedy algorithm for the LCS
Idea Calculate only corners.
a a a a a b b b b c c c c a a
Different letter boxes are easy a corner value
can be calculated in constant time from corners
above and on the left.
0 3 7 11
5 2 6 6
9 6 2 6
13 10 6 10
15 12 8 8
a a a b b b b a a a a
Equal letter boxes corner values can be traced.
Time complexity is O(min(mn(mn),mnmn)).
22Diagonal algorithm for the LCS
Calculate first inside a diagonal band
0,...(n-m).
a a a a a b b b b c c c c a a
0 3 7 11
5 2 2 6
9 6 2 2
13 10 6 6
15 12 8 6
Shortest path that goes outside this band has
cost gt d(n-m)1.
a a a b b b b a a a a
If dmn?d, then the band is wide enough.
If dmngtd, then double d and increase the band
so that the shortest path that goes outside this
band has cost gtd.
23Diagonal algorithm for the LCS (2)
At the beginning, d415.
a a a a a b b b b c c c c a a
As dmn6gt5, we have to double d.
0 3 7 11
5 2 4 6
9 6 2 4
13 10 6 8
15 12 8 8
a a a b b b b a a a a
After the first doubling, d10 and the diagonal
band is -3,...,7.
As dmn8ltd, the band is wide enough.
gt We can stop the doubling.
24Diagonal algorithm for the LCS (3)
Time complexity is O(min(d2min(m,n),
mnmax(m,n)))
a a a a a b b b b c c c c a a
0 3 7 11
5 2 4 6
9 6 2 4
13 10 6 8
15 12 8 8
a a a b b b b a a a a
- The number of diagonals after the last doubling
is ? d1. - The sum of diagonals before the last doubling
is lt d1. - Each diagonal has at most O(min(m,n))
corners. - The length of each tracing path can be limited
by 2d.
25Improving the greedy algorithm (1) Skipping
different letter boxes
a a a a a b b b b c c c c a a
Observation. Runs of different letter boxes can
be skipped in tracing paths. gt Average case
O(mnmax(m,n)/?2).
0 3 7 11
5 2 6 6
9 6 2 6
13 10 6 10
15 12 8 8
a a a b b b b a a a a
26Improving the greedy algorithm (2) Bridges
a a a a a b b b b c c c c a a
Observation. In different letter boxes, all
values are known on the bottom or on the right
border. gt Tracing a corner value can be stopped
at a bridge.
1 2 3 4 6 7 8 10 11 12 14
0 1 2 3 3 4 5 7 8 9 10 11
5 2 3 4 5 6 6
9 6 2 3 4 5 6
13 10 6 7 8 9 10
15 12 8 8
a a a b b b b a a a a
3 4 5 7 8 9 11
3 4 5 7
7 8 9
27O(mn) average case?
- Conjecture Assuming that run-lengths areequally
distributed in both strings with the same mean,
the expected running time of the greedy
algorithm using the bridge property, is O(mn). - Experimental results support the conjecture For
randomly generated data the average tracing path
length was lt2.
28Experimental results (1) Lengths of the runs
mn2000, ?2
29Experimental results (2) m? n
m2000, ?2, runs in 1,1000
30Experimental results (3) ?
mn2000, runs in 1,1000
31Experimental results (4) Bk random
insertions/deletions on A
mn2000, runs in 1,1000
32Experimental results (5) Real data All pair of
lines in three black/white images
33References
A. Apostolico, G. Landau, and S. Skiena. Matching
for run-length encoded strings. J. of Complexity,
154--16, 1999. (Also at Sequences '97, Positano
Italy, June 11--13, 1997). O. Arbell, G. Landau,
and J. Mitchell. Edit distance of run-length
encoded strings. Submitted for publication,
August 2000. H. Bunke and J. Csirik. An
algorithm for matching run-length coded strings.
Computing, 50297--314, 1993. H. Bunke and J.
Csirik. An improved algorithm for computing the
edit distance of run-length coded strings.
Information Processing Letters, 54(2)93--96,
1995. V. Levenshtein. Binary codes capable of
correcting deletions, insertions and reversals.
Soviet Physics Doklady 6707--710, 1966. J.
Mitchell. A geometric shortest path problem, with
application to computing a longest common
subsequence in run-length encoded strings. In
Technical Report, Dept. of Applied Mathematics,
SUNY Stony Brook, 1997.
34References...
P. Sellers. The theory and computation of
evolutionary distances Pattern recognition. J.
of Algorithms, 1(4)359--373, 1980. E. Ukkonen.
Algorithms for approximate string
matching. Information and Control
64(1--3)100--118, 1985. E. Ukkonen. Finding
approximate patterns in strings. J. of Algorithms
6(1--3)132--137, 1985. R. Wagner and M. Fisher.
The string-to-string correction problem. J. of
the ACM 21(1)168--173, 1974.