Approximate Matching of RunLength Compressed Strings - PowerPoint PPT Presentation

1 / 34
About This Presentation
Title:

Approximate Matching of RunLength Compressed Strings

Description:

Run-length coding of a string is RL(aaaabbbbbbccaaabb)=a4b6c2a3b2. Problem: Given two run-length encoded strings, calculate their edit distance. ... – PowerPoint PPT presentation

Number of Views:55
Avg rating:3.0/5.0
Slides: 35
Provided by: vmak
Category:

less

Transcript and Presenter's Notes

Title: Approximate Matching of RunLength Compressed Strings


1
Approximate Matching of Run-Length Compressed
Strings
  • Veli Mäkinen
  • Gonzalo Navarro
  • Esko Ukkonen

Department of Computer Science, University of
Helsinki, Finland Department of Computer
Science, University of Chile, Santiago, Chile
2
The Problem
  • Run-length coding of a string is
    RL(aaaabbbbbbccaaabb)a4b6c2a3b2.
  • Problem Given two run-length encoded strings,
    calculate their edit distance.
  • The trivial solution is to decompress and then
    calculate, but can it be done faster...

3
Variations of the scheme
  • Different variations of edit distance.
  • Levenshtein distance Minimum number of character
    insertions, deletions, and substitutions to
    convert a string into another string.
  • Longest common subsequence (LCS) Distance DID,
    minimum number of insertions and deletions, is a
    dual problem for LCS 2LCS(A,B)mn-DID(A,B).

4
Variations of the scheme (2)
  • Search problem Search all approximate
    occurrences of a short pattern P inside a large
    text T, where both the pattern and the text are
    run-length encoded.
  • With approximate occurrence we mean that the
    (edit) distance between the pattern and a
    substring of the text is at most some treshold
    value k.

5
Previous Results
  • Bunke Csirik, 95 O(mnmn) for calculating
    the LCS between two strings of lengths m and n,
    run-length encoded to lengths m and n.
  • Apostolico Landau Skiena, 97 O(mn
    log(mn)) for the LCS.
  • Mitchell, 97 O((mnp) log(mnp)) for the
    LCS, where p is the amount of matches between
    compressed characters.

6
Our Results
  • The first algorithm for the Levenshtein distance
    in this context O(mnmn) by generalizing the
    result of Bunke Csirik for the LCS.
  • Search algorithm for Levenshtein and LCS
    distances O(mmn).

Independently Arbell Landau Mitchell found
similar algorithm.
7
Our Results (2)
  • O(min(d2min(m,n), mnmax(m,n))) for the LCS,
    where dmn-2LCS(A,B).
  • Conjecture O(mn) average case for the LCS.
  • Experimental results to support the conjecture.

8
Bunke Csirik algorithm for the LCS
a a a a a b b b b c c c c a a
Equal letter box values from the diagonal.
0 1 2 3 4 5 6 7 8 9 10 11
1 2 6 10
2 1 5 9
3 0 4 8
4 1 5 7
5 4 3 2 3 4 5 6 5 4 5 6
6 3 5 7
7 4 4 8
8 5 3 7
9 8 7 6 5 4 3 2 3 4 5 6
13 12 11 10 9 8 7 6 7 8 9 10
15 14 13 12 11 10 9 8 7 6 7 8
10 7 3 7
11 8 4 8
12 9 5 9
14 11 7 9
a a a b b b b a a a a
Different letter box minimum of two values
(value from the left borderdistance, value from
the upper border distance).
2LCS(A,B)mn-DID(A,B) gt
LCS(aaaaabbbbccccaa,aaabbbbaaaa) (1511-8)/29
9
O(mnmn) for the Levenshtein distance
a a a a a b b b b c c c c a a
0 1 2 3 4 5 6 7 8 9 10 11
1 2 6 10
2 1 5 9
3 0 4 8
4 1 4 7
5 4 3 2 2 2 3 4 4 4 5 6
6 3 3 6
7 4 2 6
8 5 2 6
9 8 7 6 5 4 3 2 3 4 5 6
13 12 11 10 9 8 7 6 6 6 6 6
15 14 13 12 11 10 9 8 7 6 6 6
10 7 3 6
11 8 4 6
12 9 5 6
14 11 7 6
Equal letter boxes are calculated as in the LCS.
a a a b b b b a a a a
Different letter boxes are more difficult...
10
O(mnmn) for the Levenshtein distance (2)
mintop
3 2 1 0 1 2 4 5
X 5
4
t
minleft
Xmin(mintopt,minleftl)
l
11
O(mnmn) for the Levenshtein distance (3)
min1 0 0 1 1 2 1 3 0
mintop1
3 2 1 0 1 2 4
X 5
5 4
t1
minleft3
Xmin(mintopt,minleftl) 2
l5
12
O(mnmn) for the Levenshtein distance (4)
min0 0 1 1 1 2 1 3 0
mintop0
3 2 1 0 1 2 4
2 5
X 5 4
t2
minleft3
Xmin(mintopt,minleftl) 2
l5
13
O(mnmn) for the Levenshtein distance (5)
min0 0 1 1 2 2 1 3 0
mintop0
3 2 1 0 1 2 4
2 5
2 5 X 4
t3
minleft3
Xmin(mintopt,minleftl) 3
l5
14
O(mnmn) for the Levenshtein distance (6)
min0 0 1 1 2 2 2 3 0
mintop0
3 2 1 0 1 2 4
2 5
2 5 3 4
X
t4
minleft3
Xmin(mintopt,minleftl) 4
l5
15
O(mnmn) for the Levenshtein distance (7)
min0 0 1 1 2 2 1 3 1
mintop0
3 2 1 0 1 2 4
2 5
2 5 3 4
X 4
t4
minleft3
l4
Xmin(mintopt,minleftl) 4
16
O(mnmn) for the Levenshtein distance (8)
min0 0 1 1 1 2 1 3 1
mintop0
3 2 1 0 1 2 4
2 5
2 5 3 4
X 4 4
t4
minleft4
l3
Xmin(mintopt,minleftl) 4
17
O(mnmn) for the Levenshtein distance (9)
min1 0 0 1 1 2 1 3 1
mintop1
3 2 1 0 1 2 4
2 5
2 5 3 4
X 4 4 4
t4
minleft4
l2
Xmin(mintopt,minleftl) 5
18
O(mnmn) for the Levenshtein distance (10)
min2 0 0 1 0 2 1 3 1
mintop2
3 2 1 0 1 2 4
2 5
2 5 3 4
X 5 4 4 4
t4
l1
minleft4
Xmin(mintopt,minleftl) 5
19
O(mnmn) for the Levenshtein distance (11)
Different letter boxes can be calculated as fast
as equal letter boxes.
3 2 1 0 1 2 4
2 5
2 5 3 4
5 5 4 4 4
m rows with n cells n columns with m
cells gt O(mnmn).
20
Approximate searching
  • Can be done by assigning first row to zero, but
    time complexity is O(mnmn).
  • If all runs in the text are shorter than 2m, then
    nlt2mn, and the time complexity is O(mmn).
  • If a run is longer than 2m-1, only the first 2m
    columns need to be calculated the rest equals to
    the last column.gt O(mmn) search algorithm.

21
Greedy algorithm for the LCS
Idea Calculate only corners.
a a a a a b b b b c c c c a a
Different letter boxes are easy a corner value
can be calculated in constant time from corners
above and on the left.
0 3 7 11
5 2 6 6
9 6 2 6
13 10 6 10
15 12 8 8
a a a b b b b a a a a
Equal letter boxes corner values can be traced.
Time complexity is O(min(mn(mn),mnmn)).
22
Diagonal algorithm for the LCS
Calculate first inside a diagonal band
0,...(n-m).
a a a a a b b b b c c c c a a
0 3 7 11
5 2 2 6
9 6 2 2
13 10 6 6
15 12 8 6
Shortest path that goes outside this band has
cost gt d(n-m)1.
a a a b b b b a a a a
If dmn?d, then the band is wide enough.
If dmngtd, then double d and increase the band
so that the shortest path that goes outside this
band has cost gtd.
23
Diagonal algorithm for the LCS (2)
At the beginning, d415.
a a a a a b b b b c c c c a a
As dmn6gt5, we have to double d.
0 3 7 11
5 2 4 6
9 6 2 4
13 10 6 8
15 12 8 8
a a a b b b b a a a a
After the first doubling, d10 and the diagonal
band is -3,...,7.
As dmn8ltd, the band is wide enough.
gt We can stop the doubling.
24
Diagonal algorithm for the LCS (3)
Time complexity is O(min(d2min(m,n),
mnmax(m,n)))
a a a a a b b b b c c c c a a
0 3 7 11
5 2 4 6
9 6 2 4
13 10 6 8
15 12 8 8
a a a b b b b a a a a
  • The number of diagonals after the last doubling
    is ? d1.
  • The sum of diagonals before the last doubling
    is lt d1.
  • Each diagonal has at most O(min(m,n))
    corners.
  • The length of each tracing path can be limited
    by 2d.

25
Improving the greedy algorithm (1) Skipping
different letter boxes
a a a a a b b b b c c c c a a
Observation. Runs of different letter boxes can
be skipped in tracing paths. gt Average case
O(mnmax(m,n)/?2).
0 3 7 11
5 2 6 6
9 6 2 6
13 10 6 10
15 12 8 8
a a a b b b b a a a a
26
Improving the greedy algorithm (2) Bridges
a a a a a b b b b c c c c a a
Observation. In different letter boxes, all
values are known on the bottom or on the right
border. gt Tracing a corner value can be stopped
at a bridge.
1 2 3 4 6 7 8 10 11 12 14
0 1 2 3 3 4 5 7 8 9 10 11
5 2 3 4 5 6 6
9 6 2 3 4 5 6
13 10 6 7 8 9 10
15 12 8 8
a a a b b b b a a a a
3 4 5 7 8 9 11
3 4 5 7
7 8 9
27
O(mn) average case?
  • Conjecture Assuming that run-lengths areequally
    distributed in both strings with the same mean,
    the expected running time of the greedy
    algorithm using the bridge property, is O(mn).
  • Experimental results support the conjecture For
    randomly generated data the average tracing path
    length was lt2.

28
Experimental results (1) Lengths of the runs
mn2000, ?2
29
Experimental results (2) m? n
m2000, ?2, runs in 1,1000
30
Experimental results (3) ?
mn2000, runs in 1,1000
31
Experimental results (4) Bk random
insertions/deletions on A
mn2000, runs in 1,1000
32
Experimental results (5) Real data All pair of
lines in three black/white images
33
References
A. Apostolico, G. Landau, and S. Skiena. Matching
for run-length encoded strings. J. of Complexity,
154--16, 1999. (Also at Sequences '97, Positano
Italy, June 11--13, 1997). O. Arbell, G. Landau,
and J. Mitchell. Edit distance of run-length
encoded strings. Submitted for publication,
August 2000. H. Bunke and J. Csirik. An
algorithm for matching run-length coded strings.
Computing, 50297--314, 1993. H. Bunke and J.
Csirik. An improved algorithm for computing the
edit distance of run-length coded strings.
Information Processing Letters, 54(2)93--96,
1995. V. Levenshtein. Binary codes capable of
correcting deletions, insertions and reversals.
Soviet Physics Doklady 6707--710, 1966. J.
Mitchell. A geometric shortest path problem, with
application to computing a longest common
subsequence in run-length encoded strings. In
Technical Report, Dept. of Applied Mathematics,
SUNY Stony Brook, 1997.
34
References...
P. Sellers. The theory and computation of
evolutionary distances Pattern recognition. J.
of Algorithms, 1(4)359--373, 1980. E. Ukkonen.
Algorithms for approximate string
matching. Information and Control
64(1--3)100--118, 1985. E. Ukkonen. Finding
approximate patterns in strings. J. of Algorithms
6(1--3)132--137, 1985. R. Wagner and M. Fisher.
The string-to-string correction problem. J. of
the ACM 21(1)168--173, 1974.
Write a Comment
User Comments (0)
About PowerShow.com