Title: Tsvi Kopelowitz
1Compressing Substrings in Compressed Time
- Tsvi Kopelowitz
- joint work with Orgad Keller, Shir Landau and
Moshe Lewenstein
2Overview
- Problem Definition
- Motivation
- Previous work
- Our Contribution
- How does it work?
- Open problems
3Substring Compression Query
- For a given compression algorithm
- Input String S (to be preprocessed) Query
i,j s.t. 1 i j n SOutput
Compressed Sij. -
S ababbababcabaababcbcbacbbacbabbabcbababacbabc
4Substring Compression Query
- For a given compression algorithm
- Input String S (to be preprocessed) Query
i,j s.t. 1 i j n SOutput
Compressed Sij. -
i
j
S ababbababcabaababcbcbacbbacbabbabcbababacbabc
- Desired Time Compressed substring size.
5Need to choose Compressor
- LZ 77 Encoding Greedy from left to right.
Encode Sk..n with longest common prefix already
encoded.
phrase encoded by distance to chosen substring
and length of the common prefix
Sababbaaba
More history implies better compression
a
b
ab
ba
aba
6Context Substring Compression Query
- Input String S (to be preprocessed)
- Query i,j, a,ß s.t. 1 i j n and 1 a
ß n - Output Compressed Sij given the context of
Sa..ß. -
S ababbababcabaababcbcbacbbacbabbabcbababacbabc
7Context Substring Compression Query
- Input String S (to be preprocessed)
- Query i,j, a,ß s.t. 1 i j n and 1 a
ß n - Output Compressed Sij given the context of
Sa..ß. -
S ababbababcabaababcbcbacbbacbabbabcbababacbabc
- Desired Time Compressed substring size.
8Substring Compression Motivation
- Data transfer in a network setting
- Sending portions of large amount of data
- Sending a portion after a different one has been
sent - Comparison of sequences (Biology)
9Previous Results Cormode, Muthukrishnan 05
- For SCQ
- O(C(i, j) logn loglogn)
- For Context-SCQ (CSCQ)
- O(Ca,ß(i, j) logn loglogn)
- Where C is the number of LZ phrases in the
encoding of the substring
10Our Results Query Times
- Several trade-offs, fastest
- For SCQ
- O(C(i, j))
- For CSCQ
- O(Ca,ß(i, j) log R) where Rj-i / Ca,ß(i,j)
- Where C is the number of LZ phrases in the
encoding of the substring
11Substring Compression Query
- Goals
- Compute LZ(Si..j)
- Time Proportional to size of LZ(Si..j) (and
not the size of Si..j) - Note we are allowed to preprocess the text!
12Computing LZ(Si..j)
- Greedy
- Assume Si..k-1 is compressed, and Sk..j is
left to be compressed.
i
k-1
k
j
- For suffix Sk find longest prefix with any
suffix beginning in Si..k-1 - i.e. locate t for which i tk-1 and
LCP(Sk,St) is maximal.
13Use Suffix Arrays
- Definition SA(S) A permutation representing
the lexicographical ordering of all suffixes of
S.
14Suffix Array for string Sabaabaabaaba
- S1abaabaabaaba
- S2baabaabaaba
- S3aabaabaaba
- S4abaabaaba
- S5baabaaba
- S6aabaaba
- S7abaaba
- S8baaba
- S9aaba
- S10aba
- S11ba
- S12a
- S13
S3aabaabaaba S6aabaaba S9aaba S1abaabaabaab
a S4abaabaaba S7abaaba S10aba S12a S2baa
baabaaba S5baabaaba S8baaba S11ba S13
SA(S) 3,6,9,1,4,7,10,12,2,5 ,8 ,11,13
1,2,3,4,5,6,7 ,8 ,9,10,11,12,13
15Geometric Representation
12
11
10
9
8
7
6
5
4
3
2
1
0
0
1
2
3
4
5
6
7
8
9
10
11
12
SA(S) 3,6,9,1,4,7,10,12,2,5 ,8 ,11,13
1,2,3,4,5,6,7 ,8 ,9,10,11,12,13
16Observation
- Given an index i in the Suffix Array
- For ngtjgti
- For 1ltjlti
- LCP decreases as we move away from index i.
17Example
SA(S) 3,6,9,1,4,7,10,12,2,5 ,8 ,11,13
1,2,3,4,5,6,7 ,8 ,9,10,11,12,13
- At i5
- S4abaabaaba
- At j 7 (so j1 8)
- S10aba
- S12a
- So
- LCP decreases as we move away from index i.
18Computing LZ(Si..j)
- Greedy
- Assume Si..k-1 is compressed, and Sk..j is
left to be compressed.
i
k-1
k
j
- For suffix Sk find longest prefix with any
suffix beginning in Si..k-1 - i.e. locate t for which i tk-1 and
LCP(Sk,St) is maximal the closest one!
19Searching within the range
- We can find suffix St which maximizes LCP(Sk,St)
(LCP with neighbors in the suffix array) - Problem How can we consider only suffixes St for
which i t k-1? (we cannot look for prefixes
outside of the range) -
20Searching within the range
- We can find suffix St which maximizes LCP(Sk,St)
(LCP with neighbors in the suffix array) - Problem How can we consider only suffixes St for
which i t k-1? (we cannot look for prefixes
outside of the range) - Three sided Range Searching for Min/Max
21Example
- We wish to compress S4..9. We have compressed
S4..6 thus far. - S7..9 is left to be compressed.
- We therefore wish to find location t for which
LCP(S7,St) is maximal, and 4 t 6
22Geometric representation
S abaabaabaaba
23Consider the Geometric representation of the text
12
S abaabaabaaba
11
10
9
8
7
6
5
4
3
2
1
0
0
1
2
3
4
5
6
7
8
9
10
11
12
24LCP Query in the Range
12
S abaabaabaaba
11
Search for minimum X coordinate from the right
Range Searching For Minimum
10
9
8
7
6
5
4
3
2
1
0
0
1
2
3
4
5
6
7
8
9
10
11
12
25LCP Query in the Range
12
S abaabaabaaba
11
10
Search for maximum X coordinate from the
left Range Searching For Maximum
9
8
7
6
5
4
3
2
1
0
0
1
2
3
4
5
6
7
8
9
10
11
12
26Range Searching For Minimum
- Lenhof Smid 94
- O(n log n loglog n) expected preprocessing time.
- O(n log n) space.
- O(log n) query time.
- O(loglog n) for a grid! WADS 07
27Another Data Structure Option
- M. Crochemore, C. Iliopoulos, M. Kubica, M. S.
Rahman and T. Walen,Improved Algorithms for the
Range Next Value Problem and Applications (STACS
2008) - Improve range searching for minima in a
permutation to - O(n1e) preprocessing time and space
- O(1) query
28Finding LZ(Si..j)
- Compute the LCP(Sk,St) for both in O(1) (Harel
and Tarjan). - Chose the best of both.
- Proceed to encoding SkLCP(Sk, St) ..j
29Our Bounds
- For each phrase encoded
- Perform 2 RSM
- Compute LCP with both candidates O(1)
Fastest query O(C(i, j))
30Context Substring Compression Query
- Goals
- Find LZ a,ß(Si..j) within the context of
Sa..ß - Do so in time proportional to the size of LZ
a,ß(Si..j) (and not the size of Si..j)
31Computing LZ a,ß(Si..j)
- Greedy
- Assume Si..k-1 is compressed, and Sk..j is
left to be compressed.
k-1
k
i
j
a
ß
- For suffix Sk find longest prefix with any
suffix beginning in Si..k-1 Already know - For suffix Sk find longest prefix with any
suffix beginning in Sa..ß ?? Does not work!!
32Example - S abaabaabaaba
- We are compressing the substring
S7..9 aba - Within the context
S2..5 baab - We wish to encode S7
- Look for LCP in the range LCP(S4,S7)6 is the
maximum but - This take us out of context!
S abaabaabaaba
33Bounded longest common prefix
- For string S, and interval a..ß in S, consider
the suffixes of Sa..ß. - S2..5 baab
- baab
- aab
- ab
- b
- BLCP For suffix Sk find longest prefix with
any suffix beginning in Sa..ß
from
34Suffix Tree for string Sabaabaabaaba
13
12
11
10
9
8
6
7
5
3
2
4
1
35Suffix Tree for string Sabaabaabaaba
- Number the suffixes by lexicographic order
- nodes have coordinates
- X-coord lexicographic rank
- Y-coord text location (SA)
36Suffix Tree for string Sabaabaabaaba
- Dfn For node u
- Depth(u) number of nodes on path from root to
parent(u) - Length(u) length of string on path from root to
u - depth(u)3
- length(u)6
37Definition - Eligibility
- St is eligible at u if
- u is on the path from the root to lt
- tlength(u) -1 ß
- meaning substring starting at location t with
length up to node u is within the bounds of the
context.
38Definition - Eligibility
- St is eligible at u if
- u is on the path from the root to lt
- tlength(u)-1 ß
Reminder We are compressing the substring
S7..9 S abaabaabaaba Within the context
S2..5 S abaabaabaaba
Note if u provides eligibility, then so do its
ancestors
39Checking eligibility
- given u, we wish to test if there is a suffix for
which u is eligible - tlength(u)-1 ß ? t ß-length(u)1
- a t (by definition)
- t is in the subtree of u ? its index in the
suffix array is in X(lv) x X(rv) - So u is eligible if and only if X(lu) x
X(ru) xa, ß-length(v)1
is not empty
40Step 1 Search path from root to lk to find
deepest eligible node u
- Perform binary search on path from root to lk to
find eligible node u closest to lk - For each node v perform a range-emptiness query
on - X(lv) x X(rv) x a, ß-length(v)1
- Select u to be lowest node with non-empty range.
Notice that if v has non-empty range, then v
also has a non-empty range
lk
41Finding the location
- u might be eligible due to more than one leaf
- Pick the left most occurrence in the range is
enough (why?) - Perform Three Sided Range-searching for min in
X(lu) x X(ru) x a,8
42Step 2 Range-searching for minimum on Y value in
eligible leaves of Tu
12
- we want the first occuring one within our range
- X(lu) x X(ru) x a,8 (one must exist)
- Suppose above returned point (x, y) y will be
the start location and the length will be
minILCP(lk,ly), ß-y1
11
10
9
8
7
6
5
4
3
2
1
0
0
1
2
3
4
5
6
7
8
9
10
11
12
43Our Bounds
Fastest O(Ca,ß(i, j) log (j-i / C a,ß(i,j)) O(1))
For each phrase encoded
44Open problem(s)
- What about other types of compressors?
45Thank You!