Tsvi Kopelowitz - PowerPoint PPT Presentation

1 / 45
About This Presentation
Title:

Tsvi Kopelowitz

Description:

Compressing Substrings in Compressed Time. 2. Overview. Problem Definition. Motivation ... Context Substring Compression Query. Input: String S (to be preprocessed) ... – PowerPoint PPT presentation

Number of Views:22
Avg rating:3.0/5.0
Slides: 46
Provided by: csu71
Category:

less

Transcript and Presenter's Notes

Title: Tsvi Kopelowitz


1
Compressing Substrings in Compressed Time
  • Tsvi Kopelowitz
  • joint work with Orgad Keller, Shir Landau and
    Moshe Lewenstein

2
Overview
  • Problem Definition
  • Motivation
  • Previous work
  • Our Contribution
  • How does it work?
  • Open problems

3
Substring Compression Query
  • For a given compression algorithm
  • Input String S (to be preprocessed) Query
    i,j s.t. 1 i j n SOutput
    Compressed Sij.

S ababbababcabaababcbcbacbbacbabbabcbababacbabc
4
Substring Compression Query
  • For a given compression algorithm
  • Input String S (to be preprocessed) Query
    i,j s.t. 1 i j n SOutput
    Compressed Sij.

i
j
S ababbababcabaababcbcbacbbacbabbabcbababacbabc
  • Desired Time Compressed substring size.

5
Need to choose Compressor
  • LZ 77 Encoding Greedy from left to right.
    Encode Sk..n with longest common prefix already
    encoded.

phrase encoded by distance to chosen substring
and length of the common prefix
Sababbaaba
More history implies better compression
a
b
ab
ba
aba
6
Context Substring Compression Query
  • Input String S (to be preprocessed)
  • Query i,j, a,ß s.t. 1 i j n and 1 a
    ß n
  • Output Compressed Sij given the context of
    Sa..ß.

S ababbababcabaababcbcbacbbacbabbabcbababacbabc
7
Context Substring Compression Query
  • Input String S (to be preprocessed)
  • Query i,j, a,ß s.t. 1 i j n and 1 a
    ß n
  • Output Compressed Sij given the context of
    Sa..ß.

S ababbababcabaababcbcbacbbacbabbabcbababacbabc
  • Desired Time Compressed substring size.

8
Substring Compression Motivation
  • Data transfer in a network setting
  • Sending portions of large amount of data
  • Sending a portion after a different one has been
    sent
  • Comparison of sequences (Biology)

9
Previous Results Cormode, Muthukrishnan 05
  • For SCQ
  • O(C(i, j) logn loglogn)
  • For Context-SCQ (CSCQ)
  • O(Ca,ß(i, j) logn loglogn)
  • Where C is the number of LZ phrases in the
    encoding of the substring

10
Our Results Query Times
  • Several trade-offs, fastest
  • For SCQ
  • O(C(i, j))
  • For CSCQ
  • O(Ca,ß(i, j) log R) where Rj-i / Ca,ß(i,j)
  • Where C is the number of LZ phrases in the
    encoding of the substring

11
Substring Compression Query
  • Goals
  • Compute LZ(Si..j)
  • Time Proportional to size of LZ(Si..j) (and
    not the size of Si..j)
  • Note we are allowed to preprocess the text!

12
Computing LZ(Si..j)
  • Greedy
  • Assume Si..k-1 is compressed, and Sk..j is
    left to be compressed.

i
k-1
k
j
  • For suffix Sk find longest prefix with any
    suffix beginning in Si..k-1
  • i.e. locate t for which i tk-1 and
    LCP(Sk,St) is maximal.

13
Use Suffix Arrays
  • Definition SA(S) A permutation representing
    the lexicographical ordering of all suffixes of
    S.

14
Suffix Array for string Sabaabaabaaba
  • S1abaabaabaaba
  • S2baabaabaaba
  • S3aabaabaaba
  • S4abaabaaba
  • S5baabaaba
  • S6aabaaba
  • S7abaaba
  • S8baaba
  • S9aaba
  • S10aba
  • S11ba
  • S12a
  • S13

S3aabaabaaba S6aabaaba S9aaba S1abaabaabaab
a S4abaabaaba S7abaaba S10aba S12a S2baa
baabaaba S5baabaaba S8baaba S11ba S13
SA(S) 3,6,9,1,4,7,10,12,2,5 ,8 ,11,13
1,2,3,4,5,6,7 ,8 ,9,10,11,12,13
15
Geometric Representation
12
11
10
9
8
7
6
5
4
3
2
1
0
0
1
2
3
4
5
6
7
8
9
10
11
12
SA(S) 3,6,9,1,4,7,10,12,2,5 ,8 ,11,13
1,2,3,4,5,6,7 ,8 ,9,10,11,12,13
16
Observation
  • Given an index i in the Suffix Array
  • For ngtjgti
  • For 1ltjlti
  • LCP decreases as we move away from index i.

17
Example
SA(S) 3,6,9,1,4,7,10,12,2,5 ,8 ,11,13
1,2,3,4,5,6,7 ,8 ,9,10,11,12,13
  • At i5
  • S4abaabaaba
  • At j 7 (so j1 8)
  • S10aba
  • S12a
  • So
  • LCP decreases as we move away from index i.

18
Computing LZ(Si..j)
  • Greedy
  • Assume Si..k-1 is compressed, and Sk..j is
    left to be compressed.

i
k-1
k
j
  • For suffix Sk find longest prefix with any
    suffix beginning in Si..k-1
  • i.e. locate t for which i tk-1 and
    LCP(Sk,St) is maximal the closest one!

19
Searching within the range
  • We can find suffix St which maximizes LCP(Sk,St)
    (LCP with neighbors in the suffix array)
  • Problem How can we consider only suffixes St for
    which i t k-1? (we cannot look for prefixes
    outside of the range)

20
Searching within the range
  • We can find suffix St which maximizes LCP(Sk,St)
    (LCP with neighbors in the suffix array)
  • Problem How can we consider only suffixes St for
    which i t k-1? (we cannot look for prefixes
    outside of the range)
  • Three sided Range Searching for Min/Max

21
Example
  • We wish to compress S4..9. We have compressed
    S4..6 thus far.
  • S7..9 is left to be compressed.
  • We therefore wish to find location t for which
    LCP(S7,St) is maximal, and 4 t 6

22
Geometric representation
S abaabaabaaba
23
Consider the Geometric representation of the text
12
S abaabaabaaba
11
10
9
8
7
6
5
4
3
2
1
0
0
1
2
3
4
5
6
7
8
9
10
11
12
24
LCP Query in the Range
12
S abaabaabaaba
11
Search for minimum X coordinate from the right
Range Searching For Minimum
10
9
8
7
6
5
4
3
2
1
0
0
1
2
3
4
5
6
7
8
9
10
11
12
25
LCP Query in the Range
12
S abaabaabaaba
11
10
Search for maximum X coordinate from the
left Range Searching For Maximum
9
8
7
6
5
4
3
2
1
0
0
1
2
3
4
5
6
7
8
9
10
11
12
26
Range Searching For Minimum
  • Lenhof Smid 94
  • O(n log n loglog n) expected preprocessing time.
  • O(n log n) space.
  • O(log n) query time.
  • O(loglog n) for a grid! WADS 07

27
Another Data Structure Option
  • M. Crochemore, C. Iliopoulos, M. Kubica, M. S.
    Rahman and T. Walen,Improved Algorithms for the
    Range Next Value Problem and Applications (STACS
    2008)
  • Improve range searching for minima in a
    permutation to
  • O(n1e) preprocessing time and space
  • O(1) query

28
Finding LZ(Si..j)
  • Compute the LCP(Sk,St) for both in O(1) (Harel
    and Tarjan).
  • Chose the best of both.
  • Proceed to encoding SkLCP(Sk, St) ..j

29
Our Bounds
  • For each phrase encoded
  • Perform 2 RSM
  • Compute LCP with both candidates O(1)

Fastest query O(C(i, j))
30
Context Substring Compression Query
  • Goals
  • Find LZ a,ß(Si..j) within the context of
    Sa..ß
  • Do so in time proportional to the size of LZ
    a,ß(Si..j) (and not the size of Si..j)

31
Computing LZ a,ß(Si..j)
  • Greedy
  • Assume Si..k-1 is compressed, and Sk..j is
    left to be compressed.

k-1
k
i
j
a
ß
  • For suffix Sk find longest prefix with any
    suffix beginning in Si..k-1 Already know
  • For suffix Sk find longest prefix with any
    suffix beginning in Sa..ß ?? Does not work!!

32
Example - S abaabaabaaba
  • We are compressing the substring
    S7..9 aba
  • Within the context
    S2..5 baab
  • We wish to encode S7
  • Look for LCP in the range LCP(S4,S7)6 is the
    maximum but
  • This take us out of context!

S abaabaabaaba
33
Bounded longest common prefix
  • For string S, and interval a..ß in S, consider
    the suffixes of Sa..ß.
  • S2..5 baab
  • baab
  • aab
  • ab
  • b
  • BLCP For suffix Sk find longest prefix with
    any suffix beginning in Sa..ß

from
34
Suffix Tree for string Sabaabaabaaba
13
12
11
10
9
8
6
7
5
3
2
4
1
35
Suffix Tree for string Sabaabaabaaba
  • Number the suffixes by lexicographic order
  • nodes have coordinates
  • X-coord lexicographic rank
  • Y-coord text location (SA)

36
Suffix Tree for string Sabaabaabaaba
  • Dfn For node u
  • Depth(u) number of nodes on path from root to
    parent(u)
  • Length(u) length of string on path from root to
    u
  • depth(u)3
  • length(u)6

37
Definition - Eligibility
  • St is eligible at u if
  • u is on the path from the root to lt
  • tlength(u) -1 ß
  • meaning substring starting at location t with
    length up to node u is within the bounds of the
    context.

38
Definition - Eligibility
  • St is eligible at u if
  • u is on the path from the root to lt
  • tlength(u)-1 ß

Reminder We are compressing the substring
S7..9 S abaabaabaaba Within the context
S2..5 S abaabaabaaba
Note if u provides eligibility, then so do its
ancestors
39
Checking eligibility
  • given u, we wish to test if there is a suffix for
    which u is eligible
  • tlength(u)-1 ß ? t ß-length(u)1
  • a t (by definition)
  • t is in the subtree of u ? its index in the
    suffix array is in X(lv) x X(rv)
  • So u is eligible if and only if X(lu) x
    X(ru) xa, ß-length(v)1
    is not empty

40
Step 1 Search path from root to lk to find
deepest eligible node u
  • Perform binary search on path from root to lk to
    find eligible node u closest to lk
  • For each node v perform a range-emptiness query
    on
  • X(lv) x X(rv) x a, ß-length(v)1
  • Select u to be lowest node with non-empty range.

Notice that if v has non-empty range, then v
also has a non-empty range
lk
41
Finding the location
  • u might be eligible due to more than one leaf
  • Pick the left most occurrence in the range is
    enough (why?)
  • Perform Three Sided Range-searching for min in
    X(lu) x X(ru) x a,8

42
Step 2 Range-searching for minimum on Y value in
eligible leaves of Tu
12
  • we want the first occuring one within our range
  • X(lu) x X(ru) x a,8 (one must exist)
  • Suppose above returned point (x, y) y will be
    the start location and the length will be
    minILCP(lk,ly), ß-y1

11
10
9
8
7
6
5
4
3
2
1
0
0
1
2
3
4
5
6
7
8
9
10
11
12
43
Our Bounds
Fastest O(Ca,ß(i, j) log (j-i / C a,ß(i,j)) O(1))
  • Perform Range-emptiness
  • Galloping Binary search

For each phrase encoded
44
Open problem(s)
  • What about other types of compressors?

45
Thank You!
Write a Comment
User Comments (0)
About PowerShow.com