Title: Space-Efficient String Mining under Frequency Constraints
1Space-Efficient String Mining under Frequency
Constraints
- Johannes Fischer
- Ludwig-Maximilians-Universität München
Veli Mäkinen and Niko Välimäki University of
Helsinki
2Frequent string mining optimal time
- "frequent" is most frequent but does not make a
difference... - "I" differentiates DB1 from DB2
- "We are" differentiates DB2 from DB1
- String mining under several kind of frequency
constraints can be done in optimal linear time
using suffix array techniques FHK06.
DB1
DB2
I am frequent I am also frequent Am I also making
a difference
We are frequent We are also frequent We are all
frequent
3Frequent string mining optimal space?
- "frequent" is most frequent but does not make a
difference... - "I" differentiates DB1 from DB2
- "We are" differentiates DB2 from DB1
- Problem Can string mining be done using
assymptotically the same space as what is needed
for storing the string collection?
DB1
DB2
I am frequent I am also frequent Am I also making
a difference
We are frequent We are also frequent We are all
frequent
4Our result Space-efficient string mining
- Given a collection C of d documents with overall
length nC?T? CT, where T ? S, T ? C. - We give a string mining algorithm that uses
- O(n log Sd log n) bits of working space and
- O(n log n) time.
- Since usually d ltlt n, the solution is
significantly more space-efficient than previous
ones that use O(n log n) working space.
5High-level description
- Tight integration of Kasai et al. Kasetal01
algorithm to visit all branching substrings of a
text and Hui's Hui92 color set size technique. - Toolbox compressed suffix array, compressed LCP
values, range minimum queries, searchable partial
sums.
6Overview without compressed structures
RMQ(LCP,8,14)1
1 2 3 4 5 6 7 8 9 10 11 12 13 14
15 16 17 18 19 20 21 22 23 a a b a a
b a a a b b b a b b a
b b a 5 12 18 23 4 22 8 9 1 10
2 6 15 19 11 18 3 21 7 14 16 20 13 0 0
0 0 0 1 1 2 3 1 2 3 2 3 0
1 1 2 2 2 1 2 3 a a
a a a a a a a a b b b b b
b b b b a a
a b b b b b a a a a b
b b a b b
a a b b a b a a
b a a
a a b b
a
b b
b
T SA LCP
7Right-most path of suffix tree
a
a
b
a
5 12 18 23 4 22 8 9 1 10 2 6 15 19
11 18 3 21 7 14 16 20 13 0 0 0 0 0
1 1 2 3 1 2 3 2 3 0 1 1 2
2 2 1 2 3 a a a a a
a a a a a b b b b b b b b
b a a a b b
b b b a a a a b b b
a b b a a b
b a b a a
b a a a
a b b
a
b b
b
SA LCP
8Suffixes-insertion algorithm
a
b
a
b
a
5 12 18 23 4 22 8 9 1 10 2 6 15 19
11 18 3 21 7 14 16 20 13 0 0 0 0 0
1 1 2 3 1 2 3 2 3 0 1 1 2
2 2 1 2 3 a a a a a
a a a a a b b b b b b b b
b a a a b b
b b b a a a a b b b
a b b a a b
b a b a a
b a a a
a b b
a
b b
b
SA LCP
9Maintain only the right-most path
a
- Once a node is popped,its subtree is ready, and
all statistics for the substring ending to the
node can be reported
b
a
b
a
5 12 18 23 4 22 8 9 1 10 2 6 15 19
11 18 3 21 7 14 16 20 13 0 0 0 0 0
1 1 2 3 1 2 3 2 3 0 1 1 2
2 2 1 2 3 a a a a a
a a a a a b b b b b b b b
b a a a b b
b b b a a a a b b b
a b b a a b
b a b a a
b a a a
a b b
a
b b
b
SA LCP
10Hui's algorithm
- Store at each node v of suffix tree
- the values
- Sv number of leaves in the subtree of v, and
- Cv number of dublicate occurrences of the
substring ending at node v.
a
a
Sv3 Cv1
Sv-Cv tells how many different documents
there are in the subtree of v. AKA Sv-Cv
defines the frequency of the substring ending at
node v.
0 1 2 3 0 3 1 1 0 1 0 1 2
3 1 2 0 3 1 2 2 3 2 5 12 18 23
4 22 8 9 1 10 2 6 15 19 11 18 3 21
7 14 16 20 13 0 0 0 0 0 1 1 2 3
1 2 3 2 3 0 1 1 2 2 2 1 2
3 a a a a a a a a a
a b b b b b b b b b
a a a b b b b b
a a a a b b b
a b b a a b b
a b a a
b a a a
a b b
a b
b
b
D SA LCP
11Making it all space-efficient 1/5
- Right-most path is kept in a specialstack
- Relative string depths are coded using Elias
codes. - Takes O(n) bits.
- Allows constant time pop/push.
a
a
b
a
5 12 18 23 4 22 8 9 1 10 2 6 15 19
11 18 3 21 7 14 16 20 13 0 0 0 0 0
1 1 2 3 1 2 3 2 3 0 1 1 2
2 2 1 2 3 a a a a a
a a a a a b b b b b b b b
b a a a b b
b b b a a a a b b b
a b b a a b
b a b a a
b a a a
a b b
a
b b
b
SA LCP
12Making it all space-efficient 2/5
- Preliminary counter Sv values along the
right-most path are encoded identically as the
stack. - Once a node v popped its Sv value is final
and this value is added to its parent. - O(n) bits with constant time updates.
a
a
Sv3 Cv1
5 12 18 23 4 22 8 9 1 10 2 6 15 19
11 18 3 21 7 14 16 20 13 0 0 0 0 0
1 1 2 3 1 2 3 2 3 0 1 1 2
2 2 1 2 3 a a a a a
a a a a a b b b b b b b b
b a a a b b
b b b a a a a b b b
a b b a a b
b a b a a
b a a a
a b b
a
b b
b
SA LCP
13Making it all space-efficient 3/5
- Preliminary counter Cv values along the
right-most path are encoded using a dynamic
searchable partial sums structure. - Once a node v popped its Cv value is final
and this value is added to its parent. - O(n) bits with O(log n) time updates.
a
a
Sv3 Cv1
5 12 18 23 4 22 8 9 1 10 2 6 15 19
11 18 3 21 7 14 16 20 13 0 0 0 0 0
1 1 2 3 1 2 3 2 3 0 1 1 2
2 2 1 2 3 a a a a a
a a a a a b b b b b b b b
b a a a b b
b b b a a a a b b b
a b b a a b
b a b a a
b a a a
a b b
a
b b
b
SA LCP
14Making it all space-efficient 4/5
- Table D encodes document numbers where suffixes
belong to in lex. order. - Predecessor-query on D gives the previous
occurrence inside the same document. - RMQ-between the two occurrences gives the
string depth where the Cv counter should be
incremented.
a
a
Sv3 Cv1
0 1 2 3 0 3 1 1 0 1 0 1 2
3 1 2 0 3 1 2 2 3 2 5 12 18 23
4 22 8 9 1 10 2 6 15 19 11 18 3 21
7 14 16 20 13 0 0 0 0 0 1 1 2 3
1 2 3 2 3 0 1 1 2 2 2 1 2
3 a a a a a a a a a
a b b b b b b b b b
a a a b b b b b
a a a a b b b
a b b a a b b
a b a a
b a a a
a b b
a b
b
b
D SA LCP
RMQ0
RMQ2
RMQ1
15Making it all space-efficient 5/5
- Table D does not need to be stored as
predecessors can be updated "on-the-fly" using
an array pred1..d. - Compressed suffix array supportsaccess in
O(loge n) time and takes O(n log S) bits. - A bit-vector B1,n marks the document
boundaries in the text, so that
rank(B,SAi)Di. - LCP and RMQ structures each take2n(1o(1)) bits
HS02,FH07.
a
a
Sv3 Cv1
0 1 2 3 0 3 1 1 0 1 0 1 2
3 1 2 0 3 1 2 2 3 2 5 12 18 23
4 22 8 9 1 10 2 6 15 19 11 18 3 21
7 14 16 20 13 0 0 0 0 0 1 1 2 3
1 2 3 2 3 0 1 1 2 2 2 1 2
3 a a a a a a a a a
a b b b b b b b b b
a a a b b b b b
a a a a b b b
a b b a a b b
a b a a
b a a a
a b b
a b
b
b
D SA LCP
RMQ0
RMQ2
RMQ1
16Extensions
- This presentation only sketched how to compute
the frequency values inside one document
collection. In addition, - the computation is easy to adjust to report
patterns occurring frequently in one document
collection and infrequently in the other - the computation gives a space-efficient
construction algorithm for Sadakane's scheme of
stroring the frequency values Sad07 and - other compressed text indexes can be plugged in
to obtain other space/time tradeoffs.
17Epilogue
- Thanks to the discussions with Luis Russo after
the workshop, we were able to improve the space
from O(n log d) to O(d log n). - The presentation has been changed accordingly.
18References
- FHK06 Johannes Fischer, Volker Heun, Stefan
Kramer Optimal String Mining under Frequency
Constraints, Proc. PKDD'06, LNAI 4213, pages
139-150, 2006. - FH07 Johannes Fischer, Volker Heun A New
Succinct Representation of RMQ-Information and
Improvements in the Enhanced Suffix Array. In
Proc. ESCAPE'07, LNCS 4614, pages 459- 470, 2007. - FMV07 Johannes Fischer, Veli Mäkinen, Niko
Välimäki Space-efficient String Mining under
Frequency Constraints. Submitted. - HS02 Wing-Kai Hon, Kunihiko Sadakane
Space-Economical Algorithms for Finding Maximal
Unique Matches. In Proc. CPM 2002, LNCS 2373,
pages 144-152, 2002. - Hui92 Lucas Hui Color Set Size Problem with
Application to String Matching. In Proc. CPM
1992, LNCS 644, pages 230-243, 1992. - Kasetal01 Toru Kasai, Gunho Lee, Hiroki
Arimura, Setsuo Arikawa, Kunsoo Park Linear-Time
Longest- Common-Prefix Computation in Suffix
Arrays and Its Applications. In Proc. CPM 2001,
LNCS 2089, pages 181-192, 2001. - Sad07 Kunihiko Sadakane Succinct data
structures for flexible text retrieval systems.
J. Discrete Algorithms 5(1) 12-22 (2007)