Space-Efficient String Mining under Frequency Constraints - PowerPoint PPT Presentation

About This Presentation

Title:

Space-Efficient String Mining under Frequency Constraints

Description:

String mining under several kind of frequency constraints can be done in optimal ... Preliminary counter S[v] values. along the right-most path are encoded ... – PowerPoint PPT presentation

Number of Views:29

Avg rating:3.0/5.0

Slides: 19

Provided by: vel98

Category:

more less

Transcript and Presenter's Notes

Title: Space-Efficient String Mining under Frequency Constraints

1
Space-Efficient String Mining under Frequency
Constraints

Johannes Fischer
Ludwig-Maximilians-Universität München

Veli Mäkinen and Niko Välimäki University of
Helsinki
2
Frequent string mining optimal time

"frequent" is most frequent but does not make a
difference...
"I" differentiates DB1 from DB2
"We are" differentiates DB2 from DB1
String mining under several kind of frequency
constraints can be done in optimal linear time
using suffix array techniques FHK06.

DB1
DB2
I am frequent I am also frequent Am I also making
a difference
We are frequent We are also frequent We are all
frequent
3
Frequent string mining optimal space?

"frequent" is most frequent but does not make a
difference...
"I" differentiates DB1 from DB2
"We are" differentiates DB2 from DB1
Problem Can string mining be done using
assymptotically the same space as what is needed
for storing the string collection?

DB1
DB2
I am frequent I am also frequent Am I also making
a difference
We are frequent We are also frequent We are all
frequent
4
Our result Space-efficient string mining

Given a collection C of d documents with overall
length nC?T? CT, where T ? S, T ? C.
We give a string mining algorithm that uses
O(n log Sd log n) bits of working space and
O(n log n) time.
Since usually d ltlt n, the solution is
significantly more space-efficient than previous
ones that use O(n log n) working space.

5
High-level description

Tight integration of Kasai et al. Kasetal01
algorithm to visit all branching substrings of a
text and Hui's Hui92 color set size technique.
Toolbox compressed suffix array, compressed LCP
values, range minimum queries, searchable partial
sums.

6
Overview without compressed structures
RMQ(LCP,8,14)1
1 2 3 4 5 6 7 8 9 10 11 12 13 14
15 16 17 18 19 20 21 22 23 a a b a a
b a a a b b b a b b a
b b a 5 12 18 23 4 22 8 9 1 10
2 6 15 19 11 18 3 21 7 14 16 20 13 0 0
0 0 0 1 1 2 3 1 2 3 2 3 0
1 1 2 2 2 1 2 3 a a
a a a a a a a a b b b b b
b b b b a a
a b b b b b a a a a b
b b a b b
a a b b a b a a
b a a
a a b b
a
b b
b

T SA LCP
7
Right-most path of suffix tree
a
a
b
a
5 12 18 23 4 22 8 9 1 10 2 6 15 19
11 18 3 21 7 14 16 20 13 0 0 0 0 0
1 1 2 3 1 2 3 2 3 0 1 1 2
2 2 1 2 3 a a a a a
a a a a a b b b b b b b b
b a a a b b
b b b a a a a b b b
a b b a a b
b a b a a
b a a a
a b b
a
b b
b

SA LCP
8
Suffixes-insertion algorithm
a
b
a
b
a
5 12 18 23 4 22 8 9 1 10 2 6 15 19
11 18 3 21 7 14 16 20 13 0 0 0 0 0
1 1 2 3 1 2 3 2 3 0 1 1 2
2 2 1 2 3 a a a a a
a a a a a b b b b b b b b
b a a a b b
b b b a a a a b b b
a b b a a b
b a b a a
b a a a
a b b
a
b b
b

SA LCP
9
Maintain only the right-most path
a

Once a node is popped,its subtree is ready, and
all statistics for the substring ending to the
node can be reported

b
a
b
a
5 12 18 23 4 22 8 9 1 10 2 6 15 19
11 18 3 21 7 14 16 20 13 0 0 0 0 0
1 1 2 3 1 2 3 2 3 0 1 1 2
2 2 1 2 3 a a a a a
a a a a a b b b b b b b b
b a a a b b
b b b a a a a b b b
a b b a a b
b a b a a
b a a a
a b b
a
b b
b

SA LCP
10
Hui's algorithm

Store at each node v of suffix tree
the values
Sv number of leaves in the subtree of v, and
Cv number of dublicate occurrences of the
substring ending at node v.

a
a
Sv3 Cv1
Sv-Cv tells how many different documents
there are in the subtree of v. AKA Sv-Cv
defines the frequency of the substring ending at
node v.
0 1 2 3 0 3 1 1 0 1 0 1 2
3 1 2 0 3 1 2 2 3 2 5 12 18 23
4 22 8 9 1 10 2 6 15 19 11 18 3 21
7 14 16 20 13 0 0 0 0 0 1 1 2 3
1 2 3 2 3 0 1 1 2 2 2 1 2
3 a a a a a a a a a
a b b b b b b b b b
a a a b b b b b
a a a a b b b
a b b a a b b
a b a a
b a a a
a b b
a b
b
b

D SA LCP
11
Making it all space-efficient 1/5

Right-most path is kept in a specialstack
Relative string depths are coded using Elias
codes.
Takes O(n) bits.
Allows constant time pop/push.

a
a
b
a
5 12 18 23 4 22 8 9 1 10 2 6 15 19
11 18 3 21 7 14 16 20 13 0 0 0 0 0
1 1 2 3 1 2 3 2 3 0 1 1 2
2 2 1 2 3 a a a a a
a a a a a b b b b b b b b
b a a a b b
b b b a a a a b b b
a b b a a b
b a b a a
b a a a
a b b
a
b b
b

SA LCP
12
Making it all space-efficient 2/5

Preliminary counter Sv values along the
right-most path are encoded identically as the
stack.
Once a node v popped its Sv value is final
and this value is added to its parent.
O(n) bits with constant time updates.

a
a
Sv3 Cv1
5 12 18 23 4 22 8 9 1 10 2 6 15 19
11 18 3 21 7 14 16 20 13 0 0 0 0 0
1 1 2 3 1 2 3 2 3 0 1 1 2
2 2 1 2 3 a a a a a
a a a a a b b b b b b b b
b a a a b b
b b b a a a a b b b
a b b a a b
b a b a a
b a a a
a b b
a
b b
b

SA LCP
13
Making it all space-efficient 3/5

Preliminary counter Cv values along the
right-most path are encoded using a dynamic
searchable partial sums structure.
Once a node v popped its Cv value is final
and this value is added to its parent.
O(n) bits with O(log n) time updates.

a
a
Sv3 Cv1
5 12 18 23 4 22 8 9 1 10 2 6 15 19
11 18 3 21 7 14 16 20 13 0 0 0 0 0
1 1 2 3 1 2 3 2 3 0 1 1 2
2 2 1 2 3 a a a a a
a a a a a b b b b b b b b
b a a a b b
b b b a a a a b b b
a b b a a b
b a b a a
b a a a
a b b
a
b b
b

SA LCP
14
Making it all space-efficient 4/5

Table D encodes document numbers where suffixes
belong to in lex. order.
Predecessor-query on D gives the previous
occurrence inside the same document.
RMQ-between the two occurrences gives the
string depth where the Cv counter should be
incremented.

a
a
Sv3 Cv1
0 1 2 3 0 3 1 1 0 1 0 1 2
3 1 2 0 3 1 2 2 3 2 5 12 18 23
4 22 8 9 1 10 2 6 15 19 11 18 3 21
7 14 16 20 13 0 0 0 0 0 1 1 2 3
1 2 3 2 3 0 1 1 2 2 2 1 2
3 a a a a a a a a a
a b b b b b b b b b
a a a b b b b b
a a a a b b b
a b b a a b b
a b a a
b a a a
a b b
a b
b
b

D SA LCP
RMQ0
RMQ2
RMQ1
15
Making it all space-efficient 5/5

Table D does not need to be stored as
predecessors can be updated "on-the-fly" using
an array pred1..d.
Compressed suffix array supportsaccess in
O(loge n) time and takes O(n log S) bits.
A bit-vector B1,n marks the document
boundaries in the text, so that
rank(B,SAi)Di.
LCP and RMQ structures each take2n(1o(1)) bits
HS02,FH07.

This presentation only sketched how to compute
the frequency values inside one document
collection. In addition,
the computation is easy to adjust to report
patterns occurring frequently in one document
collection and infrequently in the other
the computation gives a space-efficient
construction algorithm for Sadakane's scheme of
stroring the frequency values Sad07 and
other compressed text indexes can be plugged in
to obtain other space/time tradeoffs.

17
Epilogue

Thanks to the discussions with Luis Russo after
the workshop, we were able to improve the space
from O(n log d) to O(d log n).
The presentation has been changed accordingly.

18
References

FHK06 Johannes Fischer, Volker Heun, Stefan
Kramer Optimal String Mining under Frequency
Constraints, Proc. PKDD'06, LNAI 4213, pages
139-150, 2006.
FH07 Johannes Fischer, Volker Heun A New
Succinct Representation of RMQ-Information and
Improvements in the Enhanced Suffix Array. In
Proc. ESCAPE'07, LNCS 4614, pages 459- 470, 2007.
FMV07 Johannes Fischer, Veli Mäkinen, Niko
Välimäki Space-efficient String Mining under
Frequency Constraints. Submitted.
HS02 Wing-Kai Hon, Kunihiko Sadakane
Space-Economical Algorithms for Finding Maximal
Unique Matches. In Proc. CPM 2002, LNCS 2373,
pages 144-152, 2002.
Hui92 Lucas Hui Color Set Size Problem with
Application to String Matching. In Proc. CPM
1992, LNCS 644, pages 230-243, 1992.
Kasetal01 Toru Kasai, Gunho Lee, Hiroki
Arimura, Setsuo Arikawa, Kunsoo Park Linear-Time
Longest- Common-Prefix Computation in Suffix
Arrays and Its Applications. In Proc. CPM 2001,
LNCS 2089, pages 181-192, 2001.
Sad07 Kunihiko Sadakane Succinct data
structures for flexible text retrieval systems.
J. Discrete Algorithms 5(1) 12-22 (2007)