Space-Efficient String Mining under Frequency Constraints - PowerPoint PPT Presentation

About This Presentation
Title:

Space-Efficient String Mining under Frequency Constraints

Description:

String mining under several kind of frequency constraints can be done in optimal ... Preliminary counter S[v] values. along the right-most path are encoded ... – PowerPoint PPT presentation

Number of Views:29
Avg rating:3.0/5.0
Slides: 19
Provided by: vel98
Category:

less

Transcript and Presenter's Notes

Title: Space-Efficient String Mining under Frequency Constraints


1
Space-Efficient String Mining under Frequency
Constraints
  • Johannes Fischer
  • Ludwig-Maximilians-Universität München

Veli Mäkinen and Niko Välimäki University of
Helsinki
2
Frequent string mining optimal time
  • "frequent" is most frequent but does not make a
    difference...
  • "I" differentiates DB1 from DB2
  • "We are" differentiates DB2 from DB1
  • String mining under several kind of frequency
    constraints can be done in optimal linear time
    using suffix array techniques FHK06.

DB1
DB2
I am frequent I am also frequent Am I also making
a difference
We are frequent We are also frequent We are all
frequent
3
Frequent string mining optimal space?
  • "frequent" is most frequent but does not make a
    difference...
  • "I" differentiates DB1 from DB2
  • "We are" differentiates DB2 from DB1
  • Problem Can string mining be done using
    assymptotically the same space as what is needed
    for storing the string collection?

DB1
DB2
I am frequent I am also frequent Am I also making
a difference
We are frequent We are also frequent We are all
frequent
4
Our result Space-efficient string mining
  • Given a collection C of d documents with overall
    length nC?T? CT, where T ? S, T ? C.
  • We give a string mining algorithm that uses
  • O(n log Sd log n) bits of working space and
  • O(n log n) time.
  • Since usually d ltlt n, the solution is
    significantly more space-efficient than previous
    ones that use O(n log n) working space.

5
High-level description
  • Tight integration of Kasai et al. Kasetal01
    algorithm to visit all branching substrings of a
    text and Hui's Hui92 color set size technique.
  • Toolbox compressed suffix array, compressed LCP
    values, range minimum queries, searchable partial
    sums.

6
Overview without compressed structures
RMQ(LCP,8,14)1
1 2 3 4 5 6 7 8 9 10 11 12 13 14
15 16 17 18 19 20 21 22 23 a a b a a
b a a a b b b a b b a
b b a 5 12 18 23 4 22 8 9 1 10
2 6 15 19 11 18 3 21 7 14 16 20 13 0 0
0 0 0 1 1 2 3 1 2 3 2 3 0
1 1 2 2 2 1 2 3 a a
a a a a a a a a b b b b b
b b b b a a
a b b b b b a a a a b
b b a b b
a a b b a b a a
b a a
a a b b
a
b b
b


T SA LCP
7
Right-most path of suffix tree
a
a
b
a
5 12 18 23 4 22 8 9 1 10 2 6 15 19
11 18 3 21 7 14 16 20 13 0 0 0 0 0
1 1 2 3 1 2 3 2 3 0 1 1 2
2 2 1 2 3 a a a a a
a a a a a b b b b b b b b
b a a a b b
b b b a a a a b b b
a b b a a b
b a b a a
b a a a
a b b
a
b b
b


SA LCP
8
Suffixes-insertion algorithm
a
b
a
b
a
5 12 18 23 4 22 8 9 1 10 2 6 15 19
11 18 3 21 7 14 16 20 13 0 0 0 0 0
1 1 2 3 1 2 3 2 3 0 1 1 2
2 2 1 2 3 a a a a a
a a a a a b b b b b b b b
b a a a b b
b b b a a a a b b b
a b b a a b
b a b a a
b a a a
a b b
a
b b
b


SA LCP
9
Maintain only the right-most path
a
  • Once a node is popped,its subtree is ready, and
    all statistics for the substring ending to the
    node can be reported

b
a
b
a
5 12 18 23 4 22 8 9 1 10 2 6 15 19
11 18 3 21 7 14 16 20 13 0 0 0 0 0
1 1 2 3 1 2 3 2 3 0 1 1 2
2 2 1 2 3 a a a a a
a a a a a b b b b b b b b
b a a a b b
b b b a a a a b b b
a b b a a b
b a b a a
b a a a
a b b
a
b b
b


SA LCP
10
Hui's algorithm
  • Store at each node v of suffix tree
  • the values
  • Sv number of leaves in the subtree of v, and
  • Cv number of dublicate occurrences of the
    substring ending at node v.

a
a
Sv3 Cv1
Sv-Cv tells how many different documents
there are in the subtree of v. AKA Sv-Cv
defines the frequency of the substring ending at
node v.
0 1 2 3 0 3 1 1 0 1 0 1 2
3 1 2 0 3 1 2 2 3 2 5 12 18 23
4 22 8 9 1 10 2 6 15 19 11 18 3 21
7 14 16 20 13 0 0 0 0 0 1 1 2 3
1 2 3 2 3 0 1 1 2 2 2 1 2
3 a a a a a a a a a
a b b b b b b b b b
a a a b b b b b
a a a a b b b
a b b a a b b
a b a a
b a a a
a b b
a b
b
b


D SA LCP
11
Making it all space-efficient 1/5
  • Right-most path is kept in a specialstack
  • Relative string depths are coded using Elias
    codes.
  • Takes O(n) bits.
  • Allows constant time pop/push.

a
a
b
a
5 12 18 23 4 22 8 9 1 10 2 6 15 19
11 18 3 21 7 14 16 20 13 0 0 0 0 0
1 1 2 3 1 2 3 2 3 0 1 1 2
2 2 1 2 3 a a a a a
a a a a a b b b b b b b b
b a a a b b
b b b a a a a b b b
a b b a a b
b a b a a
b a a a
a b b
a
b b
b


SA LCP
12
Making it all space-efficient 2/5
  • Preliminary counter Sv values along the
    right-most path are encoded identically as the
    stack.
  • Once a node v popped its Sv value is final
    and this value is added to its parent.
  • O(n) bits with constant time updates.

a
a
Sv3 Cv1
5 12 18 23 4 22 8 9 1 10 2 6 15 19
11 18 3 21 7 14 16 20 13 0 0 0 0 0
1 1 2 3 1 2 3 2 3 0 1 1 2
2 2 1 2 3 a a a a a
a a a a a b b b b b b b b
b a a a b b
b b b a a a a b b b
a b b a a b
b a b a a
b a a a
a b b
a
b b
b


SA LCP
13
Making it all space-efficient 3/5
  • Preliminary counter Cv values along the
    right-most path are encoded using a dynamic
    searchable partial sums structure.
  • Once a node v popped its Cv value is final
    and this value is added to its parent.
  • O(n) bits with O(log n) time updates.

a
a
Sv3 Cv1
5 12 18 23 4 22 8 9 1 10 2 6 15 19
11 18 3 21 7 14 16 20 13 0 0 0 0 0
1 1 2 3 1 2 3 2 3 0 1 1 2
2 2 1 2 3 a a a a a
a a a a a b b b b b b b b
b a a a b b
b b b a a a a b b b
a b b a a b
b a b a a
b a a a
a b b
a
b b
b


SA LCP
14
Making it all space-efficient 4/5
  • Table D encodes document numbers where suffixes
    belong to in lex. order.
  • Predecessor-query on D gives the previous
    occurrence inside the same document.
  • RMQ-between the two occurrences gives the
    string depth where the Cv counter should be
    incremented.

a
a
Sv3 Cv1
0 1 2 3 0 3 1 1 0 1 0 1 2
3 1 2 0 3 1 2 2 3 2 5 12 18 23
4 22 8 9 1 10 2 6 15 19 11 18 3 21
7 14 16 20 13 0 0 0 0 0 1 1 2 3
1 2 3 2 3 0 1 1 2 2 2 1 2
3 a a a a a a a a a
a b b b b b b b b b
a a a b b b b b
a a a a b b b
a b b a a b b
a b a a
b a a a
a b b
a b
b
b


D SA LCP
RMQ0
RMQ2
RMQ1
15
Making it all space-efficient 5/5
  • Table D does not need to be stored as
    predecessors can be updated "on-the-fly" using
    an array pred1..d.
  • Compressed suffix array supportsaccess in
    O(loge n) time and takes O(n log S) bits.
  • A bit-vector B1,n marks the document
    boundaries in the text, so that
    rank(B,SAi)Di.
  • LCP and RMQ structures each take2n(1o(1)) bits
    HS02,FH07.

a
a
Sv3 Cv1
0 1 2 3 0 3 1 1 0 1 0 1 2
3 1 2 0 3 1 2 2 3 2 5 12 18 23
4 22 8 9 1 10 2 6 15 19 11 18 3 21
7 14 16 20 13 0 0 0 0 0 1 1 2 3
1 2 3 2 3 0 1 1 2 2 2 1 2
3 a a a a a a a a a
a b b b b b b b b b
a a a b b b b b
a a a a b b b
a b b a a b b
a b a a
b a a a
a b b
a b
b
b


D SA LCP
RMQ0
RMQ2
RMQ1
16
Extensions
  • This presentation only sketched how to compute
    the frequency values inside one document
    collection. In addition,
  • the computation is easy to adjust to report
    patterns occurring frequently in one document
    collection and infrequently in the other
  • the computation gives a space-efficient
    construction algorithm for Sadakane's scheme of
    stroring the frequency values Sad07 and
  • other compressed text indexes can be plugged in
    to obtain other space/time tradeoffs.

17
Epilogue
  • Thanks to the discussions with Luis Russo after
    the workshop, we were able to improve the space
    from O(n log d) to O(d log n).
  • The presentation has been changed accordingly.

18
References
  • FHK06 Johannes Fischer, Volker Heun, Stefan
    Kramer Optimal String Mining under Frequency
    Constraints, Proc. PKDD'06, LNAI 4213, pages
    139-150, 2006.
  • FH07 Johannes Fischer, Volker Heun A New
    Succinct Representation of RMQ-Information and
    Improvements in the Enhanced Suffix Array. In
    Proc. ESCAPE'07, LNCS 4614, pages 459- 470, 2007.
  • FMV07 Johannes Fischer, Veli Mäkinen, Niko
    Välimäki Space-efficient String Mining under
    Frequency Constraints. Submitted.
  • HS02 Wing-Kai Hon, Kunihiko Sadakane
    Space-Economical Algorithms for Finding Maximal
    Unique Matches. In Proc. CPM 2002, LNCS 2373,
    pages 144-152, 2002.
  • Hui92 Lucas Hui Color Set Size Problem with
    Application to String Matching. In Proc. CPM
    1992, LNCS 644, pages 230-243, 1992.
  • Kasetal01 Toru Kasai, Gunho Lee, Hiroki
    Arimura, Setsuo Arikawa, Kunsoo Park Linear-Time
    Longest- Common-Prefix Computation in Suffix
    Arrays and Its Applications. In Proc. CPM 2001,
    LNCS 2089, pages 181-192, 2001.
  • Sad07 Kunihiko Sadakane Succinct data
    structures for flexible text retrieval systems.
    J. Discrete Algorithms 5(1) 12-22 (2007)
Write a Comment
User Comments (0)
About PowerShow.com