A Suffix Array Based Ngram Extraction Algorithm - PowerPoint PPT Presentation

1 / 27
About This Presentation
Title:

A Suffix Array Based Ngram Extraction Algorithm

Description:

High robustness, language independency, numeric control, etc. Project Goal ... (n-gram, frequency) pairs from large strings, e.g. hundreds of kilobytes. ... – PowerPoint PPT presentation

Number of Views:356
Avg rating:3.0/5.0
Slides: 28
Provided by: work3
Category:

less

Transcript and Presenter's Notes

Title: A Suffix Array Based Ngram Extraction Algorithm


1
A Suffix Array Based N-gram Extraction Algorithm
Graduate Project
  • Haixia Tang

2
Outline
  • Project Goal
  • Related Work
  • Algorithm
  • Experimental Results
  • Conclusion and Future Work
  • References

3
Project Goal
  • What are n-grams?

good morning
4
Project Goal Cont
  • Why n-grams?
  • High robustness, language independency, numeric
    control, etc.
  • Project Goal
  • Extract k most frequent (n-gram, frequency) pairs
    from large strings, e.g. hundreds of kilobytes.

5
Related Work
  • TextNgrams (Keselj 2003)
  • Efficient Mining of Textual Associations (Gil and
    Dias 2003)
  • Textngram basis for n-gram analysis (Cozens
    2003)
  • The design, implementation and use of the ngram
    statistics package (Banerjee and Pedersen 2004)

6
Algorithm Outline
7
Suffix Array Construction
  • What is a suffix array

8
References related to Suffix Array Construction
  • A suffix array construction algorithm by Manber
    and Myers (1993)
  • The Skew algorithm by Kärkkäinen and Sanders
    (2003)
  • DC3 algorithm by Dementiev, Kärkkäinen, Mehnert,
    and Sanders (2005)

9
Lcp Array Construction
  • What is an lcp array
  • An lcp array construction algorithm by Manber
    and Myers (1993).

Lcpx the longest common prefix of suffixes
SSAx-1 and SSAx
10
Extraction of Implicit N-gram Descriptors
  • Implicit n-gram descriptors.
  • (endpos, frequency) pairs
  • A suffix array An lcp array -gt a list of
    implicit n-gram descriptors.
  • A streaming method is used when the data set is
    large.

11
Data Flow of the Descriptor Module
Streaming Method
Loop for ceiling(N/(M/2)) times
Read M/2 Data
Main Memory
Input
Output
Boundary Descriptor
Suffix Array Lcp Array
Implicit n-gram descriptor list
Internal extract_implicit algorithm
Descriptors extracted
Append data
12
Selection of k most frequent Descriptors
  • The standard selection algorithm (Cormen,
    Leiserson, Rivest, and et al. 2001) - selects
    the kth order statistics.
  • Streaming method when the input data set is
    large.
  • Sorting the list by end positions.

13
Data Flow of the Selection Step
Streaming Method
Selection Step
Loop until all the data processed
Input
Output
Implicit Descriptor List
Read Data
Main Memory
k most frequent Descriptors
Randomized Selection Algorithm
14
Extraction of the (n-gram, frequency) pairs
  • To extract the (n-gram, frequency) pairs from the
    original string.
  • Original string k most freq descriptors -gt a
    list of (n-gram, frequency) pairs.
  • Sorting the list by frequency.

15
Data Flow of the Extraction Step
Extraction Step
Loop for k times
Forward until current n-gram endpos
Input
Output
Keep a window of last n bytes read
List of (n-gram , frequency) pairs
Append current n-gram
Input Text
Input
Main Memory
k most frequent descriptors sorted by end
positions
16
Implementation Details
  • Two implementations
  • Internal
  • assuming the main memory is unlimited.
  • suffix array lcp array construction (Manber and
    Myers 1993),
  • buffered file stream operations -- fputc(),
    fgetc(),
  • External
  • streaming method
  • suffix array lcp array construction (Dementiev,
    R., J. Kärkkäinen , J. Mehnert, and P. Sanders
    2005, January).
  • STXXL library by Dementiev and Sanders (2005) for
    the suffix array construction module.
  • Low-level file operations read(), write(),
  • C/C implementation, gcc/g 3.3, 3.4

17
Data Sets
18
Experiments
  • The internal memory implementation vs. a Perl
    hash table based n-gram extraction approach
    provided by Keselj (2003)
  • sun4u machine with 16G RAM, a 70G disk, running
    SunOS 5.8.
  • The internal memory implementation vs. the
    external memory implementation
  • Intel(R) Pentium(R) 4, 3.0GHz CPU, 2.0G RAM, 2.0G
    Swap, 50G disk, running Red Hat Linux

19
Parameters
  • n the length of an n-gram.
  • k the number of the most frequent n- grams
    we are interested.
  • Ts the time spent by the internal suffix
    array based implementation.
  • Th the time spent by the Perl hash-table-
    based implementation.
  • Tsa the time spent by the external suffix array
    construction module.
  • TimpTselText the time spent by the
    descriptor module, selection module,
    and extraction module of the
    external implementation.

20
Fixed k
1..10
21
Fixed n
22
Interval vs. External
23
Tsa vs. TimpTselText
24
Conclusion
  • We introduced an efficient suffix array based
    n-gram extraction algorithm.
  • 5 modules
  • Suffix, Lcp, Descriptor, Selection, Extraction.
  • 2 Comparisons and conclusions
  • Internal vs. Perl hash table based
  • Faster, immune to n
  • External vs. internal
  • Internal is faster when data fits in main memory
  • External can deal with lager datasets than
    internal.

25
Future Work
  • To extend our algorithm to character level, word
    level and other desired unit level.
  • To design and implement an external version of
    the lcp array construction module.

26
References
  • Dementiev, R. (2005, August). STXXL Home Page.
  • Dementiev, R., J. Kärkkäinen , J. Mehnert, and P.
    Sanders (2005, January). Better external memory
    suffix array. Construction ALENEX05 Algorithm
    Engineering and Experiments.
  • Dementiev, R. and P. Sanders (2005). Stxxl
    Standard template library for xxl data sets.
    Technical Report 2005/18 .
  • Gil, A. and G. Dias (2003, October). Efficient
    mining of textual associations. In Proceedings of
    the International Conference on Natural Language
    Processing and Knowledge Engineering, Chengqing
    Zong (eds), Beijing, China, pp. 549555. IEEE
    Press.
  • Kärkkäinen, J. and P. Sanders (2003). Simple
    linear work suffix array construction. In In.
    Proc. 13th International Conference on Automata,
    Languages and Programming, Volume 2719 of LNCS,
    Chengqing Zong (eds), Beijing, China, pp.
    943955. Springer.
  • Keselj, V. (2003). A New Perl Package for N-gram
    analysis.
  • Keselj, V., F. Peng, N. Cercone, and C. Thomas
    (2003). N-gram-based author profiles for
    authorship attribution. Pacific Association for
    Computational Linguistics.
  • Manber, U. and G. Myers (1993, October). Suffix
    arrays A new method for on-line string searches.
    SIAM J. Comput. 22, 935948.
  • Mehnert, J. (2004). Pipelined external memory
    2-Tupling, 4-Tupling, 2-Discarding, 4-Discarding
    and DC3 ( Skew) suffix array construction
    algorithm implementations. Menon-Sen, A. (2002,
    October).

27
Thanks
Questions?
Write a Comment
User Comments (0)
About PowerShow.com