Title: A Suffix Array Based Ngram Extraction Algorithm
1A Suffix Array Based N-gram Extraction Algorithm
Graduate Project
2Outline
- Conclusion and Future Work
3Project Goal
good morning
4Project Goal Cont
- Why n-grams?
- High robustness, language independency, numeric
control, etc. - Project Goal
- Extract k most frequent (n-gram, frequency) pairs
from large strings, e.g. hundreds of kilobytes.
5Related Work
- TextNgrams (Keselj 2003)
- Efficient Mining of Textual Associations (Gil and
Dias 2003) - Textngram basis for n-gram analysis (Cozens
2003) - The design, implementation and use of the ngram
statistics package (Banerjee and Pedersen 2004)
6Algorithm Outline
7Suffix Array Construction
8References related to Suffix Array Construction
- A suffix array construction algorithm by Manber
and Myers (1993) - The Skew algorithm by Kärkkäinen and Sanders
(2003) - DC3 algorithm by Dementiev, Kärkkäinen, Mehnert,
and Sanders (2005)
9Lcp Array Construction
- What is an lcp array
- An lcp array construction algorithm by Manber
and Myers (1993).
Lcpx the longest common prefix of suffixes
SSAx-1 and SSAx
10Extraction of Implicit N-gram Descriptors
- Implicit n-gram descriptors.
- (endpos, frequency) pairs
- A suffix array An lcp array -gt a list of
implicit n-gram descriptors. - A streaming method is used when the data set is
large.
11Data Flow of the Descriptor Module
Streaming Method
Loop for ceiling(N/(M/2)) times
Read M/2 Data
Main Memory
Input
Output
Boundary Descriptor
Suffix Array Lcp Array
Implicit n-gram descriptor list
Internal extract_implicit algorithm
Descriptors extracted
Append data
12Selection of k most frequent Descriptors
- The standard selection algorithm (Cormen,
Leiserson, Rivest, and et al. 2001) - selects
the kth order statistics. - Streaming method when the input data set is
large. - Sorting the list by end positions.
13Data Flow of the Selection Step
Streaming Method
Selection Step
Loop until all the data processed
Input
Output
Implicit Descriptor List
Read Data
Main Memory
k most frequent Descriptors
Randomized Selection Algorithm
14Extraction of the (n-gram, frequency) pairs
- To extract the (n-gram, frequency) pairs from the
original string. - Original string k most freq descriptors -gt a
list of (n-gram, frequency) pairs. - Sorting the list by frequency.
15Data Flow of the Extraction Step
Extraction Step
Loop for k times
Forward until current n-gram endpos
Input
Output
Keep a window of last n bytes read
List of (n-gram , frequency) pairs
Append current n-gram
Input Text
Input
Main Memory
k most frequent descriptors sorted by end
positions
16Implementation Details
- Two implementations
- Internal
- assuming the main memory is unlimited.
- suffix array lcp array construction (Manber and
Myers 1993), - buffered file stream operations -- fputc(),
fgetc(), - External
- streaming method
- suffix array lcp array construction (Dementiev,
R., J. Kärkkäinen , J. Mehnert, and P. Sanders
2005, January). - STXXL library by Dementiev and Sanders (2005) for
the suffix array construction module. - Low-level file operations read(), write(),
- C/C implementation, gcc/g 3.3, 3.4
17Data Sets
18Experiments
- The internal memory implementation vs. a Perl
hash table based n-gram extraction approach
provided by Keselj (2003) - sun4u machine with 16G RAM, a 70G disk, running
SunOS 5.8. - The internal memory implementation vs. the
external memory implementation - Intel(R) Pentium(R) 4, 3.0GHz CPU, 2.0G RAM, 2.0G
Swap, 50G disk, running Red Hat Linux
19Parameters
- n the length of an n-gram.
- k the number of the most frequent n- grams
we are interested. - Ts the time spent by the internal suffix
array based implementation. - Th the time spent by the Perl hash-table-
based implementation. - Tsa the time spent by the external suffix array
construction module. - TimpTselText the time spent by the
descriptor module, selection module,
and extraction module of the
external implementation.
20Fixed k
1..10
21Fixed n
22Interval vs. External
23Tsa vs. TimpTselText
24Conclusion
- We introduced an efficient suffix array based
n-gram extraction algorithm. - 5 modules
- Suffix, Lcp, Descriptor, Selection, Extraction.
- 2 Comparisons and conclusions
- Internal vs. Perl hash table based
- Faster, immune to n
- External vs. internal
- Internal is faster when data fits in main memory
- External can deal with lager datasets than
internal.
25Future Work
- To extend our algorithm to character level, word
level and other desired unit level. - To design and implement an external version of
the lcp array construction module.
26References
- Dementiev, R. (2005, August). STXXL Home Page.
- Dementiev, R., J. Kärkkäinen , J. Mehnert, and P.
Sanders (2005, January). Better external memory
suffix array. Construction ALENEX05 Algorithm
Engineering and Experiments. - Dementiev, R. and P. Sanders (2005). Stxxl
Standard template library for xxl data sets.
Technical Report 2005/18 . - Gil, A. and G. Dias (2003, October). Efficient
mining of textual associations. In Proceedings of
the International Conference on Natural Language
Processing and Knowledge Engineering, Chengqing
Zong (eds), Beijing, China, pp. 549555. IEEE
Press. - Kärkkäinen, J. and P. Sanders (2003). Simple
linear work suffix array construction. In In.
Proc. 13th International Conference on Automata,
Languages and Programming, Volume 2719 of LNCS,
Chengqing Zong (eds), Beijing, China, pp.
943955. Springer. - Keselj, V. (2003). A New Perl Package for N-gram
analysis. - Keselj, V., F. Peng, N. Cercone, and C. Thomas
(2003). N-gram-based author profiles for
authorship attribution. Pacific Association for
Computational Linguistics. - Manber, U. and G. Myers (1993, October). Suffix
arrays A new method for on-line string searches.
SIAM J. Comput. 22, 935948. - Mehnert, J. (2004). Pipelined external memory
2-Tupling, 4-Tupling, 2-Discarding, 4-Discarding
and DC3 ( Skew) suffix array construction
algorithm implementations. Menon-Sen, A. (2002,
October).
27Thanks
Questions?