A Suffix Array Based Ngram Extraction Algorithm - PowerPoint PPT Presentation

1 / 27

About This Presentation

Title:

A Suffix Array Based Ngram Extraction Algorithm

Description:

High robustness, language independency, numeric control, etc. Project Goal ... (n-gram, frequency) pairs from large strings, e.g. hundreds of kilobytes. ... – PowerPoint PPT presentation

Number of Views:356

Avg rating:3.0/5.0

Slides: 28

Provided by: work3

Category:

more less

Transcript and Presenter's Notes

Title: A Suffix Array Based Ngram Extraction Algorithm

1
A Suffix Array Based N-gram Extraction Algorithm
Graduate Project

Haixia Tang

2
Outline

Project Goal

Related Work

Algorithm

Experimental Results

Conclusion and Future Work

References

3
Project Goal

What are n-grams?

good morning
4
Project Goal Cont

Why n-grams?
High robustness, language independency, numeric
control, etc.
Project Goal
Extract k most frequent (n-gram, frequency) pairs
from large strings, e.g. hundreds of kilobytes.

5
Related Work

TextNgrams (Keselj 2003)
Efficient Mining of Textual Associations (Gil and
Dias 2003)
Textngram basis for n-gram analysis (Cozens
2003)
The design, implementation and use of the ngram
statistics package (Banerjee and Pedersen 2004)

6
Algorithm Outline
7
Suffix Array Construction

What is a suffix array

8
References related to Suffix Array Construction

A suffix array construction algorithm by Manber
and Myers (1993)
The Skew algorithm by Kärkkäinen and Sanders
(2003)
DC3 algorithm by Dementiev, Kärkkäinen, Mehnert,
and Sanders (2005)

9
Lcp Array Construction

What is an lcp array
An lcp array construction algorithm by Manber
and Myers (1993).

Lcpx the longest common prefix of suffixes
SSAx-1 and SSAx
10
Extraction of Implicit N-gram Descriptors

Implicit n-gram descriptors.
(endpos, frequency) pairs
A suffix array An lcp array -gt a list of
implicit n-gram descriptors.
A streaming method is used when the data set is
large.

11
Data Flow of the Descriptor Module
Streaming Method
Loop for ceiling(N/(M/2)) times
Read M/2 Data
Main Memory
Input
Output
Boundary Descriptor
Suffix Array Lcp Array
Implicit n-gram descriptor list
Internal extract_implicit algorithm
Descriptors extracted
Append data
12
Selection of k most frequent Descriptors

The standard selection algorithm (Cormen,
Leiserson, Rivest, and et al. 2001) - selects
the kth order statistics.
Streaming method when the input data set is
large.
Sorting the list by end positions.

13
Data Flow of the Selection Step
Streaming Method
Selection Step
Loop until all the data processed
Input
Output
Implicit Descriptor List
Read Data
Main Memory
k most frequent Descriptors
Randomized Selection Algorithm
14
Extraction of the (n-gram, frequency) pairs

To extract the (n-gram, frequency) pairs from the
original string.
Original string k most freq descriptors -gt a
list of (n-gram, frequency) pairs.
Sorting the list by frequency.

15
Data Flow of the Extraction Step
Extraction Step
Loop for k times
Forward until current n-gram endpos
Input
Output
Keep a window of last n bytes read
List of (n-gram , frequency) pairs
Append current n-gram
Input Text
Input
Main Memory
k most frequent descriptors sorted by end
positions
16
Implementation Details

Two implementations
Internal
assuming the main memory is unlimited.
suffix array lcp array construction (Manber and
Myers 1993),
buffered file stream operations -- fputc(),
fgetc(),
External
streaming method
suffix array lcp array construction (Dementiev,
R., J. Kärkkäinen , J. Mehnert, and P. Sanders
2005, January).
STXXL library by Dementiev and Sanders (2005) for
the suffix array construction module.
Low-level file operations read(), write(),
C/C implementation, gcc/g 3.3, 3.4

17
Data Sets
18
Experiments

The internal memory implementation vs. a Perl
hash table based n-gram extraction approach
provided by Keselj (2003)
sun4u machine with 16G RAM, a 70G disk, running
SunOS 5.8.
The internal memory implementation vs. the
external memory implementation
Intel(R) Pentium(R) 4, 3.0GHz CPU, 2.0G RAM, 2.0G
Swap, 50G disk, running Red Hat Linux

19
Parameters

n the length of an n-gram.
k the number of the most frequent n- grams
we are interested.
Ts the time spent by the internal suffix
array based implementation.
Th the time spent by the Perl hash-table-
based implementation.
Tsa the time spent by the external suffix array
construction module.
TimpTselText the time spent by the
descriptor module, selection module,
and extraction module of the
external implementation.

20
Fixed k
1..10
21
Fixed n
22
Interval vs. External
23
Tsa vs. TimpTselText
24
Conclusion

We introduced an efficient suffix array based
n-gram extraction algorithm.
5 modules
Suffix, Lcp, Descriptor, Selection, Extraction.
2 Comparisons and conclusions
Internal vs. Perl hash table based
Faster, immune to n
External vs. internal
Internal is faster when data fits in main memory
External can deal with lager datasets than
internal.

25
Future Work

To extend our algorithm to character level, word
level and other desired unit level.
To design and implement an external version of
the lcp array construction module.

26
References

Dementiev, R. (2005, August). STXXL Home Page.
Dementiev, R., J. Kärkkäinen , J. Mehnert, and P.
Sanders (2005, January). Better external memory
suffix array. Construction ALENEX05 Algorithm
Engineering and Experiments.
Dementiev, R. and P. Sanders (2005). Stxxl
Standard template library for xxl data sets.
Technical Report 2005/18 .
Gil, A. and G. Dias (2003, October). Efficient
mining of textual associations. In Proceedings of
the International Conference on Natural Language
Processing and Knowledge Engineering, Chengqing
Zong (eds), Beijing, China, pp. 549555. IEEE
Press.
Kärkkäinen, J. and P. Sanders (2003). Simple
linear work suffix array construction. In In.
Proc. 13th International Conference on Automata,
Languages and Programming, Volume 2719 of LNCS,
Chengqing Zong (eds), Beijing, China, pp.
943955. Springer.
Keselj, V. (2003). A New Perl Package for N-gram
analysis.
Keselj, V., F. Peng, N. Cercone, and C. Thomas
(2003). N-gram-based author profiles for
authorship attribution. Pacific Association for
Computational Linguistics.
Manber, U. and G. Myers (1993, October). Suffix
arrays A new method for on-line string searches.
SIAM J. Comput. 22, 935948.
Mehnert, J. (2004). Pipelined external memory
2-Tupling, 4-Tupling, 2-Discarding, 4-Discarding
and DC3 ( Skew) suffix array construction
algorithm implementations. Menon-Sen, A. (2002,
October).