Pattern Matching Using ngrams With Algebraic Signatures - PowerPoint PPT Presentation

1 / 32

About This Presentation

Title:

Pattern Matching Using ngrams With Algebraic Signatures

Description:

Use n grams with n 1 to increase the discriminative power of an attempt ... More discriminative sampling. Longer jumps. By almost K or 256 symbols in general ... – PowerPoint PPT presentation

Number of Views:75

Avg rating:3.0/5.0

Slides: 33

Provided by: Witold6

Category:

more less

Transcript and Presenter's Notes

Title: Pattern Matching Using ngrams With Algebraic Signatures

1
Pattern Matching Using n-grams With Algebraic
Signatures

Witold Litwin1, Riad Mokadem1, Philippe Rigaux1
Thomas Schwarz21 Université Paris
Dauphine2 Santa Clara University

2
n-gram Search

New pattern matching idea
Matches algebraic signatures
Preprocesses both pattern string (record)
String preprocessing is a new idea
To the best of our knowledge
Provides incidental protection of stored data
Important for P2P grid systems
Fast processing
Especially useful for DBs longer patterns
ASCII, Unicode, DNA
Should be then often faster than Boyer-Moore
Possibly the fastest known in this context

3
Algebraic Signature

Symbols of the alphabet are elements of a Galois
Field
GF (256) usually
We choose there one primitive element ?
Usually ? 2
The algebraic signature of the string of i
symbols p1 pi is the sum
pi p1? pi? i.
Here the addition and the multiplication are the
operations in GF.

4
Algebraic Signature

In our GF (2f) where f 8,16
p q p q p XOR q
One method for multiplying is
pq antilog (( log? p log? q) mod 255)
The division is then
p / q antilog (( log? p - log? q) mod 255)
The log and antilog are encoded in log and
antilog tables with 2f elements each.
Entry 0 is for element 0 of the GF and is by
convention set to 2f - 1.

5
Cumulative Algebraic Signature

We encode every symbol pi in a string into the
signature of the prefix p1pi
The value of a CAS symbol now encodes also the
knowledge of values of all the previous ones
Matching a single symbol means prefix matching

6
Application of CASs

Protection against involuntary data disclosure
On P2P Grid Servers especially
Numerous CAS encoded string matching algorithms
Prefix match with O (1) complexity
Pattern match by signature only
Karp Rabin like, linear O (L) complexity
Longest common string search
Longest common prefix search

7
CAS Properties

O (K) encoding and decoding speed
For encoding, for instance
pi pi-1 pi ? i CAS ( pi-1) pi ? i
Fast n gram signature calculus
For Sk,l pkpl with k gt 1 and l k n
AS ( Sk,l ) AS (S l - k1) (pl XOR
pk - 1) / ? k-1
Logarithmic Algebraic Signature (LAS)
LAS ( Sk,l ) log AS ( Sk,l )
( log (pl XOR pk - 1) (k-1)) mod 2f 1

8
The n-gram SearchKey ideas

Design a sublinear pattern match search
With speed about L / K
Apply to CAS encoded DB
New idea for string search algorithm with
preprocessing
Justified for a DB
Store once, search many times

9
The n-gram SearchKey ideas

Preprocess the pattern to create a jump table
As in Boyer Moore
Use n grams with n gt 1 to increase the
discriminative power of an attempt
Comparison of a sample from the pattern
a single symbol for BM
an LAS of an n gram for a CAS-encoded string

10
The n-gram SearchKey ideas

If the alphabet uses m symbols, the probability
that a symbol matches is 1/m
Assuming all symbols equally likely
For usual ASCII pattern matching m 20-25
For DNA m 4
A single symbol may often match without the whole
pattern matching
e.g., ¼ times for DNA on the average
Leading to small jumps,
by m symbols on the average

11
The n-gram SearchKey ideas

The probability of an n - gram matching may be
min ( 1/ 2f , 1 / mn )
In our examples it can reach 1 / 256
More discriminative sampling
Longer jumps
By almost K or 256 symbols in general
Useful for longer strings
DNA, text, images

12
ASCII Exemple Usual Alphabet
2-grams gt 5 jumps 1-gram gt 6 jumps
13
DNA Exemple4-letter Alphabet
3 jumps
4 jumps
4 jumps
11 jumps
14
The n-gram Search Preprocessing

Encode every record (string) into its CAS
Done for incidental protection anyhow for
SDDS-2006
Encode the terminal n - gram of the searched
pattern SK into its LAS in variable V
Fill up the jump table T for every other n -
gram in SK
calculate every LAS
for each LAS, store in T its rightmost offset
with respect to the end of SK

15
The n-gram Search Jump Table

For GF (256), every n gram Si, in-1 in the
pattern and i LAS (Si, in-1)
T ( i ) the offset
T ( i ) K n 1 otherwise
Remainder LAS (0) 255
T can be also hash table
See the paper
Slower to use but possibly more memory efficient
Probably more useful for a larger GF

16
ASCII Exemple
Dauphine
7
0
1
7

in
1
V ne

au
5

ph
3
Notation xy LAS (xy)

255
7
17
The n-gram Search Processing

Calculate LAS of the current n-gram in the string
Start with the n-gram SK-n1,K
Continue depending on jump calculus
Attempt to match V
If .true then calculate LAS of the entire current
possibly matching substring
of length K and ending with the current n-gram
If .true, then resolve the possible collision
Either attempt to match all the K symbols
Or match enough of terminal n-grams or symbols to
decrease the probability of collision to a very
small value

18
The n-gram Search Processing

Otherwise
Go to T using LAS of the n-gram
Jump by the number of symbols found in T
Update the current position for n-gram to
attempt the match
Re-attempt the match as above
Unless the n-gram to attempt is beyond the end
of the string

19
ASCII Exemple Again
2-grams gt 5 jumps 1-gram gt 6 jumps
20
DNA Exemple Again
3 jumps
4 jumps
4 jumps
11 jumps
21
n-grams / BM

Average shifts with n-grams can be typically
longer
Calculate an attempt jump may be more expensive
as well
About twice as long at first approach
The precise analysis remains to be done
Rule of thumb If shifts are more than 2 times
longer, n-grams with n gt 1 or should be faster
than BM.

22
Experimental Results

Searching large data of
DNA
Typical ASCII
XML Documents
Patterns of 6 to 500 symbols (bytes)
1.8 GHZ P3 and 2.4 GHZ DualCore AMD Turion 64
Processors

23
Results Compared to BM

DNA
Up to 72 times faster
Typical ASCII
Up to about 11 times faster
XML Documents
Up to more than 5 times faster
Search faster for longer pattern
Average shifts are longer

24
DNA
25
ASCII
26
XML
27
Related Work

Implemented in SDDS-2006
Applies best to
longer patterns
where many jumps occur
alphabets much smaller than the size of GF used
Instead of shifts of size m in the average, one
reaches almost min (K, 2f) per shift
up to almost 256 for DNA or ASCII with GF (256)
up to almost 64K for DNA or Unicode with GF (64K)
instead of 4 or 25 respectively
For Boyer-Moore especially

28
Related Work