Title: Pattern Matching Using ngrams With Algebraic Signatures
1Pattern Matching Using n-grams With Algebraic
Signatures
- Witold Litwin1, Riad Mokadem1, Philippe Rigaux1
Thomas Schwarz21 Université Paris
Dauphine2 Santa Clara University
2n-gram Search
- New pattern matching idea
- Matches algebraic signatures
- Preprocesses both pattern string (record)
- String preprocessing is a new idea
- To the best of our knowledge
- Provides incidental protection of stored data
- Important for P2P grid systems
- Fast processing
- Especially useful for DBs longer patterns
- ASCII, Unicode, DNA
- Should be then often faster than Boyer-Moore
- Possibly the fastest known in this context
3Algebraic Signature
- Symbols of the alphabet are elements of a Galois
Field - GF (256) usually
- We choose there one primitive element ?
- Usually ? 2
- The algebraic signature of the string of i
symbols p1 pi is the sum - pi p1? pi? i.
- Here the addition and the multiplication are the
operations in GF.
4Algebraic Signature
- In our GF (2f) where f 8,16
- p q p q p XOR q
- One method for multiplying is
- pq antilog (( log? p log? q) mod 255)
- The division is then
- p / q antilog (( log? p - log? q) mod 255)
- The log and antilog are encoded in log and
antilog tables with 2f elements each. - Entry 0 is for element 0 of the GF and is by
convention set to 2f - 1.
5Cumulative Algebraic Signature
- We encode every symbol pi in a string into the
signature of the prefix p1pi - The value of a CAS symbol now encodes also the
knowledge of values of all the previous ones - Matching a single symbol means prefix matching
6Application of CASs
- Protection against involuntary data disclosure
- On P2P Grid Servers especially
- Numerous CAS encoded string matching algorithms
- Prefix match with O (1) complexity
- Pattern match by signature only
- Karp Rabin like, linear O (L) complexity
- Longest common string search
- Longest common prefix search
-
7CAS Properties
- O (K) encoding and decoding speed
- For encoding, for instance
- pi pi-1 pi ? i CAS ( pi-1) pi ? i
- Fast n gram signature calculus
- For Sk,l pkpl with k gt 1 and l k n
- AS ( Sk,l ) AS (S l - k1) (pl XOR
pk - 1) / ? k-1 - Logarithmic Algebraic Signature (LAS)
- LAS ( Sk,l ) log AS ( Sk,l )
- ( log (pl XOR pk - 1) (k-1)) mod 2f 1
8The n-gram SearchKey ideas
- Design a sublinear pattern match search
- With speed about L / K
- Apply to CAS encoded DB
- New idea for string search algorithm with
preprocessing - Justified for a DB
- Store once, search many times
9The n-gram SearchKey ideas
- Preprocess the pattern to create a jump table
- As in Boyer Moore
- Use n grams with n gt 1 to increase the
discriminative power of an attempt - Comparison of a sample from the pattern
- a single symbol for BM
- an LAS of an n gram for a CAS-encoded string
10The n-gram SearchKey ideas
- If the alphabet uses m symbols, the probability
that a symbol matches is 1/m - Assuming all symbols equally likely
- For usual ASCII pattern matching m 20-25
- For DNA m 4
- A single symbol may often match without the whole
pattern matching - e.g., ÂĽ times for DNA on the average
- Leading to small jumps,
- by m symbols on the average
11The n-gram SearchKey ideas
- The probability of an n - gram matching may be
- min ( 1/ 2f , 1 / mn )
- In our examples it can reach 1 / 256
- More discriminative sampling
- Longer jumps
- By almost K or 256 symbols in general
- Useful for longer strings
- DNA, text, images
12ASCII Exemple Usual Alphabet
2-grams gt 5 jumps 1-gram gt 6 jumps
13DNA Exemple4-letter Alphabet
3 jumps
4 jumps
4 jumps
11 jumps
14The n-gram Search Preprocessing
- Encode every record (string) into its CAS
- Done for incidental protection anyhow for
SDDS-2006 - Encode the terminal n - gram of the searched
pattern SK into its LAS in variable V - Fill up the jump table T for every other n -
gram in SK - calculate every LAS
- for each LAS, store in T its rightmost offset
with respect to the end of SK
15The n-gram Search Jump Table
- For GF (256), every n gram Si, in-1 in the
pattern and i LAS (Si, in-1) - T ( i ) the offset
- T ( i ) K n 1 otherwise
- Remainder LAS (0) 255
- T can be also hash table
- See the paper
- Slower to use but possibly more memory efficient
- Probably more useful for a larger GF
16ASCII Exemple
Dauphine
7
0
1
7
in
1
V ne
au
5
ph
3
Notation xy LAS (xy)
255
7
17The n-gram Search Processing
- Calculate LAS of the current n-gram in the string
- Start with the n-gram SK-n1,K
- Continue depending on jump calculus
- Attempt to match V
- If .true then calculate LAS of the entire current
possibly matching substring - of length K and ending with the current n-gram
- If .true, then resolve the possible collision
- Either attempt to match all the K symbols
- Or match enough of terminal n-grams or symbols to
decrease the probability of collision to a very
small value
18The n-gram Search Processing
- Otherwise
- Go to T using LAS of the n-gram
- Jump by the number of symbols found in T
- Update the current position for n-gram to
attempt the match - Re-attempt the match as above
- Unless the n-gram to attempt is beyond the end
of the string
19ASCII Exemple Again
2-grams gt 5 jumps 1-gram gt 6 jumps
20DNA Exemple Again
3 jumps
4 jumps
4 jumps
11 jumps
21n-grams / BM
- Average shifts with n-grams can be typically
longer - Calculate an attempt jump may be more expensive
as well - About twice as long at first approach
- The precise analysis remains to be done
- Rule of thumb If shifts are more than 2 times
longer, n-grams with n gt 1 or should be faster
than BM.
22Experimental Results
- Searching large data of
- DNA
- Typical ASCII
- XML Documents
- Patterns of 6 to 500 symbols (bytes)
- 1.8 GHZ P3 and 2.4 GHZ DualCore AMD Turion 64
Processors
23Results Compared to BM
- DNA
- Up to 72 times faster
- Typical ASCII
- Up to about 11 times faster
- XML Documents
- Up to more than 5 times faster
- Search faster for longer pattern
- Average shifts are longer
24DNA
25ASCII
26XML
27Related Work
- Implemented in SDDS-2006
- Applies best to
- longer patterns
- where many jumps occur
- alphabets much smaller than the size of GF used
- Instead of shifts of size m in the average, one
reaches almost min (K, 2f) per shift - up to almost 256 for DNA or ASCII with GF (256)
- up to almost 64K for DNA or Unicode with GF (64K)
- instead of 4 or 25 respectively
- For Boyer-Moore especially
28Related Work
- In SDDS 2006 P2P or Grid System in general
- Wish to hide what is searched for ?
- Use the signature only based search
- Usually slower since linear only
29Conclusion
- A new pattern matching algorithm
- Uses algebraic signatures
- Preprocesses both the pattern and the string
- Appears particularly efficient
- For databases
- For longer patterns
- Possibly faster in this context than any other
algorithm known know - But all this are only preliminray results
30Future Work
- Performance Analysis
- Theoretical
- Jump Length
- Median, Average
- Experimental
- Actual text
- Non uniform symbol distribution
- DNA
- Actual DNA strings
31Future Work
- Variants
- Jump Table
- Partial Signatures of n grams
- Symbol pi encodes the n gram signature up to
pi-n1 pi - No more XORing Division to find this signature
- Faster unsuccessful attempt to match
- Approximate Match
- Tolerating match errors
- E.g., and at most 1 symbol
32Thank You for Your Attention
- witold.litwin_at_dauphine.fr