Pattern Matching Using ngrams With Algebraic Signatures - PowerPoint PPT Presentation

1 / 32
About This Presentation
Title:

Pattern Matching Using ngrams With Algebraic Signatures

Description:

Use n grams with n 1 to increase the discriminative power of an attempt ... More discriminative sampling. Longer jumps. By almost K or 256 symbols in general ... – PowerPoint PPT presentation

Number of Views:75
Avg rating:3.0/5.0
Slides: 33
Provided by: Witold6
Category:

less

Transcript and Presenter's Notes

Title: Pattern Matching Using ngrams With Algebraic Signatures


1
Pattern Matching Using n-grams With Algebraic
Signatures
  • Witold Litwin1, Riad Mokadem1, Philippe Rigaux1
    Thomas Schwarz21 Université Paris
    Dauphine2 Santa Clara University

2
n-gram Search
  • New pattern matching idea
  • Matches algebraic signatures
  • Preprocesses both pattern string (record)
  • String preprocessing is a new idea
  • To the best of our knowledge
  • Provides incidental protection of stored data
  • Important for P2P grid systems
  • Fast processing
  • Especially useful for DBs longer patterns
  • ASCII, Unicode, DNA
  • Should be then often faster than Boyer-Moore
  • Possibly the fastest known in this context

3
Algebraic Signature
  • Symbols of the alphabet are elements of a Galois
    Field
  • GF (256) usually
  • We choose there one primitive element ?
  • Usually ? 2
  • The algebraic signature of the string of i
    symbols p1 pi is the sum
  • pi p1? pi? i.
  • Here the addition and the multiplication are the
    operations in GF.

4
Algebraic Signature
  • In our GF (2f) where f 8,16
  • p q p q p XOR q
  • One method for multiplying is
  • pq antilog (( log? p log? q) mod 255)
  • The division is then
  • p / q antilog (( log? p - log? q) mod 255)
  • The log and antilog are encoded in log and
    antilog tables with 2f elements each.
  • Entry 0 is for element 0 of the GF and is by
    convention set to 2f - 1.

5
Cumulative Algebraic Signature
  • We encode every symbol pi in a string into the
    signature of the prefix p1pi
  • The value of a CAS symbol now encodes also the
    knowledge of values of all the previous ones
  • Matching a single symbol means prefix matching

6
Application of CASs
  • Protection against involuntary data disclosure
  • On P2P Grid Servers especially
  • Numerous CAS encoded string matching algorithms
  • Prefix match with O (1) complexity
  • Pattern match by signature only
  • Karp Rabin like, linear O (L) complexity
  • Longest common string search
  • Longest common prefix search

7
CAS Properties
  • O (K) encoding and decoding speed
  • For encoding, for instance
  • pi pi-1 pi ? i CAS ( pi-1) pi ? i
  • Fast n gram signature calculus
  • For Sk,l pkpl with k gt 1 and l k n
  • AS ( Sk,l ) AS (S l - k1) (pl XOR
    pk - 1) / ? k-1
  • Logarithmic Algebraic Signature (LAS)
  • LAS ( Sk,l ) log AS ( Sk,l )
  • ( log (pl XOR pk - 1) (k-1)) mod 2f 1

8
The n-gram SearchKey ideas
  • Design a sublinear pattern match search
  • With speed about L / K
  • Apply to CAS encoded DB
  • New idea for string search algorithm with
    preprocessing
  • Justified for a DB
  • Store once, search many times

9
The n-gram SearchKey ideas
  • Preprocess the pattern to create a jump table
  • As in Boyer Moore
  • Use n grams with n gt 1 to increase the
    discriminative power of an attempt
  • Comparison of a sample from the pattern
  • a single symbol for BM
  • an LAS of an n gram for a CAS-encoded string

10
The n-gram SearchKey ideas
  • If the alphabet uses m symbols, the probability
    that a symbol matches is 1/m
  • Assuming all symbols equally likely
  • For usual ASCII pattern matching m 20-25
  • For DNA m 4
  • A single symbol may often match without the whole
    pattern matching
  • e.g., ¼ times for DNA on the average
  • Leading to small jumps,
  • by m symbols on the average

11
The n-gram SearchKey ideas
  • The probability of an n - gram matching may be
  • min ( 1/ 2f , 1 / mn )
  • In our examples it can reach 1 / 256
  • More discriminative sampling
  • Longer jumps
  • By almost K or 256 symbols in general
  • Useful for longer strings
  • DNA, text, images

12
ASCII Exemple Usual Alphabet
2-grams gt 5 jumps 1-gram gt 6 jumps
13
DNA Exemple4-letter Alphabet
3 jumps
4 jumps
4 jumps
11 jumps
14
The n-gram Search Preprocessing
  • Encode every record (string) into its CAS
  • Done for incidental protection anyhow for
    SDDS-2006
  • Encode the terminal n - gram of the searched
    pattern SK into its LAS in variable V
  • Fill up the jump table T for every other n -
    gram in SK
  • calculate every LAS
  • for each LAS, store in T its rightmost offset
    with respect to the end of SK

15
The n-gram Search Jump Table
  • For GF (256), every n gram Si, in-1 in the
    pattern and i LAS (Si, in-1)
  • T ( i ) the offset
  • T ( i ) K n 1 otherwise
  • Remainder LAS (0) 255
  • T can be also hash table
  • See the paper
  • Slower to use but possibly more memory efficient
  • Probably more useful for a larger GF

16
ASCII Exemple
Dauphine
7
0
1
7


in
1
V ne


au
5


ph
3
Notation xy LAS (xy)


255
7
17
The n-gram Search Processing
  • Calculate LAS of the current n-gram in the string
  • Start with the n-gram SK-n1,K
  • Continue depending on jump calculus
  • Attempt to match V
  • If .true then calculate LAS of the entire current
    possibly matching substring
  • of length K and ending with the current n-gram
  • If .true, then resolve the possible collision
  • Either attempt to match all the K symbols
  • Or match enough of terminal n-grams or symbols to
    decrease the probability of collision to a very
    small value

18
The n-gram Search Processing
  • Otherwise
  • Go to T using LAS of the n-gram
  • Jump by the number of symbols found in T
  • Update the current position for n-gram to
    attempt the match
  • Re-attempt the match as above
  • Unless the n-gram to attempt is beyond the end
    of the string

19
ASCII Exemple Again
2-grams gt 5 jumps 1-gram gt 6 jumps
20
DNA Exemple Again
3 jumps
4 jumps
4 jumps
11 jumps
21
n-grams / BM
  • Average shifts with n-grams can be typically
    longer
  • Calculate an attempt jump may be more expensive
    as well
  • About twice as long at first approach
  • The precise analysis remains to be done
  • Rule of thumb If shifts are more than 2 times
    longer, n-grams with n gt 1 or should be faster
    than BM.

22
Experimental Results
  • Searching large data of
  • DNA
  • Typical ASCII
  • XML Documents
  • Patterns of 6 to 500 symbols (bytes)
  • 1.8 GHZ P3 and 2.4 GHZ DualCore AMD Turion 64
    Processors

23
Results Compared to BM
  • DNA
  • Up to 72 times faster
  • Typical ASCII
  • Up to about 11 times faster
  • XML Documents
  • Up to more than 5 times faster
  • Search faster for longer pattern
  • Average shifts are longer

24
DNA
25
ASCII
26
XML
27
Related Work
  • Implemented in SDDS-2006
  • Applies best to
  • longer patterns
  • where many jumps occur
  • alphabets much smaller than the size of GF used
  • Instead of shifts of size m in the average, one
    reaches almost min (K, 2f) per shift
  • up to almost 256 for DNA or ASCII with GF (256)
  • up to almost 64K for DNA or Unicode with GF (64K)
  • instead of 4 or 25 respectively
  • For Boyer-Moore especially

28
Related Work
  • In SDDS 2006 P2P or Grid System in general
  • Wish to hide what is searched for ?
  • Use the signature only based search
  • Usually slower since linear only

29
Conclusion
  • A new pattern matching algorithm
  • Uses algebraic signatures
  • Preprocesses both the pattern and the string
  • Appears particularly efficient
  • For databases
  • For longer patterns
  • Possibly faster in this context than any other
    algorithm known know
  • But all this are only preliminray results

30
Future Work
  • Performance Analysis
  • Theoretical
  • Jump Length
  • Median, Average
  • Experimental
  • Actual text
  • Non uniform symbol distribution
  • DNA
  • Actual DNA strings

31
Future Work
  • Variants
  • Jump Table
  • Partial Signatures of n grams
  • Symbol pi encodes the n gram signature up to
    pi-n1 pi
  • No more XORing Division to find this signature
  • Faster unsuccessful attempt to match
  • Approximate Match
  • Tolerating match errors
  • E.g., and at most 1 symbol

32
Thank You for Your Attention
  • witold.litwin_at_dauphine.fr
Write a Comment
User Comments (0)
About PowerShow.com