Fast Fingerprint Calculations - PowerPoint PPT Presentation

1 / 30
About This Presentation
Title:

Fast Fingerprint Calculations

Description:

Clone Identification: Systematically suppress names, then test for function code ... J.H. Substring Matching for Software Clone Detection, and Change Tracking. ... – PowerPoint PPT presentation

Number of Views:59
Avg rating:3.0/5.0
Slides: 31
Provided by: thomass61
Category:

less

Transcript and Presenter's Notes

Title: Fast Fingerprint Calculations


1
Fast Fingerprint Calculations
  • Thomas Schwarz, S.J.

2
Fingerprints
  • Definition A fingerprint (a.k.a. signature) of
    an object Ob is a small string f(Ob) with the
    following properties
  • 1. f is a function of Ob. In particular, if two
    objects are equal, then so are their
    fingerprints.
  • 2. Prob(f(Ob1) f(Ob2)) ltlt 1 for random
    objects Ob1 ? Ob2.

3
Usage
  • Fingerprints are used to
  • Identify Objects
  • Compare Objects Remotely
  • Test an Object for Changes
  • Since fingerprints are smaller, they are very
    useful as stand-ins for remote objects.

4
Usage
  • Object identification example
  • Software Cloning During maintenance, the need
    arises for a module very similar in character to
    one that already exists. Because of time
    pressure, this module is simply copied, all names
    are systematically changed, and then modified to
    serve the new needs.
  • Maintenance Problem If a bug is detected in a
    clone or in the original, it probably subsides in
    the original and / or other clones. Besides,
    clones arise because of time pressure, but the
    short-cut ends up costing in the long run. Thus,
    it is better during maintenance to identify
    clones.
  • Clone Identification Systematically suppress
    names, then test for function code to be
    identical.
  • Johnson, J.H. Substring Matching for Software
    Clone Detection, and Change Tracking.
    International Conference on Software Maintenance.
    Victoria, BC, 1994, p. 120 - 126

5
Usage
  • Similarity Testing for Files
  • n-gram Contiguous substring of n characters in a
    file.
  • File Similarity Count the number of occurrences
    of a particular n-gram.
  • Use the fingerprint of an n-gram as a hash value.
    Count the fingerprints instead of the n-gram.
  • Cohen, J.D. Recursive Hashing Functions for
    n-grams. ACM Trans. Information Systems, p. 291
    -320.

6
Usage
  • Remote String Searches Find all occurrences of a
    given string in files on remote servers.
  • Instead of sending the string to all servers,
    only a fingerprint and the length l of the string
    is sent. The servers generate running
    fingerprints of l-grans and compare them with the
    strings fingerprint.

7
Usage
  • Remote File Comparison
  • Original Problem How to compare pages of remote
    replicas of a database.
  • Solution Calculate fingerprints (signatures)
    of each page. Calculate a super-signature from
    the pages. If super-signatures coincide,
    conclude that the replicas are in sync. If not,
    run a smart protocol to find the non-fitting
    signatures.

Abdel-Ghaffar, K. A. S., El-Abbadi, A.
Efficient Detection of Corrupted Pages in a
Replicated File. ACM Symp. Distributed Computing,
1993, p. 219-227. Barbara, D., Garcia-Molina, H.
, Feijoo, B. Exploiting Symmetries for Low-Cost
Comparison of File Copies. Proc. Int. Conf.
Distributed Computing Systems, 1988, p.
471-479. Barbara, D., Lipton, R. J. A class of
Randomized Strategies for Low-Cost Comparison of
File Copies. IEEE Trans. Parallel and Distributed
Systems, vol. 2(2), 1991, p. 160-170. Fuchs, W.
Wu, K. L., Abraham, J. A. Low-Cost Comparison
and Diagnosis of Large, Remotely Located Files.
Proc. Symp. Reliability Distributed Software and
Database Systems, p. 67-73, 1986. Schwarz, Th.,
Bowdidge, B., Burkhard, W., Low Cost Comparison
of Files, Int. Conf. on Distr. Comp. Syst.,
(ICDCS 90) , 196-201.
8
Usage
  • Secure Signatures
  • To identify an object, maintain its signature.
    If the object is altered by an adversary, the
    adversary cannot do so in a computationally
    feasible way without changing the signature.
  • Cryptographically secure signature SHA-1, MD5
  • Used for authentication, e.g. in computer
    forensics, digital signatures, etc.

9
SHA-1
  • 20B long.
  • Designed for Fast Calculation
  • Considered unbreakable
  • Used increasingly in applications were
    cryptographic security is not needed.
  • Radia Pearlmans Law of Cryptography
  • If a lot of smart people spent lots of time
    trying to break a scheme, and did not succeed,
    then it cannot be done.

10
Useful Properties of Fingerprints
  • Fast Calculation.
  • Low collision rate.
  • If the fingerprints have length l then the
    probability of a collision should be 2-l.
  • If there are small changes, then fingerprints
    should change.
  • Cryptographically unbreakable.
  • Given a signature, one cannot construct an object
    with this signature.

11
Useful Properties of Fingerprints
  • Updatable
  • If the object changes, then we can update the
    signature from the old signature and the changes.
  • Concatenation of Objects
  • If an object is made up of several objects, then
    we can calculate the signature of the
    super-object from its constituents. Possibly in
    a way that allows us to quickly pinpoint
    different component objects.

12
Karp Rabin Style Fingerprints
Here, the calculation takes places in a ring R
with multiplication and addition.
Karp, R. M., Rabin, M. O. Efficient randomized
pattern-matching algorithms. IBM Journal of
Research and Development, Vol. 31, No. 2, March
1987.
13
Karp Rabin Style Fingerprints
  • Calculation time linear in N
  • (1 multiplication and 1 addition)

14
Karp Rabin Style Fingerprints
  • Easy to calculate consecutive n-grams
  • Easy to calculate signatures of concatenations
  • sig? (a0, a1, al-1, al, al1 aln-1)
  • ?n sig ?(a0, a1, al-1, ) sig ?(al, al1
    aln-1)
  • Possibly use a table of values of ?n

15
Karp Rabin Style Fingerprints
  • Cryptographically not secure
  • A cryptographically secure signature is a one-way
    function, in order to be efficiently calculable,
    it needs to process large portions of a string at
    a time. Thus, cryptographical security is not a
    desirable property in general.

16
Choice of Ring R
  • 1. Integers modulo prime p
  • The ring is then a well-understood field.
  • But, reducing modulo p is a costly operation.
  • 2. Integers modulo 2f
  • Powers of two are zero dividers,
  • e.g. 22f-10
  • This excludes powers of 2 as ?.

17
Choice of Ring
  • 3. Reduction by a polynomial.
  • The space of unsigned integers 0, ..., 232-1 can
    be naturally identified with the space of all
    polynomials kt over k 0,1 with degree up to
    31.
  • Select a polynomial ?(t) with degree f up to 31
    and consider the ring R kt/(?(t)). Elements
    in this are naturally identified with all
    unsigned integers 0, ..., 2f-1.
  • Addition of these polynomials corresponds to the
    fast XOR, multiplication is more difficult, but
    multiplication by t is a left shift followed by
    conditionally XORing with ?.
  • This is the most promising construction.

18
R kt/(?(t)) Example
  • Set ? t5t1, that is, ? 10011.
  • Elements of R are all bit strings of length 4.
  • To add 0101 and 1100, just XOR 01011100 1001.
  • To multiply with t 0010, left shift and XOR
    conditionally with ?.
  • To multiply 0010 with 0010, left-shift the first
    and obtain 0 0100. The leading coefficient is
    zero, which is dropped. Result is 0100.
  • To multiply 1100 with 0010, left shift to obtain
    1,1000, the leading coefficient is one, so XOR
    with ? 10011 to obtain result 1011.

19
Galois Fields
  • If ? is irreducible, then R is a Galois field.
  • If we use a Galois field, we can concatenate
    fingerprints to obtain a signature

20
Galois Fields
  • If we use
  • (?, ?2, ?3?n)
  • then the signature are the parity symbols of a
    generalized, non-systematic Reed-Solomon code.
  • Since these codes are MDS, the signature will
    change for up to n changes in the object.

21
Galois Field Signatures
  • To calculate a Galois field footprint, we only
    need per symbol
  • One XOR
  • One left-shift
  • One test whether the leading coefficient is now
    one.
  • Conditionally one XOR.

22
Speeding up Galois field footprints
  • However, we do not have to execute the reduction
    step each time.
  • Instead, left shift and XOR b times (Broders
    idea). Then do a table to reduce the overhang.

Broder, A. Some applications of Rabin's
fingerprinting method. In Capocelli, De Santis,
and Vaccaro, (ed.), Sequences II Methods in
Communications, Security, and Computer Science,
pages 143--152. Springer-Verlag, 1993.
23
Speeding up Galois field footprints
  • String is (1000, 1100, 1010, 0111, 0011, ...
  • Choose ? 10011. This is an irreducible
    polynomial.
  • Step 0 1000
  • Step 1 1,00001100 1,1100
  • Step 2 11,10001010 11,0010
  • Step 3 110,01000111 110,0011
  • Step 4 1100,01100011 1100,0101
  • Now use table look-up to calculate
  • 0101 table1100 0101 0111 0010.
  • 8 elementary ops 1 shift right 1 table
    look-up1 XOR.

24
Speeding up Galois field footprints
  • String is (1000, 1100, 1010, 0111, 0011, ...
  • Step 0 1000
  • Step 1 1,000 1100 1,1100 1,1100?
    1,11001,0011 1111.
  • Step 2 1,1110 1010 1,0100 0111.
  • Step 3 0,1110 0111 1001.
  • Step 4 1,0010 0011 1,0001 0010.
  • 8 elementary ops 4 condition evaluations 2
    elementary ops on average.

25
Speeding up Galois field footprints
  • How do we calculate the table entries
  • Systematically reduce by t-multiples of ?
    10011.
  • To calculate table entry for 12 1100 reduce
    1100,0000 in four steps.
  • 1100,0000 t3?
  • 1100,0000 1001,1000
  • 0101,1000
  • Now reduce with t2?
  • 0101,1000 t2?
  • 0101,1000 0100,1100
  • 0001,0100
  • No step with t? since the corresponding
    coefficient is zero.
  • Reduce with ?
  • 0001,0100 0001,0011 0000,0111 0111.

26
Speeding up Galois field footprints
  • Optimal table size needs to be determined
    experimentally.
  • Table needs to fit in cache, so it cannot be much
    bigger than 216.
  • If the table is too small, then the look-up costs
    does not amortize well.

27
Galois field signatures
  • Galois field signatures are concatenations of
    Galois field fingerprints.
  • Broder tables work for multiplication with t2,
    t3, t4 as well, but less efficiently, since now
    we shift two, three, or four times so that we
    need to use table-lookup more often.

28
Performance Results
  • 1.772 msec per MB for 16 bit parity
  • 2.012 msec per MB for 16 bit ?1 power.
  • 3.114 msec per MB for 16 bit ?2 power.
  • Results for a 1.99GHz Pentium 4 w. 512MB memory.

29
How Long should Signatures be?
  • Key fact There are 31,557,600 seconds in a year.
  • At x calculations per second, there will be
    31,577,600x incidents, which will lead to a
    collision lt 2-31 31,577,600x 0.015x times per
    year for 32 bit signatures.
  • So, minimum length should be 64 bits. We can
    achieve this easily.
  • Larger signatures protects at a better rate than
    hard drive failures (writing on the wrong track),
    software failures, etc.

30
Research Questions
  • Property of a signature Changing n symbols
    changes the n-fold signature for sure.
  • It is known that if we change to a different
    vector ?, e.g. one where the components are all
    primitive elements, we loose the property. Are
    there other ? with this property?
  • We can use Broder tabling with different
    irreducible ? and then concatenate the Galois
    field footprints. Can we find a condition under
    which the property holds?
  • What properties hold when ? is not irreducible?
    It seems statistically fine as long as ? has a
    constant coefficient.
Write a Comment
User Comments (0)
About PowerShow.com