Title: Fast Fingerprint Calculations
1Fast Fingerprint Calculations
2Fingerprints
- Definition A fingerprint (a.k.a. signature) of
an object Ob is a small string f(Ob) with the
following properties - 1. f is a function of Ob. In particular, if two
objects are equal, then so are their
fingerprints. - 2. Prob(f(Ob1) f(Ob2)) ltlt 1 for random
objects Ob1 ? Ob2.
3Usage
- Fingerprints are used to
- Identify Objects
- Compare Objects Remotely
- Test an Object for Changes
- Since fingerprints are smaller, they are very
useful as stand-ins for remote objects.
4Usage
- Object identification example
- Software Cloning During maintenance, the need
arises for a module very similar in character to
one that already exists. Because of time
pressure, this module is simply copied, all names
are systematically changed, and then modified to
serve the new needs. - Maintenance Problem If a bug is detected in a
clone or in the original, it probably subsides in
the original and / or other clones. Besides,
clones arise because of time pressure, but the
short-cut ends up costing in the long run. Thus,
it is better during maintenance to identify
clones. - Clone Identification Systematically suppress
names, then test for function code to be
identical. - Johnson, J.H. Substring Matching for Software
Clone Detection, and Change Tracking.
International Conference on Software Maintenance.
Victoria, BC, 1994, p. 120 - 126
5Usage
- Similarity Testing for Files
- n-gram Contiguous substring of n characters in a
file. - File Similarity Count the number of occurrences
of a particular n-gram. - Use the fingerprint of an n-gram as a hash value.
Count the fingerprints instead of the n-gram. - Cohen, J.D. Recursive Hashing Functions for
n-grams. ACM Trans. Information Systems, p. 291
-320.
6Usage
- Remote String Searches Find all occurrences of a
given string in files on remote servers. - Instead of sending the string to all servers,
only a fingerprint and the length l of the string
is sent. The servers generate running
fingerprints of l-grans and compare them with the
strings fingerprint.
7Usage
- Remote File Comparison
- Original Problem How to compare pages of remote
replicas of a database. - Solution Calculate fingerprints (signatures)
of each page. Calculate a super-signature from
the pages. If super-signatures coincide,
conclude that the replicas are in sync. If not,
run a smart protocol to find the non-fitting
signatures.
Abdel-Ghaffar, K. A. S., El-Abbadi, A.
Efficient Detection of Corrupted Pages in a
Replicated File. ACM Symp. Distributed Computing,
1993, p. 219-227. Barbara, D., Garcia-Molina, H.
, Feijoo, B. Exploiting Symmetries for Low-Cost
Comparison of File Copies. Proc. Int. Conf.
Distributed Computing Systems, 1988, p.
471-479. Barbara, D., Lipton, R. J. A class of
Randomized Strategies for Low-Cost Comparison of
File Copies. IEEE Trans. Parallel and Distributed
Systems, vol. 2(2), 1991, p. 160-170. Fuchs, W.
Wu, K. L., Abraham, J. A. Low-Cost Comparison
and Diagnosis of Large, Remotely Located Files.
Proc. Symp. Reliability Distributed Software and
Database Systems, p. 67-73, 1986. Schwarz, Th.,
Bowdidge, B., Burkhard, W., Low Cost Comparison
of Files, Int. Conf. on Distr. Comp. Syst.,
(ICDCS 90) , 196-201.
8Usage
- Secure Signatures
- To identify an object, maintain its signature.
If the object is altered by an adversary, the
adversary cannot do so in a computationally
feasible way without changing the signature. - Cryptographically secure signature SHA-1, MD5
- Used for authentication, e.g. in computer
forensics, digital signatures, etc.
9SHA-1
- 20B long.
- Designed for Fast Calculation
- Considered unbreakable
- Used increasingly in applications were
cryptographic security is not needed. - Radia Pearlmans Law of Cryptography
- If a lot of smart people spent lots of time
trying to break a scheme, and did not succeed,
then it cannot be done.
10Useful Properties of Fingerprints
- Fast Calculation.
- Low collision rate.
- If the fingerprints have length l then the
probability of a collision should be 2-l. - If there are small changes, then fingerprints
should change. - Cryptographically unbreakable.
- Given a signature, one cannot construct an object
with this signature.
11Useful Properties of Fingerprints
- Updatable
- If the object changes, then we can update the
signature from the old signature and the changes. - Concatenation of Objects
- If an object is made up of several objects, then
we can calculate the signature of the
super-object from its constituents. Possibly in
a way that allows us to quickly pinpoint
different component objects.
12Karp Rabin Style Fingerprints
Here, the calculation takes places in a ring R
with multiplication and addition.
Karp, R. M., Rabin, M. O. Efficient randomized
pattern-matching algorithms. IBM Journal of
Research and Development, Vol. 31, No. 2, March
1987.
13Karp Rabin Style Fingerprints
- Calculation time linear in N
- (1 multiplication and 1 addition)
14Karp Rabin Style Fingerprints
- Easy to calculate consecutive n-grams
- Easy to calculate signatures of concatenations
- sig? (a0, a1, al-1, al, al1 aln-1)
- ?n sig ?(a0, a1, al-1, ) sig ?(al, al1
aln-1) - Possibly use a table of values of ?n
-
15Karp Rabin Style Fingerprints
- Cryptographically not secure
- A cryptographically secure signature is a one-way
function, in order to be efficiently calculable,
it needs to process large portions of a string at
a time. Thus, cryptographical security is not a
desirable property in general.
16Choice of Ring R
- 1. Integers modulo prime p
- The ring is then a well-understood field.
- But, reducing modulo p is a costly operation.
- 2. Integers modulo 2f
- Powers of two are zero dividers,
- e.g. 22f-10
- This excludes powers of 2 as ?.
17Choice of Ring
- 3. Reduction by a polynomial.
- The space of unsigned integers 0, ..., 232-1 can
be naturally identified with the space of all
polynomials kt over k 0,1 with degree up to
31. - Select a polynomial ?(t) with degree f up to 31
and consider the ring R kt/(?(t)). Elements
in this are naturally identified with all
unsigned integers 0, ..., 2f-1. - Addition of these polynomials corresponds to the
fast XOR, multiplication is more difficult, but
multiplication by t is a left shift followed by
conditionally XORing with ?. - This is the most promising construction.
18R kt/(?(t)) Example
- Set ? t5t1, that is, ? 10011.
- Elements of R are all bit strings of length 4.
- To add 0101 and 1100, just XOR 01011100 1001.
- To multiply with t 0010, left shift and XOR
conditionally with ?. - To multiply 0010 with 0010, left-shift the first
and obtain 0 0100. The leading coefficient is
zero, which is dropped. Result is 0100. - To multiply 1100 with 0010, left shift to obtain
1,1000, the leading coefficient is one, so XOR
with ? 10011 to obtain result 1011.
19Galois Fields
- If ? is irreducible, then R is a Galois field.
- If we use a Galois field, we can concatenate
fingerprints to obtain a signature
20Galois Fields
- If we use
- (?, ?2, ?3?n)
- then the signature are the parity symbols of a
generalized, non-systematic Reed-Solomon code. - Since these codes are MDS, the signature will
change for up to n changes in the object.
21Galois Field Signatures
- To calculate a Galois field footprint, we only
need per symbol - One XOR
- One left-shift
- One test whether the leading coefficient is now
one. - Conditionally one XOR.
22Speeding up Galois field footprints
- However, we do not have to execute the reduction
step each time. - Instead, left shift and XOR b times (Broders
idea). Then do a table to reduce the overhang.
Broder, A. Some applications of Rabin's
fingerprinting method. In Capocelli, De Santis,
and Vaccaro, (ed.), Sequences II Methods in
Communications, Security, and Computer Science,
pages 143--152. Springer-Verlag, 1993.
23Speeding up Galois field footprints
- String is (1000, 1100, 1010, 0111, 0011, ...
- Choose ? 10011. This is an irreducible
polynomial. - Step 0 1000
- Step 1 1,00001100 1,1100
- Step 2 11,10001010 11,0010
- Step 3 110,01000111 110,0011
- Step 4 1100,01100011 1100,0101
- Now use table look-up to calculate
- 0101 table1100 0101 0111 0010.
- 8 elementary ops 1 shift right 1 table
look-up1 XOR.
24Speeding up Galois field footprints
- String is (1000, 1100, 1010, 0111, 0011, ...
- Step 0 1000
- Step 1 1,000 1100 1,1100 1,1100?
1,11001,0011 1111. - Step 2 1,1110 1010 1,0100 0111.
- Step 3 0,1110 0111 1001.
- Step 4 1,0010 0011 1,0001 0010.
- 8 elementary ops 4 condition evaluations 2
elementary ops on average.
25Speeding up Galois field footprints
- How do we calculate the table entries
- Systematically reduce by t-multiples of ?
10011. - To calculate table entry for 12 1100 reduce
1100,0000 in four steps. - 1100,0000 t3?
- 1100,0000 1001,1000
- 0101,1000
- Now reduce with t2?
- 0101,1000 t2?
- 0101,1000 0100,1100
- 0001,0100
- No step with t? since the corresponding
coefficient is zero. - Reduce with ?
- 0001,0100 0001,0011 0000,0111 0111.
26Speeding up Galois field footprints
- Optimal table size needs to be determined
experimentally. - Table needs to fit in cache, so it cannot be much
bigger than 216. - If the table is too small, then the look-up costs
does not amortize well.
27Galois field signatures
- Galois field signatures are concatenations of
Galois field fingerprints. - Broder tables work for multiplication with t2,
t3, t4 as well, but less efficiently, since now
we shift two, three, or four times so that we
need to use table-lookup more often.
28Performance Results
- 1.772 msec per MB for 16 bit parity
- 2.012 msec per MB for 16 bit ?1 power.
- 3.114 msec per MB for 16 bit ?2 power.
- Results for a 1.99GHz Pentium 4 w. 512MB memory.
29How Long should Signatures be?
- Key fact There are 31,557,600 seconds in a year.
- At x calculations per second, there will be
31,577,600x incidents, which will lead to a
collision lt 2-31 31,577,600x 0.015x times per
year for 32 bit signatures. - So, minimum length should be 64 bits. We can
achieve this easily. - Larger signatures protects at a better rate than
hard drive failures (writing on the wrong track),
software failures, etc.
30Research Questions
- Property of a signature Changing n symbols
changes the n-fold signature for sure. - It is known that if we change to a different
vector ?, e.g. one where the components are all
primitive elements, we loose the property. Are
there other ? with this property? - We can use Broder tabling with different
irreducible ? and then concatenate the Galois
field footprints. Can we find a condition under
which the property holds? - What properties hold when ? is not irreducible?
It seems statistically fine as long as ? has a
constant coefficient.