Title: Hash Functions for Network Applications II
1Hash Functions for Network Applications (II)
- Yaxuan Qi
- NSLab, RIIT
- Tsinghua University
2Outline
- Concept and Theory (12)
- Hash functions
- Bloom Filters
- Applications (34)
3Basic Idea
4Technique
5False Positive
n number of messages m number of bloom bits k
number of hash functions
p(y?fp ) p(y???X)p(y???k?bits??1)
p(y???k?bits??1) ???y??????k?bits,
??set(?X??)??? ????1???bit?set(?X??)???
6Math (I)
n number of messages m number of bloom bits k
number of hash functions
Two potential assumptions m big enough kn/m
constant
7In practice
n number of messages m number of bloom bits k
number of hash functions
If the number of 0 bits in the array is
substantially less than expected, then the
probability of a false positive will be higher
than the quantity f that we computed.
8Optimal Number of Hash Functions
- Given m and n
- minimizes f as a function of k
- Two competing forces
- k ??
- (from view of search) more chances to find a 0
bit for an element that is not a match - k ??
- (from view of construction) increases the
fraction of 0 bits in the array
9Math (II)
In practice, k must be an integer, and a smaller,
suboptimal k might be preferred since this
reduces the number of hash functions that have to
be computed.
10Optimization Summary
- Assumption
- We have good hash functions, look random.
- Given m bits for filter and n elements, choose
number k of hash functions to minimize false
positives - Let
- Let
- As k increases
- more chances to find a 0
- but more 1s in the array.
- Conclusion
11Partial Bloom Filters
- The total number of bits is still m, but the bits
are divided equally among the k hash functions. - Each hash function has a range of m/k consecutive
bit, make parallelization of array accesses.
- Though the probability of a false positive is
actually always at least as large with this
division, the difference is small...
12Counting Bloom Filters Idea
13Counting Bloom filters Implementation
4 bits is enough...
14Compressed Bloom Filters Problem
15Compressed Bloom Filters Motivation
- Insight Bloom filter is not just a data
structure, it is also a message. - If the Bloom filter is a message, worthwhile to
compress it - Further reduce traffic of URL exchanging
- Compressing bit vectors is easy.
- Arithmetic coding gets close to entropy.
- Can Bloom filters be compressed?
- Bloom filter looks like a random string
16Compression Technique
17Compression Results
- At k m (ln 2) /n, false positives are maximized
with a compressed Bloom filter. - Best case without compression is worst case with
compression - compression always helps.
- Side benefit
- Use fewer hash functions with compression
- possible speedup (depend on the bottleneck
memory or link).
18Bloom Filter vs. Perfect Hash
- If the set X of n elements is fixed, one can find
a perfect hash function for X - plus a fully uniform random hash function
- Then build a table with n entries of j bits each
- Mapping each X to n j-bit index, thus the false
positive is exactly (1/2)j . - matches the lower bound of bloom filter
- HOWEVER
- any change in the set X would require an
expensive recomputation of a perfect hash
function.
19Bloom Filter Tricks
- Union (combining two BFs)
- The same m and the same hash functions
- Just OR the two bit vectors of the original Bloom
filters - Shrinking (halve a big BF)
- just OR the first and second halves together
- the highest order bit can be masked
- Intersection (estimation)
20Applications
21Questions?