Title: The Bloomier Filter
1The Bloomier Filter
- Bernard Chazzelle Princeton Un., NEC Lab.
- Joe Killian NEC Lab.
- Ronitt Rubinfeld NEC Lab.
- Ayellet Tal Technion, Princeton Un.
Presented by Lilach Bien
2Overview
- The Problem
- Definitions
- The algorithm
- Analysis
- Lower Bounds
- Deterministic algorithm
- Mutable version of the problem
3The Problem
4The Problem Bloom Filters
- A large set of data D, with a small subset S
- We want to query whether an item d belongs to S
- No false negative rate (if d belongs to S well
recognize it) - A small positive rate (we may say d belongs to
S, although it doesnt) - Allowing a small positive rate enables to build
a compact data structure
5The Problem Bloomier Filters
- Bloom Filters membership queries on a small
subset of D. - Bloomier Filters computing arbitrary functions
defined only in a small subset of D. - The function will be computed correctly for all
members of S (no false negative) - For items not in S, we almost always return a
special value ?. - Allow dynamic updates to the function, if S
doesnt change.
6Example
- D1,100 S1,3 R1,2
- f(1)1 f(2)1 f(3)2
- 1 2 87 55 40
- 1 1 ? ? 1
- f(2)2
- 66 2 3
- ? 2 2
7Bloomier Filters - Uses
- Building a meta database for a union of
databases. - Keeps track of which database contains
information about each entry. - Maintaining directories if the data or code is
maintained in multiple locations.
8Definitions
9Formal Definitions
- f is a function from D0,,N-1
- The range is R?,1,,2r-1
- S t1,tn is a subset of D of size n.
- f(ti)vi vi?R
- f(x) ? for x outside of S
- f can be specified by the assignment
- A(t1,v1),,(tn,vn)
10Formal Definitions (Cont.)
- Bloomier filters allow to query f at any point
of S always correctly - For a random x?D\S the query return f(x) ? with
probability 1-? - The input to the algorithm is A and ?
11Supported Operations
- CREATE (A)
- Given an assignment A(t1,v1),(tn,vn), we
initialize the data structure Tables. - SET_VALUE(t,v,Tables)
- For t?D and v ?R we associate the value v with
the domain element t in Tables. - It is required that t belongs to S.
12Supported Operations (Cont.)
- LOOKUP(t, Tables)
- For t?S we return the last value v associated
with t. - For all but a fraction ? of D\S we return ?.
- For the remaining elements of D\S we return an
arbitrary element of R.
13The Idea
- We encode the values in R as elements of the
additive group X0,1q - Addition in Q is bitwise XOR
- Any x?R is transformed to Q by its q-bit binary
expansion ENCODE(x) - For y ?Q we define DECODE(y) as
- The corresponding number in R, if yltR
- ? otherwise
14The Idea (Cont.)
- Well save the function values for elements of S
in a table. - Well use a hash function to compute a
random q-bit masking value M for every
x in D. - To lookup the value of x, well access a set of
places in the table and calculate a q-bit number
a. - Well return M XOR a.
15The Idea (Cont.)
- If t is in S well build the table so
- a XOR M f(t).
- Otherwise, since M is random, well get a random
q-bit number y. - Proof For the ith bit of y
- Suppose ai0 (without loss of generality)
- We get
16The Idea (Cont.)
- Since y is random, for big enough q, DECODE(y)
will return ? with high probability - If we save in the table elements of R (y is an
element of R) DECODE(y) will not return ? with
probability R/2q - We can do better.
17Using 2 Tables
- We have a table of size m, and a hash function
HASH D?1,..,mk - if HASH(t) (h1,..,hk) we say that h1,,hk is
the neighborhood of t, N(t) - For large enough m and k, we can choose for each
t?S an element ?(t) from HASH(t) such that - For each t?S, t?t, it holds that ?(t) ? ?(t)
- If ?(t) hi we use ?(t) to denote i.
18Using 2 Tables (Cont.)
- Well use 2 tables
- The first table will store values in ?,1,,k
encoded as values in Q. - It will return ?(t) for t in S, and return ? for
most of the other items. - The second table will store values in R.
- For each t in S the value f(t) in will be in
place ?(t) .
19Using 2 Tables (Cont.)
- If x is in D/S then with probability k/2q the
first table will not return ?. - With probability k/2q we will access the second
table and return garbage. - Now we can also change function values if we
want. - We use the first table to check which place in
the second table stores the value we want to
change. - We change the value in the second table.
20The Algorithm
21The First Table
- Reminder
- We want to use the table to compute a value a for
each item t in D. - For items in S, a XOR M will give us the encoded
?(t) . - When we access the first table with an element t
we know N(t)h1,,hk and M. - Well compute
- We want to set the values in the indices of N(t)
so - a XOR M will give us the encoded ?(t).
22Order Respecting Matching
- Let S be a set with neighborhood N(t) defined
for each t?S. - Let ? be a complete ordering on the elements of
S. - A matching ? respects (S, ?,N) if
- For all t ?S, ?(t) ? N(t)
- If tigt ?tj then ?(ti)?N(ti)
23Order Respecting Matching (Cont.)
- If for N defined by HASH a matching ? respects
(S, ?,N) it has all the properties we wanted - For all t ?S, ?(t) ? N(t)
- For all t,t ?S, ?(t) ? ?(t)
- We may build the first table incrementally so
that for -
- a XOR M will give us the encoded ?(t).
24Building The First Table
- Input
- Order ?
- Neighborhood N(t) defined by HASH
- Order respecting matching ?
- For t ?1,, ?n we set Table?(t) so that
- encodes ?(t).
- Since ? is order respecting we cant affect any
value already set for tlt ? t.
25Finding A Good Ordering And Matching
- We get S and HASH, and compute ? and ? so ? is
order respecting. - A location h?1,,m is a singleton for S if
h?N(t) for exactly one t?S. - TWEAK(t,S,HASH) is the smallest value j such
that hj is a singleton for S, where
N(t)(h1,,hk) - TWEAK(t,S,HASH)? if no such j exists.
- If TWEAK(t,S,HASH) is defined we may set
- ?(t) TWEAK(t,S,HASH). This is an easy match.
26Finding A Good Ordering And Matching (Cont.)
- If t is an easy match it doesnt collide with
the neighborhood of any t?S. - E the subset of S with easy matches.
- HS/E.
- We recursively find (?,?) for H.
- We extend (?,?) to (?,?)
- We first put the ordered elements of H, and then
the elements of E. - ? is the union of matchings for H and E.
27FIND_MATCH
- FIND_MATCH (HASH, S)m, k Find (?, ? ) for S,
HASH - 1. E ? ??
- For ti ? S
- If TWEAK (ti, S,HASH ) is defined
- ?i TWEAK (ti, S, HASH )
- E E ti
- If E ? Return (failure)
- 2. H S \ E
- Recursively compute (?', ?') FIND_MATCH (HASH
,H)m ,k. - If FIND_MATCH (HASH ,H)m,kfailure Return
(failure) - 3. ? ?'
- For ti ? E
- Add ti to the end of ? (ie, make ti be the
largest element in ? thus far) - Return (? ??1,,?n)
- (where ?i is determined for ti ? E, in Step 1,
and for ti ? H (via ?') in Step 2.)
28CREATE
- CREATE (A (t1, v1) , (tn, vn))m, k, q
(create a mutable table) - 1. Uniformly choose hash D ? 1,,mk ? 0,
1q - S t1,, tn
- Create Table1 to be an array of m elements of
0, 1q - Create Table2 to be an array of m elements of R.
- (the initial values for both tables are
arbitrary) - Put (HASH , m, k, q) into the "header" of Table1
- (we assume that these values may be recovered
from Table1) - 2. (?, ?) FIND_MATCH (hash , S)m, k
- If FIND_MATCH (hash , S)m, k failure Goto
Step 1 - 3. For t ? 1, , ? n
- v A(t) (ie, the value assigned by A to t)
- (h1,,hk,M) HASH (t)
- L ? (t) l ? (t) (ie, L hl)
- Table1 L ENCODE (l) ? M ?
- Table2 L v
- 4. Return (Table (Table1,Table2))
29LOOKUP SET_VALUE
- LOOKUP (t, Table (Table1,Table2))
- 1. Get (HASH, m, k, q) from Table1
- (h1,, hk, M) HASH (t)
- l DECODE (M ? )
- 2. If l is defined
- L hl
- Return (Table2L)
- Else Return (?)
- SET_VALUE (t, v, Table (Table1,Table2))
- 1. Get (HASH, m, k, q) from Table1
- (h1,, hk, M) HASH (t)
- l DECODE (M ? )
- 2. If l is defined
- L hl
- Table2L v
- Return (success)
- Else Return (failure)
30Analysis
31Analyzing FIND_MATCH
- We show that FIND_MATCH succeeds with constant
probability for every S. - Well define a bi-partite graph G
- On the left side there are n vertices LL1,,Ln
corresponding to S. - On the right side there are m vertices
RR1,,Rm corresponding to 1,,m - There is an edge between Li and Rj if for ti?S if
there is l such that jhl.
32The Singleton Property
- We say that G has the singleton property if for
all nonempty A?L there exists a vertex Ri?R such
that Ri is adjacent to exactly one vertex in A. - If G has the singleton property FIND_MATCH will
never get stuck (there will always be easy
matches). - N(v) the set of neighbors of v?L.
- N(A) the set of neighbors of the elements in A.
33Lossless Expansion Property
- We say that G has the lossless expansion
property if for all nonempty A?L, N(A)gtkA/2 - If G has the lossless expansion property it has
the singleton property - Assume to contrary that there is an A such that
each node in N(A) has at least 2 neighbors. - The sub-graph for A has at least 2N(A) edges.
- Since N(A)gtkA/2, the sub-graph has more than
kA edges a contradiction.
34Lossless Expansion Property (Cont.)
- For a random graph G with
- Fixed k, kgt2
- mckn for a fixed c
- G is a lossless expander with constant
probability. - FIND_MATCH will succede with constant
probability.
35Data Structure Complexity
- The error probability is k/2q
- We have to set
- SpaceO(n(rlog1/e)) bits
- Lookup Time O(1)
- Update Time O(1)
36Data Structure Complexity (Cont.)
- FIND_MATCH well use the graph again.
- We may show that with high probability for all
non-empty A?L, N(A)gtcA for some constant
cgtk/2. - For a set A ?L well assume there are a items in
N(A) with one neighbor and cA-a items with
more than one neighbor. - The sub-graph for A has at least
- a2(cA-a)2cA-a edges.
- On the other hand it has at most kA edges.
-
37Data Structure Complexity (Cont.)
- Each item in A has at most k neighbors.
- The number of items in A that has neighbors that
belong only to them is at least - a/k ? (2c-k)A/k (2c/k-1)ApA
- These items are easy matches.
- The run-time of FIND_MATCH is, if there is such c
is - O(n)O((1-p)n)O((1-p)2n)O(n)
- That is also the expected run-time of CREATE
38Lower Bounds
39Deterministic Algorithm
- If R1,2,?, S splits into subsets A and B that
map to 1 and 2, resp. - Even in that case deterministic Bloomier
filtering requires O(n log log N) bits of
storage. - Define G - a graph where each node is a vector
in -1,0,1N with exactly n coordinates equal to
1, and n others equal to -1. - The 1s represent A and the -1s represent B.
- Two nodes v and v are adjacent if the set A of v
intersects the set B of v - (if v(x1,,xN) and v(y1,yN) they are
adjacent if there is i such that xiyi-1)
40Deterministic Algorithm (Cont.)
- Since the memory is the only source of
information about A and B no 2 adjacent node
should correspond to the same memory
configuration. - The memory size m is at least log?(G) (?(G) is
the minimum number of colors required to color
G). - Well show that ?(G) is between O(2n log N) and
O(n2n log N).
41Lower Bound On ?(G)
- For every color c required to color G we have a
vector zc in -1,1N. - For a node v(x1,,xN) we allow xi to be 1 (or
-1) only if zi is 1 (or -1). - A set of binary vectors in length l is (k,l)
universal if for every choice of k coordinate
positions we get all the possible 2k patterns. - Well show that zc is (N,n) universal if we turn
the minus ones to zeroes.
42Lower Bound On ?(G) (Cont.)
- Let i1,..,in be n coordinate positions.
- For each w in -1,1N we have a node v whose
i1,..,in coordinates match w. - If v is colored in color c then the i1,..,in
coordinates of zc match w. - Therefore, for each choice of n-coordinate
positions we get all the possible patterns. - There size of an (N,n) universal set is O(2n log
N) so this is a lower bound on ?(G) .
43Upper Bound On ?(G)
- There exists an (N,2n) universal set of vectors
of size O(n2n log N). - Well turn all the zeroes to minus ones.
- Well use that group as zc.
- Because the set zc is universal we may select
for each node is a vector zc that matches the 1s
and -1s of the node. - c will be the color of the node.
44Mutable Filtering
- If
- and the number m of storage bits satisfies
- for some large enough constant c, the Bloomier
Filtering cannot support dynamic updates on S of
size 2n. - The proof is for the R1,2,?, S splits into
subsets of size n A and B that map to 1 and 2,
resp. - We assume the algorithm is randomized.
45Mutable Filtering (Cont.)
- Let ? be a sequence of random choices made by
the algorithm, when the input to the algorithm
was A and B. - We assume B was a specific set Borg and change
A. - For each possible A we have a corresponding
memory configuration. - In other words for each memory configuration
we have a family of possibilities to A that led
to this configuration. - Let F be the largest family.
46Mutable Filtering (Cont.)
- Now we change B For each possible Bnew we get
to a different memory configuration. - For each configuration 1?i?2m there is a family
of options to Bnew that leads to it. We mark it
by Gi.
I
II
47Mutable Filtering (Cont.)
- Given a memory configuration C in II, For any
path that leads to it - B can be the Bnew on the path.
- For each item in such a set we must answer in
B. - A can be the set on the path before configuration
in I. - For each item in such a set that couldnt be
changed to Bnew on the path we must answer in
A. - Suppose in I we were in the configuration F
leads to, and then we randomly chose Bnew. - i(Bnew) denotes j such that Bnew?G j
- In II we have to
- Answer in A for each item of a set in F that
couldnt be changed to Bnew - Answer in B for each item of a set in G i(Bnew)
48The Proof
- is the subset of F whose sets intersect
Bnew. -
- We show that with high probability (over the
selection of Bnew) the sets -
-
- are intersecting.
- There is an item for which the algorithm must
answer both in A and in B. - There is a set Bnew that causes the algorithm to
make errors.
49Lk And Its Size
-
- Lk is the set of items that belong to at list k
sets in F. - Well look at subsets of
- that belong to Lk and show they intersect.
- We first bound the size of Lk.
- Fk is the sub-family of F that contains only
subsets of Lk. -
-
50Lk And Its Size (Cont.)
51 And Its Size
-
- It is a subset of both and Lk.
- The algorithm should answer in A for each item
of - Well show that with probability ?1/2
cannot be very small. - The expected number of sets in F a random item
of D intersects is
52 And Its Size (Cont.)
-
- If an item in Lk does not appear in it
intersects only sets in - Such an item appears in at least k sets.
-
-
-
- According to Markov bound
53 And Its Size
-
- Mi is a subset of Lk and
- The algorithm should answer in B for each item
of -
-
- According to Chernoff Bound
54 And Its Size (Cont.)
-
-
-
- This probability is the number of Bnews that
hold - in all Gis, divided by the number of Bnews
that hold -
-
55 And Its Size (Cont.)
56The Error
- With probability at least ½-o(1)
- and
- that is, the 2 sets intersect
- The algorithm must answer in A for each item of
- and in B for each item of
- There is a set Bnew for which algorithm will make
an error
57Questions?