The Bloomier Filter - PowerPoint PPT Presentation

About This Presentation

Title:

The Bloomier Filter

Description:

The Problem Bloom Filters. A large set of data D, with a small subset S. We want to query whether an ... G is a lossless expander with constant probability. ... – PowerPoint PPT presentation

Number of Views:84

Avg rating:3.0/5.0

Slides: 58

Provided by: Z74

Category:

more less

Transcript and Presenter's Notes

Title: The Bloomier Filter

1
The Bloomier Filter

Bernard Chazzelle Princeton Un., NEC Lab.
Joe Killian NEC Lab.
Ronitt Rubinfeld NEC Lab.
Ayellet Tal Technion, Princeton Un.

Presented by Lilach Bien
2
Overview

The Problem
Definitions
The algorithm
Analysis
Lower Bounds
Deterministic algorithm
Mutable version of the problem

3
The Problem

Bloom Bloomier Filters

4
The Problem Bloom Filters

A large set of data D, with a small subset S
We want to query whether an item d belongs to S
No false negative rate (if d belongs to S well
recognize it)
A small positive rate (we may say d belongs to
S, although it doesnt)
Allowing a small positive rate enables to build
a compact data structure

5
The Problem Bloomier Filters

Bloom Filters membership queries on a small
subset of D.
Bloomier Filters computing arbitrary functions
defined only in a small subset of D.
The function will be computed correctly for all
members of S (no false negative)
For items not in S, we almost always return a
special value ?.
Allow dynamic updates to the function, if S
doesnt change.

6
Example

D1,100 S1,3 R1,2
f(1)1 f(2)1 f(3)2
1 2 87 55 40
1 1 ? ? 1
f(2)2
66 2 3
? 2 2

7
Bloomier Filters - Uses

Building a meta database for a union of
databases.
Keeps track of which database contains
information about each entry.
Maintaining directories if the data or code is
maintained in multiple locations.

8
Definitions
9
Formal Definitions

f is a function from D0,,N-1
The range is R?,1,,2r-1
S t1,tn is a subset of D of size n.
f(ti)vi vi?R
f(x) ? for x outside of S
f can be specified by the assignment
A(t1,v1),,(tn,vn)

10
Formal Definitions (Cont.)

Bloomier filters allow to query f at any point
of S always correctly
For a random x?D\S the query return f(x) ? with
probability 1-?
The input to the algorithm is A and ?

11
Supported Operations

CREATE (A)
Given an assignment A(t1,v1),(tn,vn), we
initialize the data structure Tables.
SET_VALUE(t,v,Tables)
For t?D and v ?R we associate the value v with
the domain element t in Tables.
It is required that t belongs to S.

12
Supported Operations (Cont.)

LOOKUP(t, Tables)
For t?S we return the last value v associated
with t.
For all but a fraction ? of D\S we return ?.
For the remaining elements of D\S we return an
arbitrary element of R.

13
The Idea

We encode the values in R as elements of the
additive group X0,1q
Addition in Q is bitwise XOR
Any x?R is transformed to Q by its q-bit binary
expansion ENCODE(x)
For y ?Q we define DECODE(y) as
The corresponding number in R, if yltR
? otherwise

14
The Idea (Cont.)

Well save the function values for elements of S
in a table.
Well use a hash function to compute a
random q-bit masking value M for every
x in D.
To lookup the value of x, well access a set of
places in the table and calculate a q-bit number
a.
Well return M XOR a.

15
The Idea (Cont.)

If t is in S well build the table so
a XOR M f(t).
Otherwise, since M is random, well get a random
q-bit number y.
Proof For the ith bit of y
Suppose ai0 (without loss of generality)
We get

16
The Idea (Cont.)

Since y is random, for big enough q, DECODE(y)
will return ? with high probability
If we save in the table elements of R (y is an
element of R) DECODE(y) will not return ? with
probability R/2q
We can do better.

17
Using 2 Tables

We have a table of size m, and a hash function
HASH D?1,..,mk
if HASH(t) (h1,..,hk) we say that h1,,hk is
the neighborhood of t, N(t)
For large enough m and k, we can choose for each
t?S an element ?(t) from HASH(t) such that
For each t?S, t?t, it holds that ?(t) ? ?(t)
If ?(t) hi we use ?(t) to denote i.

18
Using 2 Tables (Cont.)

Well use 2 tables
The first table will store values in ?,1,,k
encoded as values in Q.
It will return ?(t) for t in S, and return ? for
most of the other items.
The second table will store values in R.
For each t in S the value f(t) in will be in
place ?(t) .

19
Using 2 Tables (Cont.)

If x is in D/S then with probability k/2q the
first table will not return ?.
With probability k/2q we will access the second
table and return garbage.
Now we can also change function values if we
want.
We use the first table to check which place in
the second table stores the value we want to
change.
We change the value in the second table.

20
The Algorithm
21
The First Table

Reminder
We want to use the table to compute a value a for
each item t in D.
For items in S, a XOR M will give us the encoded
?(t) .
When we access the first table with an element t
we know N(t)h1,,hk and M.
Well compute
We want to set the values in the indices of N(t)
so
a XOR M will give us the encoded ?(t).

22
Order Respecting Matching

Let S be a set with neighborhood N(t) defined
for each t?S.
Let ? be a complete ordering on the elements of
S.
A matching ? respects (S, ?,N) if
For all t ?S, ?(t) ? N(t)
If tigt ?tj then ?(ti)?N(ti)

23
Order Respecting Matching (Cont.)

If for N defined by HASH a matching ? respects
(S, ?,N) it has all the properties we wanted
For all t ?S, ?(t) ? N(t)
For all t,t ?S, ?(t) ? ?(t)
We may build the first table incrementally so
that for
a XOR M will give us the encoded ?(t).

24
Building The First Table

Input
Order ?
Neighborhood N(t) defined by HASH
Order respecting matching ?
For t ?1,, ?n we set Table?(t) so that
encodes ?(t).
Since ? is order respecting we cant affect any
value already set for tlt ? t.

25
Finding A Good Ordering And Matching

We get S and HASH, and compute ? and ? so ? is
order respecting.
A location h?1,,m is a singleton for S if
h?N(t) for exactly one t?S.
TWEAK(t,S,HASH) is the smallest value j such
that hj is a singleton for S, where
N(t)(h1,,hk)
TWEAK(t,S,HASH)? if no such j exists.
If TWEAK(t,S,HASH) is defined we may set
?(t) TWEAK(t,S,HASH). This is an easy match.

26
Finding A Good Ordering And Matching (Cont.)

If t is an easy match it doesnt collide with
the neighborhood of any t?S.
E the subset of S with easy matches.
HS/E.
We recursively find (?,?) for H.
We extend (?,?) to (?,?)
We first put the ordered elements of H, and then
the elements of E.
? is the union of matchings for H and E.

27
FIND_MATCH

FIND_MATCH (HASH, S)m, k Find (?, ? ) for S,
HASH
1. E ? ??
For ti ? S
If TWEAK (ti, S,HASH ) is defined
?i TWEAK (ti, S, HASH )
E E ti
If E ? Return (failure)
2. H S \ E
Recursively compute (?', ?') FIND_MATCH (HASH
,H)m ,k.
If FIND_MATCH (HASH ,H)m,kfailure Return
(failure)
3. ? ?'
For ti ? E
Add ti to the end of ? (ie, make ti be the
largest element in ? thus far)
Return (? ??1,,?n)
(where ?i is determined for ti ? E, in Step 1,
and for ti ? H (via ?') in Step 2.)

28
CREATE

CREATE (A (t1, v1) , (tn, vn))m, k, q
(create a mutable table)
1. Uniformly choose hash D ? 1,,mk ? 0,
1q
S t1,, tn
Create Table1 to be an array of m elements of
0, 1q
Create Table2 to be an array of m elements of R.
(the initial values for both tables are
arbitrary)
Put (HASH , m, k, q) into the "header" of Table1
(we assume that these values may be recovered
from Table1)
2. (?, ?) FIND_MATCH (hash , S)m, k
If FIND_MATCH (hash , S)m, k failure Goto
Step 1
3. For t ? 1, , ? n
v A(t) (ie, the value assigned by A to t)
(h1,,hk,M) HASH (t)
L ? (t) l ? (t) (ie, L hl)
Table1 L ENCODE (l) ? M ?
Table2 L v
4. Return (Table (Table1,Table2))

29
LOOKUP SET_VALUE

LOOKUP (t, Table (Table1,Table2))
1. Get (HASH, m, k, q) from Table1
(h1,, hk, M) HASH (t)
l DECODE (M ? )
2. If l is defined
L hl
Return (Table2L)
Else Return (?)
SET_VALUE (t, v, Table (Table1,Table2))
1. Get (HASH, m, k, q) from Table1
(h1,, hk, M) HASH (t)
l DECODE (M ? )
2. If l is defined
L hl
Table2L v
Return (success)
Else Return (failure)

30
Analysis
31
Analyzing FIND_MATCH

We show that FIND_MATCH succeeds with constant
probability for every S.
Well define a bi-partite graph G
On the left side there are n vertices LL1,,Ln
corresponding to S.
On the right side there are m vertices
RR1,,Rm corresponding to 1,,m
There is an edge between Li and Rj if for ti?S if
there is l such that jhl.

32
The Singleton Property

We say that G has the singleton property if for
all nonempty A?L there exists a vertex Ri?R such
that Ri is adjacent to exactly one vertex in A.
If G has the singleton property FIND_MATCH will
never get stuck (there will always be easy
matches).
N(v) the set of neighbors of v?L.
N(A) the set of neighbors of the elements in A.

33
Lossless Expansion Property

We say that G has the lossless expansion
property if for all nonempty A?L, N(A)gtkA/2
If G has the lossless expansion property it has
the singleton property
Assume to contrary that there is an A such that
each node in N(A) has at least 2 neighbors.
The sub-graph for A has at least 2N(A) edges.
Since N(A)gtkA/2, the sub-graph has more than
kA edges a contradiction.

34
Lossless Expansion Property (Cont.)

For a random graph G with
Fixed k, kgt2
mckn for a fixed c
G is a lossless expander with constant
probability.
FIND_MATCH will succede with constant
probability.

35
Data Structure Complexity

The error probability is k/2q
We have to set
SpaceO(n(rlog1/e)) bits
Lookup Time O(1)
Update Time O(1)

36
Data Structure Complexity (Cont.)

FIND_MATCH well use the graph again.
We may show that with high probability for all
non-empty A?L, N(A)gtcA for some constant
cgtk/2.
For a set A ?L well assume there are a items in
N(A) with one neighbor and cA-a items with
more than one neighbor.
The sub-graph for A has at least
a2(cA-a)2cA-a edges.
On the other hand it has at most kA edges.

37
Data Structure Complexity (Cont.)

Each item in A has at most k neighbors.
The number of items in A that has neighbors that
belong only to them is at least
a/k ? (2c-k)A/k (2c/k-1)ApA
These items are easy matches.
The run-time of FIND_MATCH is, if there is such c
is
O(n)O((1-p)n)O((1-p)2n)O(n)
That is also the expected run-time of CREATE

38
Lower Bounds
39
Deterministic Algorithm

If R1,2,?, S splits into subsets A and B that
map to 1 and 2, resp.
Even in that case deterministic Bloomier
filtering requires O(n log log N) bits of
storage.
Define G - a graph where each node is a vector
in -1,0,1N with exactly n coordinates equal to
1, and n others equal to -1.
The 1s represent A and the -1s represent B.
Two nodes v and v are adjacent if the set A of v
intersects the set B of v
(if v(x1,,xN) and v(y1,yN) they are
adjacent if there is i such that xiyi-1)

40
Deterministic Algorithm (Cont.)

Since the memory is the only source of
information about A and B no 2 adjacent node
should correspond to the same memory
configuration.
The memory size m is at least log?(G) (?(G) is
the minimum number of colors required to color
G).
Well show that ?(G) is between O(2n log N) and
O(n2n log N).

41
Lower Bound On ?(G)

For every color c required to color G we have a
vector zc in -1,1N.
For a node v(x1,,xN) we allow xi to be 1 (or
-1) only if zi is 1 (or -1).
A set of binary vectors in length l is (k,l)
universal if for every choice of k coordinate
positions we get all the possible 2k patterns.
Well show that zc is (N,n) universal if we turn
the minus ones to zeroes.

42
Lower Bound On ?(G) (Cont.)

Let i1,..,in be n coordinate positions.
For each w in -1,1N we have a node v whose
i1,..,in coordinates match w.
If v is colored in color c then the i1,..,in
coordinates of zc match w.
Therefore, for each choice of n-coordinate
positions we get all the possible patterns.
There size of an (N,n) universal set is O(2n log
N) so this is a lower bound on ?(G) .

43
Upper Bound On ?(G)

There exists an (N,2n) universal set of vectors
of size O(n2n log N).
Well turn all the zeroes to minus ones.
Well use that group as zc.
Because the set zc is universal we may select
for each node is a vector zc that matches the 1s
and -1s of the node.
c will be the color of the node.

44
Mutable Filtering

If
and the number m of storage bits satisfies
for some large enough constant c, the Bloomier
Filtering cannot support dynamic updates on S of
size 2n.
The proof is for the R1,2,?, S splits into
subsets of size n A and B that map to 1 and 2,
resp.
We assume the algorithm is randomized.

45
Mutable Filtering (Cont.)

Let ? be a sequence of random choices made by
the algorithm, when the input to the algorithm
was A and B.
We assume B was a specific set Borg and change
A.
For each possible A we have a corresponding
memory configuration.
In other words for each memory configuration
we have a family of possibilities to A that led
to this configuration.
Let F be the largest family.

46
Mutable Filtering (Cont.)

Now we change B For each possible Bnew we get
to a different memory configuration.
For each configuration 1?i?2m there is a family
of options to Bnew that leads to it. We mark it
by Gi.

I
II
47
Mutable Filtering (Cont.)

Given a memory configuration C in II, For any
path that leads to it
B can be the Bnew on the path.
For each item in such a set we must answer in
B.
A can be the set on the path before configuration
in I.
For each item in such a set that couldnt be
changed to Bnew on the path we must answer in
A.
Suppose in I we were in the configuration F
leads to, and then we randomly chose Bnew.
i(Bnew) denotes j such that Bnew?G j
In II we have to
Answer in A for each item of a set in F that
couldnt be changed to Bnew
Answer in B for each item of a set in G i(Bnew)