Generalized%20Hashing%20with%20Variable-Length%20Bit%20Strings - PowerPoint PPT Presentation

About This Presentation

Title:

Generalized%20Hashing%20with%20Variable-Length%20Bit%20Strings

Description:

Storing Variable-Length Keys in Arrays, Sets, ... Cuckoo hashing. FKS perfect hashing. Also many hash functions designed, including several universal families ... – PowerPoint PPT presentation

Number of Views:72

Avg rating:3.0/5.0

Slides: 30

Provided by: michael598

Learn more at: http://www.aladdin.cs.cmu.edu

Category:

more less

Transcript and Presenter's Notes

Title: Generalized%20Hashing%20with%20Variable-Length%20Bit%20Strings

1
Generalized Hashing with Variable-Length Bit
Strings

Michael Klipper
With
Dan Blandford Guy Blelloch

Original source D. Blandford and G. E.
Blelloch. Storing Variable-Length Keys in
Arrays, Sets, and Dictionaries, with
Applications. In Symposium on Discrete
Algorithms (SODA), 2005 (hopefully)
2
Hashing techniquescurrently available

Many hashing algorithms out there
Separate chaining
Cuckoo hashing
FKS perfect hashing
Also many hash functions designed, including
several universal families
O(1) expected amortized time for updates, and
many have O(1) worst case time for searches
They use W(n lg n) bits for n entries, since at
least lg n bits are used per entry to distinguish
between keys.

3
What kind of bounds do we achieve?

Lets say we store n entries in our hashtable of
the form (si, ti) for i 0, 1, 2, (n-1). Each
si and ti are bit strings of variable length.
For our purposes, many of the tis might only be
a few bits long.
Time for all operations (later slide)
O(1) expected amortized
Total space used
O(Si max(si - lg n, 1) ti) bits

4
The Improvement We Attain

Lets say we store n entries taking up m total
bits. In terms of the si and ti values on the
previous slide,
m Si si ti
Note that m W(n lg n).
Thus, our space usage is O(m n lg n) bits, as
opposed to the W(m) bits that standard hashtable
structures use.
In particular, our structure is much more
efficient than standard structures when m is
close to n lg n (for example, when most entries
are only a few bits long).

5
GoalGeneralized Dynamic Hashtables

We want to support the following operations
query(key, keyLength)
Looks up the key in the hashtable and returns the
data associated and its length
insert(key, keyLength, data, dataLength)
Adds (key, data) as an entry in the hashtable
remove(key, keyLength)
Removes the key and the data associated
NOTE Each key will only have one entry
associated with it. Another name for this kind
of structure is a variable-length dictionary
structure.

6
Other Structures

Variable-Length Sets
Also supports query, insert, and remove, though
there is no extra data associated with keys
Can be easily implemented as a generalized
hashtable that stores no extra data
O(1) expected amortized time for all operations
If the n keys are s0, s1, sn-1, then the total
space used in bits is
O(Si max(si - lg n, 1))

7
Other Structures (cont.)

Variable-Length Arrays
For n entries, the keys are 0, 1, n-1.
These arrays will not be able to resize their
number of entries.
Operations
get(i) returns the data stored at index i and its
length
set(i, val, len) updates the data at index i to
val of length len
Once again, O(1) expected amortized time for
operations. Total space usage is O(Si ti).

8
Implementation Note

Assume for now that we have a variable-length
array structure described on the previous slide.
We will use this to make generalized dynamic
hashtables, which are more interesting than the
arrays.
At the end of this presentation, I can talk about
implementation of variable-length arrays if time
permits.

9
The Main Idea BehindHow Hashtables Work

Our generalized hashtable structure contains a
variable-length array with 2q entries (which will
serve as the buckets for the hashtable). We keep
2q approximately equal to n by occasional
rehashing of the bucket contents.
The item (si, ti), where si is the key and ti is
the data, is placed in a bucket as follows we
first hash si to some index (more on this later),
and we write (si, ti) into the bucket specified
by that index. Note that when we hash si, we
implicitly treat it as an integer.

10
Hashtables (cont.)

If several entries of the set collide in a
bucket, we throw them all into the bucket
together as one giant concatenated bit string.
Thus, we essentially use a separate-chaining
algorithm.
To tell where one entry starts and another
begins,
we encode the entries with a prefix-free code
(such as Huffman codes or gamma codes).

Sample bucket (where si is si encoded, etc.)
s1 t1
s2 t2
s3 t3
11
Time and Space Bounds

Note that we use prefix-free codes that only use
a constant factor more space (i.e. they encode m
bits in O(m) space) and can be encoded/decoded in
O(1) time.
Time If we use a universal hash function to
determine the bucket index, then each bucket
receives only a constant expected number of
elements, so it takes O(1) expected amortized
time to find an element in a bucket. The
prefix-free codes we use allow O(1) decoding of
any element.
Space The prefix-free codes increase the amount
of bits stored by at most a constant factor. If
we have m bits total we want to store, our space
bound for variable-length arrays says that the
buckets take up O(m) bits.

12
Theres a bit more than that

Recall the space bound for the hash table is
O(Si max(si - lg n, 1) ti).
Where does the lg n savings per entry come from?
We perform a technique called quotienting.
We actually use two hash functions h and h.
h(si) is the bucket index, and h(si) has
length max(si - q, 1). (Recall that 2q is
approximately n.)
Instead of writing (si, ti) in the bucket, we
actually write (h(si), ti). This way, each
entry needs h(si) ti bits to write, which
fulfills our space bound above.

13
A Quotienting Scheme

Let h0 be a hash function from a universal family
whose range is q bits. We describe a way to make
a family of hash functions from the family from
which h0 is drawn.
Let sit be the q most
significant bits of si,
and let sib be the other bits.
We define our hash functions
as follows
h(si) sib
h(si) h0(sib) xor sit

h(si)
sib
sit
101101 001010100100101
si
h0

010011
h(si)
111110
14
Undoing the Quotienting

In the previous example, we saw that h(si)
evaluated to 111110, or 62. This means we store
h(si) in bucket number 62!
Note that given h(si) and h(si) we can
retrieve si because
sib h(si)
and
sit h0(h(si)) xor h(si).
The family of h functions we make is another
universal family, so our time bound explained
earlier still holds.

15
An Application of Hashtables Graph Structures

One area where we will be able to use the
hashtable structure is in storing graphs. Here,
we describe a semidynamic directed-graph
implementation. This means that the number of
vertices is fixed, but edges can be added or
deleted at runtime.
Let u and v be vertices of a graph. We want the
following operations compactly and in O(1)
expected amortized time
deg(v) - get the degree of vertex v
adjacent(u, v) - returns true iff u and v are
adjacent
firstEdge(v) - returns the first neighbor of v in
G
nextEdge(u, v) - returns the next neighbor of u
after v (assumes u and v are adjacent)
addEdge(u, v) - adds an edge from u to v in G
deleteEdge(u, v) - deletes the edge (u, v) from G

16
Hashing Integers

Up to now, we have used bit strings as the main
objects in the hashtable. It will also be useful
to hash on integer values. Hence, we have
created some utilities to convert between bit
strings and integers using as few bits as
possible, so an integer x takes basically lg x
bits to write as a bit string.

17
A Graph Layout Where We Store Edges in a Hashtable

Lets say u is a vertex of degree d and v1, vd
are its neighbors. Lets say that v0 vd1 u
by convention.
Then the entry representing the edge (u, vi) has
key (u, vi) and data (vi-1, vi1).

Hash Table
This extra entry starts the list.
u
u
u
u
v2
v1
u
v3
v1
u
v4
v2
v3
v2
v1
v4
Degree of Vertex
4
18
Implementations of a Couple Operations

For simplicity, Im leaving off the length
arguments in query() and insert().
adjacent(u, v)
return (query((u, v)) ! -1)
firstEdge(u)
let (vp, vn, d) query((u, u))
return vn
addEdge(u, v)
let (vp, vn, d) query((u, u))
remove((u, u))
insert((u, u), (vp, v, d 1))
insert((u, v), (u, vn))

19
Compression and Space Usage

Instead of ((u, vi), (vi-1, vi1)) in the table,
we will store
((u, vi u), (vi-1 u, vi1 u))
With this representation, we need O(S(u,v)ÎE lg
u v) space.
A good labeling of the vertices will make many of
these differences small. For instance, for many
classes of graphs, such as planar graphs, the
total space used is O(n) bits! The following
paper has details

D. Blandford, G. E. Blelloch, and I. Kash.
Compact Representations of Separable Graphs. In
SODA, 2003, pages 342-351.
20
More Details aboutImplementing Arrays
Well use the following data for our example in
these slides t0 10110 t1 0110 t2
11111 t3 0101 t4 1100 t5 010 t6 11011 t7
00001111 Well assume that the word size is 2
bytes.
21
Key Idea BLOCKS

Multiple data items can be crammed into a word,
so lets take advantage of that.
There are many possible ways to store data in
blocks. The way that Ill discuss here is to use
two words per block one stores data and one
marks separation of entries.

0 1 1 0
b0
1 0 1 1 0
1 1 1 1 1
1st word
Example
2nd word
1 0 0 0 0 1 0 0 0 1 0 0 0 0 1 0
This is the block containing strings t0 through
t2 from our example.
22
Blocks continued

Well name a block bi if i is the first entry
number to be stored in that block. The size of a
block is the sum of the sizes of the entries
inside it.
Well maintain a size invariant
for any adjacent blocks bi and bj, bi bj
is at least a full word.
Note splitting and merging blocks is easy.

We assume these things for now
Entries fit into a word we can handle longer
entries by storing a pointer to separate memory
in its place
Entries are nonempty

23
Organization of Blocks

We have a bit array A of length n (this is a
regular old C array). Ai 1 if and only if
string i starts a block. This is our indexing
structure.
We also have a standard hashtable H. If string
i starts a block, H(i) address of bi. We
assume H is computed in O(1) expected amortized
time.
Blocks are large enough that storing them in H
only increases the space usage by a constant
factor.

Example In this example, b0 and b3 are adjacent
blocks, as are b3 and b7.
H(0)
t0 t1 t2
b0
1
0
0
H(3)
b3
1
t3 t4 t5 t6
A
0
0
0
H(7)
b7
t7
1
24
A Note about Space Usage

Any two 1s in the indexing structure A are
separated by at most one word. This is because
entries are nonempty and a block only holds one
word for entries.

25
The get() operation

Since bits that are turned on in A are close, we
can find the block to which an entry belongs in
O(1) time. One way to do this is table lookup.
If the ith entry is in block bk, then the ith
entry of the array is the (i k 1)st entry in
that block.
By using table lookup, we can find where the
correct 1s in the second word are, which tell us
where the entry starts and ends.

26
A picture of the get() operation, illustrated
with get(2)
To find entry 2, we look in block 0.
1
0
H(0)
0
A2
1
0
b0
0 1 1 0
1 0 1 1 0
1 1 1 1 1
0
0
1 0 0 0 0 1 0 0 0 1 0 0 0 0 1 0
1
A
start
end
Conclusion Entry 2 is 5 bits long. It is 11111.
27
How set() works in a nutshell

Find the block with the entry.
Rewrite it.
If the block is too large, split it into two.
Merge adjacent blocks together to preserve the
size invariant.

28
Now, to prove the theorem about space usage for
arrays

Let m Si ti and w machine word size. I
claim the total number of bits used is O(m).
Our size invariant for blocks guarantees that on
average, blocks are half full. Thus, there are
O(m / w) blocks used, since there are m bits
total of data and each block has W(w) bits stored
in it on average.
Our indexing structure A and hashtable H use O(w)
bits per block (O(1) words). Total bits
O(m / w) blocks O(w) per block O(m) bits.

29
A note about entrieslonger than w bits

What is really done in our code with entries
longer than w bits is not just allocating
separate memory and putting a pointer in the
array, though its close.
We do essentially what standard structures do,
and we chain the words making up our entry into a
linked list. We have a clever way to do this
which doesnt need to use w-bit pointers instead
we only need 7 or 8 bits for a pointer.

Write a Comment

User Comments (0)