Title: Generalized%20Hashing%20with%20Variable-Length%20Bit%20Strings
1Generalized Hashing with Variable-Length Bit
Strings
- Michael Klipper
- With
- Dan Blandford Guy Blelloch
Original source D. Blandford and G. E.
Blelloch. Storing Variable-Length Keys in
Arrays, Sets, and Dictionaries, with
Applications. In Symposium on Discrete
Algorithms (SODA), 2005 (hopefully)
2Hashing techniquescurrently available
- Many hashing algorithms out there
- Separate chaining
- Cuckoo hashing
- FKS perfect hashing
- Also many hash functions designed, including
several universal families - O(1) expected amortized time for updates, and
many have O(1) worst case time for searches - They use W(n lg n) bits for n entries, since at
least lg n bits are used per entry to distinguish
between keys.
3What kind of bounds do we achieve?
- Lets say we store n entries in our hashtable of
the form (si, ti) for i 0, 1, 2, (n-1). Each
si and ti are bit strings of variable length.
For our purposes, many of the tis might only be
a few bits long. - Time for all operations (later slide)
- O(1) expected amortized
- Total space used
- O(Si max(si - lg n, 1) ti) bits
4The Improvement We Attain
- Lets say we store n entries taking up m total
bits. In terms of the si and ti values on the
previous slide, - m Si si ti
- Note that m W(n lg n).
- Thus, our space usage is O(m n lg n) bits, as
opposed to the W(m) bits that standard hashtable
structures use. - In particular, our structure is much more
efficient than standard structures when m is
close to n lg n (for example, when most entries
are only a few bits long).
5GoalGeneralized Dynamic Hashtables
- We want to support the following operations
- query(key, keyLength)
- Looks up the key in the hashtable and returns the
data associated and its length - insert(key, keyLength, data, dataLength)
- Adds (key, data) as an entry in the hashtable
- remove(key, keyLength)
- Removes the key and the data associated
- NOTE Each key will only have one entry
associated with it. Another name for this kind
of structure is a variable-length dictionary
structure.
6Other Structures
- Variable-Length Sets
- Also supports query, insert, and remove, though
there is no extra data associated with keys - Can be easily implemented as a generalized
hashtable that stores no extra data - O(1) expected amortized time for all operations
- If the n keys are s0, s1, sn-1, then the total
space used in bits is - O(Si max(si - lg n, 1))
7Other Structures (cont.)
- Variable-Length Arrays
- For n entries, the keys are 0, 1, n-1.
- These arrays will not be able to resize their
number of entries. - Operations
- get(i) returns the data stored at index i and its
length - set(i, val, len) updates the data at index i to
val of length len - Once again, O(1) expected amortized time for
operations. Total space usage is O(Si ti).
8Implementation Note
- Assume for now that we have a variable-length
array structure described on the previous slide.
We will use this to make generalized dynamic
hashtables, which are more interesting than the
arrays. - At the end of this presentation, I can talk about
implementation of variable-length arrays if time
permits.
9The Main Idea BehindHow Hashtables Work
- Our generalized hashtable structure contains a
variable-length array with 2q entries (which will
serve as the buckets for the hashtable). We keep
2q approximately equal to n by occasional
rehashing of the bucket contents. - The item (si, ti), where si is the key and ti is
the data, is placed in a bucket as follows we
first hash si to some index (more on this later),
and we write (si, ti) into the bucket specified
by that index. Note that when we hash si, we
implicitly treat it as an integer.
10Hashtables (cont.)
- If several entries of the set collide in a
bucket, we throw them all into the bucket
together as one giant concatenated bit string.
Thus, we essentially use a separate-chaining
algorithm. - To tell where one entry starts and another
begins, - we encode the entries with a prefix-free code
(such as Huffman codes or gamma codes).
Sample bucket (where si is si encoded, etc.)
s1 t1
s2 t2
s3 t3
11Time and Space Bounds
- Note that we use prefix-free codes that only use
a constant factor more space (i.e. they encode m
bits in O(m) space) and can be encoded/decoded in
O(1) time. - Time If we use a universal hash function to
determine the bucket index, then each bucket
receives only a constant expected number of
elements, so it takes O(1) expected amortized
time to find an element in a bucket. The
prefix-free codes we use allow O(1) decoding of
any element. - Space The prefix-free codes increase the amount
of bits stored by at most a constant factor. If
we have m bits total we want to store, our space
bound for variable-length arrays says that the
buckets take up O(m) bits.
12Theres a bit more than that
- Recall the space bound for the hash table is
- O(Si max(si - lg n, 1) ti).
- Where does the lg n savings per entry come from?
- We perform a technique called quotienting.
- We actually use two hash functions h and h.
h(si) is the bucket index, and h(si) has
length max(si - q, 1). (Recall that 2q is
approximately n.) - Instead of writing (si, ti) in the bucket, we
actually write (h(si), ti). This way, each
entry needs h(si) ti bits to write, which
fulfills our space bound above.
13A Quotienting Scheme
- Let h0 be a hash function from a universal family
whose range is q bits. We describe a way to make
a family of hash functions from the family from
which h0 is drawn. - Let sit be the q most
- significant bits of si,
- and let sib be the other bits.
- We define our hash functions
- as follows
- h(si) sib
- h(si) h0(sib) xor sit
h(si)
sib
sit
101101 001010100100101
si
h0
010011
h(si)
111110
14Undoing the Quotienting
- In the previous example, we saw that h(si)
evaluated to 111110, or 62. This means we store
h(si) in bucket number 62! - Note that given h(si) and h(si) we can
retrieve si because - sib h(si)
- and
- sit h0(h(si)) xor h(si).
- The family of h functions we make is another
universal family, so our time bound explained
earlier still holds.
15An Application of Hashtables Graph Structures
- One area where we will be able to use the
hashtable structure is in storing graphs. Here,
we describe a semidynamic directed-graph
implementation. This means that the number of
vertices is fixed, but edges can be added or
deleted at runtime. - Let u and v be vertices of a graph. We want the
following operations compactly and in O(1)
expected amortized time - deg(v) - get the degree of vertex v
- adjacent(u, v) - returns true iff u and v are
adjacent - firstEdge(v) - returns the first neighbor of v in
G - nextEdge(u, v) - returns the next neighbor of u
after v (assumes u and v are adjacent) - addEdge(u, v) - adds an edge from u to v in G
- deleteEdge(u, v) - deletes the edge (u, v) from G
16Hashing Integers
- Up to now, we have used bit strings as the main
objects in the hashtable. It will also be useful
to hash on integer values. Hence, we have
created some utilities to convert between bit
strings and integers using as few bits as
possible, so an integer x takes basically lg x
bits to write as a bit string.
17A Graph Layout Where We Store Edges in a Hashtable
- Lets say u is a vertex of degree d and v1, vd
are its neighbors. Lets say that v0 vd1 u
by convention. - Then the entry representing the edge (u, vi) has
key (u, vi) and data (vi-1, vi1).
Hash Table
This extra entry starts the list.
u
u
u
u
v2
v1
u
v3
v1
u
v4
v2
v3
v2
v1
v4
Degree of Vertex
4
18Implementations of a Couple Operations
- For simplicity, Im leaving off the length
arguments in query() and insert(). - adjacent(u, v)
- return (query((u, v)) ! -1)
- firstEdge(u)
- let (vp, vn, d) query((u, u))
- return vn
- addEdge(u, v)
- let (vp, vn, d) query((u, u))
- remove((u, u))
- insert((u, u), (vp, v, d 1))
- insert((u, v), (u, vn))
19Compression and Space Usage
- Instead of ((u, vi), (vi-1, vi1)) in the table,
we will store - ((u, vi u), (vi-1 u, vi1 u))
- With this representation, we need O(S(u,v)ÃŽE lg
u v) space. - A good labeling of the vertices will make many of
these differences small. For instance, for many
classes of graphs, such as planar graphs, the
total space used is O(n) bits! The following
paper has details
D. Blandford, G. E. Blelloch, and I. Kash.
Compact Representations of Separable Graphs. In
SODA, 2003, pages 342-351.
20More Details aboutImplementing Arrays
Well use the following data for our example in
these slides t0 10110 t1 0110 t2
11111 t3 0101 t4 1100 t5 010 t6 11011 t7
00001111 Well assume that the word size is 2
bytes.
21Key Idea BLOCKS
- Multiple data items can be crammed into a word,
so lets take advantage of that. - There are many possible ways to store data in
blocks. The way that Ill discuss here is to use
two words per block one stores data and one
marks separation of entries.
0 1 1 0
b0
1 0 1 1 0
1 1 1 1 1
1st word
Example
2nd word
1 0 0 0 0 1 0 0 0 1 0 0 0 0 1 0
This is the block containing strings t0 through
t2 from our example.
22Blocks continued
- Well name a block bi if i is the first entry
number to be stored in that block. The size of a
block is the sum of the sizes of the entries
inside it. - Well maintain a size invariant
- for any adjacent blocks bi and bj, bi bj
is at least a full word. - Note splitting and merging blocks is easy.
- We assume these things for now
- Entries fit into a word we can handle longer
entries by storing a pointer to separate memory
in its place - Entries are nonempty
23Organization of Blocks
- We have a bit array A of length n (this is a
regular old C array). Ai 1 if and only if
string i starts a block. This is our indexing
structure. - We also have a standard hashtable H. If string
i starts a block, H(i) address of bi. We
assume H is computed in O(1) expected amortized
time. - Blocks are large enough that storing them in H
only increases the space usage by a constant
factor.
Example In this example, b0 and b3 are adjacent
blocks, as are b3 and b7.
H(0)
t0 t1 t2
b0
1
0
0
H(3)
b3
1
t3 t4 t5 t6
A
0
0
0
H(7)
b7
t7
1
24A Note about Space Usage
- Any two 1s in the indexing structure A are
separated by at most one word. This is because
entries are nonempty and a block only holds one
word for entries.
25The get() operation
- Since bits that are turned on in A are close, we
can find the block to which an entry belongs in
O(1) time. One way to do this is table lookup. - If the ith entry is in block bk, then the ith
entry of the array is the (i k 1)st entry in
that block. - By using table lookup, we can find where the
correct 1s in the second word are, which tell us
where the entry starts and ends.
26A picture of the get() operation, illustrated
with get(2)
To find entry 2, we look in block 0.
1
0
H(0)
0
A2
1
0
b0
0 1 1 0
1 0 1 1 0
1 1 1 1 1
0
0
1 0 0 0 0 1 0 0 0 1 0 0 0 0 1 0
1
A
start
end
Conclusion Entry 2 is 5 bits long. It is 11111.
27How set() works in a nutshell
- Find the block with the entry.
- Rewrite it.
- If the block is too large, split it into two.
- Merge adjacent blocks together to preserve the
size invariant.
28Now, to prove the theorem about space usage for
arrays
- Let m Si ti and w machine word size. I
claim the total number of bits used is O(m). - Our size invariant for blocks guarantees that on
average, blocks are half full. Thus, there are
O(m / w) blocks used, since there are m bits
total of data and each block has W(w) bits stored
in it on average. - Our indexing structure A and hashtable H use O(w)
bits per block (O(1) words). Total bits - O(m / w) blocks O(w) per block O(m) bits.
29A note about entrieslonger than w bits
- What is really done in our code with entries
longer than w bits is not just allocating
separate memory and putting a pointer in the
array, though its close. - We do essentially what standard structures do,
and we chain the words making up our entry into a
linked list. We have a clever way to do this
which doesnt need to use w-bit pointers instead
we only need 7 or 8 bits for a pointer.