Title: An%20Approach%20to%20Generalized%20Hashing
1An Approach to Generalized Hashing
- Michael Klipper
- With
- Dan Blandford
- Guy Blelloch
2Hashing techniques currently available
- Many hashing algorithms out there
- Separate chaining
- Cuckoo hashing
- FKS perfect hashing
- Also many hash functions designed, including
several universal families - Good O(1) expected amortized time for updates,
and many have O(1) worst case time for searches - Bad Require fixed-length keys and fixed-length
data
3Whats so bad about fixed length?
- Easy to waste a lot of space
- Every hash bucket must be as large as the largest
item to be stored in the table. - This is a large problem for sparsely-filled
tables, or tables where large items occur
infrequently. - Hash tables are often building blocks to more
complicated structures, so optimizing them pays
off in a lot of places.
4Example A Graph Layout Where We Store Edges in a
Hash Table
- Lets say u is a vertex of degree d and v1, vd
are its neighbors. Lets say that v0 vd1 u
by convention. Then the entry representing the
edge (u, vi) has key (u, vi) and data (vi-1,
vi1).
Hash Table
This extra entry starts the list.
u
u
u
u
v2
v1
u
v3
v1
u
v4
v2
v3
v2
v1
v4
Degree of Vertex
4
5An Idea for Compression
- Instead of ((u, vi), (vi-1, vi1)) in the table,
we will store
((u, vi u), (vi-1 u, vi1 u)). - With this representation, we need O(kn) space
where k S(u,v)ÃŽE log u v. - A good labeling of the vertices will make many of
these differences small! But not all of them.
The following paper has details
D. Blandford, G. E. Blelloch, and I. Kash.
Compact Representations of Separable Graphs. In
SODA, 2003, pages 342-351.
6First, a simpler problem
- Variable-length data stored in arrays
- Its like a hash table except that the indices
now are in the fixed range 0n-1 for n items in
the array.
Well use the following data for our example in
these slides (0, 10110) (1, 0110) (2, 11111) (3,
0101) (4, 1100) (5, 010) (6, 11011) (7,
00001111) Well assume that the word size of the
machine is 2 bytes.
7Key Idea BLOCKS
- Multiple data items can be crammed into a word,
so lets take advantage of that. - Two words in a block one with data, one marking
off separations of strings - If the first index in a block is i, well label
the block as bi
0 1 1 0
b0
1 0 1 1 0
1 1 1 1 1
1st word
2nd word
1 0 0 0 0 1 0 0 0 1 0 0 0 0 1 0
This is the block containing strings s0 through
s2 from our example.
8Organization of Blocks
- Index structure (regular array) Ai 1 if and
only if string i starts a block - Hash table (one of the regular kind) if string
i starts a block, H(i) address of bi - Note that it is easy to split and merge blocks.
Key size invariant two adjacent blocks (like b0
and b3 in the example) must have their sizes sum
to greater than the word size of the machine
H(0)
b0
1
0
0
H(3)
b3
1
A
0
0
0
H(7)
b7
1
9A Rough Look at Space and Time Bounds for this
Array Structure
- Lets say we have n items, and w is the word size
in bits of the machine. WLOG all data strings
are nonempty. - Let m Si si.
At most w strings
1
0
lt w apart
0
On avg block is ½ full
1
0
Lookup tables cut the time down to constant time
for finding a block and the string inside it,
since they can operate on entire words at
once. Indexing structures hash table use O(w)
bits per block. Each block is on average half
full due to the invariant. O(m w) bits used
and operations are O(1) time!
0
0
O(m/w 1) blocks!
1
10Briefly, how we proceed from there
- We can finally implement our generalized hash
table using an array of the type we just
described as the hash table. - There are more details the following paper
explains this.
D. Blandford and G. E. Blelloch. Storing
Variable-Length Keys in Arrays, Sets, and
Dictionaries, with Applications. In Symposium on
Discrete Algorithms (SODA), 2005 (hopefully)
11Great, but if theres a paper written on the
subject already, then what do I do?
- A lot of this code isnt yet written. We havent
yet checked to see that the code we have fulfills
the theoretical bounds, since we have to make
sure that any cutting corners done for the
programming is theoretically safe. - My job is to get a lot of this running and look
for optimizations. - Also, once this is running, well want to run
experiments to see how well it runs, especially
in modeling graphs.