Generalized%20Hashing%20with%20Variable-Length%20Bit%20Strings - PowerPoint PPT Presentation

About This Presentation
Title:

Generalized%20Hashing%20with%20Variable-Length%20Bit%20Strings

Description:

Storing Variable-Length Keys in Arrays, Sets, ... Cuckoo hashing. FKS perfect hashing. Also many hash functions designed, including several universal families ... – PowerPoint PPT presentation

Number of Views:71
Avg rating:3.0/5.0
Slides: 30
Provided by: michael598
Category:

less

Transcript and Presenter's Notes

Title: Generalized%20Hashing%20with%20Variable-Length%20Bit%20Strings


1
Generalized Hashing with Variable-Length Bit
Strings
  • Michael Klipper
  • With
  • Dan Blandford Guy Blelloch

Original source D. Blandford and G. E.
Blelloch. Storing Variable-Length Keys in
Arrays, Sets, and Dictionaries, with
Applications. In Symposium on Discrete
Algorithms (SODA), 2005 (hopefully)
2
Hashing techniquescurrently available
  • Many hashing algorithms out there
  • Separate chaining
  • Cuckoo hashing
  • FKS perfect hashing
  • Also many hash functions designed, including
    several universal families
  • O(1) expected amortized time for updates, and
    many have O(1) worst case time for searches
  • They use W(n lg n) bits for n entries, since at
    least lg n bits are used per entry to distinguish
    between keys.

3
What kind of bounds do we achieve?
  • Lets say we store n entries in our hashtable of
    the form (si, ti) for i 0, 1, 2, (n-1). Each
    si and ti are bit strings of variable length.
    For our purposes, many of the tis might only be
    a few bits long.
  • Time for all operations (later slide)
  • O(1) expected amortized
  • Total space used
  • O(Si max(si - lg n, 1) ti) bits

4
The Improvement We Attain
  • Lets say we store n entries taking up m total
    bits. In terms of the si and ti values on the
    previous slide,
  • m Si si ti
  • Note that m W(n lg n).
  • Thus, our space usage is O(m n lg n) bits, as
    opposed to the W(m) bits that standard hashtable
    structures use.
  • In particular, our structure is much more
    efficient than standard structures when m is
    close to n lg n (for example, when most entries
    are only a few bits long).

5
GoalGeneralized Dynamic Hashtables
  • We want to support the following operations
  • query(key, keyLength)
  • Looks up the key in the hashtable and returns the
    data associated and its length
  • insert(key, keyLength, data, dataLength)
  • Adds (key, data) as an entry in the hashtable
  • remove(key, keyLength)
  • Removes the key and the data associated
  • NOTE Each key will only have one entry
    associated with it. Another name for this kind
    of structure is a variable-length dictionary
    structure.

6
Other Structures
  • Variable-Length Sets
  • Also supports query, insert, and remove, though
    there is no extra data associated with keys
  • Can be easily implemented as a generalized
    hashtable that stores no extra data
  • O(1) expected amortized time for all operations
  • If the n keys are s0, s1, sn-1, then the total
    space used in bits is
  • O(Si max(si - lg n, 1))

7
Other Structures (cont.)
  • Variable-Length Arrays
  • For n entries, the keys are 0, 1, n-1.
  • These arrays will not be able to resize their
    number of entries.
  • Operations
  • get(i) returns the data stored at index i and its
    length
  • set(i, val, len) updates the data at index i to
    val of length len
  • Once again, O(1) expected amortized time for
    operations. Total space usage is O(Si ti).

8
Implementation Note
  • Assume for now that we have a variable-length
    array structure described on the previous slide.
    We will use this to make generalized dynamic
    hashtables, which are more interesting than the
    arrays.
  • At the end of this presentation, I can talk about
    implementation of variable-length arrays if time
    permits.

9
The Main Idea BehindHow Hashtables Work
  • Our generalized hashtable structure contains a
    variable-length array with 2q entries (which will
    serve as the buckets for the hashtable). We keep
    2q approximately equal to n by occasional
    rehashing of the bucket contents.
  • The item (si, ti), where si is the key and ti is
    the data, is placed in a bucket as follows we
    first hash si to some index (more on this later),
    and we write (si, ti) into the bucket specified
    by that index. Note that when we hash si, we
    implicitly treat it as an integer.

10
Hashtables (cont.)
  • If several entries of the set collide in a
    bucket, we throw them all into the bucket
    together as one giant concatenated bit string.
    Thus, we essentially use a separate-chaining
    algorithm.
  • To tell where one entry starts and another
    begins,
  • we encode the entries with a prefix-free code
    (such as Huffman codes or gamma codes).

Sample bucket (where si is si encoded, etc.)
s1 t1
s2 t2
s3 t3
11
Time and Space Bounds
  • Note that we use prefix-free codes that only use
    a constant factor more space (i.e. they encode m
    bits in O(m) space) and can be encoded/decoded in
    O(1) time.
  • Time If we use a universal hash function to
    determine the bucket index, then each bucket
    receives only a constant expected number of
    elements, so it takes O(1) expected amortized
    time to find an element in a bucket. The
    prefix-free codes we use allow O(1) decoding of
    any element.
  • Space The prefix-free codes increase the amount
    of bits stored by at most a constant factor. If
    we have m bits total we want to store, our space
    bound for variable-length arrays says that the
    buckets take up O(m) bits.

12
Theres a bit more than that
  • Recall the space bound for the hash table is
  • O(Si max(si - lg n, 1) ti).
  • Where does the lg n savings per entry come from?
  • We perform a technique called quotienting.
  • We actually use two hash functions h and h.
    h(si) is the bucket index, and h(si) has
    length max(si - q, 1). (Recall that 2q is
    approximately n.)
  • Instead of writing (si, ti) in the bucket, we
    actually write (h(si), ti). This way, each
    entry needs h(si) ti bits to write, which
    fulfills our space bound above.

13
A Quotienting Scheme
  • Let h0 be a hash function from a universal family
    whose range is q bits. We describe a way to make
    a family of hash functions from the family from
    which h0 is drawn.
  • Let sit be the q most
  • significant bits of si,
  • and let sib be the other bits.
  • We define our hash functions
  • as follows
  • h(si) sib
  • h(si) h0(sib) xor sit

h(si)
sib
sit
101101 001010100100101
si
h0

010011
h(si)
111110
14
Undoing the Quotienting
  • In the previous example, we saw that h(si)
    evaluated to 111110, or 62. This means we store
    h(si) in bucket number 62!
  • Note that given h(si) and h(si) we can
    retrieve si because
  • sib h(si)
  • and
  • sit h0(h(si)) xor h(si).
  • The family of h functions we make is another
    universal family, so our time bound explained
    earlier still holds.

15
An Application of Hashtables Graph Structures
  • One area where we will be able to use the
    hashtable structure is in storing graphs. Here,
    we describe a semidynamic directed-graph
    implementation. This means that the number of
    vertices is fixed, but edges can be added or
    deleted at runtime.
  • Let u and v be vertices of a graph. We want the
    following operations compactly and in O(1)
    expected amortized time
  • deg(v) - get the degree of vertex v
  • adjacent(u, v) - returns true iff u and v are
    adjacent
  • firstEdge(v) - returns the first neighbor of v in
    G
  • nextEdge(u, v) - returns the next neighbor of u
    after v (assumes u and v are adjacent)
  • addEdge(u, v) - adds an edge from u to v in G
  • deleteEdge(u, v) - deletes the edge (u, v) from G

16
Hashing Integers
  • Up to now, we have used bit strings as the main
    objects in the hashtable. It will also be useful
    to hash on integer values. Hence, we have
    created some utilities to convert between bit
    strings and integers using as few bits as
    possible, so an integer x takes basically lg x
    bits to write as a bit string.

17
A Graph Layout Where We Store Edges in a Hashtable
  • Lets say u is a vertex of degree d and v1, vd
    are its neighbors. Lets say that v0 vd1 u
    by convention.
  • Then the entry representing the edge (u, vi) has
    key (u, vi) and data (vi-1, vi1).

Hash Table
This extra entry starts the list.
u
u
u
u
v2
v1
u
v3
v1
u
v4
v2
v3
v2
v1
v4
Degree of Vertex
4
18
Implementations of a Couple Operations
  • For simplicity, Im leaving off the length
    arguments in query() and insert().
  • adjacent(u, v)
  • return (query((u, v)) ! -1)
  • firstEdge(u)
  • let (vp, vn, d) query((u, u))
  • return vn
  • addEdge(u, v)
  • let (vp, vn, d) query((u, u))
  • remove((u, u))
  • insert((u, u), (vp, v, d 1))
  • insert((u, v), (u, vn))

19
Compression and Space Usage
  • Instead of ((u, vi), (vi-1, vi1)) in the table,
    we will store
  • ((u, vi u), (vi-1 u, vi1 u))
  • With this representation, we need O(S(u,v)ÎE lg
    u v) space.
  • A good labeling of the vertices will make many of
    these differences small. For instance, for many
    classes of graphs, such as planar graphs, the
    total space used is O(n) bits! The following
    paper has details

D. Blandford, G. E. Blelloch, and I. Kash.
Compact Representations of Separable Graphs. In
SODA, 2003, pages 342-351.
20
More Details aboutImplementing Arrays
Well use the following data for our example in
these slides t0 10110 t1 0110 t2
11111 t3 0101 t4 1100 t5 010 t6 11011 t7
00001111 Well assume that the word size is 2
bytes.
21
Key Idea BLOCKS
  • Multiple data items can be crammed into a word,
    so lets take advantage of that.
  • There are many possible ways to store data in
    blocks. The way that Ill discuss here is to use
    two words per block one stores data and one
    marks separation of entries.

0 1 1 0
b0
1 0 1 1 0
1 1 1 1 1
1st word
Example
2nd word
1 0 0 0 0 1 0 0 0 1 0 0 0 0 1 0
This is the block containing strings t0 through
t2 from our example.
22
Blocks continued
  • Well name a block bi if i is the first entry
    number to be stored in that block. The size of a
    block is the sum of the sizes of the entries
    inside it.
  • Well maintain a size invariant
  • for any adjacent blocks bi and bj, bi bj
    is at least a full word.
  • Note splitting and merging blocks is easy.
  • We assume these things for now
  • Entries fit into a word we can handle longer
    entries by storing a pointer to separate memory
    in its place
  • Entries are nonempty

23
Organization of Blocks
  • We have a bit array A of length n (this is a
    regular old C array). Ai 1 if and only if
    string i starts a block. This is our indexing
    structure.
  • We also have a standard hashtable H. If string
    i starts a block, H(i) address of bi. We
    assume H is computed in O(1) expected amortized
    time.
  • Blocks are large enough that storing them in H
    only increases the space usage by a constant
    factor.

Example In this example, b0 and b3 are adjacent
blocks, as are b3 and b7.
H(0)
t0 t1 t2
b0
1
0
0
H(3)
b3
1
t3 t4 t5 t6
A
0
0
0
H(7)
b7
t7
1
24
A Note about Space Usage
  • Any two 1s in the indexing structure A are
    separated by at most one word. This is because
    entries are nonempty and a block only holds one
    word for entries.

25
The get() operation
  • Since bits that are turned on in A are close, we
    can find the block to which an entry belongs in
    O(1) time. One way to do this is table lookup.
  • If the ith entry is in block bk, then the ith
    entry of the array is the (i k 1)st entry in
    that block.
  • By using table lookup, we can find where the
    correct 1s in the second word are, which tell us
    where the entry starts and ends.

26
A picture of the get() operation, illustrated
with get(2)
To find entry 2, we look in block 0.
1
0
H(0)
0
A2
1
0
b0
0 1 1 0
1 0 1 1 0
1 1 1 1 1
0
0
1 0 0 0 0 1 0 0 0 1 0 0 0 0 1 0
1
A
start
end
Conclusion Entry 2 is 5 bits long. It is 11111.
27
How set() works in a nutshell
  1. Find the block with the entry.
  2. Rewrite it.
  3. If the block is too large, split it into two.
  4. Merge adjacent blocks together to preserve the
    size invariant.

28
Now, to prove the theorem about space usage for
arrays
  • Let m Si ti and w machine word size. I
    claim the total number of bits used is O(m).
  • Our size invariant for blocks guarantees that on
    average, blocks are half full. Thus, there are
    O(m / w) blocks used, since there are m bits
    total of data and each block has W(w) bits stored
    in it on average.
  • Our indexing structure A and hashtable H use O(w)
    bits per block (O(1) words). Total bits
  • O(m / w) blocks O(w) per block O(m) bits.

29
A note about entrieslonger than w bits
  • What is really done in our code with entries
    longer than w bits is not just allocating
    separate memory and putting a pointer in the
    array, though its close.
  • We do essentially what standard structures do,
    and we chain the words making up our entry into a
    linked list. We have a clever way to do this
    which doesnt need to use w-bit pointers instead
    we only need 7 or 8 bits for a pointer.
Write a Comment
User Comments (0)
About PowerShow.com