Title: CS223 Advanced Data Structures
1CS223 Advanced Data Structures
- Dr. Wenzhan Song
- Assistant Professor, Computer Science
2Chapter 5Hashing
An ideal hash table
3- int hash( const string key, int tableSize )
-
- int hashVal 0
- for( int i 0 i lt key.length( ) i )
- hashVal key i
- return hashVal tableSize
-
- int hash( const string key, int tableSize )
-
- return ( key 0 27 key 1 729
key 2 ) tableSize -
- /
- A hash routine for string objects.
- /
Fig. 5.2. A simple hash function
Fig. 5.3 Another possible hash function not too
good
- Fig. 5.4 A good hash function
- Not necessarily the best respect to table
distribution - But extremely simple and reasonably fast
- Typically implementation may choose some chars
(e.g., odd space) to calculate hash
4Separate Chaining Hashing
5- template lttypename HashedObjgt
- class HashTable
-
- public
- explicit HashTable( int size 101 )
- bool contains( const HashedObj x ) const
- void makeEmpty( )
- void insert( const HashedObj x )
- void remove( const HashedObj x )
- private
- vectorltlistltHashedObjgt gt theLists // The
array of Lists - int currentSize
- void rehash( )
- int myhash( const HashedObj x ) const
-
Fig. 5.6 Type declaration for separate chaining
hash table
6- int myhash( const HashedObj x ) const
-
- int hashVal hash( x )
-
- hashVal theLists.size( )
- if( hashVal lt 0 )
- hashVal theLists.size( )
-
- return hashVal
-
Fig. 5.7 myHash member function for hash tables
7- // Example of an Employee class
- class Employee
-
- public
- const string getName( ) const
- return name
- bool operator( const Employee rhs ) const
- return getName( ) rhs.getName( )
- bool operator!( const Employee rhs ) const
- return !( this rhs
- // Additional public members not shown
- private
- string name
- double salary
- int seniority
Fig. 5.8 Example of a class that can be used as a
HashObj
8- void makeEmpty( )
-
- for( int i 0 i lt theLists.size( ) i
) - theLists i .clear( )
-
- bool contains( const HashedObj x ) const
-
- const listltHashedObjgt whichList
theLists myhash( x ) - return find( whichList.begin( ),
whichList.end( ), x ) ! whichList.end( ) -
-
- bool remove( const HashedObj x )
-
- listltHashedObjgt whichList theLists
myhash( x ) - listltHashedObjgtiterator itr find(
whichList.begin( ), whichList.end( ), x ) -
- if( itr whichList.end( ) )
- return false
Fig. 5.9 makeEmpty, contains and remove routines
for separate chaining hash table
9- bool insert( const HashedObj x )
-
- listltHashedObjgt whichList theLists
myhash( x ) - if( find( whichList.begin( ),
whichList.end( ), x ) ! whichList.end( ) ) - return false
- whichList.push_back( x )
-
- // Rehash see Section 5.5
- if( currentSize gt theLists.size( ) )
- rehash( )
-
- return true
-
Fig. 5.10 insert routine for separate chaining
hash table
10Analysis of Separate Chaining
- Load factor r the ratio of the number of
elements in the hash table to the table size - The average length of list is r
- Search cost
- Unsuccessful search visit r nodes in average
successful search traverse 1r/2 links in
average - Conclusion the table size is not really
important, but the load factor r is. The general
rule for separate chaining hashing is to make the
table size about as large as the number of
elements expected. In other words, let r 1.
11Hash Tables without Linked Lists
- Linear Probing
- Quadratic Probing
- Double Hashing
12Linear probing
f(i) i
Hash table with linear probing, after each
insertion
13Linear probing
Primary clustering any key hash into the cluster
will require several attempts to resolve the
collision
Number of probes plotted against load factor for
linear probing (dashed) and random strategy
(solid). S successful search, U - unsuccessful
search, I - insertion
14Quadratic probing
f(i) i2 a collision resolution method to
eliminate the primary clustering problem of
linear probing. Notice, also f(i)f(i-1)2i-1
Hash table with quadratic probing, after each
insertion
15Quadratic probing
- Theorem 5.1.
- If quadratic probing is used, and the table
size is prime, then a new element can always be
inserted if the table is at least half empty. - Proof
- Here 0 lt I, j lt TableSize/2. Suppose, for
the sake of contradiction, that the probing
locations are the same, but i ! j. Then - h(x)i2h(x)j2 (mod TableSize)
- i2 j2 (mod TableSize)
- (i-j)(ij) 0 (mod TableSize)
- This is impossible. Contradiction induced.
- It is crucial that the table size be prime. If it
is not prime, the number of alternative locations
can be severely reduced. - For example, if TableSize16, the only
alternative location is 1,4,9 - Secondary clustering elements that hash to same
position will probe the same alternative cells.
16Double hashing
f(i) ihash2(x)
Hash table with double hashing, after each
insertion
17Double hashing
- One popular choice is f(i) ihash2(x)
- Probe at distance hash2(x), 2hash2(x),
- Choose hash2(x) such that it is never 0
- For example, hash2(x) x mod 9 is not good,
because hash2(99)0 - hash2(x) R (x mod R), with R a prime smaller
than TableSize, will work well - TableSize must be prime number
- In previous example, imagine insert 23 into
table - hash2(23) 7-2 5
- 1st try probe 5th slot away -gt collide with 58
- 2nd try probe 10th slot (e.g., 0th) away, same
as current location - Hence, only one alternative location is possible
18Hash Tables without Linked Lists
- Standard deletion can not be performed in a
probing hash table, because the cell might have
caused a collision to go past it. - Solution set flag to ACTIVE, EMPTY, DELETED
- Fig 5.14, 5.15, 5.16, 5.17
19- template lttypename HashedObjgt
- class HashTable
-
- public
- explicit HashTable( int size 101 )
- bool contains( const HashedObj x ) const
- void makeEmpty( )
- bool insert( const HashedObj x )
- bool remove( const HashedObj x )
- enum EntryType ACTIVE, EMPTY, DELETED
- private
- struct HashEntry
-
- HashedObj element
- EntryType info
Fig. 5.14 Class interface for hash tables using
probing strategies, including the nested
HashEntry class
20- explicit HashTable( int size 101 ) array(
nextPrime( size ) ) - makeEmpty( )
-
- void makeEmpty( )
-
- currentSize 0
- for( int i 0 i lt array.size( ) i )
- array i .info EMPTY
Fig. 5.15 Routines to initialize quadratic
probing hash table
21- bool contains( const HashedObj x ) const
- return isActive( findPos( x ) )
-
- int findPos( const HashedObj x ) const
-
- int offset 1
- int currentPos myhash( x )
-
- while( array currentPos .info ! EMPTY
- array currentPos .element ! x
) -
- currentPos offset // Compute ith
probe - offset 2
- if( currentPos gt array.size( ) )
- currentPos - array.size( )
-
-
- return currentPos
-
Fig. 5.16 Contains routine for hashing with
quadratic probing
22- bool insert( const HashedObj x )
-
- // Insert x as active
- int currentPos findPos( x )
- if( isActive( currentPos ) )
- return false
-
- array currentPos HashEntry( x,
ACTIVE ) -
- // Rehash see Section 5.5
- if( currentSize gt array.size( ) / 2 )
- rehash( )
-
- return true
-
-
- bool remove( const HashedObj x )
-
- int currentPos findPos( x )
Fig. 5.17 insert and remove routines for hash
tables with quadratic probing
23Hash Tables without Linked Lists
- If double hashing is correctly implemented,
simulations imply that the expected number of
probes is almost the same as for a random
collision resolution strategy. - Quadratic Probing, however, does not require the
use of a second hash function and is thus likely
simpler and faster in practice.
24Rehashing
- Motivation
- Insertion might fail with those probing method
after the load factor r above a threshold, then
HashTable shall be enlarged at least twice. - With quadratic probing, three strategies
- Rehash as soon as the table is half full
- Rehash only when an insertion fails
- Middle-of-theroad strategy rehash when the
tables reaches a certain load factor - With a good cutoff, it could be the best
25Rehashing
Hash table with linear probing with input
13,15,6,24 h(x) x mod 7
26Rehashing
Hash table with linear probing after 23 is
inserted
27Rehashing
- New hash table after rehashing
- Scan previous table and add number sequentially
6 15 23 24 12 - In the left figure, enlarge hash table from 7 to
17 (because 17 is the next prime at least twice
of 7), and use new hash function h(x) x mod 17
28Rehashing Implementation
- /
- Rehashing for quadratic probing hash
table. - /
- void rehash( )
-
- vectorltHashEntrygt oldArray array
-
- // Create new double-sized, empty
table - array.resize( nextPrime( 2
oldArray.size( ) ) ) - for( int j 0 j lt array.size( ) j )
- array j .info EMPTY
-
- // Copy table over
- currentSize 0
- for( int i 0 i lt oldArray.size( ) i
) - if( oldArray i .info ACTIVE )
- insert( oldArray i .element )
-
Fig. 5.22
29Rehashing Implementation
- /
- Rehashing for separate chaining hash
table. - /
- void rehash( )
-
- vectorltlistltHashedObjgt gt oldLists
theLists -
- // Create new double-sized, empty
table - theLists.resize( nextPrime( 2
theLists.size( ) ) ) - for( int j 0 j lt theLists.size( ) j
) - theLists j .clear( )
-
- // Copy table over
- currentSize 0
- for( int i 0 i lt oldLists.size( ) i
) -
- listltHashedObjgtiterator itr
oldLists i .begin( ) - while( itr ! oldLists i .end( ) )
- insert( itr )
Fig. 5.22
30Hash tables in the Standard Library
- Hash_set http//msdn.microsoft.com/en-us/library/
bksash1t(VS.80).aspx - Hash_map http//msdn.microsoft.com/en-us/library/
6x7w9f6z(VS.80).aspx - Compare it with your AVL set implementation ...
31Extensible Hashing
32Extensible Hashing
33Extensible Hashing
34Summary
- Implement insert and contains operations in
constant average time - Load factor is important for eficiency
- Compare to Binary search tree
- Insert and contains (e.g., isElementOf) BST is
better - But the input is sorted, BST could be expensive
AVL and Splay tree need expensive operations to
balance, then hashing is a better choice - Applications
- Compilers use hash table to keep track of
declared variables in source code symbol table - Any graph theory problem where the nodes have
real names instead of numbers - Programs that play games transposition table
- Online spelling checks
-
35Summary (continued)
- Separate chaining hashing requires the use of
links, which costs some memory, and the standard
method of implementing calls on memory allocation
routines, which typically are expensive. - Linear probing is easily implemented, but
performance degrades severely as the load factor
increases because of primary clustering. - Quadratic probing is only slightly more difficult
to implement and gives good performance in
practice. An insertion can fail if the table is
half empty, but this is not likely. Even if it
were, such an insertion would be so expensive
that it wouldnt matter and would almost
certainly point up a weakness in the hash
function. - Double hashing eliminates primary and secondary
clustering, but the computation of a second hash
function can be costly. - Gonnet and Baeza-Yates compare several hashing
strategies their results suggest that quadratic
probing is the fastest method.