CS223 Advanced Data Structures - PowerPoint PPT Presentation

1 / 35

About This Presentation

Title:

CS223 Advanced Data Structures

Description:

Quadratic Probing, however, does not require the use of a second hash function ... Middle-of-the road strategy: rehash when the tables reaches a certain load factor ... – PowerPoint PPT presentation

Number of Views:72

Avg rating:3.0/5.0

Slides: 36

Provided by: holge

Category:

more less

Transcript and Presenter's Notes

Title: CS223 Advanced Data Structures

1
CS223 Advanced Data Structures

Dr. Wenzhan Song
Assistant Professor, Computer Science

2
Chapter 5Hashing
An ideal hash table
3

int hash( const string key, int tableSize )
int hashVal 0
for( int i 0 i lt key.length( ) i )
hashVal key i
return hashVal tableSize
int hash( const string key, int tableSize )
return ( key 0 27 key 1 729
key 2 ) tableSize
/
A hash routine for string objects.
/

Fig. 5.2. A simple hash function
Fig. 5.3 Another possible hash function not too
good

Fig. 5.4 A good hash function
Not necessarily the best respect to table
distribution
But extremely simple and reasonably fast
Typically implementation may choose some chars
(e.g., odd space) to calculate hash

4
Separate Chaining Hashing
5

template lttypename HashedObjgt
class HashTable
public
explicit HashTable( int size 101 )
bool contains( const HashedObj x ) const
void makeEmpty( )
void insert( const HashedObj x )
void remove( const HashedObj x )
private
vectorltlistltHashedObjgt gt theLists // The
array of Lists
int currentSize
void rehash( )
int myhash( const HashedObj x ) const

Fig. 5.6 Type declaration for separate chaining
hash table
6

int myhash( const HashedObj x ) const
int hashVal hash( x )
hashVal theLists.size( )
if( hashVal lt 0 )
hashVal theLists.size( )
return hashVal

Fig. 5.7 myHash member function for hash tables
7

// Example of an Employee class
class Employee
public
const string getName( ) const
return name
bool operator( const Employee rhs ) const
return getName( ) rhs.getName( )
bool operator!( const Employee rhs ) const
return !( this rhs
// Additional public members not shown
private
string name
double salary
int seniority

Fig. 5.8 Example of a class that can be used as a
HashObj
8

void makeEmpty( )
for( int i 0 i lt theLists.size( ) i
)
theLists i .clear( )
bool contains( const HashedObj x ) const
const listltHashedObjgt whichList
theLists myhash( x )
return find( whichList.begin( ),
whichList.end( ), x ) ! whichList.end( )
bool remove( const HashedObj x )
listltHashedObjgt whichList theLists
myhash( x )
listltHashedObjgtiterator itr find(
whichList.begin( ), whichList.end( ), x )
if( itr whichList.end( ) )
return false

Fig. 5.9 makeEmpty, contains and remove routines
for separate chaining hash table
9

bool insert( const HashedObj x )
listltHashedObjgt whichList theLists
myhash( x )
if( find( whichList.begin( ),
whichList.end( ), x ) ! whichList.end( ) )
return false
whichList.push_back( x )
// Rehash see Section 5.5
if( currentSize gt theLists.size( ) )
rehash( )
return true

Fig. 5.10 insert routine for separate chaining
hash table
10
Analysis of Separate Chaining

Load factor r the ratio of the number of
elements in the hash table to the table size
The average length of list is r
Search cost
Unsuccessful search visit r nodes in average
successful search traverse 1r/2 links in
average
Conclusion the table size is not really
important, but the load factor r is. The general
rule for separate chaining hashing is to make the
table size about as large as the number of
elements expected. In other words, let r 1.

11
Hash Tables without Linked Lists

Linear Probing
Quadratic Probing
Double Hashing

12
Linear probing
f(i) i
Hash table with linear probing, after each
insertion
13
Linear probing
Primary clustering any key hash into the cluster
will require several attempts to resolve the
collision
Number of probes plotted against load factor for
linear probing (dashed) and random strategy
(solid). S successful search, U - unsuccessful
search, I - insertion
14
Quadratic probing
f(i) i2 a collision resolution method to
eliminate the primary clustering problem of
linear probing. Notice, also f(i)f(i-1)2i-1
Hash table with quadratic probing, after each
insertion
15
Quadratic probing

Theorem 5.1.
If quadratic probing is used, and the table
size is prime, then a new element can always be
inserted if the table is at least half empty.
Proof
Here 0 lt I, j lt TableSize/2. Suppose, for
the sake of contradiction, that the probing
locations are the same, but i ! j. Then
h(x)i2h(x)j2 (mod TableSize)
i2 j2 (mod TableSize)
(i-j)(ij) 0 (mod TableSize)
This is impossible. Contradiction induced.
It is crucial that the table size be prime. If it
is not prime, the number of alternative locations
can be severely reduced.
For example, if TableSize16, the only
alternative location is 1,4,9
Secondary clustering elements that hash to same
position will probe the same alternative cells.

16
Double hashing
f(i) ihash2(x)
Hash table with double hashing, after each
insertion
17
Double hashing

One popular choice is f(i) ihash2(x)
Probe at distance hash2(x), 2hash2(x),
Choose hash2(x) such that it is never 0
For example, hash2(x) x mod 9 is not good,
because hash2(99)0
hash2(x) R (x mod R), with R a prime smaller
than TableSize, will work well
TableSize must be prime number
In previous example, imagine insert 23 into
table
hash2(23) 7-2 5
1st try probe 5th slot away -gt collide with 58
2nd try probe 10th slot (e.g., 0th) away, same
as current location
Hence, only one alternative location is possible

18
Hash Tables without Linked Lists

Standard deletion can not be performed in a
probing hash table, because the cell might have
caused a collision to go past it.
Solution set flag to ACTIVE, EMPTY, DELETED
Fig 5.14, 5.15, 5.16, 5.17

template lttypename HashedObjgt
class HashTable
public
explicit HashTable( int size 101 )
bool contains( const HashedObj x ) const
void makeEmpty( )
bool insert( const HashedObj x )
bool remove( const HashedObj x )
enum EntryType ACTIVE, EMPTY, DELETED
private
struct HashEntry
HashedObj element
EntryType info

Fig. 5.14 Class interface for hash tables using
probing strategies, including the nested
HashEntry class
20

explicit HashTable( int size 101 ) array(
nextPrime( size ) )
makeEmpty( )
void makeEmpty( )
currentSize 0
for( int i 0 i lt array.size( ) i )
array i .info EMPTY

Fig. 5.15 Routines to initialize quadratic
probing hash table
21

bool contains( const HashedObj x ) const
return isActive( findPos( x ) )
int findPos( const HashedObj x ) const
int offset 1
int currentPos myhash( x )
while( array currentPos .info ! EMPTY
array currentPos .element ! x
)
currentPos offset // Compute ith
probe
offset 2
if( currentPos gt array.size( ) )
currentPos - array.size( )
return currentPos

Fig. 5.16 Contains routine for hashing with
quadratic probing
22

bool insert( const HashedObj x )
// Insert x as active
int currentPos findPos( x )
if( isActive( currentPos ) )
return false
array currentPos HashEntry( x,
ACTIVE )
// Rehash see Section 5.5
if( currentSize gt array.size( ) / 2 )
rehash( )
return true
bool remove( const HashedObj x )
int currentPos findPos( x )

Fig. 5.17 insert and remove routines for hash
tables with quadratic probing
23
Hash Tables without Linked Lists

If double hashing is correctly implemented,
simulations imply that the expected number of
probes is almost the same as for a random
collision resolution strategy.
Quadratic Probing, however, does not require the
use of a second hash function and is thus likely
simpler and faster in practice.

24
Rehashing

Motivation
Insertion might fail with those probing method
after the load factor r above a threshold, then
HashTable shall be enlarged at least twice.
With quadratic probing, three strategies
Rehash as soon as the table is half full
Rehash only when an insertion fails
Middle-of-theroad strategy rehash when the
tables reaches a certain load factor
With a good cutoff, it could be the best

25
Rehashing
Hash table with linear probing with input
13,15,6,24 h(x) x mod 7
26
Rehashing
Hash table with linear probing after 23 is
inserted
27
Rehashing

New hash table after rehashing
Scan previous table and add number sequentially
6 15 23 24 12
In the left figure, enlarge hash table from 7 to
17 (because 17 is the next prime at least twice
of 7), and use new hash function h(x) x mod 17

28
Rehashing Implementation

/
Rehashing for quadratic probing hash
table.
/
void rehash( )
vectorltHashEntrygt oldArray array
// Create new double-sized, empty
table
array.resize( nextPrime( 2
oldArray.size( ) ) )
for( int j 0 j lt array.size( ) j )
array j .info EMPTY
// Copy table over
currentSize 0
for( int i 0 i lt oldArray.size( ) i
)
if( oldArray i .info ACTIVE )
insert( oldArray i .element )

Fig. 5.22
29
Rehashing Implementation

/
Rehashing for separate chaining hash
table.
/
void rehash( )
vectorltlistltHashedObjgt gt oldLists
theLists
// Create new double-sized, empty
table
theLists.resize( nextPrime( 2
theLists.size( ) ) )
for( int j 0 j lt theLists.size( ) j
)
theLists j .clear( )
// Copy table over
currentSize 0
for( int i 0 i lt oldLists.size( ) i
)
listltHashedObjgtiterator itr
oldLists i .begin( )
while( itr ! oldLists i .end( ) )
insert( itr )

Fig. 5.22
30
Hash tables in the Standard Library

Hash_set http//msdn.microsoft.com/en-us/library/
bksash1t(VS.80).aspx
Hash_map http//msdn.microsoft.com/en-us/library/
6x7w9f6z(VS.80).aspx
Compare it with your AVL set implementation ...

31
Extensible Hashing
32
Extensible Hashing
33
Extensible Hashing
34
Summary

Implement insert and contains operations in
constant average time
Load factor is important for eficiency
Compare to Binary search tree
Insert and contains (e.g., isElementOf) BST is
better
But the input is sorted, BST could be expensive
AVL and Splay tree need expensive operations to
balance, then hashing is a better choice
Applications
Compilers use hash table to keep track of
declared variables in source code symbol table
Any graph theory problem where the nodes have
real names instead of numbers
Programs that play games transposition table
Online spelling checks

35
Summary (continued)

Separate chaining hashing requires the use of
links, which costs some memory, and the standard
method of implementing calls on memory allocation
routines, which typically are expensive.
Linear probing is easily implemented, but
performance degrades severely as the load factor
increases because of primary clustering.
Quadratic probing is only slightly more difficult
to implement and gives good performance in
practice. An insertion can fail if the table is
half empty, but this is not likely. Even if it
were, such an insertion would be so expensive
that it wouldnt matter and would almost
certainly point up a weakness in the hash
function.
Double hashing eliminates primary and secondary
clustering, but the computation of a second hash
function can be costly.
Gonnet and Baeza-Yates compare several hashing
strategies their results suggest that quadratic
probing is the fastest method.