Title: Hash Tables 1
1Hash Tables 1
2Dictionary
- Dictionary
- Dynamic-set data structure for storing items
indexed using keys. - Supports operations Insert, Search, and Delete.
- Applications
- Symbol table of a compiler.
- Memory-management tables in operating systems.
- Large-scale distributed systems.
- Hash Tables
- Effective way of implementing dictionaries.
- Generalization of ordinary arrays.
3Direct-address Tables
- Direct-address Tables are ordinary arrays.
- Facilitate direct addressing.
- Element whose key is k is obtained by indexing
into the kth position of the array. - Applicable when we can afford to allocate an
array with one position for every possible key. - i.e. when the universe of keys U is small.
- Dictionary operations can be implemented to take
O(1) time. - Details in Sec. 11.1.
4Hash Tables
- Notation
- U Universe of all possible keys.
- K Set of keys actually stored in the
dictionary. - K n.
- When U is very large,
- Arrays are not practical.
- K ltlt U.
- Use a table of size proportional to K The
hash tables. - However, we lose the direct-addressing ability.
- Define functions that map keys to slots of the
hash table.
5Hash Tables
- Let universe of keys U and an array of size m. A
hash function h is a function from U to 0m, that
is h U 0m
(universe of keys)
0 1 2 3 4 5 6 7
U
k1 k2 k3 k4 k6
h (k2)2
h (k1)h (k3)3
h (k6)5
h (k4)7
6Hash Tables Example
For example, if we hash keys 01000 into a hash
table with 5 entries and use h(key) key mod 5 ,
we get the following sequence of events
Insert 21
Insert 54
There is a collision at array entry 4
21
2
2
???
7Hashing
- Hash function h Mapping from U to the slots of a
hash table T0..m1. - h U ? 0,1,, m1
- With arrays, key k maps to slot Ak.
- With hash tables, key k maps or hashes to slot
Thk. - hk is the hash value of key k.
8Hashing
0
U (universe of keys)
h(k1)
h(k4)
k1
K (actual keys)
k4
k2
collision
h(k2)h(k5)
k5
k3
h(k3)
m1
9Issues with Hashing
- Multiple keys can hash to the same slot
collisions are possible. - Design hash functions such that collisions are
minimized. - But avoiding collisions is impossible.
- Design collision-resolution techniques.
- Search will cost ?(n) time in the worst case.
- However, all operations can be made to have an
expected complexity of ?(1).
10Methods of Resolution
- Chaining
- Store all elements that hash to the same slot in
a linked list. - Store a pointer to the head of the linked list in
the hash table slot. - Open Addressing
- All elements stored in hash table itself.
- When collisions occur, use a systematic
(consistent) procedure to store elements in free
slots of the table.
0
k1
k4
k2
k5
k6
k7
k3
k8
m1
11Collision Resolution by Chaining
0
U (universe of keys)
h(k1)h(k4)
X
k1
k4
K (actual keys)
k2
X
h(k2)h(k5)h(k6)
k6
k5
k7
k8
k3
X
h(k3)h(k7)
h(k8)
m1
12Collision Resolution by Chaining
0
U (universe of keys)
k1
k4
k1
k4
K (actual keys)
k2
k2
k6
k5
k6
k5
k7
k8
k3
k7
k3
k8
m1
13Hashing with Chaining
- What is the running time to insert/search/delete?
- Insert It takes O(1) time to compute the hash
function and insert at head of linked list - Search It is proportional to max linked list
length - Delete Same as search
- Therefore, in the unfortunate event that we have
a bad hash function all n keys may hash in the
same table entry giving an O(n) run-time! - So how can we create a good hash function?
14Hashing with Chaining
- Dictionary Operations
- Chained-Hash-Insert (T, x)
- Insert x at the head of list Th(keyx).
- Worst-case complexity O(1).
- Chained-Hash-Delete (T, x)
- Delete x from the list Th(keyx).
- Worst-case complexity proportional to length of
list with singly-linked lists. O(1) with
doubly-linked lists. - Chained-Hash-Search (T, k)
- Search an element with key k in list Th(k).
- Worst-case complexity proportional to length of
list.
15Analysis on Chained-Hash-Search
- Load factor ?n/m average keys per slot.
- m number of slots.
- n number of elements stored in the hash table.
- Worst-case complexity ?(n) time to compute
h(k). - Average depends on how h distributes keys among m
slots. - Assume
- Simple uniform hashing.
- Any key is equally likely to hash into any of the
m slots, independent of where any other key
hashes to. - O(1) time to compute h(k).
- Time to search for an element with key k is
Q(Th(k)). - Expected length of a linked list load factor
? n/m.
16Expected Cost of an Unsuccessful Search
Theorem An unsuccessful search takes expected
time T(1a).
- Proof
- Any key not already in the table is equally
likely to hash to any of the m slots. - To search unsuccessfully for any key k, need to
search to the end of the list Th(k), whose
expected length is a. - Adding the time to compute the hash function, the
total time required is T(1a). -
17Expected Cost of a Successful Search
Theorem A successful search takes expected time
T(1a).
- Proof
- The probability that a list is searched is
proportional to the number of elements it
contains. - Assume that the element being searched for is
equally likely to be any of the n elements in the
table. - The number of elements examined during a
successful search for an element x is 1 more than
the number of elements that appear before x in
xs list. - These are the elements inserted after x was
inserted. - Goal
- Find the average, over the n elements x in the
table, of how many elements were inserted into
xs list after x was inserted. -
18Expected Cost of a Successful Search
Theorem A successful search takes expected time
T(1a).
- Proof (contd)
- Let xi be the ith element inserted into the
table, and let ki keyxi. - Define indicator random variables Xij Ih(ki)
h(kj), for all i, j. - Simple uniform hashing ? Prh(ki) h(kj) 1/m
- ?
EXij 1/m. - Expected number of elements examined in a
successful search is
No. of elements inserted after xi into the same
slot as xi.
19Proof Contd.
(linearity of expectation)
Expected total time for a successful search
Time to compute hash function Time to search
O(2?/2 ?/2n) O(1 ?).
20Expected Cost Interpretation
- If n O(m), then ?n/m O(m)/m O(1).
- ? Searching takes constant time on average.
- Insertion is O(1) in the worst case.
- Deletion takes O(1) worst-case time when lists
are doubly linked. - Hence, all dictionary operations take O(1) time
on average with hash tables with chaining.
21Good Hash Functions
- Satisfy the assumption of simple uniform hashing.
- Not possible to satisfy the assumption in
practice. - Often use heuristics, based on the domain of the
keys, to create a hash function that performs
well. - Regularity in key distribution should not affect
uniformity. Hash value should be independent of
any patterns that might exist in the data. - E.g. Each key is drawn independently from U
according to a probability distribution P - ?kh(k) j P(k) 1/m for j 0, 1, , m1.
- An example is the division method.
22Keys as Natural Numbers
- Hash functions assume that the keys are natural
numbers. - When they are not, have to interpret them as
natural numbers. - Example Interpret a character string as an
integer expressed in some radix notation. Suppose
the string is CLRS - ASCII values C67, L76, R82, S83.
- There are 128 basic ASCII values.
- So, CLRS 67128376 1282 821281 831280
141,764,947.
23Division Method
- Map a key k into one of the m slots by taking the
remainder of k divided by m. That is, - h(k) k mod m
- Example m 31 and k 78 ? h(k) 16.
- Advantage Fast, since requires just one division
operation. - Disadvantage Have to avoid certain values of m.
- Dont pick certain values, such as m2p
- Or hash wont depend on all bits of k.
- Good choice for m
- Primes, not too close to power of 2 (or 10) are
good.
24Multiplication Method
- If 0 lt A lt 1, h(k) ?m (kA mod 1)? ?m (kA
?kA?) ? - where kA mod 1 means the fractional part of
kA, i.e., kA ?kA?. - Disadvantage Slower than the division method.
- Advantage Value of m is not critical.
- Typically chosen as a power of 2, i.e., m 2p,
which makes implementation easy. - Example m 1000, k 123, A ? 0.6180339887
- h(k) ?1000(123 0.6180339887 mod 1)?
- ?1000 0.018169... ? 18.
25Multiplication Mthd. Implementation
- Choose m 2p, for some integer p.
- Let the word size of the machine be w bits.
- Assume that k fits into a single word. (k takes w
bits.) - Let 0 lt s lt 2w. (s takes w bits.)
- Restrict A to be of the form s/2w.
- Let k ? s r1 2w r0 .
- r1 holds the integer part of kA (?kA?) and r0
holds the fractional part of kA (kA mod 1 kA
?kA?). - We dont care about the integer part of kA.
- So, just use r0, and forget about r1.
26Multiplication Mthd Implementation
w bits
k
s A2w
?
binary point
r0
r1
extract p bits
h(k)
- We want ?m (kA mod 1)?. We could get that by
shifting r0 to the left by p lg m bits and then
taking the p bits that were shifted to the left
of the binary point. - But, we dont need to shift. Just take the p most
significant bits of r0.
27How to choose A?
- How to choose A?
- The multiplication method works with any legal
value of A. - But it works better with some values than with
others, depending on the keys being hashed. - Knuth suggests using A ? (?5 1)/2.
28Multiplication Method
- We choose m to be power of 2 (m2p) and
- For example, k123456, m512 then
29Multiplication Method Implementation
30Drawback of Chaining
- Drawback of Separate Chaining
- new operator takes long time to allocate memory
in some languages - We are basically using two data structures an
array and a list - Therefore, separate chained hash tables although
useful are not used widely
31Open Addressing
- Open Addressing means that when collision occurs
at a certain location, we try alternate locations
until an empty location is found - As opposed to separate chaining, we now maintain
only one table (array). There are no associated
lists at each array index - Alternate locations are found by using a
collision resolution strategy. It is denoted by a
function f().
32Hash Functions Using Collision Resolution
Strategy
- Using a collision resolution strategy, the hash
function gets modified to hi(x) - hi(x) (hash(x)f(i)) mod tableSize.
- Here
- hi(x) new hash function
- hash(x) old hash function, probably something
like hash(x) x mod tableSize - f(i) collision resolution strategy
33Collision Resolution Strategy (contd.)
- i denotes the number of attempts made by the
collision resolution strategy. When a collision
occurs and we try to find an empty location
(using the collision resolution strategy) for the
first time, then i1. If this first attempt fails
we again try to find an empty location for the
second time round, at which time i2, and so on - You must have noticed here that the collision
resolution strategy should be a function of i
(the number of the attempt). That is why the
collision resolution function is denoted as f(i)
34Hash Tables and Collision Resolution
- Some characteristics of hash tables with
collision resolution - All data goes inside table so a larger table is
required - l 0.5 for open addressing
- We will now investigate different collision
resolution strategies. In other words, we will
take various functions for f(i) and see how the
hash table performs
35Collision Resolution Strategy 1 Linear Probing
- In linear probing f is linear. f(i) i
- This means that when there is a collision we try
successive locations starting from the location
of collision until we find an empty location
36Linear Probing Example
- Example Insert the following data into a hash
table using linear probing as the collision
resolution strategy. Assume tableSize 10 - 17 26 38 9 7 66 11
- Unless otherwise stated, we will assume that the
original hash function is - hash(x) x mod tableSize x mod 10
- Since we are using linear probing, we have f(i)
i - Let us now compute hi(x) for each of the input
data and place them inside the array
37Linear Probing Example (contd.)
- h0(17) hash(17) f(0) (17 mod 10) 0 7
(Remember that f(0) 0) - Location 7 is currently empty. So there is no
collision and 17 is entered into the table - Similarly, 26, 38 and 9 do not create any
collisions and are entered into the table - The diagram of the table after these four
insertions is shown on the next slide
38Linear Probing Example (contd.)
Array Index
39Linear Probing Example (contd.)
- The next data to be inserted is 7
- h0(7) hash(7) f(0) 7 mod 10 7. The index
7 inside the array is already occupied by 17. - So, we have a collision and we have to use the
collision resolution strategy to find an empty
location to insert 7 - Now, since this is our first attempt to find an
empty location, so i1 - Since we are using linear probing, f(i)i, so
f(1) 1 and h1(7) (hash(7) 1) mod 10 (7
1) mod 10 8 mod 10 8
40Linear Probing Example (contd.)
- However, location 8 is already occupied by 38
- So we have to use collision resolution once
again, now with i2 - Since we are using linear probing, f(i)i, so
f(2) 2 and h2(7) (hash(7) 2) mod 10 (7
2) mod 10 9 mod 10 9
41Linear Probing Example (contd.)
- However, location 9 is also occupied by 9
- So we have to use collision resolution once
again, now with i3 - Since we are using linear probing, f(i)i, so
f(3) 3 and h3(7) (hash(7) 3) mod 10 (7
3) mod 10 10 mod 10 0 - Location 0 is empty and so, we insert 7 at index 0
42Linear Probing Example (contd.)
- The next data to be inserted is 66
- h0(66) hash(66) f(0) 66 mod 10 6
- Location 6 is already occupied by 26. So we get a
collision - We have to use the collision resolution strategy
with linear probing as we did while inserting 7 - Solve this as we did on the last few slides with
insert of 7
43Linear Probing Example (contd.)
- 66 will collide 5 times and will get inserted at
location 1 - Next, we have to insert 11
- Once again we get a collision but 11 can be
inserted after the first collision - Verify the insertion of 11 into the hash table as
we did in the example before - The diagram of the hash table after all the
inserts is given on the next slide
44Linear Probing Example (contd.)
Array Index
45Drawbacks of Linear Probing
- Time to find empty cell is quite large. For
example, we had to inspect 4 locations before we
found an empty location to insert 7. Same problem
was encountered while inserting 66. - The hash table can be relatively empty and
pockets of occupied cells start forming. For
example, when we inserted 66 the lower part of
the hash table was full but the upper part was
entirely empty - Primary Clustering Several attempts required to
resolve a collision. For example, while inserting
66 which collided as many as 5 times
46Collision Resolution Strategy 2 Quadratic
Probing
- In quadratic probing f(i)i2. All other
techniques remain similar to linear probing - Example Insert the following data into a hash
table using quadratic probing as the collision
resolution strategy. Assume tableSize 10 - 17 26 38 9 7 66 11
- As in the previous example, 17, 26, 38 and 9 get
inserted without any collisions
47Quadratic Probing Example
- When we try to insert 7, hash (7) 7 mod 10 7.
Location 7 is already occupied and so we get a
collision - We should now try to find an empty location using
the collision resolution strategy of quadratic
probing. Since we are trying to find an empty
location for the first time, i1
48Quadratic Probing Example (contd.)
- Since we are using quadratic probing now,
f(i)i2, so f(1) 12 1 and h1(7) (hash(7)
1) mod 10 (7 1) mod 10 8 mod 10 8 - Location 8 is already occupied, so we try another
collision resolution now with i2 - Since we are using quadratic probing now,
f(i)i2, so f(2) 22 4 and h2(7) (hash(7)
4) mod 10 (7 4) mod 10 11 mod 10 1
49Quadratic Probing Example (contd.)
- Location 1 is empty and so we insert 7
- Notice that we had much less collisions while
inserting 7 with quadratic probing than we had
with linear probing
50Quadratic Probing Example (contd.)
- Let us now try to insert the next data 66
- We get a collision at location 6 and we use
quadratic probing to find empty cells - The cells that will be probed by quadratic
probing are - with i1, location 7 gives a collision again
- with i2, location 0 empty 66 is inserted here
- Once again notice that we had fewer collisions
now than we had with linear probing
51Quadratic Probing Example (contd.)
- Please solve the insertion of 11 by yourself
- The diagram of the hash table after all the data
has been inserted is given on the next slide
52Quadratic Probing Example (contd.)
Array Index
66
7
11
53Quadratic Probing Problem 1
- There is no guarantee to find empty cell once
table is more than half full (see proof on page
92. This proof is not required for the exam)
54Quadratic Probing Problem 2
- Standard deletion cannot be used
- To understand this, let us see how we find 66 in
the hash table given on slide 25 - hash(66) 66 mod 10 6
- Location 6 contains 26 which is not the data we
are finding - This means that
- either 66 is not there in the entire hash table
- or, 66 got stored somewhere else when we used
quadratic probing to find an empty location while
inserting it
55Quadratic Probing Problem 2 (contd.)
- Since we just solved this example, we know that
the second option is what actually happened - However, the find routine does not know that this
is what happened - So, the find method has to visit each location
that might have been visited by quadratic probing
while inserting 66
56Quadratic Probing Problem 2 (contd.)
- These locations would be at a distance 1 or 4 or
9 or 16 or 25 (everything being mod tableSize)
away from location 6 (6 is the value returned by
hash(66)) - Notice that the above distances are at a distance
i2 from location 6 because quadratic probing uses
f(i) i2 for i1, 2, 3 and so on - So the find method looks at location (61) mod 10
7 and does not find 66 - Next the find method looks at location (64) mod
10 10 mod 10 0 and finds 66
57Quadratic Probing Problem 2 (contd.)
- However, what would have happened if we deleted
26 first and then we tried to find 66 - Since hash(66) 6 mod 10 6, and location 6 is
empty, the find method would have (wrongly)
assumed that 66 is not there because the first
location where 66 could be inserted (location 6)
is free. So 66 never got inserted
58Quadratic Probing Problem 2 (contd.)
- The solution is to use a technique called lazy
delete - In lazy delete, along with each location we
maintain a tag that is initially cleared - When there is a collision while inserting the tag
is set - Then quadratic (or some other) probing is used to
locate an empty cell and insert the data
59Quadratic Probing Problem 2 (contd.)
- With lazy delete, when we insert 66, we get a
collision at location 6 (occupied by 26) and we
set the tag for location 6 - Later on if we delete 26, the tag still remains
set - Now, when encounters an empty location at
location 6 it checks to see if the tag is set - Since the tag is set, find knows that there is
another data that should have been in location 6
but got bumped off to another location by the
collision resolution strategy
60Quadratic Probing Problem 3
- Suppose that collision occurs while inserting at
location x - Then the locations that will be probed using
quadratic probing are (x1) mod 10, (x4) mod 10,
(x9) mod 10, (x16) mod 10, (x25) mod 10 and so
on
61Quadratic Probing Problem 3 (contd.)
- Let us substitute a value for x, say x 5
- So successive locations that are probed by
quadratic probing until an empty location is
found are - (51) mod 10 6
- (54) mod 10 9
- (516) mod 10 21 mod 10 1
- (525) mod 10 30 mod 10 0
- (536) mod 10 41 mod 10 1
- (549) mod 10 54 mod 10 4
- (564) mod 10 69 mod 10 9
- (581) mod 10 86 mod 10 6 and so on
- Notice that some locations (1, 6, 9) are getting
probed repeatedly
62Secondary Clustering
- This problem is called secondary clustering
- Secondary Clustering Elements that hash to same
locations always probe same set of cells - This is solved by using the last collision
resolution strategy we are going to study double
hashing
63Collision Resolution Strategy 3 Double Hashing
- Here the probing function f(i) i hash2(x)
- hash2(x) is called the secondary hash function
- However, a bad choice of hash2(x) can really make
matters worse - Let us assume that tableSize 10 as we had in
the previous examples
64Bad Choice for hash2(x)
- For example if hash2(x) x mod 7 and we try
insert 7 - hash(7) 7 mod 10 7. Suppose that location 7
is already occupied and so there is a collision - Now we use our collision resolution strategy with
i1, f(i) ihash2(x). So f(1) 1 hash2(7)
1 (7 mod 7) 0 - Therefore, h1(7) 7
- In fact, f(2) also equals 0 and so h2(7) 7
- So, we are not going anywhere and repeatedly
probing location 7
65Good Choice for hash2(x)
- An example of a good hash function is hash2(x) R
- (x mod R) where R is prime number lt tableSize. - If tableSize 10 (as is in our example), R 7
is a good choice
66Double Hashing Example
- Example Insert the following data into a hash
table using double hashing as the collision
resolution strategy - 89 18 49 58 68
- 89 and 18 do not create any collisions and get
inserted at locations 9 and 8 respectively - h0(49) (hash(49) f(0)) mod 10 (49 mod 10
0) mod 10 9. Location 9 is already occupied so
we get a collision
67Double Hashing Example (contd.)
- hash2 (49) 7 (49 mod 7) 7 0 7
- So,
- h1(49) (hash(49) f(1)) mod 10
- (49 mod 10 1 hash2 (49) ) mod 10
- (9 7) mod 10 16 mod 10 6
- Location 6 is empty and 49 is inserted there
68Double Hashing Example (contd.)
- 58 and 49 also collide when we try to insert them
and the collision is resolved at the first
attempt (with i 1) using double hashing - Verify that hash2(58) 7 (58 mod 7) 7 2
5 and that 58 gets inserted at location 3 - Verify that hash2(69) 7 (69 mod 7) 7 6
1 and that 69 gets inserted at location 0 - The hash table after all insertions is shown on
the next slide
69Double Hashing Example Figure
Array Index
69
58
70Double Hashing Problem
- To understand the problem let us suppose that we
are inserting 23 into the hash table on the last
slide - We get collision at position 3 which is already
occupied by 58 - Since we are using double hashing, hash2(23) 7
(23 mod 7) 7 2 5 - So, h1(23) (hash(23) 1hash2(23)) mod 10 (3
1 5) mod 10 8 mod 10 8 - Position 8 also occupied
71Double Hashing Problem (contd.)
- So we try to find an empty space again using
double hashing - h2(23) (hash(23) 2hash2(23)) mod 10 (3
2 5) mod 10 13 mod 10 3 - Location 3 is already occupied
- We again try to find an empty space using double
hashing - h3(23) (hash(23) 3hash2(23)) mod 10 (3
3 5) mod 10 18 mod 10 8 - Location 8 is already occupied and had already
been probed while doing h1(23)
72Double Hashing Problem (contd.)
- In fact, if you try out further attempts with
i3, 4, 5 and so on you will see that locations 3
and 8 get continuously probed - The reason for this is that tableSize 10 is not
a prime - The solution for this problem is to make
tableSize prime (for e.g. 11 is a good choice for
tableSize)
73Double Hashing Ideal Secondary Hash Function
- A properly selected secondary hash function
hash2(x) ensures that number of expected probes
is close to a random collision resolution
strategy
74Double Hashing vs.Linear and Quadratic Probing
- As compared to double hashing, linear and
quadratic probing are faster because f(i)
ihash2(x) takes longer to compute than f(i) i
or f(i) i2
75Rehashing
- Rehashing tells us what to do when the hash table
gets full - Instead of waiting for the hash table to get
completely full, it is more efficient to rehash
when the table is about 70 or 80 full - The most common rehashing technique is to
construct a new table of approximately double
size of the original hash table - Since the new table has a different size,
tableSize gets a new value, and, so a new hash
function, hash(x) x mod (new_tableSize) has to
be defined
76Rehashing Example
- Example Insert 13 15 6 24 23 into an
initially empty hash table. Assume tableSize 7
and use linear probing for collision resolution
(The table is drawn in the book on page 198-199.
Please see it) - Since tableSize 7, hash(x) x mod 7
- After 23 is inserted, the hash table is 70 full
- Rehash New table size 72 14. 14 is not a
prime number - So, we select the prime number closest to and
greater than 14, i.e., 17 as the new tableSize - The new hash function is now hash(x) x mod 17
77Rehashing (contd.)
- All the data from the original table has to be
inserted into the new table at their new location
given by the new hash function (See page 199 from
book for the diagram) - Rehashing is a costly operation and it happens
frequently when the hash table is small and there
are a lot of insertions - The time required for rehashing is O(N) since N
elements need to be rehashed from the original
hash table into the new one - Therefore it adds a constant cost to each
insertion
78Other Rehashing Techniques
- Rehash when table is half full
- Rehash as soon as insertion fails
- Rehash beyond a certain load factor l
- Technique 2 above gives best results since
performance degrades as l increases.
79Advantages of Rehashing
- Frees programmer from worrying about tableSize
while inserting data - Hash tables cannot be made arbitrarily large to
start with in complex programs - Rehashing can be used for other data structures
as well (e.g. queue)