Hash Tables

About This Presentation

Title:

Hash Tables

Description:

Linear probing - previous example - is the most commonly Closed Hashing uses the Main Table or flat area to find another location REHASH FUNCTION - LINEAR ... – PowerPoint PPT presentation

Number of Views:318

Avg rating:3.0/5.0

Slides: 112

Provided by: chel1

Category:

more less

Transcript and Presenter's Notes

Title: Hash Tables

1
Hash Tables

The crucial disadvantage for avoiding arrays is
that we need to allocate in advance the size of
this structure
We tend to overestimate its size and end up with
a very sparse structure

2
Storing BIG DATA

We tend to think that the actual number of keys
to be stored is equal to the universe of
possible existing keys

3
Hash Tables

Often the number of keys to be stored is smaller
than the number in the universe of keys.
In this case, a hash table may save us a lot of
space.

4
Hash Tables

How can you store all possible SSN in an array?
Use an array with range 0 - 999,999,999 a
billion possible locations!
This will give you O(1) access time but
considering there are approximately
308,000,000 people in the USA ,you waste
1,000,000,000 -350,000,000 array entries!

5
Problem - Wasted Space

Problem
The range of key values we are mapping is too
large
(0-999,999,999 when compared to
the of actual keys (US citizens)

6
Hash Tables

All search structures so far
Relied on a comparison operation
Performance O(n) or O( log n) for input of
Size N
WE CAN DO BETTER WITH HASHING

Simplest case
Assume we have keys with values in the range 1 ..
M
Use a hash method to compute the value of the
key (an int) to select a slot in a direct
access table in which to store the item

8
Hash(key)

To search for an item with key,
k,
look in slot hash (key) which
produces an int that maps to
an index in the array.
If theres an item there,youve found it
If the tag is 0, its missing.

9
CONSTANT TIME SEARCH

This produces a Constant time search
O(1)

10
Example (ideal) hash function

Suppose we now have Strings and must hash them to
an integer.
Our hash function maps the following values
hashCode("apple") 5
hashCode("watermelon") 3
hashCode("grapes") 8
hashCode("cantaloupe") 7
hashCode("kiwi") 0
hashCode("strawberry") 9
hashCode("mango") 6
hashCode("banana") 2

11
Why hash tables?

We use key/value pairs to store an Entry into the
table
We use use a hash function to map a key Hawk
Key(hawk) to an integer
The value column holds the data we are actually
interested in

12
Hash Functions

Hash tables normally provide O(1) time (constant
time) to access an element
A value(called a key) is normally stored in slot
k which is an integer value)
In hash tables, this element is stored in
slot hash(key).

13
HASH FUNCTIONS

hash(k) is a hash function.
It maps the universe U of keys into the slots of
a hash table (smaller than the universe) ----
Thus reducing the size of the space we need to
use.

14
Pictorial view of Hash Tables
UNIVERSE OF VALUES ARE MAPPED TO A SMALLER NUMBER
OF SLOTS
k1
k2
k3
k4
15
Hashing

Assume I have a hash function where the key is a
String
e.g. A label which represents a city in our
HPAir project
hash( key ) integer
i.e. the function maps the key to an integer
That is a string city name to an int
which is an index into the HashMap
What performance (Big(0) do I get ?

16
Hash Tables - Constraints

Initial Constraints hash a key to an integer
The hashcode of a Key must be unique
Keys must lie in a small range for storage
efficiency,
keys must be dense in the range -
If theyre sparse (lots of gaps between
values),a lot of space is used to obtain speed

17
Hash Tables -

Hashing Keys produces integers, therefore
We need a hash functionhash( key )
integer
ie one that maps(hashes) a key to an integer
Applying this function to the key produces a
unique address

18
Problems with a unique address for each key

If hash(key) maps each key to a uniqueinteger in
the range 0 .. m-1
then search is O(1) -
BUT THIS IS HARD TO DO!!!!!

Example - using an n-character key e.g. a
String
n number of characters in the String.
Use a String class method to change the String
to a character array -
Call a method with an array name and the number
of chars in String
hash(char array, of characters)

20
Hashing a string of characters

// n number of chars in the String
int hash( char sarray, int n )
int sum 0, i 0
// sum ascii values of the characters
while( n-- gt 0 )
sum sum sarray i .getNumericValue()
return sum 256
// number of ASCII characters is 256
returns a value in 0 .. 255

21
Evaluation

int hash( char sarray, int n )
int sum 0, i 0 while( n-- gt 0 )
// get ascii values of each character
// and sum them
sum sum sarrayi.getNumericValue()
return sum 256 returns a value in
0 .. 255
The hash function itself is O(1) since the
number of characters is a constant for each
String - that number will not change for each
String

22
Hash Tables PROBLEM -Collisions

With this hash function
int hash( char s, int n ) int sum 0, i
0 while( n-- gt 0 ) sum sum
si.getNumericValue return sum 256
FOR
hash( AB, 2 ) andhash( BA, 2 ) their
Ascii (Unicode) values return the same value!
Unicode value A is 65, for B is 66
Add them together in any order and they
equal 131
This is called a collision

23
Collisions

Because we're mapping a larger universe into a
smaller set of slots, collisions occur.
A variety of techniques are used for resolving
collisions
Therefore having a unique key is HARD TO DO.

24
Pictorial view OF COLLISION
Sometimes keys map to the same memory location
COLLISION
k1
k5
k2
k3
k4
25
Hash Tables Collision solutions I

We need to store the actual key with the item in
the hash table
We compute the address
index hash( key )
Next, look for the index in the table
if ( the location is occupied) then we try
next entry till we find an open one

26
Collision Resolution Open Hashing

The most common resolution mechanism for
collisions is called chaining .
This is also called Open Hashing.
Being "open", the Hashtable will store a linked
list of entries whose keys hash to the same value
Chaining incorporates the concepts of linked
lists and direct access structures like arrays
Each slot of a hash table will be a pointer to a
linked list

27
Chaining or open hashing

When hashing a key, if a collision happens
the new key is stored in the linked list in that
location
E.g., suppose that we're mapping the universe of
integers to a hash table of size 10

28
Open Hash Table
KEYS BUCKETS ENTRIES
John Smith and Sandra map to the same location
a linked list is started from John to Sandra
29
Hash Tables - Linked lists

Collisions - Resolution
Linked list is attached to each primary table
slot
// Three entries map to same location
h(k) h(k1) h(k2)
Searching for k1
Calculate hash(k1)
Item doesnt match
Follow linked list to k1
If NULL found, key isnt in table

30
Hash Tables - Linked Lists

If a search can be satisfiedby any item with
key, k,performance is still O(1)
but
If the key values are different
we get O( 1 max )
Where max is the largest number of duplicates -
or length of the
longest chain (Linked List)

TECHNIQUE TWO - USE AN OVERFLOW AREA
Linked list constructed in special area of
tablecalled OVERFLOW AREA
If two keys map to same location
hash(k) hash(j)
k stored first
Adding j
When hash(j) maps to hash(k)
Find k THEN
Go to first slot in overflow area
Put j in it
Searching - same as linked list

32
Hashing(103)

Our hash function is based on the division method
for creating hash functions
hash(k) k mod size

hash(103) 103 mod 10 hash(103) 3
33
Hashing(103)
hash(n) 103 mod 10 hash(n) 3
103
/
34
Hashing(69)
hash(n) 69 mod 10 hash(n) 9
103
/
69
/
35
Hashing(20)
h(n) 20 mod 10 h(n) 0
20
/
103
/
69
/
36
Hashing(13)
hash(n) 13 mod 10 hash(n) 3
20
/
103
13
/
69
/
37
Hashing(110)
hash(n) 110 mod 10 hash(n) 0
20
110
/
103
13
/
69
/
38
Hashing(53)
hash(n) 53 mod 10 hash(n) 3
20
110
/
103
13
53
/
69
/
39
Final Hash Table
20
110
/
103
13
53
/
53
69
/
40
Searching for 53 Using Chaining
41
Searching for 53
20
110
/
103
13
/
53
/
69
/
42
Searching for 53
20
110
/
103
13
/
53
/
temp
69
/
43
Searching for 53
20
110
/
103
13
/
53
/
temp
69
/
44
Searching for 53
20
110
/
103
13
/
53
/
temp
69
/
45
Closed Hashing - Re-hash functions

Closed hashing, is a method of collision
resolution in hash tables.
With this method, a hash collision is resolved
by
probing, or
searching through other locations in the array

46
1 Solution - Linear probing

In one variation, the probing sequence
is called
(1) Linear Probing
Continue probing adjacent locations
until an unused array slot is found.
Then put the Entry in that location.

47
Closed hashing - e.g. linear probing

Closed Hashing keeps keys in the main table and
uses a re-hash function which has many
variations .
Linear probing - previous example - is the most
commonly Closed Hashing
uses the Main Table or flat area to find
another location

48
Rehash function - linear probing

The rehash function for Linear Probing is
hash(x) is 1
Keep going to the next slot until you find an
empty one

49
Insertion, I

Suppose you want to add seagull to this hash
table
Also suppose
hashCode(seagull) 143
table143 is not empty
table143 ! seagull
table144 is not empty
table144 ! seagull
table145 is empty
Therefore, put seagull at location 145

seagull
50
Searching, I

Suppose you want to look up seagull in this hash
table
Also suppose
hashCode(seagull) 143
table143 is not empty
table143 ! seagull
table144 is not empty
table144 ! seagull
table145 is not empty
table145 seagull !
We found seagull at location 145

51
Searching, II

Suppose you want to look up cow in this hash
table
Also suppose
hashCode(cow) 144
table144 is not empty
table144 ! cow
table145 is not empty
table145 ! cow
table146 is empty
If cow were in the table, we should have found it
by now
Therefore, it isnt here

52
Insertion, II

Suppose you want to add hawk to this hash table
Also suppose
hashCode(hawk) 143
table143 is not empty
table143 ! hawk
table144 is not empty
table144 hawk
hawk is already in the table, so do nothing

53
Insertion, III

Suppose
You want to add cardinal to this hash table
hashCode(cardinal) 147
The last location is 148
147 and 148 are occupied
Solution
Treat the table as circular after 148 comes 0
Hence, cardinal goes in location 0 (or 1, or 2,
or ...)

54
Linear PROBING Review

Closed Hashing uses Linear Probing (among others)
Linear Probing If position h(key) is occupied,
do a linear search in the table until you find a
empty slot.
The slot is searched in this order
h(key), k(key)1, h(key)2, ..., h(key)c

55
Expanding the table

If the table becomes full, an exception can be
thrown or
we can expand the capacity.
This process is involved because if we double
the size,
we risk a sparse structure that can impact the
efficiency we seek.
One solution is to rehash the table using the new
table size.

56
Closed Hashing - Buckets

One implementation for closed hashing groups hash
table slots into buckets.
The M slots of the hash table are divided into B
buckets, with each bucket consisting of M/B
slots.
The hash function assigns each record to the
first slot within one of the buckets.

57
Bucket Hashing - uses Main Table

If this slot is already occupied,
then the bucket slots are searched sequentially
until an open slot is found.

58
Buckets on the table

If a bucket is entirely full,
then the record is stored in an overflow bucket
of infinite capacity at the end of the table.
All buckets share the same overflow bucket. See
link below See this link for a fuller
explanation
http//research.cs.vt.edu/AVresearch/hashing/bucke
thash.php

59
Slots or Buckets 4 buckets
60
Bucket Hashing

To search, hash the key to determine which bucket
should contain the record.
The records in this bucket are then searched.
How is this better than linear probing? -- 1

61
Bucket Hashing

If the desired key value is not found and the
bucket still has free slots, then the search is
complete.
If the bucket is full, then the search goes to
the overflow bucket.
If many records are in the overflow bucket, this
will be an expensive process.

62
Bucket Hashing advantage

Bucket methods are good for implementing hash
tables stored on disk, because the bucket size
can be set to the size of a disk block.
Whenever search or insertion occurs, the entire
bucket is read into memory.

63
USING BUCKETS

Because the entire bucket is then in memory,
processing an insert or search operation requires
only one disk access, unless the bucket is full.
If the bucket is full, then the overflow bucket
must be retrieved from disk as well.

64
Clustering

Even with a good hash function, linear probing
has its problems
The position of the initial mapping of key k is
called the home position of k.
When several insertions map to the same home
position, they end up placed contiguously in the
table.
This collection of keys with the same home
position is called a cluster.

65
Clusters

A cluster is a group of items not containing any
open slots
Clusters cause efficiency to degrade

66
Clustering

As clusters grow, the probability increases that
a key will map to the middle of a cluster,
increasing the rate of the clusters growth.

67
Clusters

This tendency of linear probing to place items
together is known as primary clustering.
As these clusters grow, they merge with other
clusters forming even bigger clusters which grow
even faster.

68
Other collision techniques

We have looked at
chaining(Linked Lists) (Open Hashing) and
Linear Probing( Closed Hashing)
Bucket Hashing
Let us look at some other collision techniques

Other Closed hash function techniques are
Quadratic probing a variant of the above where
the term being added to the hash result is
squared.
h(key) c2
Random probing the term being added to the hash
function is a random number.
h(key) random()

70
Rehash functions

Rehashing is a technique where a sequence of
hashing functions are defined (h1, h2, ... hk).
If a collision occurs the functions are used in
the this order

Use a second hash function - Re-Hashing
hash(k) hash(j)
k stored first
Adding j
Calculate hash(j)
Find k first
Calculate hash2(j) where
hash2 is some
other hash function
Repeat until we find an empty slot
Put j in it

Hash 2(j) - second hash function
72
Hash Tables - Re-hash functions

The re-hash function has many variations
Quadratic probing
h(x) is squared
Avoids primary clustering
Secondary clustering occurs
All keys which collide on h(x) follow the same
sequence
First
a h(j)
Then a c, a 4c, a 16c, ....

73
Quadratic Probing

Some versions use
p(K, i) c1 i2 c2 i2 c3 i2 for some
choice of constants c1, c2, and c3.
Secondary clustering generally less of a problem

74
Searching in a Hash Table

We have already seen how searching works with
chaining.
With Closed Hashing, we use the following steps
Given a target, hash the target
Take the value of the hash of target and go to
the slot.
If the target exist it must be in this slot
Search in the list in the current slot using a
linear search.

75
Look up a key

public lookup(key)
int I
i find_slot(key) // method to find key in
table
if sloti is occupied // key is in table
return sloti.value // return value in
slot
else
// key is not in table
return not found

76
linear probing and single-slot step

public find_slot(key)
int i
i hash(key) // use a hash method to
hash the key
// search until we either find the key, or find
an empty slot. while ( (sloti is occupied) and
( sloti.key ? key ) )
i (i 1)
return i

77
Deleting in a table Closed Hashing

Suppose you want to look up cow in this hash
table
Also suppose
hashCode(cow) 144
table144 is not empty
table144 ! cow
table145 is not empty
table145 ! cow
table146 is empty
If cow were in the table, we should have found it
by now
Therefore it is not there.

78
Deleting from a table

Problem
When an empty slot is reached, we assume the
item we are searching for is not there.
Deletion leaves an empty slot,
When we next search for an item using linear
probing,
We assume the item is not there when we reached
the empty slot.

79
Tombstones

We assume the item is not there when we reached
the empty slot.
When, in fact, the item could be AFTER the empty
slot.

80
TOMBSTONES
Therefore, straight deletion of an item would not
work. Instead, the cell is marked (usually by
use of a boolean variable) when a item is
deleted The slot is often termed a
tombstone.
81
Hash Tables - Summary so far ...

Potential O(1) search time
If a suitable function hash(key) integer can be
found
Space for speed trade-off
Full hash tables dont work (more later!)
Collisions
Inevitable

82
Various resolution strategies looked at so
far Linked lists Overflow areas Re-hash
functions Linear probing h is
1 Quadratic probing h is i2 - Any
other hash function! or even sequence of
functions!
83
Comparison of collision techniques
Linear Probing
Random Probing
Chaining
84
Hashing with Chaining

What is the running time to insert/search/delete?
Insert It takes O(1) time to compute the hash
function and insert at head of linked list
Search It is proportional to max linked list
length
Delete Same as search

85
Efficiency of chaining

Therefore, if we have a bad hash function,
all n keys may hash to the same
table index giving an O(n) run-time!
So how can we create a good hash function?

86
Hash Tables - Choosing the Hash Function

Some functions are definitely better than others!
Key criterion
Minimum number of collisions
Keeps chains short
Maintains O(1) on average

87
Writing your own hashCode method

A hashCode method must
Return a value that is a legal array index
Always return the same value for the same input
It cant use random numbers, or the time of day
Return the same value for equal inputs
Must be consistent with your equals method

88
Hashcode Function

It does not need to return different values for
different inputs some collisions are
inevitable.
A good hashCode method should
Be efficient to compute
Give a uniform distribution of array indices
so NO SPARSE ARRAYS!

89
Other considerations

The hash table might fill up we need to be
prepared for that
Generally speaking, hash tables work best when
the table size is a prime number

90
Hash tables in Java

Java provides two classes, Hashtable and HashMap
classes which implement the MAP Interface
Both are maps they associate keys with values
Hashtable is synchronized it can be accessed
safely from multiple threads
Hashtable uses an open hash, and has a rehash
method, to increase the size of the table

91
HashMap

HashMap is newer, faster, and usually better,
but it is not synchronized
HashMap (default) uses a bucket hash -
(linked list)
and has a remove method

92
Hash table operations

Both Hashtable and HashMap are in java.util
Both have no-argument constructors, as well as
constructors that take an integer table size
Both have methods as listed in next slide

93
Methods

// put the entry in the table
public T put(T key, T value)
//Returns the value for this key, or null
public T get(T key)
public void clear() // clears the table
public Set keySet() // returns the values in the
table in a Set

94
Hash Tables - Reducing the range to 0, m )

Weve mapped the keys to a range of integers
0 key lt r -
decided on total number of possible keys
For social security numbers - 999,999,999
Now we must reduce this range to 0, m )
// from 0 to M
where m is a reasonable size for the hash table

95
Hash Tables Hash functions

Some typical functions
Division Use a mod function
hash(k) abs( k mod m)
where m is table size
which yields a range between 0 and m-1

Some typical functions
Choice of m?
Powers of 2 are generally not good!
h(k) k mod 2n
Prime numbers close to 2n - good choices

97
Choosing a viable value for M

Prime numbers close to 2n - good choices
Eg. want 4000 entry table,
choose m 4093
Other methods in your text.

98
Performance Analysis

If n slots in a table of size m are occupied, the
load factor is defined as ( a is the load
factor)
when ?1 means the table is full, and ?0 means
the table is empty.
It is generally good to get a value lt 1, near
.8.

n number of items
m number of slots
99
(No Transcript)
100
Linear probing
Double hashing
Separate chaining
101
Hash Tables - Collision Resolution Summary

Chaining
Unlimited number of elements
Unlimited number of collisions
Overhead of multiple linked lists
Re-hashing
Fast re-hashing
Fast access through use of main table space
Maximum number of elements must be known
Multiple collisions become probable -
CLUSTERING!
Overflow area
Fast access
Collisions don't use primary table space

102
Terms to Know

Open Addressing looks for another open position
in the table other than the one to which the
element is originally hashed. Requires that the
load factor be lt 1.
Open Addressing using Linear Probing - seeking
next available position creates clusters -
alternative methods - quadratic probing etc.
Separate Chaining If two keys map to the same
address, separate chaining creates a linked list
of keys that map to that address.

103
HashCode function in Java

Hash function - has two parts
Map key k to an integer
There is a default hashcode() in Java - the
method maps each object to an integer .
It returns a 32 bit integer which may be where
the object is in memory.
It works poorly with Strings as two strings could
be in different locations in memory and contain
the same data.

104
Hash Tables - Review

If you can meet the constraints of a hash
function that gives a Big(O) of 1
Hash Tables will generally give good performance
O(1) search

105

BUT
not advisable for unknown data
If collection size is relatively static few
insertions and deletions - memory management is
actually simpler

106
Universal or Perfect Hashing

Dynamic perfect hashing" involves using a
second hash table as the data structure to store
multiple values within a particular bucket.
How do we find the next location with this
approach?

107
Universal Hashing

What advantages does it have over linear probing?
What are possible problems with the approach?
Perfect hashing means that read access takes
constant time even in the worst case.

108
Universal or Perfect Hashing

For inserting , the time bounds are only true on
average.
To make insertion fast enough ,
the second level hash table is very large for
the number of keys (k2),
large enough so that collisions become
unlikely.

109
second level hash tables

This is not a problem with table size because the
first level hash distributes keys evenly
so that on average second level hash tables
are still relatively small.
The hash function for the second level tables are
chosen at random from a set of parameterized hash
functions.

110
Universal Hashing

It is possible when you know exactly what set of
keys you are going to be hashing when you design
your hash function.
It's popular for hashing keywords for
compilers
Minimal perfect hashing guarantees that n
keys will map to 0..n-1 with no collisions at
all.

111
Chained Bucket

Note when using chaining,
each linked list attached to a slot is called a
bucket
- this is called chained bucket hashing
However, there is also bucket hashing done on
the main table - just to make things real clear.

Write a Comment

User Comments (0)