Title: Hash Tables
1Hash Tables
- The crucial disadvantage for avoiding arrays is
that we need to allocate in advance the size of
this structure - We tend to overestimate its size and end up with
a very sparse structure
2Storing BIG DATA
- We tend to think that the actual number of keys
to be stored is equal to the universe of
possible existing keys
3Hash Tables
- Often the number of keys to be stored is smaller
than the number in the universe of keys. - In this case, a hash table may save us a lot of
space.
4Hash Tables
- How can you store all possible SSN in an array?
- Use an array with range 0 - 999,999,999 a
billion possible locations! - This will give you O(1) access time but
- considering there are approximately
- 308,000,000 people in the USA ,you waste
- 1,000,000,000 -350,000,000 array entries!
5Problem - Wasted Space
- Problem
- The range of key values we are mapping is too
large - (0-999,999,999 when compared to
- the of actual keys (US citizens)
6Hash Tables
- All search structures so far
- Relied on a comparison operation
- Performance O(n) or O( log n) for input of
- Size N
- WE CAN DO BETTER WITH HASHING
7- Simplest case
- Assume we have keys with values in the range 1 ..
M - Use a hash method to compute the value of the
key (an int) to select a slot in a direct
access table in which to store the item
8Hash(key)
- To search for an item with key,
- k,
- look in slot hash (key) which
- produces an int that maps to
- an index in the array.
- If theres an item there,youve found it
- If the tag is 0, its missing.
9CONSTANT TIME SEARCH
- This produces a Constant time search
- O(1)
10Example (ideal) hash function
- Suppose we now have Strings and must hash them to
an integer. - Our hash function maps the following values
- hashCode("apple") 5
- hashCode("watermelon") 3
- hashCode("grapes") 8
- hashCode("cantaloupe") 7
- hashCode("kiwi") 0
- hashCode("strawberry") 9
- hashCode("mango") 6
- hashCode("banana") 2
11Why hash tables?
- We use key/value pairs to store an Entry into the
table - We use use a hash function to map a key Hawk
- Key(hawk) to an integer
- The value column holds the data we are actually
interested in
12Hash Functions
- Hash tables normally provide O(1) time (constant
time) to access an element - A value(called a key) is normally stored in slot
k which is an integer value) - In hash tables, this element is stored in
- slot hash(key).
13HASH FUNCTIONS
- hash(k) is a hash function.
- It maps the universe U of keys into the slots of
a hash table (smaller than the universe) ---- - Thus reducing the size of the space we need to
use.
14Pictorial view of Hash Tables
UNIVERSE OF VALUES ARE MAPPED TO A SMALLER NUMBER
OF SLOTS
k1
k2
k3
k4
15Hashing
- Assume I have a hash function where the key is a
String - e.g. A label which represents a city in our
HPAir project - hash( key ) integer
- i.e. the function maps the key to an integer
- That is a string city name to an int
- which is an index into the HashMap
- What performance (Big(0) do I get ?
16Hash Tables - Constraints
- Initial Constraints hash a key to an integer
- The hashcode of a Key must be unique
- Keys must lie in a small range for storage
efficiency, - keys must be dense in the range -
- If theyre sparse (lots of gaps between
values),a lot of space is used to obtain speed
17Hash Tables -
- Hashing Keys produces integers, therefore
- We need a hash functionhash( key )
integer - ie one that maps(hashes) a key to an integer
- Applying this function to the key produces a
unique address
18Problems with a unique address for each key
- If hash(key) maps each key to a uniqueinteger in
the range 0 .. m-1 - then search is O(1) -
- BUT THIS IS HARD TO DO!!!!!
19- Example - using an n-character key e.g. a
String - n number of characters in the String.
- Use a String class method to change the String
to a character array - - Call a method with an array name and the number
of chars in String -
- hash(char array, of characters)
20Hashing a string of characters
- // n number of chars in the String
- int hash( char sarray, int n )
- int sum 0, i 0
- // sum ascii values of the characters
- while( n-- gt 0 )
- sum sum sarray i .getNumericValue()
return sum 256 - // number of ASCII characters is 256
- returns a value in 0 .. 255
21Evaluation
- int hash( char sarray, int n )
- int sum 0, i 0 while( n-- gt 0 )
// get ascii values of each character - // and sum them
- sum sum sarrayi.getNumericValue()
return sum 256 returns a value in
0 .. 255 -
- The hash function itself is O(1) since the
number of characters is a constant for each
String - that number will not change for each
String
22Hash Tables PROBLEM -Collisions
- With this hash function
- int hash( char s, int n ) int sum 0, i
0 while( n-- gt 0 ) sum sum
si.getNumericValue return sum 256 - FOR
- hash( AB, 2 ) andhash( BA, 2 ) their
Ascii (Unicode) values return the same value! - Unicode value A is 65, for B is 66
- Add them together in any order and they
equal 131 - This is called a collision
23Collisions
- Because we're mapping a larger universe into a
smaller set of slots, collisions occur. - A variety of techniques are used for resolving
collisions - Therefore having a unique key is HARD TO DO.
24Pictorial view OF COLLISION
Sometimes keys map to the same memory location
COLLISION
k1
k5
k2
k3
k4
25Hash Tables Collision solutions I
- We need to store the actual key with the item in
the hash table - We compute the address
- index hash( key )
- Next, look for the index in the table
- if ( the location is occupied) then we try
next entry till we find an open one
26Collision Resolution Open Hashing
- The most common resolution mechanism for
collisions is called chaining . -
- This is also called Open Hashing.
- Being "open", the Hashtable will store a linked
list of entries whose keys hash to the same value - Chaining incorporates the concepts of linked
lists and direct access structures like arrays - Each slot of a hash table will be a pointer to a
linked list
27Chaining or open hashing
- When hashing a key, if a collision happens
- the new key is stored in the linked list in that
location - E.g., suppose that we're mapping the universe of
integers to a hash table of size 10
28Open Hash Table
KEYS BUCKETS ENTRIES
John Smith and Sandra map to the same location
a linked list is started from John to Sandra
29Hash Tables - Linked lists
- Collisions - Resolution
- Linked list is attached to each primary table
slot - // Three entries map to same location
- h(k) h(k1) h(k2)
- Searching for k1
- Calculate hash(k1)
- Item doesnt match
- Follow linked list to k1
- If NULL found, key isnt in table
30Hash Tables - Linked Lists
- If a search can be satisfiedby any item with
key, k,performance is still O(1) - but
- If the key values are different
- we get O( 1 max )
- Where max is the largest number of duplicates -
or length of the - longest chain (Linked List)
31- TECHNIQUE TWO - USE AN OVERFLOW AREA
- Linked list constructed in special area of
tablecalled OVERFLOW AREA - If two keys map to same location
- hash(k) hash(j)
- k stored first
- Adding j
- When hash(j) maps to hash(k)
- Find k THEN
- Go to first slot in overflow area
- Put j in it
- Searching - same as linked list
32Hashing(103)
- Our hash function is based on the division method
for creating hash functions - hash(k) k mod size
hash(103) 103 mod 10 hash(103) 3
33Hashing(103)
hash(n) 103 mod 10 hash(n) 3
103
/
34Hashing(69)
hash(n) 69 mod 10 hash(n) 9
103
/
69
/
35Hashing(20)
h(n) 20 mod 10 h(n) 0
20
/
103
/
69
/
36Hashing(13)
hash(n) 13 mod 10 hash(n) 3
20
/
103
13
/
69
/
37Hashing(110)
hash(n) 110 mod 10 hash(n) 0
20
110
/
103
13
/
69
/
38Hashing(53)
hash(n) 53 mod 10 hash(n) 3
20
110
/
103
13
53
/
69
/
39Final Hash Table
20
110
/
103
13
53
/
53
69
/
40Searching for 53 Using Chaining
41Searching for 53
20
110
/
103
13
/
53
/
69
/
42Searching for 53
20
110
/
103
13
/
53
/
temp
69
/
43Searching for 53
20
110
/
103
13
/
53
/
temp
69
/
44Searching for 53
20
110
/
103
13
/
53
/
temp
69
/
45Closed Hashing - Re-hash functions
- Closed hashing, is a method of collision
resolution in hash tables. - With this method, a hash collision is resolved
by - probing, or
- searching through other locations in the array
461 Solution - Linear probing
- In one variation, the probing sequence
is called - (1) Linear Probing
- Continue probing adjacent locations
- until an unused array slot is found.
- Then put the Entry in that location.
-
47Closed hashing - e.g. linear probing
- Closed Hashing keeps keys in the main table and
uses a re-hash function which has many
variations . - Linear probing - previous example - is the most
commonly Closed Hashing - uses the Main Table or flat area to find
another location
48Rehash function - linear probing
- The rehash function for Linear Probing is
- hash(x) is 1
- Keep going to the next slot until you find an
empty one
49Insertion, I
- Suppose you want to add seagull to this hash
table - Also suppose
- hashCode(seagull) 143
- table143 is not empty
- table143 ! seagull
- table144 is not empty
- table144 ! seagull
- table145 is empty
- Therefore, put seagull at location 145
seagull
50Searching, I
- Suppose you want to look up seagull in this hash
table - Also suppose
- hashCode(seagull) 143
- table143 is not empty
- table143 ! seagull
- table144 is not empty
- table144 ! seagull
- table145 is not empty
- table145 seagull !
- We found seagull at location 145
51Searching, II
- Suppose you want to look up cow in this hash
table - Also suppose
- hashCode(cow) 144
- table144 is not empty
- table144 ! cow
- table145 is not empty
- table145 ! cow
- table146 is empty
- If cow were in the table, we should have found it
by now - Therefore, it isnt here
52Insertion, II
- Suppose you want to add hawk to this hash table
- Also suppose
- hashCode(hawk) 143
- table143 is not empty
- table143 ! hawk
- table144 is not empty
- table144 hawk
- hawk is already in the table, so do nothing
53Insertion, III
- Suppose
- You want to add cardinal to this hash table
- hashCode(cardinal) 147
- The last location is 148
- 147 and 148 are occupied
- Solution
- Treat the table as circular after 148 comes 0
- Hence, cardinal goes in location 0 (or 1, or 2,
or ...)
54Linear PROBING Review
- Closed Hashing uses Linear Probing (among others)
- Linear Probing If position h(key) is occupied,
do a linear search in the table until you find a
empty slot. - The slot is searched in this order
- h(key), k(key)1, h(key)2, ..., h(key)c
55Expanding the table
- If the table becomes full, an exception can be
thrown or -
- we can expand the capacity.
- This process is involved because if we double
the size, -
- we risk a sparse structure that can impact the
efficiency we seek. - One solution is to rehash the table using the new
table size.
56Closed Hashing - Buckets
- One implementation for closed hashing groups hash
table slots into buckets. - The M slots of the hash table are divided into B
buckets, with each bucket consisting of M/B
slots. - The hash function assigns each record to the
first slot within one of the buckets.
57Bucket Hashing - uses Main Table
- If this slot is already occupied,
- then the bucket slots are searched sequentially
until an open slot is found.
58Buckets on the table
- If a bucket is entirely full,
- then the record is stored in an overflow bucket
of infinite capacity at the end of the table. - All buckets share the same overflow bucket. See
link below See this link for a fuller
explanation - http//research.cs.vt.edu/AVresearch/hashing/bucke
thash.php
59Slots or Buckets 4 buckets
60Bucket Hashing
- To search, hash the key to determine which bucket
should contain the record. - The records in this bucket are then searched.
- How is this better than linear probing? -- 1
61Bucket Hashing
- If the desired key value is not found and the
bucket still has free slots, then the search is
complete. - If the bucket is full, then the search goes to
the overflow bucket. - If many records are in the overflow bucket, this
will be an expensive process.
62Bucket Hashing advantage
- Bucket methods are good for implementing hash
tables stored on disk, because the bucket size
can be set to the size of a disk block. - Whenever search or insertion occurs, the entire
bucket is read into memory.
63USING BUCKETS
- Because the entire bucket is then in memory,
- processing an insert or search operation requires
only one disk access, unless the bucket is full. - If the bucket is full, then the overflow bucket
must be retrieved from disk as well.
64Clustering
- Even with a good hash function, linear probing
has its problems - The position of the initial mapping of key k is
called the home position of k. - When several insertions map to the same home
position, they end up placed contiguously in the
table. - This collection of keys with the same home
position is called a cluster.
65Clusters
- A cluster is a group of items not containing any
open slots - Clusters cause efficiency to degrade
66Clustering
- As clusters grow, the probability increases that
a key will map to the middle of a cluster, - increasing the rate of the clusters growth.
67Clusters
- This tendency of linear probing to place items
together is known as primary clustering. - As these clusters grow, they merge with other
clusters forming even bigger clusters which grow
even faster.
68Other collision techniques
-
- We have looked at
- chaining(Linked Lists) (Open Hashing) and
- Linear Probing( Closed Hashing)
- Bucket Hashing
- Let us look at some other collision techniques
69- Other Closed hash function techniques are
- Quadratic probing a variant of the above where
the term being added to the hash result is
squared. - h(key) c2
- Random probing the term being added to the hash
function is a random number. - h(key) random()
70Rehash functions
- Rehashing is a technique where a sequence of
hashing functions are defined (h1, h2, ... hk). - If a collision occurs the functions are used in
the this order
71- Use a second hash function - Re-Hashing
- hash(k) hash(j)
- k stored first
- Adding j
- Calculate hash(j)
- Find k first
- Calculate hash2(j) where
- hash2 is some
- other hash function
- Repeat until we find an empty slot
- Put j in it
Hash 2(j) - second hash function
72Hash Tables - Re-hash functions
- The re-hash function has many variations
- Quadratic probing
- h(x) is squared
- Avoids primary clustering
- Secondary clustering occurs
- All keys which collide on h(x) follow the same
sequence - First
- a h(j)
- Then a c, a 4c, a 16c, ....
73Quadratic Probing
- Some versions use
- p(K, i) c1 i2 c2 i2 c3 i2 for some
choice of constants c1, c2, and c3. - Secondary clustering generally less of a problem
74Searching in a Hash Table
- We have already seen how searching works with
chaining. - With Closed Hashing, we use the following steps
- Given a target, hash the target
- Take the value of the hash of target and go to
the slot. - If the target exist it must be in this slot
- Search in the list in the current slot using a
linear search.
75Look up a key
- public lookup(key)
- int I
- i find_slot(key) // method to find key in
table - if sloti is occupied // key is in table
- return sloti.value // return value in
slot - else
- // key is not in table
- return not found
76 linear probing and single-slot step
- public find_slot(key)
-
- int i
- i hash(key) // use a hash method to
hash the key - // search until we either find the key, or find
an empty slot. while ( (sloti is occupied) and
( sloti.key ? key ) ) -
- i (i 1)
-
- return i
77Deleting in a table Closed Hashing
- Suppose you want to look up cow in this hash
table - Also suppose
- hashCode(cow) 144
- table144 is not empty
- table144 ! cow
- table145 is not empty
- table145 ! cow
- table146 is empty
- If cow were in the table, we should have found it
by now - Therefore it is not there.
78Deleting from a table
- Problem
- When an empty slot is reached, we assume the
item we are searching for is not there. - Deletion leaves an empty slot,
- When we next search for an item using linear
probing, - We assume the item is not there when we reached
the empty slot.
79Tombstones
- We assume the item is not there when we reached
the empty slot. - When, in fact, the item could be AFTER the empty
slot.
80TOMBSTONES
Therefore, straight deletion of an item would not
work. Instead, the cell is marked (usually by
use of a boolean variable) when a item is
deleted The slot is often termed a
tombstone.
81Hash Tables - Summary so far ...
- Potential O(1) search time
- If a suitable function hash(key) integer can be
found - Space for speed trade-off
- Full hash tables dont work (more later!)
- Collisions
- Inevitable
82Various resolution strategies looked at so
far Linked lists Overflow areas Re-hash
functions Linear probing h is
1 Quadratic probing h is i2 - Any
other hash function! or even sequence of
functions!
83Comparison of collision techniques
Linear Probing
Random Probing
Chaining
84Hashing with Chaining
- What is the running time to insert/search/delete?
- Insert It takes O(1) time to compute the hash
function and insert at head of linked list - Search It is proportional to max linked list
length - Delete Same as search
85Efficiency of chaining
- Therefore, if we have a bad hash function,
all n keys may hash to the same
table index giving an O(n) run-time! - So how can we create a good hash function?
86Hash Tables - Choosing the Hash Function
- Some functions are definitely better than others!
- Key criterion
- Minimum number of collisions
- Keeps chains short
- Maintains O(1) on average
87Writing your own hashCode method
- A hashCode method must
- Return a value that is a legal array index
- Always return the same value for the same input
- It cant use random numbers, or the time of day
- Return the same value for equal inputs
- Must be consistent with your equals method
88Hashcode Function
- It does not need to return different values for
different inputs some collisions are
inevitable. - A good hashCode method should
- Be efficient to compute
- Give a uniform distribution of array indices
- so NO SPARSE ARRAYS!
89Other considerations
- The hash table might fill up we need to be
prepared for that - Generally speaking, hash tables work best when
the table size is a prime number
90Hash tables in Java
- Java provides two classes, Hashtable and HashMap
classes which implement the MAP Interface - Both are maps they associate keys with values
- Hashtable is synchronized it can be accessed
safely from multiple threads - Hashtable uses an open hash, and has a rehash
method, to increase the size of the table
91HashMap
- HashMap is newer, faster, and usually better,
- but it is not synchronized
- HashMap (default) uses a bucket hash -
- (linked list)
- and has a remove method
92Hash table operations
- Both Hashtable and HashMap are in java.util
- Both have no-argument constructors, as well as
constructors that take an integer table size - Both have methods as listed in next slide
93Methods
- // put the entry in the table
- public T put(T key, T value)
- //Returns the value for this key, or null
- public T get(T key)
- public void clear() // clears the table
- public Set keySet() // returns the values in the
table in a Set
94Hash Tables - Reducing the range to 0, m )
- Weve mapped the keys to a range of integers
0 key lt r - - decided on total number of possible keys
- For social security numbers - 999,999,999
- Now we must reduce this range to 0, m )
// from 0 to M - where m is a reasonable size for the hash table
95Hash Tables Hash functions
- Some typical functions
- Division Use a mod function
- hash(k) abs( k mod m)
- where m is table size
- which yields a range between 0 and m-1
96- Some typical functions
- Choice of m?
- Powers of 2 are generally not good!
- h(k) k mod 2n
- Prime numbers close to 2n - good choices
97Choosing a viable value for M
- Prime numbers close to 2n - good choices
- Eg. want 4000 entry table,
- choose m 4093
- Other methods in your text.
98Performance Analysis
- If n slots in a table of size m are occupied, the
load factor is defined as ( a is the load
factor) -
- when ?1 means the table is full, and ?0 means
the table is empty. - It is generally good to get a value lt 1, near
.8.
n number of items
m number of slots
99(No Transcript)
100Linear probing
Double hashing
Separate chaining
101Hash Tables - Collision Resolution Summary
- Chaining
- Unlimited number of elements
- Unlimited number of collisions
- Overhead of multiple linked lists
- Re-hashing
- Fast re-hashing
- Fast access through use of main table space
- Maximum number of elements must be known
- Multiple collisions become probable -
CLUSTERING! - Overflow area
- Fast access
- Collisions don't use primary table space
102Terms to Know
- Open Addressing looks for another open position
in the table other than the one to which the
element is originally hashed. Requires that the
load factor be lt 1. - Open Addressing using Linear Probing - seeking
next available position creates clusters -
alternative methods - quadratic probing etc. - Separate Chaining If two keys map to the same
address, separate chaining creates a linked list
of keys that map to that address.
103HashCode function in Java
- Hash function - has two parts
- Map key k to an integer
- There is a default hashcode() in Java - the
method maps each object to an integer . - It returns a 32 bit integer which may be where
the object is in memory. - It works poorly with Strings as two strings could
be in different locations in memory and contain
the same data.
104Hash Tables - Review
- If you can meet the constraints of a hash
function that gives a Big(O) of 1 - Hash Tables will generally give good performance
- O(1) search
105- BUT
- not advisable for unknown data
- If collection size is relatively static few
insertions and deletions - memory management is
actually simpler
106Universal or Perfect Hashing
- Dynamic perfect hashing" involves using a
second hash table as the data structure to store
multiple values within a particular bucket. -
- How do we find the next location with this
approach?
107Universal Hashing
- What advantages does it have over linear probing?
- What are possible problems with the approach?
- Perfect hashing means that read access takes
constant time even in the worst case.
108Universal or Perfect Hashing
- For inserting , the time bounds are only true on
average. - To make insertion fast enough ,
- the second level hash table is very large for
the number of keys (k2), - large enough so that collisions become
unlikely. -
109second level hash tables
- This is not a problem with table size because the
first level hash distributes keys evenly - so that on average second level hash tables
are still relatively small. - The hash function for the second level tables are
chosen at random from a set of parameterized hash
functions.
110Universal Hashing
- It is possible when you know exactly what set of
keys you are going to be hashing when you design
your hash function. - It's popular for hashing keywords for
compilers - Minimal perfect hashing guarantees that n
keys will map to 0..n-1 with no collisions at
all.
111Chained Bucket
- Note when using chaining,
- each linked list attached to a slot is called a
bucket - - this is called chained bucket hashing
- However, there is also bucket hashing done on
the main table - just to make things real clear.