Hash Tables - PowerPoint PPT Presentation

1 / 111
About This Presentation
Title:

Hash Tables

Description:

Linear probing - previous example - is the most commonly Closed Hashing uses the Main Table or flat area to find another location REHASH FUNCTION - LINEAR ... – PowerPoint PPT presentation

Number of Views:322
Avg rating:3.0/5.0
Slides: 112
Provided by: chel1
Category:
Tags: hash | probing | tables

less

Transcript and Presenter's Notes

Title: Hash Tables


1
Hash Tables
  • The crucial disadvantage for avoiding arrays is
    that we need to allocate in advance the size of
    this structure
  • We tend to overestimate its size and end up with
    a very sparse structure

2
Storing BIG DATA
  • We tend to think that the actual number of keys
    to be stored is equal to the universe of
    possible existing keys

3
Hash Tables
  • Often the number of keys to be stored is smaller
    than the number in the universe of keys.
  • In this case, a hash table may save us a lot of
    space.

4
Hash Tables
  • How can you store all possible SSN in an array?
  • Use an array with range 0 - 999,999,999 a
    billion possible locations!
  • This will give you O(1) access time but
  • considering there are approximately
  • 308,000,000 people in the USA ,you waste
  • 1,000,000,000 -350,000,000 array entries!

5
Problem - Wasted Space
  • Problem
  • The range of key values we are mapping is too
    large
  • (0-999,999,999 when compared to
  • the of actual keys (US citizens)

6
Hash Tables
  • All search structures so far
  • Relied on a comparison operation
  • Performance O(n) or O( log n) for input of
  • Size N
  • WE CAN DO BETTER WITH HASHING

7
  • Simplest case
  • Assume we have keys with values in the range 1 ..
    M
  • Use a hash method to compute the value of the
    key (an int) to select a slot in a direct
    access table in which to store the item

8
Hash(key)
  • To search for an item with key,
  • k,
  • look in slot hash (key) which
  • produces an int that maps to
  • an index in the array.
  • If theres an item there,youve found it
  • If the tag is 0, its missing.

9
CONSTANT TIME SEARCH
  • This produces a Constant time search
  • O(1)

10
Example (ideal) hash function
  • Suppose we now have Strings and must hash them to
    an integer.
  • Our hash function maps the following values
  • hashCode("apple") 5
  • hashCode("watermelon") 3
  • hashCode("grapes") 8
  • hashCode("cantaloupe") 7
  • hashCode("kiwi") 0
  • hashCode("strawberry") 9
  • hashCode("mango") 6
  • hashCode("banana") 2

11
Why hash tables?
  • We use key/value pairs to store an Entry into the
    table
  • We use use a hash function to map a key Hawk
  • Key(hawk) to an integer
  • The value column holds the data we are actually
    interested in

12
Hash Functions
  • Hash tables normally provide O(1) time (constant
    time) to access an element
  • A value(called a key) is normally stored in slot
    k which is an integer value)
  • In hash tables, this element is stored in
  • slot hash(key).

13
HASH FUNCTIONS
  • hash(k) is a hash function.
  • It maps the universe U of keys into the slots of
    a hash table (smaller than the universe) ----
  • Thus reducing the size of the space we need to
    use.

14
Pictorial view of Hash Tables
UNIVERSE OF VALUES ARE MAPPED TO A SMALLER NUMBER
OF SLOTS
k1
k2
k3
k4
15
Hashing
  • Assume I have a hash function where the key is a
    String
  • e.g. A label which represents a city in our
    HPAir project
  • hash( key ) integer
  • i.e. the function maps the key to an integer
  • That is a string city name to an int
  • which is an index into the HashMap
  • What performance (Big(0) do I get ?

16
Hash Tables - Constraints
  • Initial Constraints hash a key to an integer
  • The hashcode of a Key must be unique
  • Keys must lie in a small range for storage
    efficiency,
  • keys must be dense in the range -
  • If theyre sparse (lots of gaps between
    values),a lot of space is used to obtain speed

17
Hash Tables -
  • Hashing Keys produces integers, therefore
  • We need a hash functionhash( key )
    integer
  • ie one that maps(hashes) a key to an integer
  • Applying this function to the key produces a
    unique address

18
Problems with a unique address for each key
  • If hash(key) maps each key to a uniqueinteger in
    the range 0 .. m-1
  • then search is O(1) -
  • BUT THIS IS HARD TO DO!!!!!

19
  • Example - using an n-character key e.g. a
    String
  • n number of characters in the String.
  • Use a String class method to change the String
    to a character array -
  • Call a method with an array name and the number
    of chars in String
  • hash(char array, of characters)

20
Hashing a string of characters
  • // n number of chars in the String
  • int hash( char sarray, int n )
  • int sum 0, i 0
  • // sum ascii values of the characters
  • while( n-- gt 0 )
  • sum sum sarray i .getNumericValue()
    return sum 256
  • // number of ASCII characters is 256
  • returns a value in 0 .. 255

21
Evaluation
  • int hash( char sarray, int n )
  • int sum 0, i 0 while( n-- gt 0 )
    // get ascii values of each character
  • // and sum them
  • sum sum sarrayi.getNumericValue()
    return sum 256 returns a value in
    0 .. 255
  • The hash function itself is O(1) since the
    number of characters is a constant for each
    String - that number will not change for each
    String

22
Hash Tables PROBLEM -Collisions
  • With this hash function
  • int hash( char s, int n ) int sum 0, i
    0 while( n-- gt 0 ) sum sum
    si.getNumericValue return sum 256
  • FOR
  • hash( AB, 2 ) andhash( BA, 2 ) their
    Ascii (Unicode) values return the same value!
  • Unicode value A is 65, for B is 66
  • Add them together in any order and they
    equal 131
  • This is called a collision

23
Collisions
  • Because we're mapping a larger universe into a
    smaller set of slots, collisions occur.
  • A variety of techniques are used for resolving
    collisions
  • Therefore having a unique key is HARD TO DO.

24
Pictorial view OF COLLISION
Sometimes keys map to the same memory location
COLLISION
k1
k5
k2
k3
k4
25
Hash Tables Collision solutions I
  • We need to store the actual key with the item in
    the hash table
  • We compute the address
  • index hash( key )
  • Next, look for the index in the table
  • if ( the location is occupied) then we try
    next entry till we find an open one

26
Collision Resolution Open Hashing
  • The most common resolution mechanism for
    collisions is called chaining .
  • This is also called Open Hashing.
  • Being "open", the Hashtable will store a linked
    list of entries whose keys hash to the same value
  • Chaining incorporates the concepts of linked
    lists and direct access structures like arrays
  • Each slot of a hash table will be a pointer to a
    linked list

27
Chaining or open hashing
  • When hashing a key, if a collision happens
  • the new key is stored in the linked list in that
    location
  • E.g., suppose that we're mapping the universe of
    integers to a hash table of size 10

28
Open Hash Table
KEYS BUCKETS ENTRIES
John Smith and Sandra map to the same location
a linked list is started from John to Sandra
29
Hash Tables - Linked lists
  • Collisions - Resolution
  • Linked list is attached to each primary table
    slot
  • // Three entries map to same location
  • h(k) h(k1) h(k2)
  • Searching for k1
  • Calculate hash(k1)
  • Item doesnt match
  • Follow linked list to k1
  • If NULL found, key isnt in table

30
Hash Tables - Linked Lists
  • If a search can be satisfiedby any item with
    key, k,performance is still O(1)
  • but
  • If the key values are different
  • we get O( 1 max )
  • Where max is the largest number of duplicates -
    or length of the
  • longest chain (Linked List)

31
  • TECHNIQUE TWO - USE AN OVERFLOW AREA
  • Linked list constructed in special area of
    tablecalled OVERFLOW AREA
  • If two keys map to same location
  • hash(k) hash(j)
  • k stored first
  • Adding j
  • When hash(j) maps to hash(k)
  • Find k THEN
  • Go to first slot in overflow area
  • Put j in it
  • Searching - same as linked list

32
Hashing(103)
  • Our hash function is based on the division method
    for creating hash functions
  • hash(k) k mod size

hash(103) 103 mod 10 hash(103) 3
33
Hashing(103)
hash(n) 103 mod 10 hash(n) 3
103
/
34
Hashing(69)
hash(n) 69 mod 10 hash(n) 9
103
/
69
/
35
Hashing(20)
h(n) 20 mod 10 h(n) 0
20
/
103
/
69
/
36
Hashing(13)
hash(n) 13 mod 10 hash(n) 3
20
/
103
13
/
69
/
37
Hashing(110)
hash(n) 110 mod 10 hash(n) 0
20
110
/
103
13
/
69
/
38
Hashing(53)
hash(n) 53 mod 10 hash(n) 3
20
110
/
103
13
53
/
69
/
39
Final Hash Table
20
110
/
103
13
53
/
53
69
/
40
Searching for 53 Using Chaining
41
Searching for 53
20
110
/
103
13
/
53
/
69
/
42
Searching for 53
20
110
/
103
13
/
53
/
temp
69
/
43
Searching for 53
20
110
/
103
13
/
53
/
temp
69
/
44
Searching for 53
20
110
/
103
13
/
53
/
temp
69
/
45
Closed Hashing - Re-hash functions
  • Closed hashing, is a method of collision
    resolution in hash tables.
  • With this method, a hash collision is resolved
    by
  • probing, or
  • searching through other locations in the array

46
1 Solution - Linear probing
  • In one variation, the probing sequence
    is called
  • (1) Linear Probing
  • Continue probing adjacent locations
  • until an unused array slot is found.
  • Then put the Entry in that location.

47
Closed hashing - e.g. linear probing
  • Closed Hashing keeps keys in the main table and
    uses a re-hash function which has many
    variations .
  • Linear probing - previous example - is the most
    commonly Closed Hashing
  • uses the Main Table or flat area to find
    another location

48
Rehash function - linear probing
  • The rehash function for Linear Probing is
  • hash(x) is 1
  • Keep going to the next slot until you find an
    empty one

49
Insertion, I
  • Suppose you want to add seagull to this hash
    table
  • Also suppose
  • hashCode(seagull) 143
  • table143 is not empty
  • table143 ! seagull
  • table144 is not empty
  • table144 ! seagull
  • table145 is empty
  • Therefore, put seagull at location 145

seagull
50
Searching, I
  • Suppose you want to look up seagull in this hash
    table
  • Also suppose
  • hashCode(seagull) 143
  • table143 is not empty
  • table143 ! seagull
  • table144 is not empty
  • table144 ! seagull
  • table145 is not empty
  • table145 seagull !
  • We found seagull at location 145

51
Searching, II
  • Suppose you want to look up cow in this hash
    table
  • Also suppose
  • hashCode(cow) 144
  • table144 is not empty
  • table144 ! cow
  • table145 is not empty
  • table145 ! cow
  • table146 is empty
  • If cow were in the table, we should have found it
    by now
  • Therefore, it isnt here

52
Insertion, II
  • Suppose you want to add hawk to this hash table
  • Also suppose
  • hashCode(hawk) 143
  • table143 is not empty
  • table143 ! hawk
  • table144 is not empty
  • table144 hawk
  • hawk is already in the table, so do nothing

53
Insertion, III
  • Suppose
  • You want to add cardinal to this hash table
  • hashCode(cardinal) 147
  • The last location is 148
  • 147 and 148 are occupied
  • Solution
  • Treat the table as circular after 148 comes 0
  • Hence, cardinal goes in location 0 (or 1, or 2,
    or ...)

54
Linear PROBING Review
  • Closed Hashing uses Linear Probing (among others)
  • Linear Probing If position h(key) is occupied,
    do a linear search in the table until you find a
    empty slot.
  • The slot is searched in this order
  • h(key), k(key)1, h(key)2, ..., h(key)c

55
Expanding the table
  • If the table becomes full, an exception can be
    thrown or
  • we can expand the capacity.
  • This process is involved because if we double
    the size,
  • we risk a sparse structure that can impact the
    efficiency we seek.
  • One solution is to rehash the table using the new
    table size.

56
Closed Hashing - Buckets
  • One implementation for closed hashing groups hash
    table slots into buckets.
  • The M slots of the hash table are divided into B
    buckets, with each bucket consisting of M/B
    slots.
  • The hash function assigns each record to the
    first slot within one of the buckets.

57
Bucket Hashing - uses Main Table
  • If this slot is already occupied,
  • then the bucket slots are searched sequentially
    until an open slot is found.

58
Buckets on the table
  • If a bucket is entirely full,
  • then the record is stored in an overflow bucket
    of infinite capacity at the end of the table.
  • All buckets share the same overflow bucket. See
    link below See this link for a fuller
    explanation
  • http//research.cs.vt.edu/AVresearch/hashing/bucke
    thash.php

59
Slots or Buckets 4 buckets
60
Bucket Hashing
  • To search, hash the key to determine which bucket
    should contain the record.
  • The records in this bucket are then searched.
  • How is this better than linear probing? -- 1

61
Bucket Hashing
  • If the desired key value is not found and the
    bucket still has free slots, then the search is
    complete.
  • If the bucket is full, then the search goes to
    the overflow bucket.
  • If many records are in the overflow bucket, this
    will be an expensive process.

62
Bucket Hashing advantage
  • Bucket methods are good for implementing hash
    tables stored on disk, because the bucket size
    can be set to the size of a disk block.
  • Whenever search or insertion occurs, the entire
    bucket is read into memory.

63
USING BUCKETS
  • Because the entire bucket is then in memory,
  • processing an insert or search operation requires
    only one disk access, unless the bucket is full.
  • If the bucket is full, then the overflow bucket
    must be retrieved from disk as well.

64
Clustering
  • Even with a good hash function, linear probing
    has its problems
  • The position of the initial mapping of key k is
    called the home position of k.
  • When several insertions map to the same home
    position, they end up placed contiguously in the
    table.
  • This collection of keys with the same home
    position is called a cluster.

65
Clusters
  • A cluster is a group of items not containing any
    open slots
  • Clusters cause efficiency to degrade

66
Clustering
  • As clusters grow, the probability increases that
    a key will map to the middle of a cluster,
  • increasing the rate of the clusters growth.

67
Clusters
  • This tendency of linear probing to place items
    together is known as primary clustering.
  • As these clusters grow, they merge with other
    clusters forming even bigger clusters which grow
    even faster.

68
Other collision techniques
  • We have looked at
  • chaining(Linked Lists) (Open Hashing) and
  • Linear Probing( Closed Hashing)
  • Bucket Hashing
  • Let us look at some other collision techniques

69
  • Other Closed hash function techniques are
  • Quadratic probing a variant of the above where
    the term being added to the hash result is
    squared.
  • h(key) c2
  • Random probing the term being added to the hash
    function is a random number.
  • h(key) random()

70
Rehash functions
  • Rehashing is a technique where a sequence of
    hashing functions are defined (h1, h2, ... hk).
  • If a collision occurs the functions are used in
    the this order

71
  • Use a second hash function - Re-Hashing
  • hash(k) hash(j)
  • k stored first
  • Adding j
  • Calculate hash(j)
  • Find k first
  • Calculate hash2(j) where
  • hash2 is some
  • other hash function
  • Repeat until we find an empty slot
  • Put j in it

Hash 2(j) - second hash function
72
Hash Tables - Re-hash functions
  • The re-hash function has many variations
  • Quadratic probing
  • h(x) is squared
  • Avoids primary clustering
  • Secondary clustering occurs
  • All keys which collide on h(x) follow the same
    sequence
  • First
  • a h(j)
  • Then a c, a 4c, a 16c, ....

73
Quadratic Probing
  • Some versions use
  • p(K, i) c1 i2 c2 i2 c3 i2 for some
    choice of constants c1, c2, and c3.
  • Secondary clustering generally less of a problem

74
Searching in a Hash Table
  • We have already seen how searching works with
    chaining.
  • With Closed Hashing, we use the following steps
  • Given a target, hash the target
  • Take the value of the hash of target and go to
    the slot.
  • If the target exist it must be in this slot
  • Search in the list in the current slot using a
    linear search.

75
Look up a key
  • public lookup(key)
  • int I
  • i find_slot(key) // method to find key in
    table
  • if sloti is occupied // key is in table
  • return sloti.value // return value in
    slot
  • else
  • // key is not in table
  • return not found

76
linear probing and single-slot step
  • public find_slot(key)
  • int i
  • i hash(key) // use a hash method to
    hash the key
  • // search until we either find the key, or find
    an empty slot. while ( (sloti is occupied) and
    ( sloti.key ? key ) )
  • i (i 1)
  • return i

77
Deleting in a table Closed Hashing
  • Suppose you want to look up cow in this hash
    table
  • Also suppose
  • hashCode(cow) 144
  • table144 is not empty
  • table144 ! cow
  • table145 is not empty
  • table145 ! cow
  • table146 is empty
  • If cow were in the table, we should have found it
    by now
  • Therefore it is not there.

78
Deleting from a table
  • Problem
  • When an empty slot is reached, we assume the
    item we are searching for is not there.
  • Deletion leaves an empty slot,
  • When we next search for an item using linear
    probing,
  • We assume the item is not there when we reached
    the empty slot.

79
Tombstones
  • We assume the item is not there when we reached
    the empty slot.
  • When, in fact, the item could be AFTER the empty
    slot.

80
TOMBSTONES
Therefore, straight deletion of an item would not
work. Instead, the cell is marked (usually by
use of a boolean variable) when a item is
deleted The slot is often termed a
tombstone.
81
Hash Tables - Summary so far ...
  • Potential O(1) search time
  • If a suitable function hash(key) integer can be
    found
  • Space for speed trade-off
  • Full hash tables dont work (more later!)
  • Collisions
  • Inevitable

82
Various resolution strategies looked at so
far Linked lists Overflow areas Re-hash
functions Linear probing h is
1 Quadratic probing h is i2 - Any
other hash function! or even sequence of
functions!
83
Comparison of collision techniques
Linear Probing
Random Probing
Chaining
84
Hashing with Chaining
  • What is the running time to insert/search/delete?
  • Insert It takes O(1) time to compute the hash
    function and insert at head of linked list
  • Search It is proportional to max linked list
    length
  • Delete Same as search

85
Efficiency of chaining
  • Therefore, if we have a bad hash function,
    all n keys may hash to the same
    table index giving an O(n) run-time!
  • So how can we create a good hash function?

86
Hash Tables - Choosing the Hash Function
  • Some functions are definitely better than others!
  • Key criterion
  • Minimum number of collisions
  • Keeps chains short
  • Maintains O(1) on average

87
Writing your own hashCode method
  • A hashCode method must
  • Return a value that is a legal array index
  • Always return the same value for the same input
  • It cant use random numbers, or the time of day
  • Return the same value for equal inputs
  • Must be consistent with your equals method

88
Hashcode Function
  • It does not need to return different values for
    different inputs some collisions are
    inevitable.
  • A good hashCode method should
  • Be efficient to compute
  • Give a uniform distribution of array indices
  • so NO SPARSE ARRAYS!

89
Other considerations
  • The hash table might fill up we need to be
    prepared for that
  • Generally speaking, hash tables work best when
    the table size is a prime number

90
Hash tables in Java
  • Java provides two classes, Hashtable and HashMap
    classes which implement the MAP Interface
  • Both are maps they associate keys with values
  • Hashtable is synchronized it can be accessed
    safely from multiple threads
  • Hashtable uses an open hash, and has a rehash
    method, to increase the size of the table

91
HashMap
  • HashMap is newer, faster, and usually better,
  • but it is not synchronized
  • HashMap (default) uses a bucket hash -
  • (linked list)
  • and has a remove method

92
Hash table operations
  • Both Hashtable and HashMap are in java.util
  • Both have no-argument constructors, as well as
    constructors that take an integer table size
  • Both have methods as listed in next slide

93
Methods
  • // put the entry in the table
  • public T put(T key, T value)
  • //Returns the value for this key, or null
  • public T get(T key)
  • public void clear() // clears the table
  • public Set keySet() // returns the values in the
    table in a Set

94
Hash Tables - Reducing the range to 0, m )
  • Weve mapped the keys to a range of integers
    0 key lt r -
  • decided on total number of possible keys
  • For social security numbers - 999,999,999
  • Now we must reduce this range to 0, m )
    // from 0 to M
  • where m is a reasonable size for the hash table

95
Hash Tables Hash functions
  • Some typical functions
  • Division Use a mod function
  • hash(k) abs( k mod m)
  • where m is table size
  • which yields a range between 0 and m-1

96
  • Some typical functions
  • Choice of m?
  • Powers of 2 are generally not good!
  • h(k) k mod 2n
  • Prime numbers close to 2n - good choices

97
Choosing a viable value for M
  • Prime numbers close to 2n - good choices
  • Eg. want 4000 entry table,
  • choose m 4093
  • Other methods in your text.

98
Performance Analysis
  • If n slots in a table of size m are occupied, the
    load factor is defined as ( a is the load
    factor)
  • when ?1 means the table is full, and ?0 means
    the table is empty.
  • It is generally good to get a value lt 1, near
    .8.

n number of items
m number of slots
99
(No Transcript)
100
Linear probing
Double hashing
Separate chaining
101
Hash Tables - Collision Resolution Summary
  • Chaining
  • Unlimited number of elements
  • Unlimited number of collisions
  • Overhead of multiple linked lists
  • Re-hashing
  • Fast re-hashing
  • Fast access through use of main table space
  • Maximum number of elements must be known
  • Multiple collisions become probable -
    CLUSTERING!
  • Overflow area
  • Fast access
  • Collisions don't use primary table space

102
Terms to Know
  • Open Addressing looks for another open position
    in the table other than the one to which the
    element is originally hashed. Requires that the
    load factor be lt 1.
  • Open Addressing using Linear Probing - seeking
    next available position creates clusters -
    alternative methods - quadratic probing etc.
  • Separate Chaining If two keys map to the same
    address, separate chaining creates a linked list
    of keys that map to that address.

103
HashCode function in Java
  • Hash function - has two parts
  • Map key k to an integer
  • There is a default hashcode() in Java - the
    method maps each object to an integer .
  • It returns a 32 bit integer which may be where
    the object is in memory.
  • It works poorly with Strings as two strings could
    be in different locations in memory and contain
    the same data.

104
Hash Tables - Review
  • If you can meet the constraints of a hash
    function that gives a Big(O) of 1
  • Hash Tables will generally give good performance
  • O(1) search

105
  • BUT
  • not advisable for unknown data
  • If collection size is relatively static few
    insertions and deletions - memory management is
    actually simpler

106
Universal or Perfect Hashing
  • Dynamic perfect hashing" involves using a
    second hash table as the data structure to store
    multiple values within a particular bucket.
  •  
  • How do we find the next location with this
    approach?

107
Universal Hashing
  • What advantages does it have over linear probing?
  • What are possible problems with the approach?
  • Perfect hashing means that read access takes
    constant time even in the worst case.

108
Universal or Perfect Hashing
  • For inserting , the time bounds are only true on
    average.
  • To make insertion fast enough ,
  • the second level hash table is very large for
    the number of keys (k2),
  • large enough so that collisions become
    unlikely.

109
second level hash tables
  • This is not a problem with table size because the
    first level hash distributes keys evenly
  • so that on average second level hash tables
    are still relatively small.
  • The hash function for the second level tables are
    chosen at random from a set of parameterized hash
    functions.

110
Universal Hashing
  • It is possible when you know exactly what set of
    keys you are going to be hashing when you design
    your hash function.
  • It's popular for hashing keywords for
    compilers
  • Minimal perfect hashing guarantees that n
    keys will map to 0..n-1 with no collisions at
    all.

111
Chained Bucket
  • Note when using chaining,
  • each linked list attached to a slot is called a
    bucket
  • - this is called chained bucket hashing
  • However, there is also bucket hashing done on
    the main table - just to make things real clear.
Write a Comment
User Comments (0)
About PowerShow.com