Title: Introduction to Programming with Data Structures
1Introduction to Programming with Data Structures
Computer Science 187
Lecture 20 Sets, Maps, and Hashtables
Announcements
- Recursion OWL is still open.
- Programming project 5 up.
2Sets and the Set Interface
- This part of the Collection hierarchy includes 3
interfaces, 2 abstract classes, and 2 actual
classes
3The Set Abstraction
- A set is a collection of elements containing no
duplicate elements - Operations on sets include
- Testing for membership
- Adding elements
- Removing elements
- Union
- Intersection
- Difference
- Subset
4The Set Interface and Methods
- // element oriented methods
- boolean contains (E e) // member test
- boolean add (E e) // enforces no-dups
- boolean remove (Object o)
- boolean isEmpty ()
- int size ()
- IteratorltEgt iterator ()
5The Set Interface and Methods (2)
- // Set/Collection oriented methods
- // subset test
- boolean containsAll(CollectionltEgt c)
- // set union
- boolean addAll(CollectionltEgt c)
- // set difference
- boolean removeAll(CollectionltEgt c)
- // set intersection
- boolean retainAll(CollectionltEgt c)
6The Set Interface and Methods (3)
- Constructors enforce the no duplicates
criterion - Add methods do not allow duplicates either
- Certain methods are optional
- add, addAll, remove, removeAll, retainAll
7Set Example
- String aA Ann,Sal,Jill,Sal
- String aB Bob,Bill,Ann,Jill
- SetltStringgt sA new HashSetltStringgt()
- // HashSet implements Set
- SetltStringgt sA2 new HashSetltStringgt()
- SetltStringgt sB new HashSetltStringgt()
- for (String s aA)
- sA.add(s) sA2.add(s)
-
- for (String s aB)
- sB.add(s)
8Set Example (2)
- ...
- System.out.println(The two sets are
- \n sA \n sB)
- sA.addAll(sB) // union
- sA2.retainAll(sB) // intersection
- System.out.println(Union sA)
- System.out.println(Intersection sA2)
9Output
java -cp ./Users/allenhanson/Desktop/SetTest/Jav
aClasses.jar SetTest The two sets are Set A
Sal, Jill, Ann Set B Bob, Jill, Ann,
Bill Union Bob, Sal, Jill, Ann,
Bill Intersection Jill, Ann logout Process
completed
10Lists vs. Sets
- Sets allow no duplicates
- Sets do not have positions, so no get or set
method - Set iterator can produce elements in any order
No order or sense of position
11Searching
- Searching for information is ubiquitous.
- Search is often key-based
- medical record based on persons name (the key)
- look for property records based on an
address-based key - Java Collections classes support efficient
search over various kinds of data structures.
12Dictionaries
- Many, many examples in our everyday life of
dictionaries. - The primary purpose is to look things up using
some key. The motivation being is that there is
some information in addition to the key that we
would find useful - account number in our bank
- a set of windows open in a graphical interface
- Websters
- A concordance for a document
- Variable-value tables
Dictionary class replaced by the Map interface.
13Looking Stuff Up
Elements
Keys
Dictionary
Key Associated Element Allen 339 Joan 0 Bar
b 32 Phil 458
14Maps and the Map Interface
- Map is related to Set it is a set of ordered
pairs - Ordered pair (key, value)
- There are no duplicate keys
- Values may appear more than once
- Can think of key as mapping to a particular
value - Maps support efficient organization of
information in tables - Mathematically, these maps are
- Many-to-one (not necessarily one-to-one)
- many keys might map to the same element
- Onto (every data element in the map has a key)
15The Map Interface
- // some methods of java.util.MapltK, Vgt
- // K is the key type
- // V is the value type
- // may return null
- V get (Object key)
- // returns previous value or null
- V put (K key, V value)
- // returns previous value or null
- V remove (Object key)
- boolean isEmpty ()
- int size ()
16Map Example
17Map Example
- // this builds the Map in previous picture
- MapltString, Stringgt m
- new HashMapltString, Stringgt()
- // HashMap is an implementation of Map
- m.put(J , Jane)
- m.put(B , Bill)
- m.put(S , Sam )
- m.put(B1, Bob )
- m.put(B2, Bill)
- //
- System.out.println(B1-gt m.get(B1))
- System.out.println(Bill-gtm.get(Bill))
No order or sense of position
18Efficient Implementation of Maps and
DictionariesHash Tables
- Problem phone company wants to implement caller
ID. - given a phone number (the key), look up persons
name (the data) - lots of phone numbers (P107-1) in a given area
code - only a small fraction of them are in use
- Possibilities
- an array indexed by key retrieval is O(1), but
huge amount of space potentially wasted.
19A List Representation
- Nodes store keys and data
- Only phone numbers in use are stored
- Retrieval is O(N) and the space requirement is
cNt (Nphone numbers in use)
20A Tree Representation
- Nodes contain key (phone number) and data (name).
- Tree balanced search is O(log N) and space is on
the order of the number of nodes.
21As a Hash Table
- Use a function of the key( h(K)) to determine
where key (and associated data) is stored - M h(K)-gtan address
- Go directly to location rather than searching
- O(n) or O(log n) -gt O(1).
- For any key K
- store K and a reference to the data associated
with K at location h(K) in the hash table (an
array). - When looking for K, compute h(K) and go to that
location in hash table to find key and the other
information.
22Theres always a catch..
- The problem is collisions - two keys k1 and k2
for which - h(k1) h(k2)
- Use a collision-handling scheme to solve the
problem of pairs that need to be stored at the
same location - Two important issues
- hash functions and
- collisions
23Hashing Observations
- A hash table is a container which is used to
hold some number of items of a given set K (the
keys) - Generally, the size of the set of keys, K, is
relatively large or even unbounded. - For example, if the keys are 32-bit integers,
then K232 - If the keys are arbitrary character strings of
arbitrary length, then K is unbounded. - We also expect that the actual number of items
stored in the container to be significantly less
than K. - That is, if n is the number of items actually
stored in the container, then nltltltK .
24As a Hash Table
- New phone number 348-8905 for Mike
Add Mike at Location 0 in Hash Table
0
Mike 348-8905
Hash function evaluates to the range 0..N-1 or
0-4 in this example.
25A Collision
- New phone number 352-2188 for Karen
348-8905
Mike
3
Uh-oh! Als already there!
259-0623
Karen 352-2188
Al
- Called a collision.
- Collisions are unavoidable in hash tables. WHY??
26What Can Be Hashed?
- Anything!
- numbers, strings, structures, etc.
- We just have to be clever about how we define the
hash function. - Java defines a hashing method for general objects
which returns an integer value. - may not be too good, in general - sometimes
returns the address of the object -- what
problems does this cause? - This method is overridden by the String class
(for example)
to provide a better method - one that guarantees
that the value returned for two strings that are
equal (same character sequence) are the same.
- Hash on content, not address
27Hash Function General Idea
Objects
hash code
Integers
0
-1
-2
-3
1
2
3
. . .
. . .
compression map
Hash Table (e.g. an array of size N)
N - 1
0
1
2
3
. . .
. . .
28Another Example Counting CharactersA Simple
Hash Function
- Want to count occurrences of each character in a
file - There are 216 possible characters, but ...
- Maybe only 100 or so occur in a given file
- Approach hash character to range 0-199
- That is, use a hash table of size 200
- A possible hash function for this example
- int hash unicodeChar 200
- Collisions are certainly possible.
29Example Hashing Functions
- Integer translation of memory location
- Ignores equality by state
- Default for most Java implementations
- Integer translation of object state
- Implies state equality hash code equality
- Integer translation of memory bits
- Often ignores large portions of object state
- Good for primitive values
- Lots of possibilities
30Devising Hash Functions
- Simple functions often produce many collisions
- ... but complex functions may not be good either!
- It is often an empirical process
- Adding letter values in a string same hash for
strings with same letters in different order - Better approach
- int hash 0
- for (int i 0 i lt s.length() i)
- hash hash 31 s.charAt(i)
- This is the hash function used for String in Java
31Devising Hash Functions (2)
- The String hash is good in that
- Every letter affects the value
- The order of the letters affects the value
- The values tend to be spread well over the
integers - Table size should not be a multiple of 31
- Calculate index int index hash size
- For short strings, index depends heavily on the
last one or two characters of the string - They chose 31 because it is prime, and this is
less likely to happen
32Devising Hash Functions (3)
- Guidelines for good hash functions
- Spread values evenly as if random
- Cheap to compute
- Generally, number of possible values gtgt table size
33Compression Map
- Translates integers in one range into integers in
another range - Maps hash codes to indices of containers in a
hashtables array - Can be implemented in various ways
- A separate CompressionMap object
- A separate HashFunction object
- Example Compression Maps
- Divide by range size and take remainder
- Known as the division method
- hash(k) k mod N
- Poor if range size is not prime
- Compute with constant values
- One possibility the MAD method
- hash(k) ak b mod N
- Good if a mod N ¹ 0
34Rehashing The Hashing Function
- What we need is a function h K -gt 0, 1, 2,
.,N-1 . - h is called a hash function .
- In general, since KgtgtN , the mapping defined by
a hash function will be a many-to-one mapping . - That is, there will exist many pairs of distinct
keys x and y for which h(x)h(y). - This situation is called a collision.
- What are the characteristics of a good hash
function - A good hash function avoids collisions.
- A good hash function tends to spread keys
evenly in the array. - A good hash function is easy to compute.
- However, remember that collisions are inevitable
because the mapping is many to one. - SO.how do we handle collisions
35Strategies for Handling Collisions
- Attempts to solve the problem of two keys hashing
to the same location - Depends on individual container capacity and may
result in a fixed or arbitrary capacity hashtable - The running times for a hashtables operations
depend on its collision-handling strategy
36Dealing with Collisions
- Three standard methods
- Chaining use a list at each array index to store
keys and data - Linear/Quadratic Probe if hashed index is full,
start walking down array looking for an empty
slot and put key and data there. - Double Hashing use two hash functions - second
is an offset to add to the value of the first. - Chaining stores data outside the table and is an
example of a technique called open hashing. - The other two store the data in the hash table
and are examples of techniques called closed
hashing (also called open addressingjust to be
really confusing). - All three methods introduce some problems.
37Open Hashing
- Colliding elements are placed in a list whose
head is located in the array.
NULL
NULL
NULL
NULL
- When looking for a key, and list head is not
null, we may have to look down the list to find
our key.
38Closed Hashing
- All records are stored in the table.
- Call h(ki) the home position (the position
computed by the hashing function). - If a slot is already occupied, the data will be
stored at some other slot in the table. - How do we find this slot? (aside mediated by a
conflict resolution policy). - Any closed hashing collision resolution method
can be viewed as generating a sequence of hash
table slots that can potentially hold the key and
data. - First slot generated is the home slot.
- If occupied, go to next slot generated if
occupied, go to next slot, etc. - Array must be treated circularly.
39Closed Hashing Example Linear Probing
- What about
- Searching follows same probe sequence--- when can
we stop? - Deletions are a pain! WHY????
????
40Linear Probing Summary
- Uses less memory than chaining
- dont have to store all the links
- Can be slower than chaining
- may have to walk along the table for a long way
- notice were walking over the elements at their
legitimate home positions as we go. - Difficult to delete a key and associated record.
- has an impact on the search process
- Keys have a tendency to clump, leading to long
search sequences.
41Linear Probe PseudoCode
- linear_probe_insert(K)
- if (table is full) error
- probe h(k)
- while (tableprobe is occupied)
- probe (probe1) mod N
- tableprobeK
42Closed Hashing Example Double Hashing
- Use two hash functions
- one as before that generates the home position.
- second one generates a sequence of offsets from
the home position that define the probe sequence. - probe (probe offset) mod N
- If the size of the table is prime, this method
will eventually examine every position in the
table. - take a number theory course to find out why.
43Double Hashing Example
h1(K) K mod 5 h2(K) 3 - K mod 3
348-8905
Mike
h2(3522188) 3 (offset)
352-2188
Karen
2. OK!
Probe Sequence
1. Full
352-2188 -gt 3 1 4 2 0
- What about
- Searching follows same probe sequence--- when can
we stop? - Deletions are still a pain!
44Double Hashing
- double_hash_insert(K)
- if (table is full) error
- probe h1(K)
- offset h2(K)
- while (tableprobe occupied)
- probe (probeoffset) mod M
- tableprobe K
- Many of same (dis) advantages as linear probing.
- Tends to distribute keys more uniformly.
45More on the Hash Function
- Need to choose a good hash function
- efficient to compute
- distributes keys uniformly throughout table
- For non-integer keys
- find a way of turning the keys into integers
- for phone number, remove - to get integer!
- for strings, add up Unicode value of characters?
- use the standard hash function on the integers
- Standard function
- h(K) K mod N
- K is the key, N is the size of the table
- How do we choose N?
46Choosing the Size of the Hash Table
- Make it big enough to last for awhile.
- N2p is bad
- h(K) gives the p least significant bits of K
- all keys with the same ending go to the same
place. - N prime is good
- helps ensure a uniform distribution
- again, need a number theory course to see why.
47Theoretical Results
- Define the load factor of a hash table as
- a m/N
- N is the size of the table
- m is the number of entries in the table with
something in them. - a is the average number of keys per array entry
- Theoretical results are obtained using a
probabilistic analysis, rather than worst case.
48Expected Number of Probes Needed to Find a Key
49Performance of Hash Tables (2)
L is our load factor a
50Same Results Graphically
Chaining
51Performance of Hash Tables (3)
- Hash table
- Insert average o(1)
- Search average o(1)
- Sorted array
- Insert average o(n)
- Search average o(log n)
- Binary Search Tree
- Insert average o(log n)
- Search average o(log n)
- But balanced trees can guarantee O(log n)
Average case, not worst case. Worst case for
chaining is O(n)
52Performance of Hash Tables (4) Space
- Hash table
- Open addressing space n/a e.g., 1.5 to 2 x n
- Chaining assuming 4 words per list node (2
header, 1 next, 1 data) n(14a) - Sorted array
- Space n
- Binary Search Tree
- Space 5n (5 words per tree node 2 header, 1
left, 1 right, 1 data)
53Terminology Review
- hash table Tables which can be searched for
an item in O(1) time using a hash function to
form an address from the key. - hash function Function which, when applied to
the key, produces a integer which can be used as
an address in a hash table. - collision When a hash function maps two
different keys to the same table address, a
collision is said to occur. - linear probing A simple re-hashing scheme in
which the next slot in the table is checked on a
collision. - quadratic probing A re-hashing scheme in
which a higher (usually 2nd) order function of
the hash index is used to calculate the address. - chaining a conflict resolution strategy where
colliding values at a hash table slot are stored
in a list associated with the slot. - clustering Tendency for clusters of adjacent
slots to be filled when linear probing is used. - secondary clustering Collision sequences
generated by addresses calculated with quadratic
probing. - perfect hash function Function which, when
applied to all the members of the set of items to
be stored in a hash table, produces a unique set
of integers within some suitable range.
54Implementing Hash Tables
- Interface HashMap used for both implementations
- Class Entry simple class for (key, value) pairs
- Class HTOpen implements open addressing
- Class HTChain implements chaining
- Further implementation concerns
55Interface HashMapltK,Vgt
- Note Java API version has many more operations!
- // may return null if key not in map
- V get (Object key)
- // returns previous value null if none
- V put (K key, V value)
- // returns previous value null if none
- V remove (Object key)
- boolean isEmpty ()
- int size ()
56Class Entry
- private static class EntryltK, Vgt
- private K key
- private V value
- public Entry (K key, V value)
- this.key key this.value value
-
- public K getKey () return key
- public V getValue () return value
- public V setValue (V newVal)
- V oldVal value
- value newVal
- return oldVal
-
57Class HTOpenltK,Vgt
- public class HTOpenltK, Vgt
- implements HashMapltK, Vgt
- private EntryltK, Vgt table
- private static final int INIT_CAP 101
- private double LOAD_THRESHOLD 0.75
- private int numKeys
- private int numDeletes
- // special marker Entry
- private final EntryltK, Vgt DELETED
- new EntryltK, Vgt(null, null)
- public HTOpen ()
- table new EntryINIT_CAP
-
- ... // inner class Entry can go here
58Class HTOpenltK,Vgt find
- private int find (Object key)
- int hash key.hashCode()
- int idx hash table.length
- if (idx lt 0) idx table.length
- while ((tableidx ! null)
- (!key.equals(tableidx.key)))
- idx
- if (idx gt table.length)
- idx 0
- // could do above 3 lines as
- // idx (idx 1) table.length
-
- return idx
59Class HTOpenltK,Vgt get
- public V get (Object key)
- int idx find(key)
- if (tableidx ! null)
- return tableidx.value
- else
- return null
60Class HTOpenltK,Vgt put
- public V put (K key, V val)
- int idx find(key)
- if (tableidx null)
- tableidx new EntryltK,Vgt(key,val)
- numKeys
- double ldFact // NOT int divide!
- (double)(numKeysnumDeletes) /
- table.length
- if (ldFact gt LOAD_THRESHOLD) rehash()
- return null
-
- V oldVal tableidx.value
- tableidx.value val
- return oldVal
61Class HTOpenltK,Vgt rehash
- private void rehash ()
- EntryltK, Vgt oldTab table
- table new Entry2oldTab.length 1
- // the 1 keeps length odd
- numKeys 0
- numDeletes 0
- for (int i 0 i lt oldTab.length i)
- if ((oldTabi ! null)
- (oldTabi ! DELETED))
- put(OldTabi.key, oldTabi.value)
-
-
-
- // The remove operation is an exercise
62Chaining
63Class HTChainltK,Vgt
- public class HTChainltK, Vgt
- implements HashMapltK, Vgt
- private LinkedListltEntryltK, Vgtgt table
- private int numKeys
- private static final int CAPACITY 101
- private static final double
- LOAD_THRESHOLD 3.0
- // put inner class Entry here
- public HTChain ()
- table new LinkedListCAPACITY
-
- ...
64Class HTChainltK,Vgt get
- public V get (Object key)
- int hash key.hashCode()
- int idx hash table.length
- if (idx lt 0) idx table.length
- if (tableidx null) return null
- for (EntryltK, Vgt item tableidx)
- if (item.key.equals(key))
- return item.value
-
- return null
65Class HTChainltK,Vgt put
- public V put (K key, V val)
- int hash key.hashCode()
- int idx hash table.length
- if (idx lt 0) idx table.length
- if (tableidx null)
- tableidx
- new LinkedListltEntryltK, Vgtgt()
- for (EntryltK, Vgt item tableidx)
- if (item.key.equals(key))
- V oldVal item.value
- item.value val
- return oldVal
-
- // more ....
66Class HTChainltK,Vgt put, contd.
- // rest of put not found case
- tableidx.addFirst(
- new EntryltK, Vgt(key, val))
- numKeys
- if (numKeys gt
- (LOAD_THRESHOLD table.length))
- rehash()
- return null
-
- // remove and rehash left as exercises
67Implementation Considerations for Maps and Sets
- Class Object implements hashCode and equals
- Every class has these methods
- One may override them when it makes sense to
- Object.equals compares addresses, not contents
- Object.hashCode based on address, not contents
- Java recommendation
- If you override equals, then
- you should also override hashCode
68Example of equals and hashCode
- Consider class Person with field IDNum
- public boolean equals (Object o)
- if (!(o instanceof Person))
- return false
- return IDNum.equals(((Person)o).IDNum)
-
- Demands a matching hashCode method
- public int hashCode ()
- // equal objects will have equal hashes
- return IDNum.hashCode()
69Implementing HashSetOpen
- Can use HashMapltE,Egt and pairs (key,key)
- This is an adapter class
- Can use an EntryltEgt inner class
- Can implement with an E array
- In each case, can code open addressing and
chaining - The coding of each method is analogous to what we
saw with HashMap
70Implementing the Java Map and Set Interfaces
- The Java API uses a hash table to implement both
the Map and Set interfaces - Implementing them is aided by abstract classes
AbstractMap and AbstractSet in the Collection
hierarchy - Interface Map requires nested type Map.EntryltK,Vgt
- Interface Map also requires support for viewing
it as a Set of Entry objects
71Applying Maps Phone Directory
- public String addOrChangeEntry (
- String name, String newNum)
- String oldNum dir.put(name, newNum)
- modified true
- return oldNum
-
- public String lookupEntry (String name)
- return dir.get(name)
-
- public String removeEntry (String name)
- String ret dir.remove(name)
- if (ret ! null) modified true
- return ret
72Applying Maps Phone Directory (2)
- // in loadData
- while ((name ins.readLine()) ! null)
- if ((number ins.readLine()) null)
- break
- dir.put(name, number)
-
- // saving
- for (Map.EntryltString,Stringgt curr
- dir.entrySet())
- outs.println(curr.getKey())
- outs.println(curr.getValue())
73Applying Maps Huffman Coding
- // First, want to build frequency table
- // for a given input file
- public static HuffData buildFreqTable (
- BufferedRead ins)
- MapltCharacter, Integergt freqs
- new HashMapltCharacter, Integergt()
- try
- ... process each character ...
- catch (IOException ex)
- ex.printStackTrace()
-
- ... build array from map ...
74Applying Maps Huffman Coding (2)
- // process each character
- int next
- while ((next ins.read()) ! -1)
- Integer count freqs.get((char) next)
- if (count null)
- count 1
- else
- count
- freqs.put((char)next, count)
-
- ins.close()
75Applying Maps Huffman Coding (3)
- // build array from map
- HuffData freqTab
- new HuffDatafreqs.size()
- int i 0
- for (Map.EntryltCharacter,Integergt entry
- freqs.entrySet())
- freqTabi
- new HuffData(
- entry.getValue().doubleValue(),
- entry.getKey())
-
- return freqTab
76Applying Maps Huffman Coding (4)
- // build ENCODING table
- public void buildCodeTab ()
- codeMap
- new HashMapltCharacter,BitStringgt()
- buildCodeTab(huffTree, new BitString())
77Applying Maps Huffman Coding (5)
- public void buildCodeTab (
- BinaryTreeltHuffDatagt tree,
- BitString code)
- HuffData datum tree.getData()
- if (datum.symbol ! null)
- codeMap.put(datum.symbol, code)
- else
- BitString l (BitString)code.clone()
- l.append(false)
- buildCodeTab(tree.left() , l)
- BitString r (BitString)code.clone()
- r.append(true)
- buildCodeTab(tree.right(), r)
78Applying Maps Huffman Coding (6)
- public void encode (BufferedReader ins,
- ObjectObjectStream outs)
- BitString res new BitString()
- try
- int next
- while ((next ins.read()) ! -1)
- Character nxt (char)next
- BitString nextChunk
- codeMap.get(nxt)
- res.append(nextChunk)
-
- res.trimCapacity() ins.close()
- outs.writeObject(res)outs.close()...