Hashing The Magic Container - PowerPoint PPT Presentation

About This Presentation
Title:

Hashing The Magic Container

Description:

Algo: hash on first D bits, yields ptr to disk block. Expected number of leaves: (N/M) log 2 ... Algo: Define entry = (content word, linked list of integers) ... – PowerPoint PPT presentation

Number of Views:19
Avg rating:3.0/5.0
Slides: 18
Provided by: dennis72
Learn more at: https://ics.uci.edu
Category:

less

Transcript and Presenter's Notes

Title: Hashing The Magic Container


1
HashingThe Magic Container
2
Interface
  • Main methods
  • Void Put(Object)
  • Object Get(Object) returns null if not i
  • Remove(Object)
  • Goal methods are O(1)! (ususally)
  • Implementation details
  • HashTable the storage bin
  • hashfunction(object) tells where object should
    go
  • collision resolution strategy what to do when
    two objects hash to same location.
  • In Java, all objects have default int
    hashcode(), but better to define your own.
    Except for strings.
  • String hashing in Java is good.

3
HashFunctions
  • Goal map objects into table so distribution is
    uniform
  • Tricky to do.
  • Examples for string s
  • product ascii codes, then mod tablesize
  • nearly always even, so bad
  • sum ascii codes, then mod tablesize
  • may be too small
  • shift bits in ascii code
  • java allows this with ltlt and gtgt
  • Java does a good job with Strings

4
Example Problem
  • Suppose we are storing numeric ids of customers,
    maybe 100,000
  • We want to check if a person is delinquent,
    usually less than 400.
  • Use an array of size 1000, the delinquents.
  • Put id in at id mod tableSize.
  • Clearly fast for getting, removing
  • But what happens if entries collide?

5
Separate Chaining
  • Array of linked lists
  • The hash function determines which list to search
  • May or may keep individual lists in sorted order
  • Problems
  • needs a very good hash function, which may not
    exist
  • worse case O(n)
  • extra-space for links
  • Another approach Open Addressing
  • everything goes into the array, somehow
  • several approaches linear, quadratic, double,
    rehashing

6
Linear Probing
  • Store information (or prts to objects) in array
  • Linear Probing
  • When inserting an object, if location filled,
    find first unfilled position. I.e look at
    hi(x)f(i) where f(i) i
  • When getting an object, start at hash addresses,
    and do linear search till find object or a hole.
  • primary clustering blocks of filled cells occur
  • Harder to insert than find existing element
  • Load factor lf percent of array filled
  • Expected probes for
  • insertion 1/2(11/(1-lf)2))
  • successful search 1/2(11/(1-lf))

7
Expected number of probes
8
Quadratic Probing
  • Idea f(i) i2 (or some other quadratic
    function)
  • Problem If table is more than 1/2 full, no
    quarantee of finding any space!
  • Theorem if table is less than 1/2 full, and
    table size is prime, then an element can be
    inserted.
  • Good Quadratic probing eliminates primary
    clustering
  • Quadratic probing has secondary clustering
    (minor)
  • if hash to same addresses, then probe sequence
    will be the same

9
Proof of theorem
  • Theorem The first P/2 probes are distinct.
  • Suppose not.
  • Then there are i and j ltP/2 that hash to same
    place
  • So h(x)i2 h(y)j2 and h(x) h(y).
  • So i2 j2 mod P
  • (ij)(i-j) 0 mod P
  • Since P is prime and i and j are less than P/2
  • then ij and i-j are less than P and P factors.
  • Contradiction

10
Double Hashing
  • Goal spreading out the probe sequence
  • f(i) ihash2(x), where hash2 is another hash
    function
  • Dangerous can be very bad.
  • Also may not eliminate any problems
  • In best case, its great

11
Rehashing
  • All methods degrade when table becomes too full
  • Simpliest solution
  • create new table, twice as large
  • rehash everything
  • O(N), so not happy if often
  • With quadratic probing, rehash when table 1/2
    full

12
Extendible Hashing Uses secondary storage
  • Suppose data does not fit in main memory
  • Goal Reduce number of disks accesses.
  • Suppose N records to store and M records fit in a
    disk block
  • Result 2 disk accesses for find (4 for insert)
  • Let D be max number of bits so 2D lt M.
  • This is for root or directory (a disk block)
  • Algo
  • hash on first D bits, yields ptr to disk block
  • Expected number of leaves (N/M) log 2
  • Expected directory size O(N(11/M) / M)
  • Theoretically difficult, more details for
    implementation

13
Applications
  • Compilers keep track of variables and scope
  • Graph Theory associate id with name (general)
  • Game Playing E.G. in chess, keep track of
    positions already considered and evaluated (which
    may be expensive)
  • Spelling Checker At least to check that word is
    right.
  • But how to suggest correct word
  • Lexicon/book indices

14
HashSets vs HashMaps
  • HashSets store objects
  • supports adding and removing in constant time
  • HashMaps store a pair (key,object)
  • this is an implementation of a Map
  • HashMaps are more useful and standard
  • Hashmaps main methods are
  • put(Object key, Object value)
  • get(Object key)
  • remove(Object key)
  • All done in expected O(1) time.

15
Lexicon Example
  • Inputs text file (N) content word file (the
    keys) (M)
  • Ouput content words in order, with page numbers
  • Algo
  • Define entry (content word, linked list of
    integers)
  • Initially, list is empty for each word.
  • Step 1 Read content word file and Make HashMap
    of content word, empty list
  • Step 2 Read text file and check if work in
    HashMap
  • if in, add to page number, else
    continue.
  • Step 3 Use the iterator method to now walk
    thru the HashMap and put it into a sortable
    container.

16
Lexicon Example
  • Complexity
  • step 1 O(M), M number of content words
  • step 2 O(N), N word file size
  • step 3 O(M log M) max.
  • So O(max(N, M log M))
  • Dumb Algorithm
  • Sort content words O(Mlog M) (balanced tree)
  • Look up each word in Content Word tree and update
  • O(NlogM)
  • Total complexity O(N log M)
  • N 5002000 1,000,000 and M 1000
  • Smart algo 1,000,000 dumb algo 1,000,00010.

17
Memoization
  • Recursive Fibonacci
  • fib(n) if (nlt2) return 1
  • else return fib(n-1)fib(n-2)
  • Use hashing to store intermediate results
  • Hashtable ht
  • fib(n) Entry e (Entry)ht.get(n)
  • if (e ! null) return e.answer
  • else if (nlt2) return 1
  • else ans fib(n-1)fib(n-2)
  • ht.put(n,ans)
  • return ans
Write a Comment
User Comments (0)
About PowerShow.com