Chap11. Hashing - PowerPoint PPT Presentation

About This Presentation
Title:

Chap11. Hashing

Description:

Introduce the concept of hashing. Examine the problem of choosing a good hashing algorithm, presents a ... Radix transformation(?? ??) 453(10??) - 382(11??) ... – PowerPoint PPT presentation

Number of Views:114
Avg rating:3.0/5.0
Slides: 41
Provided by: dbKon
Category:
Tags: chap11 | hashing | radix

less

Transcript and Presenter's Notes

Title: Chap11. Hashing


1
Chap11. Hashing
File Strutures by Folk, Zoellick and Riccardi
  • ???
  • ?????? ???

2
Chapter Objectives
  • Introduce the concept of hashing
  • Examine the problem of choosing a good hashing
    algorithm, presents a reasonable one in detail,
    and describe some others
  • Explore several approaches for reducing
    collisions and storage of several records per
    address
  • Develop and use mathematical tools for analyzing
    performance differences resulting from the use of
    different hashing techniques
  • Examine problems associated with file
    deterioration (record deletions) and discuss some
    solutions
  • Discuss collision resolution techniques
  • Examine effects of patterns of record access on
    performance

3
Contents
  • 11.1 Introduction
  • 11.2 A Simple Hashing Algorithm
  • 11.3 Hashing Functions and Record Distribution
  • 11.4 How Much Extra Memory Should Be Used?
  • 11.5 Collision Resolution by Progressive Overflow
  • 11.6 Storing More Than One Record per Address
    Buckets
  • 11.7 Making Deletions
  • 11.8 Other Collision Resolution Techniques
  • 11.9 Patterns of Record Access

4
Overview
  • O(1) access to files
  • Record number is obtained by a hashing function H
    applied to the primary key, H(key)
  • Record numbers generated should be uniformly and
    randomly distributed such that 0 lt H(key) lt N
  • A hash function is like a black box that produces
    an address every time you drop in a key
  • All parts of the key should be used by the
    hashing function H so that a lot of records with
    similar keys do not all hash to the same location
  • Given two random keys X, Y and N slots, the
    probability H(X)H(Y) is 1/N in this case, X and
    Y are called synonyms and a collision occurs

5
Introduction
11.1 Introduction
  • Hash function h(k)
  • Transforms a key K into an address
  • Hash vs other index
  • Sequential search O(N)
  • Binary search O(log2N)
  • B(B) Tree index O(logkN)
  • where k records in an index node
  • Hash O(1)

6
A Simple Hashing Scheme (1/2)
11.1 Introduction
Record
Address
key
LOWELL
Address
4
LOWELLs home address
7
A Simple Hashing Scheme (2/2)
11.1 Introduction
ASCII Code for First Two Letters
Home Address
Name
Product
66 X 65 4,290
290
BALL
66 65
76 X 96 6,004
76 96
LOWELL
004
84 X 82 6,888
84 82
TREE
888
8
Hashing differs from indexing
  • With hashing, the addresses generated appear to
    be random
  • No obvious connection between the key and the
    location of the corresponding record
  • So, hashing is sometimes referred to as
    randomizing
  • With hashing, two different keys may be
    translated to the same address
  • Two records may be sent to the same place in the
    file

9
Idea behind Hash-based Files
11.1 Introduction
  • Record with hash key i is stored in node i
  • All record with hash key h are stored in node h
  • Primary blocks of data level nodes are stored
    sequentially
  • Contents of the root node can be expressed by a
    simple function Address of data level node for
    record with primary key k
  • address of node 0H(k)
  • In literature on hash-based files, primary blocks
    of data level nodes are called buckets

10
e.g. Hash-based File
11.1 Introduction
11
Collision (1/2)
11.1 Introduction
  • Collision
  • Situation in which a record is hashed to an
    address that does not have sufficient room to
    store the record
  • Perfect hashing algorithm impossible!
  • Different key, same hash value
  • (Different record, same address)

12
Collision (2/2)
11.1 Introduction
  • Solutions
  • Spread out the records
  • Find a hashing algorithm that distributes records
    more randomly
  • Use extra memory
  • Easier to find a hash algorithm that avoids
    collisions if we have a few records to distribute
    among many address
  • Put more than one record at a single address

13
A Simple Hashing Algorithm (1/3)
11.2 A Simple Hashing Algorithm
  • Step 1. Represent the key in numerical form
  • If the key is a string take the ASCII code
  • e.g. LOWELL
  • 76 79 87 69 76 76 32 32 32 32 32 32
  • L O W E L L ( 6 blanks
    )
  • If the key is a number nothing to be done

14
A Simple Hashing Algorithm (2/3)
11.2 A Simple Hashing Algorithm
  • Step 2. Fold and Add
  • Fold
  • 76 79 87 69 76 76 32 32 32 32 32
    32
  • Add parts into one integer
  • Suppose we use 15 bit integer expression, 32767
    is limit
  • 767987697676323232323232 33820 gt 32767
    (overflow!)
  • Largest addend 9090 ( ZZ )
  • Largest allowable result 32767-9090 23677 -gt
    19937(??)
  • Ensure no intermediate sum exceeds using mod
  • 7679 8769 16448 mod 19937 16448
  • 16448 7676 24124 mod 19937 4187
  • 4187 3232 7419 mod 19937 7419
  • 7419 3232 10651 mod 19937 10651
  • 10651 3232 13883 mod 19937 13883

15
A Simple Hashing Algorithm (3/3)
11.2 A Simple Hashing Algorithm
  • Step 3. Divide by size of the address space
  • a s mod n
  • a home address
  • s the sum produced in step 2
  • n the number of addresses in the file
  • e.g.. a 13883 mod 100 83
  • A prime number is usually used for the divisor
    because primes tend to distribute remainders much
    more uniformly than do nonprimes
  • So, we chose a prime number as close as possible
    to the desired size of the address space

16
Hashing Functions and Record Distributions
11.3 Hashing Functions and Record Distributions
  • Distributing records among address

?? ?? ??? ??? ? ??
Uniform distribution
17
Some other hashing methods
11.3 Hashing Functions and Record Distributions
  • Better-than-random
  • Examine keys for a pattern
  • Fold parts of the key
  • Divide the key by a number
  • When the better-than-random methods do not work -
    randomize!
  • Square the key and take the middle
  • ?? ??? ??? ???? ??? ??? ??
  • Radix transformation(?? ??)
  • 453(10??) -gt 382(11??)

18
How Much Extra Memory Should Be Used?
11.4 How Much Extra Memory Should Be Used?
  • The more records are packed, the more likely a
    collision will occur

19
11.4 How Much Extra Memory Should Be Used?
Poisson Distribution
p(x) the probability that a given address will
have x records assigned to it after the
hashing function has been applied to all n
records ( x records? collision ??? ??)
N the number of available addresses r the
number of records to be stored x the number of
records assigned to a given address
20
Predicting Collisions for Different Packing
Densities
11.4 How Much Extra Memory Should Be Used?
  • of addresses no record assigned N X P(0)
  • of addresses one record assigned N X P(1)
  • of addresses more than two assigned
  • N X P(2) P(3) P(4) ...
  • of overflows 1 X NP(2) 2 X NP(3) ...
  • Percentage of overflow records

21
The larger space, the less overflows
11.4 How Much Extra Memory Should Be Used?
Packing Density r/N (N addresses, r
records)
22
Collision Resolution by Progressive Overflow
11.5 Collision Resolution by Progressive Overflow
  • Progressive overflow ( linear probing)
  • Insert a new record
  • 1. Take home address if empty
  • 2. Otherwise, next several addresses are searched
    in sequence, until an empty one is found
  • 3. If no more next space - wrapping around

23
11.5 Collision Resolution by Progressive Overflow
Progressive Overflow (1/5)
24
11.5 Collision Resolution by Progressive Overflow
Progressive Overflow (2/5)
25
Progressive Overflow (3/5)
11.5 Collision Resolution by Progressive Overflow
  • Search a record with a hash function value k
  • from home address k, look at successive records,
    until Found,
  • or An open address is encountered
  • Worst case
  • When the record does not exist and the file is
    full
  • The reason to avoid overflow
  • Extra searches have to occur when a record is not
    found in its home address

26
Progressive Overflow (4/5)
11.5 Collision Resolution by Progressive Overflow
  • - Search length of accesses required to
    retrieve a record (from secondary memory)

27
Progressive Overflow (5/5)
11.5 Collision Resolution by Progressive Overflow
  • With perfect hashing function average search
    length 1
  • Average search length of no greater than 2.0 are
    generally considered acceptable

28
Storing More Than One Record per Address Buckets
11.6 Storing More Than One Record per Address
Buckets
  • Bucket a block of records sharing the same
    address (on block-addressing disk)

29
Effects of Buckets on Performance
11.6 Storing More Than One Record per Address
Buckets
  • of overflow records
  • N X 1XP(b1) 2XP(b2) 3XP(b3)...
  • N of addresses
  • b of records fit in a bucket
  • bN of available locations for records
  • Packing density r/bN
  • As the bucket size gets larger, performance
    continues to improve

30
Bucket Implementation
11.6 Storing More Than One Record per Address
Buckets
31
Bucket Implementation (Cont'd)
11.6 Storing More Than One Record per Address
Buckets
  • Initializing and Loading
  • Creating empty space
  • Use hash values and find the bucket to store
  • If the home bucket is full, continue to look at
    successive buckets
  • Problems when
  • No empty space exists
  • Duplicate keys occur

32
Making Deletions
11.7 Making Deletions
  • The slot freed by the deletion hinders(disturb)
    later searches
  • Use tombstones and reuse the freed slots

33
Other Collision Resolution Techniques
11.8 Other Collision Resolution Techniques
  • Double hashing avoid clustering with a second
    hash function for overflow records
  • Chained progressive overflow each home address
    contains a pointer to the record with the same
    address
  • Chaining with a separate overflow area move all
    overflow records to a separate overflow area
  • Scatter tables Hash file contains only pointers
    to records (like indexing)

34
Linear Probing (1/2)
11.8 Other Collision Resolution Techniques
  • When a synonym is identified, search forward from
    the address given by the hash function (the
    natural address) until an empty slot is located,
    and store this record there
  • This is an example of open addressing (examining
    a predictable sequence of slots for an empty one)

35
Linear Probing (2/2)
key
E
G
E
A
M
P
L
A
S
E
A
R
C
I
N
X
H
hash
1 0 5 1 18 3 8 9 14 7 5 5 1 13
16 12 5
0 1 2 3 4 5 6 7 8 9 10 11 12 13
14 15 16 17 18
A
11.8 Other Collision Resolution Techniques
S
insertion sequence
E
A
Memory Space
A
R
C
H
I
N
G
E
E
X
I
E
G
H
E
A
A
A
C
M
P
L
E
X
H
I
G
E
E
36
Rehashing (1/2)
11.8 Other Collision Resolution Techniques
  • In linear probing, if synonym occurred,
    incremented r by 1 and searched next location
  • In rehashing, use a second hash function for the
    displacement
  • This method has the advantage of avoiding
    congestion, because each synonym under the first
    hash function likely uses a different
    displacement D, and this examines a different
    sequence of slots

37
Rehashing(where P3) (2/2)
key
E
G
E
A
M
P
L
A
S
E
A
R
C
I
N
X
H
hash
1 0 5 1 18 3 8 9 14 7 5 5 1 13
16 12 5
0 1 2 3 4 5 6 7 8 9 10 11 12 13
14 15 16 17 18
A
11.8 Other Collision Resolution Techniques
S
insertion sequence
E
A
Memory Space
A
R
C
H
I
N
G
E
H
E
E
X
N
H
E
A
G
A
A
M
P
L
E
38
Chained progressive overflow
39
Overflow File
11.8 Other Collision Resolution Techniques
  • When building the file, if a collision occurs,
    place the new synonym into a separate area of the
    file called the overflow section
  • ??
  • ????? ????
  • ?? ? ??? ?? ??? ?? ??
  • ??
  • Overflow section? ? ??? ?? ???? ???, ?? ??? 1??
    ?? ?? ??? ???

40
scatter table
  • an index that is searched by hashing
  • the search of the index requires only one access
  • a set of linked lists of synonyms
Write a Comment
User Comments (0)
About PowerShow.com