Hash-Based%20Indexes - PowerPoint PPT Presentation

About This Presentation
Title:

Hash-Based%20Indexes

Description:

Title: Hash-Based Indexes Subject: Database Management Systems Author: Raghu Ramakrishnan and Johannes Gehrke Keywords: Chapter 11 Last modified by – PowerPoint PPT presentation

Number of Views:158
Avg rating:3.0/5.0
Slides: 23
Provided by: RaghuRa66
Learn more at: http://web.cs.wpi.edu
Category:

less

Transcript and Presenter's Notes

Title: Hash-Based%20Indexes


1
Hash-Based Indexes
  • Chapter 11

2
Introduction Hash-based Indexes
  • Best for equality selections.
  • Cannot support range searches.
  • Static and dynamic hashing techniques exist
  • Trade-offs similar to ISAM vs. B trees.

3
Static Hashing
  • h(k) mod N bucket to which data entry with key
    k belongs. (N of buckets)

0
h(key) mod N
2
key
h
N-1
Primary bucket pages
Overflow pages
4
Static Hashing h(k) mod N
  • primary pages fixed (N of buckets)
  • allocated sequentially
  • never de-allocated
  • overflow pages if needed.

0
h(key) mod N
2
key
h
N-1
Primary bucket pages
Overflow pages
5
Static Hashing
  • h(k) mod N bucket to which data entry with key
    k belongs with N of buckets
  • Hash function works on search key of record r.
  • h() must distribute values over range 0 ...
    N-1.
  • For example, h(key) (a key b)
  • a and b are constants
  • lots known about how to tune h.

6
Static Hashing Cons
  • Primary pages fixed space ? static structure.
  • Fixed buckets is the problem
  • Rehashing can be done ? Not good for search.
  • In practice, instead use overflow chains.
  • Long overflow chains degrade performance.
  • Solution Employ dynamic techniques
  • Extendible hashing, or
  • Linear Hashing

7
Extendible Hashing
  • Problem Bucket (primary page) becomes full.
  • Solution Re-organize file by doubling of
    buckets?
  • But Reading and writing all pages is expensive!
  • Idea Use directory of pointers to buckets
    instead of buckets
  • double of buckets by doubling the directory
  • split just the bucket that overflowed!
  • Discussion
  • Directory much smaller than file, so doubling
    is cheaper. Only one page of data entries is
    split.
  • No overflow pages ever.
  • Trick lies in how hash function is adjusted!

8
Extendible Hashing
  • Problem Bucket (primary page) becomes full.
  • Solution Re-organize file by doubling of
    buckets?
  • But Reading and writing all pages is expensive!

9
Extendible Hashing
  • Ideas
  • Use directory of pointers to buckets instead of
    buckets
  • Details
  • Double of buckets by doubling the directory
  • Split just the bucket that overflowed!
  • Trick
  • How hash function is adjusted!

10
Example
LOCAL DEPTH
2
Bucket A
  • Directory array4
  • To find bucket for r, take last global depth
    bits of function h(r)

16
4
12
32
GLOBAL DEPTH
2
2
Bucket B
00
13
1
21
5
01
2
10
Bucket C
10
11
2
DIRECTORY
Bucket D
15
7
19
DATA PAGES
11
Example
LOCAL DEPTH
2
Bucket A
  • If h(r) 5 101,
  • it is in bucket pointed to by 01.
  • If h(r) 4 100,
  • it is in bucket pointed to by 00.

16
4
12
32
GLOBAL DEPTH
2
2
Bucket B
00
13
1
21
5
01
2
10
Bucket C
10
11
2
DIRECTORY
Bucket D
15
7
19
DATA PAGES
12
Insertion
2
LOCAL DEPTH
Bucket A
16
4
12
32
GLOBAL DEPTH
2
2
Bucket B
13
00
1
21
5
  • Insert If bucket is
    full, split it
  • (allocate new page, re-distribute content).

01
2
10
Bucket C
10
11
2
DIRECTORY
Bucket D
15
7
19
DATA PAGES
  • Splitting may
  • double the directory, or
  • simply link in a new page.
  • To tell what to do
  • Compare global depth with local depth for split
    bucket.

13
Insert h(r) 6 binary 110
14
Insert h(r) 6 binary 110
15
Insert h(r)20 binary 10100
16
Insert h(r)20
Split Bucket A into two buckets A1 and A2.
3
20
Bucket A2
4
12
3
Bucket A1
32
16
17
Insert h(r)20
Bucket A
LOCAL DEPTH
2
16
4
12
32
GLOBAL DEPTH
2
2
Bucket B
13
00
1
21
5
01
2
10
Bucket C
10
11
Bucket D
2
DIRECTORY
15
7
19
DATA PAGES
18
Insert h(r)20
2
LOCAL DEPTH
3
LOCAL DEPTH
Bucket A
16
32
GLOBAL DEPTH
32
16
Bucket A
GLOBAL DEPTH
2
2
2
3
Bucket B
1
5
21
13
00
1
5
21
13
000
Bucket B
01
001
2
10
2
010
Bucket C
10
11
10
Bucket C
011
100
2
2
DIRECTORY
101
Bucket D
15
7
19
15
19
7
Bucket D
110
111
2
3
Bucket A2
20
4
12
DIRECTORY
20
12
Bucket A2
4
(split image'
of Bucket A)
(split image'
of Bucket A)
19
Points to Note
  • 20 binary 10100.
  • Last 2 bits (00) tell us if r belongs in A
  • Last 3 bits needed to tell if r belongs into A1
    or A2

20
More Points to Note
  • Bits
  • Global depth of directory Max of bits needed
    to tell which bucket an entry belongs to
  • Local depth of a bucket Actual of bits used to
    determine if an entry belongs to this bucket.
  • When does bucket split cause directory doubling?
  • Before insert, local depth of bucket global
    depth.
  • Insert causes local depth to become gt global
    depth
  • Directory is doubled by copying it over and
    fixing pointer to split the one over-full
    page.

21
Directory Doubling
  • Why use least significant bits in directory?
  • Allows for efficient doubling via copying
    directory!

6 110
6 110
3
3
6
000
000
001
100
2
2
6
010
010
00
00
6
1
1
011
110
6
01
10
0
0
6
100
001
6
6
10
01
1
1
101
101
11
11
6
110
011
111
111
vs.
Least Significant
Most Significant
22
Extendible Hashing Delete
  • Delete
  • If removal of data entry makes bucket empty, can
    be merged with split image.
  • If each directory element points to same bucket
    as its split image, can halve directory.

23
Comments on Extendible Hashing
  • If directory fits in memory,
    then equality search answered with
    one disk access else with two.
  • 100MB file, 100 bytes/rec, 4K pages contain
    1,000,000 records (as data entries) and 25,000
    directory elements
    chances are high that directory
    will fit in memory.
  • Directory grows in spurts.
  • If the distribution of hash values is skewed,
    directory can grow large.

24
Summary
  • Hash-based indexes best for equality searches,
    cannot support range searches.
  • Static Hashing can lead to long overflow chains.
  • Extendible Hashing avoids overflow pages by
    splitting full bucket when new data to be added
  • Directory to keep track of buckets, doubles
    periodically
Write a Comment
User Comments (0)
About PowerShow.com