CS 245: Database System Principles - PowerPoint PPT Presentation

About This Presentation
Title:

CS 245: Database System Principles

Description:

CS 245: Database System Principles Notes 5: Hashing and More Hector Garcia-Molina key h(key) Two alternatives Example hash function Key = x1 x2 xn n byte ... – PowerPoint PPT presentation

Number of Views:90
Avg rating:3.0/5.0
Slides: 60
Provided by: Siroker
Category:

less

Transcript and Presenter's Notes

Title: CS 245: Database System Principles


1
CS 245 Database System Principles
Notes 5 Hashing and More
  • Hector Garcia-Molina

2
Hashing
  • key ? h(key)

ltkeygt
Buckets (typically 1 disk block)
. . .
3
  • Two alternatives

. . .
records
(1) key ? h(key)
. . .
4
Two alternatives
record
(2) key ? h(key)
key 1
Index
  • Alt (2) for secondary search key

5
Example hash function
  • Key x1 x2 xn n byte character string
  • Have b buckets
  • h add x1 x2 .. xn
  • compute sum modulo b

6
  • ? This may not be best function
  • ? Read Knuth Vol. 3 if you really need to
    select a good function.

Good hash ? Expected number of
function keys/bucket is the same for all
buckets
7
Within a bucket
  • Do we keep keys sorted?
  • Yes, if CPU time critical
  • Inserts/Deletes not too frequent

8
Next example to illustrate inserts,
overflows, deletes
  • h(K)

9
EXAMPLE 2 records/bucket
0 1 2 3
  • INSERT
  • h(a) 1
  • h(b) 2
  • h(c) 1
  • h(d) 0

h(e) 1
10
EXAMPLE deletion
Deleteef
0 1 2 3
a
b
d
c
c
e
f
g
11
Rule of thumb
  • Try to keep space utilization
  • between 50 and 80
  • Utilization keys used
  • total keys that fit

12
How do we cope with growth?
  • Overflows and reorganizations
  • Dynamic hashing

13
Extensible hashing two ideas
  • (a) Use i of b bits output by hash function
  • b
  • h(K) ?
  • use i ? grows over time.

00110101
14
  • (b) Use directory
  • h(K)i to bucket

. . .
. . .
15
Example h(k) is 4 bits 2 keys/bucket
1
  • i

0001
1
1001
1100
Insert 1010
16
Example continued
i
2
00 01 10 11
1
0001
0111
1001
1010
Insert 0111 0000
1100
17
Example continued
2
0000
0001
i
2
00 01 10 11
2
0111
Insert 1001
18
Extensible hashing deletion
  • No merging of blocks
  • Merge blocks and cut directory if possible
  • (Reverse insert procedure)

19
Deletion example
  • Run thru insert example in reverse!

20
Extensible hashing
Summary
  • Can handle growing files
  • - with less wasted space
  • - with no full reorganizations


21
Linear hashing
  • Another dynamic hashing scheme

22
Example b4 bits, i 2, 2 keys/bucket
  • insert 0101

Future growth buckets
0101
0000
1111
1010
  • 00 01 10 11

m 01 (max used block)
23
Example b4 bits, i 2, 2 keys/bucket
Future growth buckets
0101
0000
1111
1010
  • 00 01 10 11

m 01 (max used block)
24
Example Continued How to grow beyond this?
i 2
1111
1010
0101
0000
0101
  • 00 01 10 11

. . .
m 11 (max used block)
25
? When do we expand file?
  • Keep track of used slots
  • total of slots

U
  • If U gt threshold then increase m
  • (and maybe i )

26
Linear Hashing
Summary
  • Can handle growing files
  • - with less wasted space
  • - with no full reorganizations
  • No indirection like extensible hashing



27
Example BAD CASE
  • Very full
  • Very empty Need to move
  • m here
  • Would waste
  • space...

28
Summary
  • Hashing
  • - How it works
  • - Dynamic hashing
  • - Extensible
  • - Linear

29
Next
  • Indexing vs Hashing
  • Index definition in SQL
  • Multiple key access

30
Indexing vs Hashing
  • Hashing good for probes given key
  • e.g., SELECT
  • FROM R
  • WHERE R.A 5

31
Indexing vs Hashing
  • INDEXING (Including B Trees) good for
  • Range Searches
  • e.g., SELECT
  • FROM R
  • WHERE R.A gt 5

32
Index definition in SQL
  • Create index name on rel (attr)
  • Create unique index name on rel (attr)

defines candidate key
  • Drop INDEX name

33
  • CANNOT SPECIFY TYPE OF INDEX
  • (e.g. B-tree, Hashing, )
  • OR PARAMETERS
  • (e.g. Load Factor, Size of Hash,...)
  • ... at least in SQL...

Note
34
  • ATTRIBUTE LIST ? MULTIKEY INDEX
  • (next)
  • e.g., CREATE INDEX foo ON R(A,B,C)

Note
35

Multi-key Index
  • Motivation Find records where
  • DEPT Toy AND SAL gt 50k

36
Strategy I
  • Use one index, say Dept.
  • Get all Dept Toy records and
    check their salary

I1
37
Strategy II
  • Use 2 Indexes Manipulate Pointers
  • Toy Sal
  • gt 50k

38
Strategy III
  • Multiple Key Index
  • One idea

I2
I3
I1
39
Example
10k
15k
  • Example
  • Record
  • Dept
  • Index
  • Salary
  • Index

17k
21k
NameJoe DEPTSales SAL15k
12k
15k
15k
19k
40
For which queries is this index good?
  • Find RECs Dept Sales SAL20k
  • Find RECs Dept Sales SAL gt 20k
  • Find RECs Dept Sales
  • Find RECs SAL 20k

41
Interesting application
  • Geographic Data
  • DATA
  • ltX1,Y1, Attributesgt
  • ltX2,Y2, Attributesgt

y
x
. . .
42
Queries
  • What city is at ltXi,Yigt?
  • What is within 5 miles from ltXi,Yigt?
  • Which is closest point to ltXi,Yigt?

43
Example

44
Queries
  • Find points with Yi gt 20
  • Find points with Xi lt 5
  • Find points close to i lt12,38gt
  • Find points close to b lt7,24gt

45
  • Many types of geographic index structures have
    been suggested
  • Quad Trees
  • R Trees

46
Two more types of multi key indexes
  • Grid
  • Partitioned hash

47
Grid Index
  • Key 2
  • X1 X2 Xn
  • V1
  • V2
  • Key 1
  • Vn

To records with key1V3, key2X2
48
CLAIM
  • Can quickly find records with
  • key 1 Vi ? Key 2 Xj
  • key 1 Vi
  • key 2 Xj
  • And also ranges.
  • E.g., key 1 ? Vi ? key 2 lt Xj

49
  • ? But there is a catch with Grid Indexes!
  • How is Grid Index stored on disk?
  • Problem
  • Need regularity so we can compute position of
    ltVi,Xjgt entry

50
Solution Use Indirection
  • Buckets
  • V1
  • V2
  • V3 Grid only
  • V4 contains
  • pointers to
  • buckets
  • Buckets

X1 X2 X3
-- -- --
-- -- --
-- -- --
-- -- --
-- -- --
51
With indirection
  • Grid can be regular without wasting space
  • We do have price of indirection

52
Can also index grid on value ranges
  • Salary Grid

0-20K
1
20K-50K
2
50K-
3
8
Linear Scale
1
2
3
Toy
Sales
Personnel
53
Grid files
  • Good for multiple-key search
  • Space, management overhead (nothing is
    free)
  • Need partitioning ranges that evenly split keys


-
-
54
Partitioned hash function
010110 1110010
  • Idea
  • Key1 Key2

h1
h2
55
EX
  • h1(toy) 0 000
  • h1(sales) 1 001
  • h1(art) 1 010
  • . 011
  • .
  • h2(10k) 01 100
  • h2(20k) 11 101
  • h2(30k) 01 110
  • h2(40k) 00 111
  • .
  • .
  • ltFred,toy,10kgt,ltJoe,sales,10kgt
  • ltSally,art,30kgt

Insert
56
  • h1(toy) 0 000
  • h1(sales) 1 001
  • h1(art) 1 010
  • . 011
  • .
  • h2(10k) 01 100
  • h2(20k) 11 101
  • h2(30k) 01 110
  • h2(40k) 00 111
  • .
  • .
  • Find Emp. with Dept. Sales ? Sal40k

ltFredgt
ltJoegtltJangt
ltMarygt
ltSallygt
ltTomgtltBillgt
ltAndygt
57
  • h1(toy) 0 000
  • h1(sales) 1 001
  • h1(art) 1 010
  • . 011
  • .
  • h2(10k) 01 100
  • h2(20k) 11 101
  • h2(30k) 01 110
  • h2(40k) 00 111
  • .
  • .
  • Find Emp. with Sal30k

ltFredgt
ltJoegtltJangt
ltMarygt
ltSallygt
ltTomgtltBillgt
ltAndygt
58
  • h1(toy) 0 000
  • h1(sales) 1 001
  • h1(art) 1 010
  • . 011
  • .
  • h2(10k) 01 100
  • h2(20k) 11 101
  • h2(30k) 01 110
  • h2(40k) 00 111
  • .
  • .
  • Find Emp. with Dept. Sales

ltFredgt
ltJoegtltJangt
ltMarygt
ltSallygt
ltTomgtltBillgt
ltAndygt
59
Summary
  • Post hashing discussion
  • - Indexing vs. Hashing
  • - SQL Index Definition
  • - Multiple Key Access
  • - Multi Key Index
  • Variations Grid, Geo Data
  • - Partitioned Hash
Write a Comment
User Comments (0)
About PowerShow.com