Title: CS 245: Database System Principles
1CS 245 Database System Principles
Notes 5 Hashing and More
2Hashing
ltkeygt
Buckets (typically 1 disk block)
. . .
3. . .
records
(1) key ? h(key)
. . .
4Two alternatives
record
(2) key ? h(key)
key 1
Index
- Alt (2) for secondary search key
5Example hash function
- Key x1 x2 xn n byte character string
- Have b buckets
- h add x1 x2 .. xn
- compute sum modulo b
6- ? This may not be best function
- ? Read Knuth Vol. 3 if you really need to
select a good function.
Good hash ? Expected number of
function keys/bucket is the same for all
buckets
7Within a bucket
- Yes, if CPU time critical
- Inserts/Deletes not too frequent
8Next example to illustrate inserts,
overflows, deletes
9EXAMPLE 2 records/bucket
0 1 2 3
- INSERT
- h(a) 1
- h(b) 2
- h(c) 1
- h(d) 0
h(e) 1
10EXAMPLE deletion
Deleteef
0 1 2 3
a
b
d
c
c
e
f
g
11Rule of thumb
- Try to keep space utilization
- between 50 and 80
- Utilization keys used
- total keys that fit
12How do we cope with growth?
- Overflows and reorganizations
- Dynamic hashing
13Extensible hashing two ideas
- (a) Use i of b bits output by hash function
- b
- h(K) ?
- use i ? grows over time.
00110101
14- (b) Use directory
- h(K)i to bucket
. . .
. . .
15Example h(k) is 4 bits 2 keys/bucket
1
0001
1
1001
1100
Insert 1010
16Example continued
i
2
00 01 10 11
1
0001
0111
1001
1010
Insert 0111 0000
1100
17Example continued
2
0000
0001
i
2
00 01 10 11
2
0111
Insert 1001
18Extensible hashing deletion
- No merging of blocks
- Merge blocks and cut directory if possible
- (Reverse insert procedure)
19Deletion example
- Run thru insert example in reverse!
20 Extensible hashing
Summary
- Can handle growing files
- - with less wasted space
- - with no full reorganizations
21Linear hashing
- Another dynamic hashing scheme
22Example b4 bits, i 2, 2 keys/bucket
Future growth buckets
0101
0000
1111
1010
m 01 (max used block)
23Example b4 bits, i 2, 2 keys/bucket
Future growth buckets
0101
0000
1111
1010
m 01 (max used block)
24Example Continued How to grow beyond this?
i 2
1111
1010
0101
0000
0101
. . .
m 11 (max used block)
25? When do we expand file?
- Keep track of used slots
- total of slots
U
- If U gt threshold then increase m
- (and maybe i )
26 Linear Hashing
Summary
- Can handle growing files
- - with less wasted space
- - with no full reorganizations
-
- No indirection like extensible hashing
27Example BAD CASE
- Very full
- Very empty Need to move
- m here
- Would waste
- space...
-
28Summary
- Hashing
- - How it works
- - Dynamic hashing
- - Extensible
- - Linear
29Next
- Indexing vs Hashing
- Index definition in SQL
- Multiple key access
30Indexing vs Hashing
- Hashing good for probes given key
- e.g., SELECT
- FROM R
- WHERE R.A 5
31Indexing vs Hashing
- INDEXING (Including B Trees) good for
- Range Searches
- e.g., SELECT
- FROM R
- WHERE R.A gt 5
32Index definition in SQL
- Create index name on rel (attr)
- Create unique index name on rel (attr)
defines candidate key
33- CANNOT SPECIFY TYPE OF INDEX
- (e.g. B-tree, Hashing, )
- OR PARAMETERS
- (e.g. Load Factor, Size of Hash,...)
- ... at least in SQL...
Note
34- ATTRIBUTE LIST ? MULTIKEY INDEX
- (next)
- e.g., CREATE INDEX foo ON R(A,B,C)
Note
35 Multi-key Index
- Motivation Find records where
- DEPT Toy AND SAL gt 50k
36Strategy I
- Use one index, say Dept.
- Get all Dept Toy records and
check their salary
I1
37Strategy II
- Use 2 Indexes Manipulate Pointers
- Toy Sal
- gt 50k
38Strategy III
- Multiple Key Index
- One idea
I2
I3
I1
39Example
10k
15k
- Example
- Record
- Dept
- Index
- Salary
- Index
17k
21k
NameJoe DEPTSales SAL15k
12k
15k
15k
19k
40For which queries is this index good?
- Find RECs Dept Sales SAL20k
- Find RECs Dept Sales SAL gt 20k
- Find RECs Dept Sales
- Find RECs SAL 20k
41Interesting application
- Geographic Data
- DATA
- ltX1,Y1, Attributesgt
- ltX2,Y2, Attributesgt
y
x
. . .
42Queries
- What city is at ltXi,Yigt?
- What is within 5 miles from ltXi,Yigt?
- Which is closest point to ltXi,Yigt?
43Example
44Queries
- Find points with Yi gt 20
- Find points with Xi lt 5
- Find points close to i lt12,38gt
- Find points close to b lt7,24gt
45- Many types of geographic index structures have
been suggested - Quad Trees
- R Trees
46Two more types of multi key indexes
47Grid Index
- Key 2
- X1 X2 Xn
- V1
- V2
- Key 1
- Vn
To records with key1V3, key2X2
48CLAIM
- Can quickly find records with
- key 1 Vi ? Key 2 Xj
- key 1 Vi
- key 2 Xj
- And also ranges.
- E.g., key 1 ? Vi ? key 2 lt Xj
49- ? But there is a catch with Grid Indexes!
- How is Grid Index stored on disk?
- Problem
- Need regularity so we can compute position of
ltVi,Xjgt entry
50Solution Use Indirection
- Buckets
- V1
- V2
- V3 Grid only
- V4 contains
- pointers to
- buckets
- Buckets
X1 X2 X3
-- -- --
-- -- --
-- -- --
-- -- --
-- -- --
51With indirection
- Grid can be regular without wasting space
- We do have price of indirection
52Can also index grid on value ranges
0-20K
1
20K-50K
2
50K-
3
8
Linear Scale
1
2
3
Toy
Sales
Personnel
53Grid files
- Good for multiple-key search
- Space, management overhead (nothing is
free) - Need partitioning ranges that evenly split keys
-
-
54Partitioned hash function
010110 1110010
h1
h2
55EX
- h1(toy) 0 000
- h1(sales) 1 001
- h1(art) 1 010
- . 011
- .
- h2(10k) 01 100
- h2(20k) 11 101
- h2(30k) 01 110
- h2(40k) 00 111
- .
- .
- ltFred,toy,10kgt,ltJoe,sales,10kgt
- ltSally,art,30kgt
Insert
56- h1(toy) 0 000
- h1(sales) 1 001
- h1(art) 1 010
- . 011
- .
- h2(10k) 01 100
- h2(20k) 11 101
- h2(30k) 01 110
- h2(40k) 00 111
- .
- .
- Find Emp. with Dept. Sales ? Sal40k
-
ltFredgt
ltJoegtltJangt
ltMarygt
ltSallygt
ltTomgtltBillgt
ltAndygt
57- h1(toy) 0 000
- h1(sales) 1 001
- h1(art) 1 010
- . 011
- .
- h2(10k) 01 100
- h2(20k) 11 101
- h2(30k) 01 110
- h2(40k) 00 111
- .
- .
- Find Emp. with Sal30k
-
ltFredgt
ltJoegtltJangt
ltMarygt
ltSallygt
ltTomgtltBillgt
ltAndygt
58- h1(toy) 0 000
- h1(sales) 1 001
- h1(art) 1 010
- . 011
- .
- h2(10k) 01 100
- h2(20k) 11 101
- h2(30k) 01 110
- h2(40k) 00 111
- .
- .
- Find Emp. with Dept. Sales
-
ltFredgt
ltJoegtltJangt
ltMarygt
ltSallygt
ltTomgtltBillgt
ltAndygt
59Summary
- Post hashing discussion
- - Indexing vs. Hashing
- - SQL Index Definition
- - Multiple Key Access
- - Multi Key Index
- Variations Grid, Geo Data
- - Partitioned Hash
-