Quick Review of Apr 10 material - PowerPoint PPT Presentation

1 / 22

About This Presentation

Title:

Quick Review of Apr 10 material

Description:

B+-Tree File Organization similar to B+-tree index leaf nodes store records, not pointers to records stored in an original file ... Hashing Overview Hash functions – PowerPoint PPT presentation

Number of Views:61

Avg rating:3.0/5.0

Slides: 23

Provided by: DavidK242

Learn more at: http://www.cs.umd.edu

Category:

more less

Transcript and Presenter's Notes

Title: Quick Review of Apr 10 material

1
Quick Review of Apr 10 material

B-Tree File Organization
similar to B-tree index
leaf nodes store records, not pointers to records
stored in an original file
leaf and interior nodes are different
B-trees
search key values appear only once
pointer to record/bucket for that search key
value always stored with the search key itself,
even in interior nodes
Hashing Overview
Hash functions
ideally uniform, random, easy to compute

2
Today

Overflow
Hash file performance
Hash indices
Dynamic Hashing (Extendable Hashing)
Note HW3 due next class (April 17)
HW 4 due Thursday April 24 (9 days from now)
Questions 12.11, 12.12, 12.13, 12.16

3
Overflow

Overflow is when an insertion into a bucket cant
occur because it is full.
Overflow can occur for the following reasons
too many records (not enough buckets)
poor hash function
skewed data
multiple records might have the same search key
multiple search keys might be assigned the same
bucket

4
Overflow (2)

Overflow is handled by one of two methods
chaining of multiple blocks in a bucket, by
attaching a number of overflow buckets together
in a linked list
double hashing use a second hash function to
find another (hopefully non-full) bucket
in theory we could use the next bucket that had
space this is often called open hashing or
linear probing. This is often used to construct
symbol tables for compilers
useful where deletion does not occur
deletion is very awkward with linear probing, so
it isnt useful in most database applications

5
Hashed File Performance Metrics

An important performance measure is the loading
factor
(number of records)/(Bf)
B is the number of buckets
f is the number of records that will fit in a
single bucket
when loading factor too high (file becomes too
full), double the number of buckets and rehash

6
Hashed File Performance

(Assume that the hash table is in main memory)
Successful search best case 1 block worst case
every chained bucket average case half of worst
case
Unsuccessful search always hits every chained
bucket (best case, worst case, average case)
With loading factor around 90 and a good hashing
function, average is about 1.2 blocks
Advantage of hashing very fast for exact queries
Disadvantage records are not sorted in any
order. As a result, it is effectively impossible
to do range queries

7
Hash Indices

Hashing can be used for index-structure creation
as well as for file organization
A hash index organizes the search keys (and their
record pointers) into a hash file structure
strictly speaking, a hash index is always a
secondary index
if the primary file was stored using the same
hash function, an additional, separate primary
hash index would be unnecessary
We use the term hash index to refer both to
secondary hash indices and to files organized
using hashing file structures

8
Example of a Hash Index

Hash index into file
account, on search key
account-number
Hash function computes
sum of digits in account
number modulo 7.
Bucket size is 2

9
Static Hashing

Weve been discussing static hashing the hash
function maps search-key values to a fixed set of
buckets. This has some disadvantages
databases grow with time. Once buckets start to
overflow, performance will degrade
if we attempt to anticipate some future file size
and allocate sufficient buckets for that expected
size when we build the database initially, we
will waste lots of space
if the database ever shrinks, space will be
wasted
periodic reorganization avoids these problems,
but is very expensive
By using techniques that allow us to modify the
number of buckets dynamically (dynamic hashing)
we can avoid these problems
Good for databases that grow and shrink in size
Allows the hash function to be modified
dynamically

10
Dynamic Hashing

One form of dynamic hashing is extendable hashing
hash function generates values over a large range
-- typically b-bit integers, with b being
something like 32
At any given moment, only a prefix of the hash
function is used to index into a table of bucket
addresses
With the prefix at a given moment being j, with
0ltjlt32, the bucket address table size is 2j
Value of j grows and shrinks as the size of the
database grows and shrinks
Multiple entries in the bucket address table may
point to a bucket
Thus the actual number of buckets is lt 2j
the number of buckets also changes dynamically
due to coalescing and splitting of buckets

11
General Extendable Hash Structure
12
Use of Extendable Hash Structure

Each bucket j stores a value ij all the entries
that point to the same bucket have the same
values on the first ij bits
To locate the bucket containing search key Kj
compute H(Kj) X
Use the first i high order bits of X as a
displacement into the bucket address table and
follow the pointer to the appropriate bucket
T insert a record with search-key value Kj
follow lookup procedure to locate the bucket, say
j
if there is room in bucket j, insert the record
Otherwise the bucket must be split and insertion
reattempted
in some cases we use overflow buckets instead (as
explained shortly)

13
Splitting in Extendable Hash Structure

To split a bucket j when inserting a record with
search-key value Kj
if igt ij (more than one pointer in to bucket j)
allocate a new bucket z
set ij and iz to the old value ij incremented by
one
update the bucket address table (change the
second half of the set of entries pointing to j
so that they now point to z)
remove all the entries in j and rehash them so
that they either fall in z or j
reattempt the insert (Kj). If the bucket is
still full, repeat the above.

14
Splitting in Extendable Hash Structure (2)

To split a bucket j when inserting a record with
search-key value Kj
if i ij (only one pointer in to bucket j)
increment i and double the size of the bucket
address table
replace each entry in the bucket address table
with two entries that point to the same bucket
recompute new bucket address table entry for Kj
now igt ij so use the first case described earlier
When inserting a value, if the bucket is still
full after several splits (that is, i reaches
some preset value b), give up and create an
overflow bucket rather than splitting the bucket
entry table further
how might this occur?

15
Deletion in Extendable Hash Structure

To delete a key value Kj
locate it in its bucket and remove it
the bucket itself can be removed if it becomes
empty (with appropriate updates to the bucket
address table)
coalescing of buckets is possible
can only coalesce with a buddy bucket having
the same value of ij and same ij -1prefix, if one
such bucket exists
decreasing bucket address table size is also
possible
very expensive
should only be done if the number of buckets
becomes much smaller than the size of the table

16
Extendable Hash Structure Example

Hash function
on branch name
Initial hash table
(empty)

17
Extendable Hash Structure Example (2)

Hash structure after insertion of one Brighton
and two Downtown records

18
Extendable Hash Structure Example (3)

Hash structure after insertion of Mianus record

19
Extendable Hash Structure Example (4)

Hash structure after insertion of three
Perryridge records

20
Extendable Hash Structure Example (5)

Hash structure after insertion of Redwood and
Round Hill records

21
Extendable Hashing vs. Other Hashing

Benefits of extendable hashing
hash performance doesnt degrade with growth of
file
minimal space overhead
Disadvantages of extendable hashing
extra level of indirection (bucket address table)
to find desired record
bucket address table may itself become very big
(larger than memory)
need a tree structure to locate desired record in
the structure!
Changing size of bucket address table is an
expensive operation
Linear hashing is an alternative mechanism which
avoids these disadvantages at the possible cost
of more bucket overflows

22
ComparisonOrdered Indexing vs. Hashing

Each scheme has advantages for some operations
and situations. To choose wisely between
different schemes we need to consider
cost of periodic reorganization
relative frequency of insertions and deletions
is it desirable to optimize average access time
at the expense of worst-case access time?
What types of queries do we expect?
Hashing is generally better at retrieving records
for a specific key value
Ordered indices are better for range queries