Title: Quick Review of Apr 10 material
1Quick Review of Apr 10 material
- B-Tree File Organization
- similar to B-tree index
- leaf nodes store records, not pointers to records
stored in an original file - leaf and interior nodes are different
- B-trees
- search key values appear only once
- pointer to record/bucket for that search key
value always stored with the search key itself,
even in interior nodes - Hashing Overview
- Hash functions
- ideally uniform, random, easy to compute
2Today
- Overflow
- Hash file performance
- Hash indices
- Dynamic Hashing (Extendable Hashing)
- Note HW3 due next class (April 17)
- HW 4 due Thursday April 24 (9 days from now)
- Questions 12.11, 12.12, 12.13, 12.16
3Overflow
- Overflow is when an insertion into a bucket cant
occur because it is full. - Overflow can occur for the following reasons
- too many records (not enough buckets)
- poor hash function
- skewed data
- multiple records might have the same search key
- multiple search keys might be assigned the same
bucket
4Overflow (2)
- Overflow is handled by one of two methods
- chaining of multiple blocks in a bucket, by
attaching a number of overflow buckets together
in a linked list - double hashing use a second hash function to
find another (hopefully non-full) bucket - in theory we could use the next bucket that had
space this is often called open hashing or
linear probing. This is often used to construct
symbol tables for compilers - useful where deletion does not occur
- deletion is very awkward with linear probing, so
it isnt useful in most database applications
5Hashed File Performance Metrics
- An important performance measure is the loading
factor - (number of records)/(Bf)
- B is the number of buckets
- f is the number of records that will fit in a
single bucket - when loading factor too high (file becomes too
full), double the number of buckets and rehash
6Hashed File Performance
- (Assume that the hash table is in main memory)
- Successful search best case 1 block worst case
every chained bucket average case half of worst
case - Unsuccessful search always hits every chained
bucket (best case, worst case, average case) - With loading factor around 90 and a good hashing
function, average is about 1.2 blocks - Advantage of hashing very fast for exact queries
- Disadvantage records are not sorted in any
order. As a result, it is effectively impossible
to do range queries
7Hash Indices
- Hashing can be used for index-structure creation
as well as for file organization - A hash index organizes the search keys (and their
record pointers) into a hash file structure - strictly speaking, a hash index is always a
secondary index - if the primary file was stored using the same
hash function, an additional, separate primary
hash index would be unnecessary - We use the term hash index to refer both to
secondary hash indices and to files organized
using hashing file structures
8Example of a Hash Index
- Hash index into file
- account, on search key
- account-number
- Hash function computes
- sum of digits in account
- number modulo 7.
- Bucket size is 2
9Static Hashing
- Weve been discussing static hashing the hash
function maps search-key values to a fixed set of
buckets. This has some disadvantages - databases grow with time. Once buckets start to
overflow, performance will degrade - if we attempt to anticipate some future file size
and allocate sufficient buckets for that expected
size when we build the database initially, we
will waste lots of space - if the database ever shrinks, space will be
wasted - periodic reorganization avoids these problems,
but is very expensive - By using techniques that allow us to modify the
number of buckets dynamically (dynamic hashing)
we can avoid these problems - Good for databases that grow and shrink in size
- Allows the hash function to be modified
dynamically
10Dynamic Hashing
- One form of dynamic hashing is extendable hashing
- hash function generates values over a large range
-- typically b-bit integers, with b being
something like 32 - At any given moment, only a prefix of the hash
function is used to index into a table of bucket
addresses - With the prefix at a given moment being j, with
0ltjlt32, the bucket address table size is 2j - Value of j grows and shrinks as the size of the
database grows and shrinks - Multiple entries in the bucket address table may
point to a bucket - Thus the actual number of buckets is lt 2j
- the number of buckets also changes dynamically
due to coalescing and splitting of buckets
11General Extendable Hash Structure
12Use of Extendable Hash Structure
- Each bucket j stores a value ij all the entries
that point to the same bucket have the same
values on the first ij bits - To locate the bucket containing search key Kj
- compute H(Kj) X
- Use the first i high order bits of X as a
displacement into the bucket address table and
follow the pointer to the appropriate bucket - T insert a record with search-key value Kj
- follow lookup procedure to locate the bucket, say
j - if there is room in bucket j, insert the record
- Otherwise the bucket must be split and insertion
reattempted - in some cases we use overflow buckets instead (as
explained shortly)
13Splitting in Extendable Hash Structure
- To split a bucket j when inserting a record with
search-key value Kj - if igt ij (more than one pointer in to bucket j)
- allocate a new bucket z
- set ij and iz to the old value ij incremented by
one - update the bucket address table (change the
second half of the set of entries pointing to j
so that they now point to z) - remove all the entries in j and rehash them so
that they either fall in z or j - reattempt the insert (Kj). If the bucket is
still full, repeat the above.
14Splitting in Extendable Hash Structure (2)
- To split a bucket j when inserting a record with
search-key value Kj - if i ij (only one pointer in to bucket j)
- increment i and double the size of the bucket
address table - replace each entry in the bucket address table
with two entries that point to the same bucket - recompute new bucket address table entry for Kj
- now igt ij so use the first case described earlier
- When inserting a value, if the bucket is still
full after several splits (that is, i reaches
some preset value b), give up and create an
overflow bucket rather than splitting the bucket
entry table further - how might this occur?
15Deletion in Extendable Hash Structure
- To delete a key value Kj
- locate it in its bucket and remove it
- the bucket itself can be removed if it becomes
empty (with appropriate updates to the bucket
address table) - coalescing of buckets is possible
- can only coalesce with a buddy bucket having
the same value of ij and same ij -1prefix, if one
such bucket exists - decreasing bucket address table size is also
possible - very expensive
- should only be done if the number of buckets
becomes much smaller than the size of the table
16Extendable Hash Structure Example
- Hash function
- on branch name
- Initial hash table
- (empty)
17Extendable Hash Structure Example (2)
- Hash structure after insertion of one Brighton
and two Downtown records
18Extendable Hash Structure Example (3)
- Hash structure after insertion of Mianus record
19Extendable Hash Structure Example (4)
- Hash structure after insertion of three
Perryridge records
20Extendable Hash Structure Example (5)
- Hash structure after insertion of Redwood and
Round Hill records
21Extendable Hashing vs. Other Hashing
- Benefits of extendable hashing
- hash performance doesnt degrade with growth of
file - minimal space overhead
- Disadvantages of extendable hashing
- extra level of indirection (bucket address table)
to find desired record - bucket address table may itself become very big
(larger than memory) - need a tree structure to locate desired record in
the structure! - Changing size of bucket address table is an
expensive operation - Linear hashing is an alternative mechanism which
avoids these disadvantages at the possible cost
of more bucket overflows
22ComparisonOrdered Indexing vs. Hashing
- Each scheme has advantages for some operations
and situations. To choose wisely between
different schemes we need to consider - cost of periodic reorganization
- relative frequency of insertions and deletions
- is it desirable to optimize average access time
at the expense of worst-case access time? - What types of queries do we expect?
- Hashing is generally better at retrieving records
for a specific key value - Ordered indices are better for range queries