Quick Review of Apr 10 material - PowerPoint PPT Presentation

1 / 22
About This Presentation

Quick Review of Apr 10 material


B+-Tree File Organization similar to B+-tree index leaf nodes store records, not pointers to records stored in an original file ... Hashing Overview Hash functions – PowerPoint PPT presentation

Number of Views:61
Avg rating:3.0/5.0
Slides: 23
Provided by: DavidK242
Learn more at: http://www.cs.umd.edu


Transcript and Presenter's Notes

Title: Quick Review of Apr 10 material

Quick Review of Apr 10 material
  • B-Tree File Organization
  • similar to B-tree index
  • leaf nodes store records, not pointers to records
    stored in an original file
  • leaf and interior nodes are different
  • B-trees
  • search key values appear only once
  • pointer to record/bucket for that search key
    value always stored with the search key itself,
    even in interior nodes
  • Hashing Overview
  • Hash functions
  • ideally uniform, random, easy to compute

  • Overflow
  • Hash file performance
  • Hash indices
  • Dynamic Hashing (Extendable Hashing)
  • Note HW3 due next class (April 17)
  • HW 4 due Thursday April 24 (9 days from now)
  • Questions 12.11, 12.12, 12.13, 12.16

  • Overflow is when an insertion into a bucket cant
    occur because it is full.
  • Overflow can occur for the following reasons
  • too many records (not enough buckets)
  • poor hash function
  • skewed data
  • multiple records might have the same search key
  • multiple search keys might be assigned the same

Overflow (2)
  • Overflow is handled by one of two methods
  • chaining of multiple blocks in a bucket, by
    attaching a number of overflow buckets together
    in a linked list
  • double hashing use a second hash function to
    find another (hopefully non-full) bucket
  • in theory we could use the next bucket that had
    space this is often called open hashing or
    linear probing. This is often used to construct
    symbol tables for compilers
  • useful where deletion does not occur
  • deletion is very awkward with linear probing, so
    it isnt useful in most database applications

Hashed File Performance Metrics
  • An important performance measure is the loading
  • (number of records)/(Bf)
  • B is the number of buckets
  • f is the number of records that will fit in a
    single bucket
  • when loading factor too high (file becomes too
    full), double the number of buckets and rehash

Hashed File Performance
  • (Assume that the hash table is in main memory)
  • Successful search best case 1 block worst case
    every chained bucket average case half of worst
  • Unsuccessful search always hits every chained
    bucket (best case, worst case, average case)
  • With loading factor around 90 and a good hashing
    function, average is about 1.2 blocks
  • Advantage of hashing very fast for exact queries
  • Disadvantage records are not sorted in any
    order. As a result, it is effectively impossible
    to do range queries

Hash Indices
  • Hashing can be used for index-structure creation
    as well as for file organization
  • A hash index organizes the search keys (and their
    record pointers) into a hash file structure
  • strictly speaking, a hash index is always a
    secondary index
  • if the primary file was stored using the same
    hash function, an additional, separate primary
    hash index would be unnecessary
  • We use the term hash index to refer both to
    secondary hash indices and to files organized
    using hashing file structures

Example of a Hash Index
  • Hash index into file
  • account, on search key
  • account-number
  • Hash function computes
  • sum of digits in account
  • number modulo 7.
  • Bucket size is 2

Static Hashing
  • Weve been discussing static hashing the hash
    function maps search-key values to a fixed set of
    buckets. This has some disadvantages
  • databases grow with time. Once buckets start to
    overflow, performance will degrade
  • if we attempt to anticipate some future file size
    and allocate sufficient buckets for that expected
    size when we build the database initially, we
    will waste lots of space
  • if the database ever shrinks, space will be
  • periodic reorganization avoids these problems,
    but is very expensive
  • By using techniques that allow us to modify the
    number of buckets dynamically (dynamic hashing)
    we can avoid these problems
  • Good for databases that grow and shrink in size
  • Allows the hash function to be modified

Dynamic Hashing
  • One form of dynamic hashing is extendable hashing
  • hash function generates values over a large range
    -- typically b-bit integers, with b being
    something like 32
  • At any given moment, only a prefix of the hash
    function is used to index into a table of bucket
  • With the prefix at a given moment being j, with
    0ltjlt32, the bucket address table size is 2j
  • Value of j grows and shrinks as the size of the
    database grows and shrinks
  • Multiple entries in the bucket address table may
    point to a bucket
  • Thus the actual number of buckets is lt 2j
  • the number of buckets also changes dynamically
    due to coalescing and splitting of buckets

General Extendable Hash Structure
Use of Extendable Hash Structure
  • Each bucket j stores a value ij all the entries
    that point to the same bucket have the same
    values on the first ij bits
  • To locate the bucket containing search key Kj
  • compute H(Kj) X
  • Use the first i high order bits of X as a
    displacement into the bucket address table and
    follow the pointer to the appropriate bucket
  • T insert a record with search-key value Kj
  • follow lookup procedure to locate the bucket, say
  • if there is room in bucket j, insert the record
  • Otherwise the bucket must be split and insertion
  • in some cases we use overflow buckets instead (as
    explained shortly)

Splitting in Extendable Hash Structure
  • To split a bucket j when inserting a record with
    search-key value Kj
  • if igt ij (more than one pointer in to bucket j)
  • allocate a new bucket z
  • set ij and iz to the old value ij incremented by
  • update the bucket address table (change the
    second half of the set of entries pointing to j
    so that they now point to z)
  • remove all the entries in j and rehash them so
    that they either fall in z or j
  • reattempt the insert (Kj). If the bucket is
    still full, repeat the above.

Splitting in Extendable Hash Structure (2)
  • To split a bucket j when inserting a record with
    search-key value Kj
  • if i ij (only one pointer in to bucket j)
  • increment i and double the size of the bucket
    address table
  • replace each entry in the bucket address table
    with two entries that point to the same bucket
  • recompute new bucket address table entry for Kj
  • now igt ij so use the first case described earlier
  • When inserting a value, if the bucket is still
    full after several splits (that is, i reaches
    some preset value b), give up and create an
    overflow bucket rather than splitting the bucket
    entry table further
  • how might this occur?

Deletion in Extendable Hash Structure
  • To delete a key value Kj
  • locate it in its bucket and remove it
  • the bucket itself can be removed if it becomes
    empty (with appropriate updates to the bucket
    address table)
  • coalescing of buckets is possible
  • can only coalesce with a buddy bucket having
    the same value of ij and same ij -1prefix, if one
    such bucket exists
  • decreasing bucket address table size is also
  • very expensive
  • should only be done if the number of buckets
    becomes much smaller than the size of the table

Extendable Hash Structure Example
  • Hash function
  • on branch name
  • Initial hash table
  • (empty)

Extendable Hash Structure Example (2)
  • Hash structure after insertion of one Brighton
    and two Downtown records

Extendable Hash Structure Example (3)
  • Hash structure after insertion of Mianus record

Extendable Hash Structure Example (4)
  • Hash structure after insertion of three
    Perryridge records

Extendable Hash Structure Example (5)
  • Hash structure after insertion of Redwood and
    Round Hill records

Extendable Hashing vs. Other Hashing
  • Benefits of extendable hashing
  • hash performance doesnt degrade with growth of
  • minimal space overhead
  • Disadvantages of extendable hashing
  • extra level of indirection (bucket address table)
    to find desired record
  • bucket address table may itself become very big
    (larger than memory)
  • need a tree structure to locate desired record in
    the structure!
  • Changing size of bucket address table is an
    expensive operation
  • Linear hashing is an alternative mechanism which
    avoids these disadvantages at the possible cost
    of more bucket overflows

ComparisonOrdered Indexing vs. Hashing
  • Each scheme has advantages for some operations
    and situations. To choose wisely between
    different schemes we need to consider
  • cost of periodic reorganization
  • relative frequency of insertions and deletions
  • is it desirable to optimize average access time
    at the expense of worst-case access time?
  • What types of queries do we expect?
  • Hashing is generally better at retrieving records
    for a specific key value
  • Ordered indices are better for range queries
Write a Comment
User Comments (0)
About PowerShow.com