ECE 569 Database System Engineering Spring 2004 - PowerPoint PPT Presentation

About This Presentation
Title:

ECE 569 Database System Engineering Spring 2004

Description:

Title: CSE 331. Computer Organization Subject: Lecture 02 Author: Janie Irwin Last modified by: Yanyong Zhang Created Date: 8/19/1997 4:58:46 PM Document presentation ... – PowerPoint PPT presentation

Number of Views:117
Avg rating:3.0/5.0
Slides: 28
Provided by: Janie170
Category:

less

Transcript and Presenter's Notes

Title: ECE 569 Database System Engineering Spring 2004


1
ECE 569 Database System EngineeringSpring 2004
  • Yanyong Zhang www.ece.rutgers.edu/yyzhang
  • Course URL www.ece.rutgers.edu/yyzhang/spring04

2
Associative access
  • The system is not asked to retrieve tuples based
    on information about their storage location
    rather, it has to find all tuples the attribute
    values of which fulfill certain conditions
    associative access.
  • Associative access can be realized by sequential
    scanning, which happens for complicated queries.
  • select R.x, S.y,
  • from R,S
  • where R.k S.f and R.b lt 12
  • But for simple selection predicates, this is very
    slow (even for an in-memory database)

3
Access Path
  • The class of algorithms and data structures
    designed for translating attribute values into
    TID, or into other types of internal addresses of
    tuples having those attribute values, is called
    access paths.
  • Depending on what kind of selection predicate is
    to be supported, the techniques for associative
    access vary greatly.

4
Content addressability techniques
  • Primary key access.
  • A tuple of a relation must be retrieved
    efficiently via the value of its primary (unique)
    key(s). e.g., key-sequenced files and hased
    files.
  • Point query vs. range query
  • Secondary key access
  • A set of tuples are produced
  • Multi-table access
  • Tuple access is often based on relationships
    between different tuples. E.g., all orders placed
    by a given customer have to be found.

5
Associative access path techniques
  • Hashing (key transformation)
  • Using the primary key value as a parameter to a
    function, which returns the storage location of
    the tuple.
  • Key comparison
  • Maintaining a dynamic search structure on the set
    of values in the key attribute. These values can
    be organized into tables, lists, trees, and so on
  • e.g, B tree

6
Operations on files (heap files)
  • Assumptions
  • n number of records in file
  • R number of records that can fit in block
  • Lookup Given a key find corresponding record
  • On average, n / (2R) block accesses.
  • Insertion add record to file (allows
    duplicates)
  • Read last block it may need to allocate a new
    block. Approximately, requires 2 accesses
  • Deletion delete record
  • look up record n / (2R)
  • Write back to disk (1 access)
  • Reorganize (unpinned) move tuple from last page
    to utilize space (2 disk accesses)

7
Hashed Files
  • File is divided into B buckets
  • Hash function h maps elements of the key space to
    range 0, B)
  • Key space is large and unevenly distributed
  • SSNs as character strings
  • Each character takes on at most 10 of the
    possible 256 values
  • Hash function h must map key values evenly among
    a relatively small number of values.

8
Hash-based associative access
Range of Potential Key Values (the shaded areas
denote used key values)
HASHING
FOLDING
tuple address space
Range of positive integers
9
Folding
  • Convert arbitrary data types to a positive
    integer h can be applied to.
  • Reduce number of bits so that arithmetic is
    efficient.
  • Example Key is Keefe and 16803
  • Key value is the concatenation of byte
    representation of individual fields
  • Folded value of key is
  • 0x4b 0x65 0x65 0x66 0x65 0x0 0x0 0x0 0x41 0xa3
  • Partition result into words and combine using XOR
  • 0x4b 0x65 0x65 0x66
  • 0x65 0x0 0x0 0x0
  • 0x41 0xa3 0x0 0x0
  • ????????????
  • 0x6f 0xc6 0x65 0x66 1875273062

10
Hashing
  • goal of hashing
  • How to choose hash function if all the key values
    are uniformly distributed?
  • The critical issue is to produce 11 mapping
  • Collision different inputs are mapped to the
    same output.
  • The criteria of a good hash function is to keep
    the collision as small as possible.

11
Static Hashing
  • Input folded key values
  • Output bytes (relative to the beginning of the
    file), blocks ??
  • Bytes are not good because of the varying tuple
    size.
  • A block/page is called a bucket.
  • H 0 232-1 -gt 0, B-1
  • Continuous allocation
  • Fixed size B pages are allocated at file
    creation time.
  • Insert
  • Determine the bucket
  • Check the bucket ( collision may happen)

12
How to find a good hash function
  • Division / remainder (Congruential hashing)
  • H(Kb) kb mod B where kb is folded key value
    and B is the number of buckets.
  • Nth power
  • Compute kbN, and from the resulting bit string (n
    x 31 bits) take log2B bits from the middle.
  • Base transformation
  • Polynomial division
  • Numerical analysis
  • encryption

13
Performance
  • Assumption
  • Perfect hash function (tuples are uniformly
    distributed over B buckets)
  • Lookup
  • ?½ ? n/R ? 1/B? To finish first match
  • ?n/R ? 1/B? If tuple does not exist
  • Insertion
  • ?n/R ? 1/B? 1 Test for duplicates
  • 1 Otherwise
  • Deletion
  • ? ½ ? n/R ? 1/B? delete first match

14
Collision
  • Two keys collide if they hash to same value
  • A bucket with room for R tuples can accommodate R
    1 collisions before it overflows
  • Internal resolution Place overflow blocks in
    another bucket
  • (h(K) 1) mod B linear probing
  • (h2(h1(K)) multiple hashing

15
Collision - continued
  • External resolution Allocation overflow block,
    link to overflow chain

buckets
Overflow pages
16
Discussion
  • What are the disadvantages of static hashing?
  • How do you limit the number of pages accessed
    when retrieving a tuple, for both external and
    internal resolution?

17
Trie
  • The buckets will dynamically grow/shrink/balance
  • Fundamental trie

(a)
A
000
001
D
010
B
011
100
C
101
110
111
(c)
18
Dynamic hashing function
  • We need a hash function whose range of values can
    change dynamically
  • One such hash function can be constructed using a
    series of functions hi(k), i 0, 1, ., such
    that for any k, either hi(k) hi-1(k) or hi(k)
    hi-1(k)2i-1.
  • Choose H(k), which maps the key space into random
    bit patterns of length m, for m sufficiently
    large. Then hi(k) may be defined as the integers
    formed by the last i bits of H(k).

19
Extendible Hashing
  • The number of buckets can grow/shrink.
  • An intermediate data structure translates the
    hash results into page addresses. This data
    structure needs to be as compact as possible.
  • Hashes into an array of pointer to buckets
    (directory).
  • The array is small enough to be kept in memory.

20
Directory Growth
  • To adapt to dynamically varying size of hash
    file- modify directory size
  • Assume a hash function h(Kb) that produces a
    bit string s.
  • The directory is of size 2d. d is called the
    global depth and is initially 0.
  • Use least significant d bits of s to determine
    bucket to access
  • Each bucket has a corresponding local depth in
    the range 0, d which indicates the difference
    between all the records in this bucket

21
Example
  • Insert 0x13, 0x10, 0x07, 0x00, 0x1f
  • Each page can contain no more than 2 tuples

22
Example insert 0x1f
local degree global degree log2( of arrows
pointing to this bucket)
23
Performance
  • 2 steps for retrieving a tuple
  • If we can keep the directory in memory, each
    retrieval is one page access
  • Assuming 4 bytes per entry, 4KB pages, 1GB hash
    files, and we want to keep the entire directory
    in memory, what is the minimum buffer size?

24
Discussion
  • How easy is it to keep the directory in the
    memory?
  • How do we reduce the structure when the file
    shrinks?
  • How do you make the directory small, and increase
    space utilization? (deferred splitting)

25
Linear hashing
(a)
d 2
00
01
10
11
w
(b)
d 2
01
10
11
000
100
x
(c)
w
d 2
10
11
000
001
100
101
y
x
y
(c)
w
(d)
w
x
11
000
001
010
100
101
110
000
001
010
011
100
101
110
111
d 2
d 3
26
Addressing in linear hashing
  • Which hash function should be used?
  • d is the degree, p is the address of the next
    page to split
  • The algorithm is as follows
  • begin
  • if (hd(k) gt p) then page hd(k)
  • else page hd1(k)
  • if necessary, chase the overflow chain
  • end

27
Discuss
  • Where to store the overflow?
Write a Comment
User Comments (0)
About PowerShow.com