ECE 569 Database System Engineering Spring 2004 - PowerPoint PPT Presentation

About This Presentation

Title:

ECE 569 Database System Engineering Spring 2004

Description:

Title: CSE 331. Computer Organization Subject: Lecture 02 Author: Janie Irwin Last modified by: Yanyong Zhang Created Date: 8/19/1997 4:58:46 PM Document presentation ... – PowerPoint PPT presentation

Number of Views:118

Avg rating:3.0/5.0

Slides: 28

Provided by: Janie170

Learn more at: http://eceweb1.rutgers.edu

Category:

more less

Transcript and Presenter's Notes

Title: ECE 569 Database System Engineering Spring 2004

1
ECE 569 Database System EngineeringSpring 2004

Yanyong Zhang www.ece.rutgers.edu/yyzhang
Course URL www.ece.rutgers.edu/yyzhang/spring04

2
Associative access

The system is not asked to retrieve tuples based
on information about their storage location
rather, it has to find all tuples the attribute
values of which fulfill certain conditions
associative access.
Associative access can be realized by sequential
scanning, which happens for complicated queries.
select R.x, S.y,
from R,S
where R.k S.f and R.b lt 12
But for simple selection predicates, this is very
slow (even for an in-memory database)

3
Access Path

The class of algorithms and data structures
designed for translating attribute values into
TID, or into other types of internal addresses of
tuples having those attribute values, is called
access paths.
Depending on what kind of selection predicate is
to be supported, the techniques for associative
access vary greatly.

4
Content addressability techniques

Primary key access.
A tuple of a relation must be retrieved
efficiently via the value of its primary (unique)
key(s). e.g., key-sequenced files and hased
files.
Point query vs. range query
Secondary key access
A set of tuples are produced
Multi-table access
Tuple access is often based on relationships
between different tuples. E.g., all orders placed
by a given customer have to be found.

5
Associative access path techniques

Hashing (key transformation)
Using the primary key value as a parameter to a
function, which returns the storage location of
the tuple.
Key comparison
Maintaining a dynamic search structure on the set
of values in the key attribute. These values can
be organized into tables, lists, trees, and so on
e.g, B tree

6
Operations on files (heap files)

Assumptions
n number of records in file
R number of records that can fit in block
Lookup Given a key find corresponding record
On average, n / (2R) block accesses.
Insertion add record to file (allows
duplicates)
Read last block it may need to allocate a new
block. Approximately, requires 2 accesses
Deletion delete record
look up record n / (2R)
Write back to disk (1 access)
Reorganize (unpinned) move tuple from last page
to utilize space (2 disk accesses)

7
Hashed Files

File is divided into B buckets
Hash function h maps elements of the key space to
range 0, B)
Key space is large and unevenly distributed
SSNs as character strings
Each character takes on at most 10 of the
possible 256 values
Hash function h must map key values evenly among
a relatively small number of values.

8
Hash-based associative access
Range of Potential Key Values (the shaded areas
denote used key values)
HASHING
FOLDING
tuple address space
Range of positive integers
9
Folding

Convert arbitrary data types to a positive
integer h can be applied to.
Reduce number of bits so that arithmetic is
efficient.
Example Key is Keefe and 16803
Key value is the concatenation of byte
representation of individual fields
Folded value of key is
0x4b 0x65 0x65 0x66 0x65 0x0 0x0 0x0 0x41 0xa3
Partition result into words and combine using XOR
0x4b 0x65 0x65 0x66
0x65 0x0 0x0 0x0
0x41 0xa3 0x0 0x0
????????????
0x6f 0xc6 0x65 0x66 1875273062

10
Hashing

goal of hashing
How to choose hash function if all the key values
are uniformly distributed?
The critical issue is to produce 11 mapping
Collision different inputs are mapped to the
same output.
The criteria of a good hash function is to keep
the collision as small as possible.

11
Static Hashing

Input folded key values
Output bytes (relative to the beginning of the
file), blocks ??
Bytes are not good because of the varying tuple
size.
A block/page is called a bucket.
H 0 232-1 -gt 0, B-1
Continuous allocation
Fixed size B pages are allocated at file
creation time.
Insert
Determine the bucket
Check the bucket ( collision may happen)

12
How to find a good hash function

Division / remainder (Congruential hashing)
H(Kb) kb mod B where kb is folded key value
and B is the number of buckets.
Nth power
Compute kbN, and from the resulting bit string (n
x 31 bits) take log2B bits from the middle.
Base transformation
Polynomial division
Numerical analysis
encryption

13
Performance

Assumption
Perfect hash function (tuples are uniformly
distributed over B buckets)
Lookup
?½ ? n/R ? 1/B? To finish first match
?n/R ? 1/B? If tuple does not exist
Insertion
?n/R ? 1/B? 1 Test for duplicates
1 Otherwise
Deletion
? ½ ? n/R ? 1/B? delete first match

14
Collision

Two keys collide if they hash to same value
A bucket with room for R tuples can accommodate R
1 collisions before it overflows
Internal resolution Place overflow blocks in
another bucket
(h(K) 1) mod B linear probing
(h2(h1(K)) multiple hashing

15
Collision - continued

External resolution Allocation overflow block,
link to overflow chain

buckets
Overflow pages
16
Discussion

What are the disadvantages of static hashing?
How do you limit the number of pages accessed
when retrieving a tuple, for both external and
internal resolution?

17
Trie

The buckets will dynamically grow/shrink/balance
Fundamental trie

(a)
A
000
001
D
010
B
011
100
C
101
110
111
(c)
18
Dynamic hashing function

We need a hash function whose range of values can
change dynamically
One such hash function can be constructed using a
series of functions hi(k), i 0, 1, ., such
that for any k, either hi(k) hi-1(k) or hi(k)
hi-1(k)2i-1.
Choose H(k), which maps the key space into random
bit patterns of length m, for m sufficiently
large. Then hi(k) may be defined as the integers
formed by the last i bits of H(k).

19
Extendible Hashing

The number of buckets can grow/shrink.
An intermediate data structure translates the
hash results into page addresses. This data
structure needs to be as compact as possible.
Hashes into an array of pointer to buckets
(directory).
The array is small enough to be kept in memory.

20
Directory Growth

To adapt to dynamically varying size of hash
file- modify directory size
Assume a hash function h(Kb) that produces a
bit string s.
The directory is of size 2d. d is called the
global depth and is initially 0.
Use least significant d bits of s to determine
bucket to access
Each bucket has a corresponding local depth in
the range 0, d which indicates the difference
between all the records in this bucket

21
Example

Insert 0x13, 0x10, 0x07, 0x00, 0x1f
Each page can contain no more than 2 tuples

22
Example insert 0x1f
local degree global degree log2( of arrows
pointing to this bucket)
23
Performance

2 steps for retrieving a tuple
If we can keep the directory in memory, each
retrieval is one page access
Assuming 4 bytes per entry, 4KB pages, 1GB hash
files, and we want to keep the entire directory
in memory, what is the minimum buffer size?

24
Discussion

How easy is it to keep the directory in the
memory?
How do we reduce the structure when the file
shrinks?
How do you make the directory small, and increase
space utilization? (deferred splitting)

25
Linear hashing
(a)
d 2
00
01
10
11
w
(b)
d 2
01
10
11
000
100
x
(c)
w
d 2
10
11
000
001
100
101
y
x
y
(c)
w
(d)
w
x
11
000
001
010
100
101
110
000
001
010
011
100
101
110
111
d 2
d 3
26
Addressing in linear hashing