Overview of Relational Database Systems - PowerPoint PPT Presentation

1 / 28
About This Presentation
Title:

Overview of Relational Database Systems

Description:

Assume rocking scan is used to read pages of S. The I/O cost is. N M ( N / K1 - 1) (M - K2) ... (keep the duplicate rows): Scanning each tuple once. CPU ... – PowerPoint PPT presentation

Number of Views:28
Avg rating:3.0/5.0
Slides: 29
Provided by: Informatio313
Category:

less

Transcript and Presenter's Notes

Title: Overview of Relational Database Systems


1
Overview of Relational Database Systems
  • From Physical Storage to
  • Query Optimization

2
The Relational Data Model
  • Table attribute
  • attribute column of a table
  • tuple row of a table
  • superkey, key, primary key, candidate key

3
Two Example Relations
Employee
Department
4
The Core Relational Algebra
  • Selection (?)
  • Projection (?)
  • Cartesian Product (?)
  • Union (?)
  • Set Difference (-)
  • Set Intersection (?)
  • Join (?c )
  • Division (?)

5
SQL
select Employee.Name, Department.Phone from
Employee, Department where Employee.Dept
Sales and Employee.Dept Department.Name
? Employee.Name,Department.Phone (?
Employee.DeptSales and Employee.DeptDepartment
.Name (Employee ? Department))
SQL is more powerful than relational algebra
because it can do many things, such as computing
aggregates for different groups of tuples, that
relational algebra cannot do.
6
Query Processing An Overview
select from R, S, T where R.A gt a and R.B S.B
and S.C T.C
A possible execution plan for this query 1.
Perform selection ?Agta(R) based on a sequential
scan of the tuples of R. Let R1 denote the result
of ?Agta(R). 2. Perform join R1 ?R1.BS.B S using
the sort merge join algorithm. Let R2 denote the
result of the join. 3. Perform join R2 ?R2.CT.C
T using the nested loop join algorithm.
7
Query Optimization An Overview
  • The goal of query optimization is to find an
    execution plan which can be evaluated with
    minimum cost, I.e., an optimal plan.
  • In centralized database, the query evaluation
    cost I/O cost (dominant) CPU cost
  • I/O cost can be cut down by employing B trees
    hashing, etc.
  • The of equivalent execution plans for a given
    query (m!)km where there are m operations in
    the query each operation can be evaluated in k
    different ways.
  • The search space for query optimization the set
    of all equivalent execution plans.
  • We dont want to spend too much time to find a
    plan as query processing time time to find a
    plan time to execute the plan.

8
Two-level database architecture
  • Main memory
  • fast access (0.01?s per instruction for 100mips
    CPU)
  • small capacity
  • volatile
  • expensive
  • Secondary storage (disk)
  • slow access (about 10ms per disk block)
  • large capacity
  • nonvolatile
  • cheap

9
Secondary Storage
  • Typical disk drives
  • spindle
  • actuator
  • read/write heads
  • tracks
  • cylinders
  • pages (sectors / blocks)
  • page address (cyl, trk, pg)
  • The cost of transferring one page betn memory
    disk sum of
  • seek time (for track)
  • rotational delay / latency (for page)
  • block transfer time rotation time x (page size
    / track size), why?

10
Storage Hierarchy
  • A page I/O is sequential if transferring data
    incurs neither seek time nor rotational delay
    otherwise, the page I/O is random. So try to
    reduce of random page I/Os!
  • Other architectures include
  • One-level architecture all data in the main
    memory database system are kept in main memory.
  • Three-level architecture Tertiary storage like
    optical disks or tape drives is needed for
    applications such as EOS/DIS (earth observation
    system/data information system) which stores as
    many as 10 perabytes (1 million GB) of data.

11
B Tree
  • Each node is stored as a page
  • Its a balanced tree
  • Every nonroot internal node is at least half full
  • The leaf node pointers provide a way to access
    the tuples in an ordered manner and are useful
    for evaluating range conditions
  • Acceptable overhead on insertion, deletion, and
    space
  • B tree indices
  • B tree file organization
  • B tree indices

Exercise Read the algorithms for B/B-tree
search, insertion, deletion by yourself from a
data org/database book.
12
Hashing
  • The idea is to build a hash table that contains
    an index entry for each tuple in the relation and
    to use a hash function h( ) to quickly identify
    each entry in the hash table.
  • The hash table consists of many buckets
    (identified by their bucket numbers) which
    corresponds to one or more disk pages.
  • Each bucket consists of a of index entries of
    the form (a, P), where a is the attribute value
    of some tuple and P is a tuple pointer, pointing
    to the tuple on disk.

Exercise Read the implementation techniques for
static hashing dynamic hashing from a data
org/database book.
13
Reminder !!!
B?-trees Hashing are the most fundamental
techniques in organizing retrieving information
efficiently. Make sure you understand their
details.
14
Evaluating Join
Among the three most frequently used relational
operations (I.e., selection, projection, join),
join is the most expensive operation.
Consider R ?R.AS.B S. Let n m are the numbers
of tuples in R S, respectively, and N M are
the sizes of R S in pages, respectively. We
also assume M ? N.
15
Nested Loop Join Algorithm
for each tuple x in R for each tuple y in S if
xA yB then return (x, y)
outer relation inner relation
CPU cost is always O(nm)
To optimize the I/O cost gt to minimize the
of I/O pages
16
Nested Loop I/O Cost
for each K1 pages P of R for each K2 pages Q of
S for each tuple x in P for each tuple y in Q if
xA yB then return (x, y)
Assume K1 K2 K buffer pages available and K1
? N and K2 ? M. Assume rocking scan is used to
read pages of S.
The I/O cost is N M (? N / K1? - 1) (M - K2)
(A)
Min when K1 minN, K-1
When S is the outer relation, the I/O cost is M
N (? M / K2? - 1) (N - K1)
Min when K2 minM, K-1
(B)
17
Nested Loop I/O Cost (Cont)
Min(A) is more likely to be greater than min(B),
I.e., to minimize the of I/O pages, use the
smaller relation as the outer relation and let it
use as many buffer pages as needed (minM, K-1).
To minimize the of I/O operations initiated and
say when R is used as the outer relation, the
of I/O operations needed is ? N/K1 ? ? M/K2 ?
( ? N/K1 ? - 1)( ? M/K2 ? - 1) ? N/K1 ??
M/K2 ? 1 Ignoring the ceilings, it reaches the
min when K1 K2 K/2. So to minimize the of
I/O operations, it does not matter which of R S
is used as the outer relation.
18
Sort Merge Join Algorithm
1. Sort the two relations in ascending order of
their respective joining attributes, I.e., sort R
on A and sort S on B if they are not already
sorted. 2. Perform a merge join. Two cases to
consider a) the values under at least one
joining attribute are distinct, b) both
attributes have repeating values.
Worst case (none is sorted nearly all values of
the joining attributes are the same) CPU cost
O(n log n m log m nm) I/O cost O(N log N
M log M C(R,S)) where C(R,S) is the cost of
performing (R x S) which is the same cost as
nested loop algorithm.
Best case (sorted no repeated value) CPU cost
O(n m) I/O cost O(N M)
19
Hash Join Algorithm
  • 1. Build a hash table for the smaller relation S
    based on the joining attribute. Rather than
    putting the pointers to tuples in the buckets,
    the tuples themselves are placed in the buckets.
  • 2. Use the larger relation R to probe the hash
    table to perform the join. The probe process is
    described below
  • for each tuple x in R
  • hash on the joining attribute using the same
    hash function used above to find a bucket in the
    hash table
  • if the bucket is nonempty
  • for every tuple y in the found bucket
  • if xA yB then return (x, y)

CPU cost O(m n b) where b avg of tuples
per bucket I/O cost O(N M)
20
Comparison of Join Algorithms
  • Hash join is very efficient but is only
    applicable to equijoin.
  • Sort merge join performs better than nested loop
    when both relations are large, esp. true if one
    or both relations are already sorted on the
    joining attributes.
  • Nested loop join performs well when one relation
    is large one is small. A special case is when
    the smaller relation can be entirely held in main
    memory. In this case, both relations need to be
    read in only once. When combined with the index
    on the joining attribute of the (larger) inner
    relation, excellent performance can yield.

21
Evaluating Selection
?A op a (R)
A is an attribute a is a constant op ? , ?, lt,
?, gt, ?
If op is ?, then most tuples of R are likely to
satisfy the condition gt sequential scanning. So
for the next few OHPs, we assume op is not ?.
22
Selectivity
Defn The selectivity of A op a on R, denoted as
S A op a (R), is the percentage of the tuples of
R that satisfy A op a.
Most commercial systems maintain detailed
statistics about the values of each attribute.
These statistics are typically in the form of a
histogram. For poor systems without histograms
but just some statistics, SAa(R) est 1 /
dist(A) SAgta(R) est (max(A) - a) / (max(A) -
min(A)) where dist(A) is the number of distinct
values of A in R.
23
Selection Cost (without fast access path)
Let k be nS A op a (R), I.e., the of tuples in
R that satisfy A op a. Suppose fast access path
is unavailable or not used.
Case 1 (A is sorted) use binary search. CPU
cost O(log n k). I/O cost O(log N
?(k/n)N?) where N is of pages to hold R.
Case 2 (A is not sorted) need sequential scan on
all tuples. CPU cost O(n). I/O cost O(N).
24
Selection Cost(with fast access path)
Case 1 (Sorted A-values) I.e., the fast access
path is a clustered index. Btree takes constant
steps to get the first match. So the CPU cost is
O(k), the I/O cost is O(?(k/n)N?).
Case 2 (Unsorted A-values) CPU cost is still
O(k). The I/O cost is bounded by O(mink, N).
25
Evaluating Projection
? A1,,At (R) where A1,,At are attributes of R.
Case 1 (keep the duplicate rows) Scanning each
tuple once. CPU cost is O(n). I/O cost is O(N).
Case 2 (select distinct) Step 1 as above. Step
2 sort the result. Step 3 remove adjacent
duplicates. CPU cost is dominated by sorting O(n
log n). The I/O cost is dominated by step 1 2.
The I/O cost for step 1 is O(N). Let W be the
size of the step 1 result (W n(? i1..t
length(Ai))/PageSize). The I/O cost for step 2 is
O(W log W).
26
Is the time to find an optimal plan too expensive?
  • If the queries are submitted many times, spending
    more effort to find an (nearly) optimal execution
    plan is worthwhile.
  • For some special types of queries (chain query,
    star query, tree query), an optimal execution
    plan can be found in a reasonable amount of time.
  • For general queries, either heuristics are used
    to find a reasonably optimal plan or a reduced
    search space is used. E.g., in System R, only
    execution plans corresponding to left deep join
    trees are considered.

27
Algebra-Based Optimization(based on heuristic
rules)
  • Perform selection as early as possible
  • Replace Cartesian products by joins whenever
    possible
  • If there are several joins, perform the most
    restrictive joins first
  • Project out useless attributes early

28
Cost Estimation-Based Optimization (e.g. IBM DB2)
Idea For each query, enumerate all possible
execution plans. For each plan, estimate the
cost. Finally choose the one with lowest
estimated cost.
  • Difficulties
  • Too many possible execution plans to enumerate.
  • It may be too difficult to estimate the cost of
    each execution plan accurately, mostly because of
    the chain operations (as need to estimate the
    size of the result from previous operation(s)).
Write a Comment
User Comments (0)
About PowerShow.com