ICS 214A: Database Management Systems - PowerPoint PPT Presentation

1 / 39

About This Presentation

Title:

ICS 214A: Database Management Systems

Description:

Read write records, scan relations. Get page containing tuples. Read/write file pages ... simple to implement using a scan. relation scan. index scan ... – PowerPoint PPT presentation

Number of Views:20

Avg rating:3.0/5.0

Slides: 40

Provided by: che7

Category:

more less

Transcript and Presenter's Notes

Title: ICS 214A: Database Management Systems

1
ICS 214A Database Management Systems

Part III Query Processing and Optimization
Professor Sharad Mehrotra

2
Big picture
We are here!
SQL statements
Access plan
Query Processor
optimizer
Read write records, scan relations
Get page containing tuples
Indexes
Record-oriented file system
Buffer manager
Basic file system
Read/write file pages
Hardware
3
Query Processor Optimizer

Overview
Query processor
Query optimizer

4
Overview query processing steps

Parsing
? Validation
? Optimization
? Execution

5
Parsing

SQL --gt Parse Tree
Good old lex and yacc
Detect and reject syntax errors

6
Example SQL query

Find the movies with stars born in 1960
SELECT title
FROM StarsIn
WHERE starName IN (
SELECT name
FROM MovieStar
WHERE birthdate LIKE 1960
)

7
Example Parse Tree
ltQuerygt
ltSFWgt
SELECT ltSelListgt FROM ltFromListgt
WHERE ltConditiongt
ltAttributegt ltRelNamegt
ltTuplegt IN ltQuerygt
title StarsIn
ltAttributegt ( ltQuerygt )
starName ltSFWgt
SELECT ltSelListgt FROM ltFromListgt
WHERE ltConditiongt
ltAttributegt ltRelNamegt
ltAttributegt LIKE ltPatterngt
name MovieStar
birthDate 1960
8
Validation

Validator parse tree --gt initial logical plan
Detect reject semantic errors
Nonexistent tables/views/columns
Insufficient access priveleges
Type mismatch
E.g., AVG(name), name GPA, Student UNION Enroll
Also
Expand
Expand view definitions
Validator uses information in system catalogs for
all the above

9
Example Initial Logical Query Plan
?title

NOTE
Represented as a query tree though sometimes a
DAG if it has common subexpressions
Usually common subexpressions evaluating
separately and stored in temporary file
Effectively query plan corresponds to a tree
Not all systems use relational
Algebra for representing logical
Query plans. DB2 uses the query Graph model

?starNamename
?
StarsIn ?name
?birthdate LIKE 1960
MovieStar
10
Optimizer

Given an initial Logical Plan generate a physical
plan that is Optimal with respect to one or
more performance metrics.
Performance Metrics
I/O, CPU, Communication, Power Consumption, Cache
misses, of internet queries, . (depends upon
domain)

11
Query Optimizer

Many logical plans are equivalent
Databases support typically many ways to
implement physical operators
Query optimization is a difficult search problem
of identifying an optimal plan.
Issues
Search space of plans
Which plans does the optimizer consider
Cost estimation
how to assign cost to a plan
Enumeration algorithm
Search space

12
Optimizer use rewrite rules to generate better
logical plans
?title
starNamename
StarsIn ?name
?birthdate LIKE 1960
MovieStar
13
Optimizers may consider only subset of plans

Different optimizers may consider different types
of trees.

B
Left deep tree
Bushy tree

These trees are different since inputs to binary
operator are not symmetric.

Right deep tree
14
Multiple physical plans may exist for a given
logical plan

Hash join
SEQ scan
index scan
StarsIn MovieStar
15
Logical vs Physical Plan

Logical Plan

Physical Plan

Nested Loop
Merge Join
File scan R3
Sort
Sort
File scan R1
File scan R2
16
Query Processor

Implements a bunch of algorithms that form the
building blocks of the physical plan.
Unary operators
Tuple level select, project (can be evaluated on
tuple directly)
Table level sort, hash, aggregation, group by.
binary operators
join, union, intersection, difference.
Important Issue
Query processor architecture
how is the execution of different physical
operators synchronized and how is data passed
form one operator to another.

17
Example
R(A,B,C), S(C,D,E)

Select B,D
From R,S
Where R.A c ? S.E 2 ? R.CS.C

18
R A B C S C D E a 1 10 10 x 2 b 1 20 2
0 y 2 c 2 10 30 z 2 d 2 35 40 x 1 e 3 45 50
y 3
19
Plan I

?B,D
sR.Ac? S.E2 ? R.CS.C
x
R S

OR ?B,D sR.Ac? S.E2 ? R.C S.C (R x S)
20
R x S R.A R.B R.C S.C S.D S.E a 1 10 10
x 2 a 1 10 20 y 2 . .
C 2 10 10 x 2 . .
21
Plan II

?B,D
sR.A c sS.E 2
R S

natural join
22
R S A B C s (R) s(S) C D
E a 1 10 A B C C D E 10
x 2 b 1 20 c 2 10 10 x 2 20 y
2 c 2 10 20 y 2 30 z 2 d 2
35 30 z 2 40 x 1 e 3 45
50 y 3
23
Plan III

Use R.A and S.C Indexes
(1) Use R.A index to select R tuples with R.A
c
(2) For each R.C value found, use S.C index to
find matching tuples
(3) Eliminate S tuples S.E ? 2
(4) Join matching R,S tuples, project B,D
attributes and place in result

24
R S A B C C D E a 1
10 10 x 2 b 1 20 20
y 2 c 2 10 30 z 2 d 2 35
40 x 1 e 3 45
50 y 3
A
C
I1
I2
25
Query Processor physical level

Implements a bunch of algorithms that form the
building blocks of the physical plan.
Unary operators
sort, hash, select, project, aggregation, group
by.
binary operators
join, union, intersection, difference.
Important Issue
Query processor architecture
how is the execution of different physical
operators synchronized and how is data passed
form one operator to another?

26
Query Processor Architecture

Consider a simple join operator and query
consisting of 2 joins. How should data be passed
form one join to another?
Create temporary files.
Create a process for each operator and use IPC
(e.g, shared memory, pipes).
OS does scheduling of these processes.

27
Query Processor Architecture (cont)

Take a query plan and convert as much of it as
possible into a block that can be executed and
one single iterative program with loops and other
control structures
Implement operators as iterators with open,
next, close as a single process easy to
parallelize

28
Iterators

Demand-driven evaluation of query tree.
Operators exchange data in units such as records
Each operator supports the following interfaces
open
next
close
open() at top of tree results in cascade of opens
down the tree.
An operator getting a next() call may recursively
make next() calls from within to produce its next
answer.
close() at top of tree results in cascade of
close down the tree

29
Benefits of Iterators

Query executed as one process
Items produced one at a time
No temporary files
amenable to parallelization

30
Sample Iterators

Print
Open()
open input
Next()
call next on input format the item to screen
Close()
close input

31
Sample Iterators (cont)

Scan
open
open file
next
read next item
close
close file

32
Sample Iterators (cont)

Sort (merge sort)
open
open input
build all initial run files calling next on
input
close input
merge run files
next
determine next output item read new item from
correct run file
close
destroy remaining run files

33
Operators in Relational Model

Tuple _at_ a time unary (?,?)
can be evaluated immediately
tuple at a time without looking at entire
relation
Full-relation unary operators
S - duplicate removal
VL- aggregations, grouping
E - sorting
Binary operators

34
Operator Implementation

Tuple _at_ a time operator
simple to implement using a scan
relation scan
index scan
Implementation of other operators can be
classified according to the following
sort based
hash based
nested loop based
Furthermore, each of the above may or may not
exploit indices.

35
Sorting

If data lt M, no problem
read into memory and sort using your favorite
method. (usually quick sort)
what if data does not fit memory?
Uses external sort
Abstract description of external sort
divide input into runs R1,,Rn which are
themselves sorted (run generation)
merge runs into sorted output (merging)
Notice 2 different sort algorithms used.
1. For in memory sorting
2. For managing subsets of data on disk.

36
Run Generation

Simple idea
read enough data as fits memory
use MM sorting
output run
Replacement Selection
organize memory as priority queue
initially fill memory with items and organize as
priority queue
pop smallest entry and put in run
immediately replace entry in PQ by a new item
if new value gt largest outputted so far, the new
value would be outputted in same run.

37
Replacement Selection

Advantages
larger size runs, hence merge can be done in
fewer phases
Disadvantages
makes memory management more complex
data read from disk in blocks. Should we copy it
to a PQ and store separately (copy overhead)
else 1/2 buffer would be wasted since records
already outputted to run

38
Merging Runs

Read size C cluster from each run so as to fill M
Merge them and store in output
Continue recursively until only 1 run remaining

39
Complexity of Sort

Let W of initial runs, (W R/M if
quicksort, else R/2M 1 if replacement
selection used.
C cluster size
F fan in of merge
of levels of Merge
Total I/O cost
cost of run creation L 2 R/C times I/O
cost of reading C
What should C be?
Smaller value of C allows for max. of runs to
be merged in 1 iteration.
Minimizes the number of phases and hence the
amount of data read and written to disk
larger value of Cgt more sequential I/O per
iteration.

40
Choosing Cluster Size (usually large cluster size
with small fan-in is good)

Let R 51200KB, M 160KB
Num. Of runs 51200/160 320
disk latency seek time 25ms, transfer rate
2ms/4KB
If C 16KB
F 160/16 - 1 9
number of merge levels, L log_9(320) 3
reading 16KB takes 25 2 4 33ms
total time 2 L R/C 33 ms 2 3 3200
33/60000 approx. 10.56 minutes.
If C 4KB
F 160 /4 - 1 39
numbe of merge levels, L log_39(320) 2
reading 4KB takes 25 4 29 ms
total time 2 2 29 3200 4 /60000
24.74 minutes

41
Optimizing Merge

What if initial number of runs W is not power of
F?
Given 12 runs, and fan-in of 10
cost of strategy 1 12 10 2 24
cost of strategy 2 12 3 15
In the first merge, do not merge runs
aggressively as many as possible. Instead, just
merge enough to ensure subsequent merges will use
full fan-in F
At any stage, merge as many runs as possible
using the smallest run files available

42
Implementing Sort Based Algorithms

Sort-based Unary operators (duplicate
elimination, aggregation)
naive sort first then compile
aggregate/eliminate duplicates
smart merge aggregation/duplicate elimination
with run generation

43
Duplicate Elimination using Sort

Remove duplicates during run generation
duplicates never permitted in runs that are
created.
This results in shorter runs ( a run size never
exceeds the output size).
Hence cost can significantly reduce.
E.g., in extreme case if the file completely
consists of duplicates, we only pay cost of
reading it once and do not pay overhead of run
generation, run storage, etc.

44
Aggregation using Sort

sort based on group-by criteria.
For different aggregate queries, different
information maintained per group.
E.g., for min query, in the run created, we do
not maintain duplicates, rather maintain a min
for a given run. During merging, the min of the
new run is the minimum of the values associated
with the min for groups with individual runs.

45
Implementing Sort-Based Operators

Binary operators (join, union (set-based
semantics), intersection, difference, semi-join)
naïve first sort the individual relations and
then compute the binary operation.
Smart join the last merge steps of the sort on
both relations and compute the binary operation
in that step.

46
Sort based Union Operator (R union S)

divide R and S into runs using quicksort and sort
runs individually
Let F be the fan-in of sorting
Merge runs of R and S until the remaining runs of
R and S taken together is less than or equal to
F.
Read each remaining run of R and S into memory
and compute union
same as duplicate removal from these runs taken
together.
Similar algorithms can easily be devised for
sort-based intersection, and set difference

47
Merge Join

Similar to sort-based implementation of other
operators
Divide R and S into runs and merge runs until
there are just enough runs of R and S left over
such that these runs can be merged together in
memory.
Merging runs of R and S for join slightly
different compared to other binary operators such
as set difference, union since both inputs do not
advance simultaneously in presence of duplicates.
E.g., say R and S both have duplicate entries for
a join value v 5.
Runs of R are advanced a tuple at a time, and the
corresponding entries in the runs of S is looked
up.
It is possible we may have to read pages
containing duplicates (I.e., v 5 entries) in S
repeatedly.