Query Processor

About This Presentation

Title:

Query Processor

Description:

The relational algebra is used to express the output of the query. ... And resp = manager'' EMP(eno, ename, title) ASG(eno, jno, resp, dur) ... – PowerPoint PPT presentation

Number of Views:197

Avg rating:3.0/5.0

Slides: 74

Provided by: gangaI

Category:

more less

Transcript and Presenter's Notes

Title: Query Processor

1
Query Processor

A query processor is a module in the DBMS that
performs the tasks to process, to optimize, and
to generate execution strategy for a high-level
query
For a DDBMS, the QP also does data localization
for the query based on the fragmentation scheme
and generates the execution strategy that
incorporates the communication operations
involved in processing the query

2
Query Optimizer

Queries expressed in SQL can have multiple
equivalent relational algebra query expressions
The distributed query optimizer must select the
ordering of relational algebra operations, sites
to process data, and possibly the way data should
be transferred. This makes distributed query
processing significantly more difficult

3
Complexity of Relational Algebra Operations

The relational algebra is used to express the
output of the query. The complexity of relational
algebra operations play a role in defining some
of the principles of query optimization. All
complexity measures are based on the cardinality
of the relation
Operations Complexity
Select, Project (w/o duplicate elimination) O(n)
Project (with duplicate elimination), Group O(n
logn)
Join, Semi-join, Division, Set Operators O(n
logn)
Cartesian Product O(n2 )

4
Characteristics of Query Processors

Languages
Input language can be relational algebra or
calculus output language is relational algebra
(annotated with communication primitives). The
query processor must efficiently map input
language to output language
Types of Optimization
The output language specification represents the
execution strategy. There can be many such
strategies, the best one can be selected through
exhaustive search, or by applying heuristic
(minimize size of intermediate relations). For
distributed databases semijoins can be applied
to reduce data transfer.

5
When to Optimize

Static done before executing the query (at
compilation time), cost of optimization amortized
over multiple executions, mostly based on
exhaustive search. Since sizes of intermediate
relations need to be estimated, it can result in
sub-optimal strategies.
Dynamic done at run time every time the query
is executed, can make use of exact sizes of
intermediate relations, expensive, based on
heuristics
Hybrid mixes static and dynamic approaches the
approach is mainly static, but dynamic query
optimization may take place when high difference
between predicted and actual sizes are detected

6
Characteristics of Query Processors

Statistics
fragment cardinality and size
size and number of distinct values for each
attribute. detailed histograms of attribute
values for better selectivity estimation.
Decision Sites
one site or several sites participate in
selection of strategy
Exploitation of network topology
wide area network communication cost
local area network parallel execution

7
Characteristics of Query Processors

Exploitation of replicated fragments
larger number of possible strategies
Use of Semijoins
reduce size of data transfer
increase of messages and local processing
good for fast or slow networks?

8
Layers of Query Processing
Calculus Query on Distributed Relations
QUERY DECOMPOSITION
CONTROL SITE
Algebra Query on Distributed Relations
DATA LOCALIZATION
Fragment Query
GLOBAL OPTIMIZATION
Optimized Fragment Query With Communication
Operations
LOCAL SITE
LOCAL OPTIMIZATION
Optimized Local Queries
9
Query Decomposition

Normalization
The calculus query is written in a normalized
form (CNF or DNF) for subsequent manipulation
Analysis
The query is analyzed for semantic correctness
Simplification
Redundant predicates are eliminated to obtain
simplified queries
Restructuring
The calculus query is translated to optimal
algebraic query representation

10
Query Decomposition Normalization

Lexical and syntactic analysis
check validity
check for attributes and relations
type checking on the qualification
There are two possible forms of representing the
predicates in query qualification Conjunctive
Normal Form (CNF) or Disjunctive Normal Form
(DNF)
CNF (p11 ? p12 ?... ? p1n) ? ... ? (pm1 ? pm2
?... ? pmn)
DNF (p11 ? p12 ?... ? p1n) ? ... ? (pm1 ? pm2
?... ? pmn)
OR's mapped into union
AND's mapped into join or selection

11
Query Decomposition Analysis

Queries are rejected because
the attributes or relations are not defined in
the global schema or
operations used in qualifiers are semantically
incorrect
For only those queries that do not use
disjunction or negation semantic correctness can
be determined by using query graph
One node of the query graph represents result
sites, others operand relations, edge between
nodes operand nodes represent joins, and edge
between operand node and result node represents
project

12
Query Graph and Join Graph
SELECT Ename, Resp FROM E, G, J WHERE E. ENo
G. ENO AND G.JNO J.JNO AND JNAME CAD''
AND DUR gt 36 AND Title Prog''
E. ENo G. ENO
G.JNO J.JNO
Ename
Resp
JNAME CAD''
Title Prog''
E. ENo G. ENO
G.JNO J.JNO
DUR gt 36
13
Disconnected Query Graph

Semantically incorrect conjunctive multivariable
query without negation have query graphs which
are not connected

SELECT Ename, Resp FROM E, G, J WHERE E. ENo
G. ENO AND JNAME CAD'' AND DUR gt 36 AND
Title Prog''
Ename
Resp
Title Prog''
E. ENo G. ENO
DUR gt 36
JNAME CAD''
14
Simplification Eliminating Redundancy

Elimination of redundant predicates using well
known idempotency rules
p ? p p p ? p p p ? true true
p ? false p p ? true p p ? false
false
p1 ? (p1 ? p 2 ) p1
p1 ? (p1 ? p 2 ) p1
Such redundant predicates arise when user query
is enriched with several predicates to
incorporate view relation correspondence, and
ensure semantic integrity and security

15
Eliminating Redundancy-- An Example
SELECT TITLE FROM E WHERE (NOT (TITLE
Programmer'') AND (TITLE Programmer'' OR
TITLE Elec.Engr'') AND NOT (TITLE
Elec.Engr'')) OR ENAME J.Doe''
SELECT TITLE FROM E WHERE ENAME J.Doe''
16
Eliminating Redundancy-- An Example
p1 ltTITLE Programmer''gt p2 ltTITLE
Elec. Engr''gt p3 ltENAME J.Doe''gt
Let the query qualification is ( p1 ? (p1 ?
p2) ? p2) ? p3
The disjunctive normal form of the query is (
p1 ? p1 ? p2) ? ( p1 ? p2 ? p2) ? p3
(false ? p2) ? ( p1 ? false) Ú p3 false ?
false ? p3 p3
17
Query Decomposition Rewriting

Rewriting calculus query in relational algebra
straightforward transformation from relational
calculus to relational algebra, and
restructuring relational algebra expression to
improve performance

18
Rewriting -- Transformation Rules (I)

Commutativity of binary operations
R ? S ? S ? R
R ? S ? S ? R
Associativity of binary operations
(R ? S) ? T ? R ? ( S ? T )
Idempotence of unary operations grouping of
projections and selections
?A ( ? A (R )) ? ? A (R ) for A?A? A
?p1(A1) ( ? p2(A2) (R )) ? ? p1(A1) ?p2(A2) (R )

19
Rewriting -- Transformation Rules (II)

Commuting selection with projection
?A1, , An ( ? p (Ap) (R )) ? ?A1, , An ( ? p
(Ap) ( ?A1, , An, Ap(R )))
Commuting selection with binary operations
? p (Ai)(R ? S) ? (? p (Ai)(R)) ? S
? p (Ai)(R S) ? (? p (Ai)(R)) S
? p (Ai)(R ? S) ? ? p (Ai)(R) ? ? p (Ai)(S)
Commuting projection with binary operations
?C(R ? S) ? ?A(R) ? ?B (S) C A ? B
?C(R S) ? ?C(R) ?C (S)
?C (R ? S) ? ?C (R) ? ?C (S)

20
An SQL Query and Its Query Tree
?ENAME
?(ENAMEltgtJ.DOE )?(JNAMECAD/CAM )? (Dur12 ?
Dur24)
SELECT Ename FROM J, G, E WHERE G.EnoE.ENo
AND G.JNo J.JNo AND ENAME ltgt J.Doe'
AND JName CAD' AND (Dur12 or
Dur24)
PROJ
ASG
EMP
21
Query Decomposition Rewriting
?ENAME
? JNO, ENAME
? JNO
?JNAMECAD/CAM
? JNO, ENO
? ENO, ENAME
?Dur12 ? Dur24
?ENAMEltgtJ.DOE
PROJ
ASG
EMP
22
Data Localization

Input Algebraic query on distributed relations
Determine which fragments are involved
Localization program
substitute for each global query its
materialization program
optimize

23
Data Localization-- An Example
EMP is fragmented into EMP1 ?ENO? E3
(EMP) EMP2 ? E3 lt ENO? E6 (EMP) EMP3 ?ENO
gtE6 (EMP)
?ENAME
?Dur12 ? Dur24
?ENAMEltgtJ.DOE
ASG is fragmented into ASG1 ?ENO? E3
(ASG) ASG2 ?ENO gtE3 (ASG)
?JNAMECAD/CAM
PROJ
?
?
ASG1
ASG1
ASG2
EMP1
EMP1
EMP1
24
Reduction with Selection
EMP is fragmented into EMP1 ?ENO? E3
(EMP) EMP2 ? E3 lt ENO? E6 (EMP) EMP3 ?ENO
gtE6 (EMP)
SELECT FROM EMP WHERE ENOE5
Given Relation R, FRR1, R2, , Rn where Rj
?pj(R) ?pj(Rj) ? if ?x ? R ?(pi(x)?pj(x))
25
Reduction with join
SELECT FROM EMP, ASG WHERE EMP.ENOASG.ENO
EMP is fragmented into EMP1 ?ENO? E3
(EMP) EMP2 ? E3 lt ENO? E6 (EMP) EMP3 ?ENO
gtE6 (EMP)
ASG is fragmented into ASG1 ?ENO? E3
(ASG) ASG2 ?ENO gtE3 (ASG)
26
Reduction with Join (I)
(R1 ? R2) S ? (R1 S) ? (R2
S)
27
Reduction with Join (II)
?
Given Ri ?pi(R) and Rj ?pj(R) Ri Rj ? if
?x ? Ri , ?y? Rj ?(pi(x)?pj(y))
Reduction with join 1. Distribute join over
union 2. Eliminate unnecessary work
28
Reduction for VF

Find useless intermediate relations
Relation R defined over attributes A A1, A2,
, An vertically fragmented as Ri ?A (R) where
A? A
?K,D (Ri) is useless if the set of projection
attributes D is not in A

EMP1 ?ENO,ENAME (EMP) EMP2 ?ENO,TITLE (EMP)
?ENAME
SELECT ENAME FROM EMP
EMP1
29
Reduction for DHF
Distribute joins over union Apply the join
reduction for horizontal fragmentation
SELECT FROM EMP, ASG WHERE ASG.ENO
EMP.ENO AND EMP.TITLE Mech. Eng.
30
Reduction for DHF (II)
Joins over union
?
31
Reduction for HF

Remove empty relations generated by contradicting
selection on horizontal fragments
Remove useless relations generated by projections
on vertical fragments
Distribute joins over unions in order to isolate
and remove useless joins

32
Reduction for HF --An Example
EMP1 ?ENO?E4 (?ENO,ENAME (EMP)) EMP2
?ENOgtE4 (?ENO,ENAME (EMP)) EMP3 ?ENO,TITLE
(EMP) QUERY SELECT ENAME FROM EMP WHERE ENO
E5
33
Why Optimization An Example Query
Database
EMP(eno, ename, title) ASG(eno, jno, resp, dur)
RA tree
Query
Find the name of the employees who are managing a
project?
SQL
Select ename From EMP e, ASG g Where e.Eno g.
Eno And resp manager
34
Example - Strategies
Plan B
Site 5
Fragment Schema
?respmanager
EMP1 ?ENO lt 100(EMP) at site 1 EMP2 ?ENO gt
100(EMP) at site 2 ASG1 ?ENO lt 100(ASG) at
site 3 ASG2 ?ENO gt 100(ASG) at site 4
?
?
ASG1
EMP1
ASG2
EMP2
Query site Site 5
Plan A
ASG1
ASG2
35
Example DB Statistics Costs

Database Statistics
EMP has 400 tuples,
ASG has 1000 tuples,
there are 20 managers in G
the data is uniformly distributed among sites.
ASG and EMP are locally clustered on attributes
RESP and ENO, respectively
Costs
tuple access tacc 1 unit,
tuple transfer ttrans 10 units,

36
Costs for Example Plan

The cost of Plan A
Produce ASG 20 ? tacc 20 (processing
locally)
Transfer ASG 20 ttrans 200 (transfer to
EMP site)
Produce EMP (1010) tacc 2 40 (join at
the EMP site)
Transfer EMP 20 ttrans 200 (send to Site
5)
Total cost 460
The cost of Plan B
Transfer EMP 400 ttrans 4,000 (send EMP
to Site 5)
Transfer ASG 1000 ttrans 10,000 (send ASG
to Site 5)
Produce ASG 1000 tacc 1,000 (selection
at Site 5)
Join EMP and ASG 400 20 tacc 8,000
(join at Site 5)
Total cost 23,000

37
Query Optimization

Problems in query optimization
Determining the physical copies of the fragments
upon which to execute the fragment query
expressions (also known as materialization)
Selecting the order of execution of operations
Selecting the method for executing each operation
The above problems are not independent, for
instance, the choice of the best materialization
for a query depends on the order in which
operations are executed. But they are treated as
independent. Further,
We bypass (1) by taking materialization for
granted
We bypass (3) by clustering all operations at the
same site as a local database system dependent
problem

38
Query Optimization - Objectives

The selection of alternative query execution
strategies is made based on predetermined
objectives
Two main objectives
minimize the total processing time (total cost)
network and computers at nodes do not get loaded.
Response time cannot be guaranteed
minimize the response time
allocation must facilitate parallel execution of
the query
but throughput may decrease and cost can be
higher than total cost
Total processing time (cost) is the sum of all
the time (cost) incurred in executing the query
(CPU, I/O, data transfer)
Response time is the elapsed time from the
initiation till the completion of the query

39
Optimization Algorithms The Issues

Cost model
cost components
weights for each components
costs for primitive operations
Search space
The set of equivalent algebra expressions (query
trees)
Search strategies
How do we move inside the search space
Exhaustive search, heuristics,

40
Cost Models

The cost measures are I/O and CPU for
centralized DBMSs and I/O, CPU and data transfer
costs for DDBMS
Total cost CPU cost I/O cost communication
cost
CPU cost Ccpu insts
I/O cost C i/o i/os
Communication Cost Cmsgmsgs Ctrbytes
Ccpu, C i/o, Ctr and Cmsg are all assumed to be
constants.
Response time sum (sequential operations)
Ccpus_insts
Ci/os_i/os
Cmsgs_msg ctrs_bytes
S_x stands for maximum number of sequential xs
that need to be executed to process the query

41
Intermediate Result Size

The size of the intermediate relations produced
during the execution facilitates the selection of
the execution strategy
This is useful in selecting an execution strategy
that reduces data transfer
The sizes of intermediate relations need to be
estimated based on cardinalities of relations and
lengths of attributes
RA1, A2,..., An fragmented as R1,R2,, Rn the
statistical data collected typically are
len(Ai), length of attribute Ai in bytes
min(Ai) and max(Ai) for ordered domains
card(dom(Ai)) unique values in domAi
Number of tuples in each fragment card(Rj)

42
Intermediate Size Estimation

Join selectivity factor
SFj(r,s) card(r s) / card(r) card(s)
Selecton selectivity factor
SFS(F) card(?f(r)) / card(r)
size(r) card(r) len(r)
Cardinality of intermediate relations
SFS(A value) 1/card(dom(A))
SFS(A gt value) max(A) - value/max(A)-min(A)
SFS(A lt value) value - min(A)/max(A)-min(A)
Sfs(p(Ai)?p(Aj)) sfs(p(Ai)) sfs(p(Aj))
Sfs(p(Ai)? p(Aj)) sfs(p(Ai)) sfs(p(Aj)) -
sfs(p(Ai)) sfs(p(Aj))
SFS(A ? values) SFS(A value) card(values)

43
Intermediate Size Estimation (II)

Projection
card(?a(r)) card(r)
Cartesian product
card(r X S) card(r) card(s)
Join
card(R AB S) card(s)
if A is key in R, B is foreign key in S
card(R AB S) SFJ(R,S) card(r)
card(s)
Union
Upper bound card(r) card(s)
Lower bound maxcard(r), card(s)

44
Cost of Processing Primitive Operations

Selection
Projection
Union
Join
nested-loops
sort-merge
hash-based
For distributed join, semi-join is proposed to
perform joins

45
Semi-join
Amount of data transferred R S

join is replaced with a project followed by
semi-join and then join
the project and join operations are done at one
site, and semi-join at another site
amount of data transferred R S

46
Semi-join versus Join

using sem-ijoin increases local processing costs
because a relation must be scanned twice (join,
project)
For joining intermediate relations produced
during sem-ijoin one cannot exploit indices on
the base relations
Sem-ijoin may not be good when communication
costs are low

47
Search Space
SELECT ENAME, RESP FROM EMP, ASG, PROJ WHERE
EMP.ENOASG.ENO AND ASG.PNOPROJ.PNO

Search space is characterized by alternative
execution plans
Most optimizers focus on join trees
For N relations, there are O(N!) equivalent join
trees

?
ASG
EMP
PROJ
48
Restricting Search Space
deep tree

O(N!) is large
Considering join methods, the search space is
even bigger
Restrict by means of heuristics
Ignore cartisian product
Restrict the shape of the join tree
Only consider deep trees
.

R4
R1
R3
R2
bushy tree
R4
R3
R1
R2
Left-deep tree
R1
R2
R3
R4
49
Search Strategy

How to move in the search space to find the
optimal plan
Deterministic
Start from base relations and build plans by
adding relations at each step
Dynamic programming breadth-first
Greedy depth-first
Randomized
Search for the optimal one around a particular
starting point
simulated annealing
iterative improvement

50
Search Strategies -- Example
Deterministic
Randomized
51
Distributed Query Optimization Algorithms

System R and R
Hill Climbing and SDD-1

52
System R (Centralized) Algorithm

Simple (one relation) queries are executed
according to the best access path.
Execute joins
Determine the possible ordering of joins
Determine the cost of each ordering
Choose the join ordering with the minimal cost
For joins, two join methods are considered
Nested loops
Merge join

53
System R Algorithm -- Example

Names of employees working on the CAD/CAM
project
Assume
EMP has an index on ENO,
ASG has an index on PNO,
PROJ has an index on PNO and an index on PNAME

54
System R Algorithm -- Example

Choose the best access paths to each relation
EMP sequential scan (no selection on EMP)
ASG sequential scan (no selection on ASG)
PROJ index on PNAME (there is a selection on
PROJ based on PNAME)
Determine the best join ordering
EMP ASG PROJ
ASG PROJ EMP
PROJ ASG EMP
ASG EMP PROJ
EMP ? PROJ ASG
PROJ ? EMP ASG
Select the best ordering based on the join costs
evaluated according to the two methods

55
System R Example (cont'd)
PROJ
EMP
ASG
ASG EMP
PROJ ASG
EMP PROJ

Best total join order is one of

56
System R Algorithm

(PROJ ASG) EMP has a useful index on
the select attribute and direct access to the
join attributes of ASG and EMP.
Final plan
select PROJ using index on PNAME
then join with ASG using index on PNO
then join with EMP using index on ENO

57
System R Distributed Query Optimization

Total-cost minimization. Cost function includes
local processing as well as transmission.
Algorithm
For each relation in query tree find the best
access path
For the join of n relations find the optimal join
order strategy
each local site optimizes the local query
processing

58
Data Transfer Strategies

Ship-whole. entire relation is shipped and stored
as temporary relation, merge join algorithm is
used, done in pipeline mode
Fetch-as-needed. this method is equivalent to
semijoin of the inner relation with the outer
relation tuple

59
Join Strategy 1

External relation R with internal relation S, let
LC be local processing cost, CC be data transfer
cost, let average number of tuples of S that
match one tuple of R be s
Strategy 1. Ship the entire outer relation to the
site of internal relation
TC LC(get R)
CC(size(R))
LC(get s tuples from S)card(R)

60
Join Strategy 2

Ship the entire inner relation to the site of the
outer relation
TC LC(get S)
CC(size(S))
LC(store S)
LC(get R)
LC(get s tuples from S)card(R)

61
Join Strategy 3

Fetch tuples of the inner relation for each tuple
of the outer relation
TC LC(get R)
CC(len(A)) card(R)
LC(get s tuples from S) card(R)
CC(slen(S))card(R)

62
Join Strategy 4

Move both relations to 3rd site and join there
TC LC(get R)
LC(get S)
CC(size(S))
LC(store S)
CC(size(R))
LC(get s tuples from S)card(R)
Conceptually, the algorithm does an exhaustive
search among all alternatives and selects one
that minimizes total cost

63
Hill Climbing Algorithm - Algorithm

Inputs
query graph, locations of relations, and relation
statistics
Initial solution
the least costly among all when the relations
are sent to a candidate result site denoted by
ES0, and the site as chosen site
Splits ES0 into
ES1 ship one relation of join to the site of
other relation
ES2 these two relations are joined locally and
the result is transmitted to the chosen site
If cost(ES1) cost(ES2) LC gt cost (ES0) select
ES0,
else select ES1 and ES2.
The process can be recursively applied to ES1 and
ES2 till no more benefit occurs

64
Hill Climbing Algorithm - Example
?SAL
TITLE
PAY
EMP
?PNAMECAD/CAM
ASG
Ignore the local processing cost Length of tuples
is 1 for all relation
PROJ
65
HCA - Example
Site1 EMP(8)
ES1
?
Site2 PAY(4)
ES2
Solution 1 Cost
TITLE
?
Site4 ASG(10)
Site3 PROJ(1)
?
ES3
Site1 EMP(8)
ES1
Site2 PAY(4)
ESo is the BEST
ES2
Solution 2 Cost
Site4 ASG(10)
ES3
Site3 PROJ(1)
66
Hill Climbing Algorithm - Comments

Greedy algorithm determines an initial feasible
solution and iteratively tries to improve it.
If there are local minimas, it may not find the
global minima
If the optimal solution has a high initial cost,
it wont be found since it wont be chosen as the
initial feasible solution.

Site1 EMP(8)
Site2 PAY(4)
Site4 ASG(10)
Site3 PROJ(1)
COST
67
SDD-1 Algorithm

SDD-1 algorithm generalized the hill-climbing
algorithm to determine ordering of beneficial
semijoins and uses statistics on the database,
called database profiles.
Cost of semijoin
Cost (R SJA S) CMSG CTRsize(?A(S))
Benefit is the cost of transferring irrelevant
tuple
Benefit(R SJA S) (1-SFSJ(S.A)) size(R) CTR
A semijoin is beneficial if cost lt benefit.

68
SDD-1 The Algorithm

initialization phase generates all beneficial
semijoins, and an execution strategy that
includes only local processing
most beneficial semijoin is selected statistics
are modified and new beneficial semijoins are
selected
the above step is done until no more beneficial
joins are left
assembly site selection to perform local
operations
postoptimization removes unnecessary semijoins

69
SDD1 - Example
SELECT FROM EMP, ASG, PROJ WHERE EMP.ENO
ASG.ENO AND ASG.PNO PROJ.PNO
Site 2 ASG
ENO
PNO
Site 1 EMP
Site 3 PROJ
70
SDD1 - First Iteration