Title: L05: Distributed Query Processing
1L05 Distributed Query Processing Optimization
- Query Processing
- Query Decomposition
- Data Localization
- Query Optimization
2Query Processing
- Any high-level query (SQL) on a database must be
processed, optimized and executed by the DBMS - The high-level query is scanned, and parsed to
check for syntactic correctness - An internal representation of a query is created,
which is either a query tree or a query graph - The DBMS then devises an execution strategy for
retrieving the result of the query. (An execution
strategy is a plan for executing the query by
accessing the data, and storing the intermediate
results) - The process of choosing one out of the many
execution strategies is known as query
optimization
3Query Processor
- A query processor is a module in the DBMS that
performs the tasks to process, to optimize, and
to generate execution strategy for a high-level
query - For a DDBMS, the QP also does data localization
for the query based on the fragmentation scheme
and generates the execution strategy that
incorporates the communication operations
involved in processing the query
4Query Optimizer
- Queries expressed in SQL can have multiple
equivalent relational algebra query expressions - The distributed query optimizer must select the
ordering of relational algebra operations, sites
to process data, and possibly the way data should
be transferred. This makes distributed query
processing significantly more difficult
5Complexity of Relational Algebra Operations
- The relational algebra is used to express the
output of the query. The complexity of relational
algebra operations play a role in defining some
of the principles of query optimization. All
complexity measures are based on the cardinality
of the relation - Operations Complexity
- Select, Project (w/o duplicate elimination) O(n)
- Project (with duplicate elimination), Group O(n
logn) - Join, Semi-join, Division, Set Operators O(n
logn) - Cartesian Product O(n2 )
- This was given in the book (p194). It is over
simplified.
6Characteristics of Query Processors
- Languages
- Input language can be relational algebra or
calculus output language is relational algebra
(annotated with communication primitives). The
query processor must efficiently map input
language to output language - Types of Optimization
- The output language specification represents the
execution strategy. There can be many such
strategies, the best one can be selected through
exhaustive search, or by applying heuristic
(minimize size of intermediate relations). For
distributed databases semijoins can be applied
to reduce data transfer.
7When to Optimize
- Static done before executing the query (at
compilation time), cost of optimization amortized
over multiple executions, mostly based on
exhaustive search. Since sizes of intermediate
relations need to be estimated, it can result in
sub-optimal strategies. - Dynamic done at run time every time the query
is executed, can make use of exact sizes of
intermediate relations, expensive, based on
heuristics - Hybrid mixes static and dynamic approaches the
approach is mainly static, but dynamic query
optimization may take place when high difference
between predicted and actual sizes are detected
8Characteristics of Query Processors
- Statistics
- fragment cardinality and size
- size and number of distinct values for each
attribute. detailed histograms of attribute
values for better selectivity estimation. - Decision Sites
- one site or several sites participate in
selection of strategy - Exploitation of network topology
- wide area network communication cost
- local area network parallel execution
9Characteristics of Query Processors
- Exploitation of replicated fragments
- larger number of possible strategies
- Use of Semijoins
- reduce size of data transfer
- increase of messages and local processing
- good for fast or slow networks?
10Layers of Query Processing
Calculus Query on Distributed Relations
QUERY DECOMPOSITION
CONTROL SITE
Algebra Query on Distributed Relations
DATA LOCALIZATION
Fragment Query
GLOBAL OPTIMIZATION
Optimized Fragment Query With Communication
Operations
LOCAL SITE
LOCAL OPTIMIZATION
Optimized Local Queries
11L05 Distributed Query Processing Optimization
- Query Processing
- Query Decomposition
- Data Localization
- Query Optimization
12Query Decomposition
- Normalization
- The calculus query is written in a normalized
form (CNF or DNF) for subsequent manipulation - Analysis
- The query is analyzed for semantic correctness
- Simplification
- Redundant predicates are eliminated to obtain
simplified queries - Restructuring
- The calculus query is translated to optimal
algebraic query representation
13Query Decomposition Normalization
- Lexical and syntactic analysis
- check validity
- check for attributes and relations
- type checking on the qualification
- There are two possible forms of representing the
predicates in query qualification Conjunctive
Normal Form (CNF) or Disjunctive Normal Form
(DNF) - CNF (p11 ? p12 ?... ? p1n) ? ... ? (pm1 ? pm2
?... ? pmn) - DNF (p11 ? p12 ?... ? p1n) ? ... ? (pm1 ? pm2
?... ? pmn) - OR's mapped into union
- AND's mapped into join or selection
14Query Decomposition Analysis
- Queries are rejected because
- the attributes or relations are not defined in
the global schema or - operations used in qualifiers are semantically
incorrect - For only those queries that do not use
disjunction or negation semantic correctness can
be determined by using query graph - One node of the query graph represents result
sites, others operand relations, edge between
nodes operand nodes represent joins, and edge
between operand node and result node represents
project
15Query Graph and Join Graph
SELECT Ename, Resp FROM E, G, J WHERE E. ENo
G. ENO AND G.JNO J.JNO AND JNAME CAD''
AND DUR gt 36 AND Title Prog''
E. ENo G. ENO
G.JNO J.JNO
Ename
Resp
JNAME CAD''
Title Prog''
E. ENo G. ENO
G.JNO J.JNO
DUR gt 36
16Disconnected Query Graph
- Semantically incorrect conjunctive multivariable
query without negation have query graphs which
are not connected
SELECT Ename, Resp FROM E, G, J WHERE E. ENo
G. ENO AND JNAME CAD'' AND DUR gt 36 AND
Title Prog''
Ename
Resp
Title Prog''
E. ENo G. ENO
DUR gt 36
JNAME CAD''
17Simplification Eliminating Redundancy
- Elimination of redundant predicates using well
known idempotency rules - p ? p p p ? p p p ? true true
- p ? false p p ? true p p ? false
false - p1 ? (p1 ? p 2 ) p1
- p1 ? (p1 ? p 2 ) p1
- Such redundant predicates arise when user query
is enriched with several predicates to
incorporate view relation correspondence, and
ensure semantic integrity and security
18Eliminating Redundancy-- An Example
SELECT TITLE FROM E WHERE (NOT (TITLE
Programmer'') AND (TITLE Programmer'' OR
TITLE Elec.Engr'') AND NOT (TITLE
Elec.Engr'')) OR ENAME J.Doe''
SELECT TITLE FROM E WHERE ENAME J.Doe''
19Eliminating Redundancy-- An Example
p1 ltTITLE Programmer''gt p2 ltTITLE
Elec. Engr''gt p3 ltENAME J.Doe''gt
Let the query qualification is ( p1 ? (p1 ?
p2) ? p2) ? p3
The disjunctive normal form of the query is (
p1 ? p1 ? p2) ? ( p1 ? p2 ? p2) ? p3
(false ? p2) ? ( p1 ? false) Ú p3 false ?
false ? p3 p3
20Query Decomposition Rewriting
- Rewriting calculus query in relational algebra
- straightforward transformation from relational
calculus to relational algebra, and - restructuring relational algebra expression to
improve performance
21Rewriting -- Transformation Rules (I)
- Commutativity of binary operations
- R ? S ? S ? R
- R ? S ? S ? R
- Associativity of binary operations
- (R ? S) ? T ? R ? ( S ? T )
- Idempotence of unary operations grouping of
projections and selections - ?A ( ? A (R )) ? ? A (R ) for A?A? A
- ?p1(A1) ( ? p2(A2) (R )) ? ? p1(A1) ?p2(A2) (R )
22Rewriting -- Transformation Rules (II)
- Commuting selection with projection
- ?A1, , An ( ? p (Ap) (R )) ? ?A1, , An ( ? p
(Ap) ( ?A1, , An, Ap(R ))) - Commuting selection with binary operations
- ? p (Ai)(R ? S) ? (? p (Ai)(R)) ? S
- ? p (Ai)(R S) ? (? p (Ai)(R)) S
- ? p (Ai)(R ? S) ? ? p (Ai)(R) ? ? p (Ai)(S)
- Commuting projection with binary operations
- ?C(R ? S) ? ?A(R) ? ?B (S) C A ? B
- ?C(R S) ? ?C(R) ?C (S)
- ?C (R ? S) ? ?C (R) ? ?C (S)
23An SQL Query and Its Query Tree
?ENAME
?(ENAMEltgtJ.DOE )?(JNAMECAD/CAM )? (Dur12 ?
Dur24)
SELECT Ename FROM J, G, E WHERE G.EnoE.ENo
AND G.JNo J.JNo AND ENAME ltgt J.Doe'
AND JName CAD' AND (Dur12 or
Dur24)
PROJ
ASG
EMP
24Query Decomposition Rewriting
?ENAME
? JNO, ENAME
? JNO
?JNAMECAD/CAM
? JNO, ENO
? ENO, ENAME
?Dur12 ? Dur24
?ENAMEltgtJ.DOE
PROJ
ASG
EMP
25L05 Distributed Query Processing Optimization
- Query Processing
- Query Decomposition
- Data Localization
- Query Optimization
26Data Localization
- Localization program
- Given an algebraic query on global schema
- Determine which fragments are involved
- A naïve way to localize a distributed query
- Substitute each global relation with its
localization program ? generic query - Use reduction techniques for efficiency
27Reduction for HF
- Remove empty relations generated by contradicting
selection on horizontal fragments - Remove useless relations generated by projections
on vertical fragments - Distribute joins over unions in order to isolate
and remove useless joins
28Data Localization-- An Example
EMP is fragmented into EMP1 ?ENO? E3
(EMP) EMP2 ? E3 lt ENO? E6 (EMP) EMP3 ?ENO
gtE6 (EMP)
?ENAME
?Dur12 ? Dur24
?ENAMEltgtJ.DOE
ASG is fragmented into ASG1 ?ENO? E3
(ASG) ASG2 ?ENO gtE3 (ASG)
?JNAMECAD/CAM
PROJ
?
?
ASG1
ASG1
ASG2
EMP1
EMP1
EMP1
29Reduction with Selection
EMP is fragmented into EMP1 ?ENO? E3
(EMP) EMP2 ? E3 lt ENO? E6 (EMP) EMP3 ?ENO
gtE6 (EMP)
SELECT FROM EMP WHERE ENOE5
Given Relation R, FRR1, R2, , Rn where Rj
?pj(R) ?pj(Rj) ? if ?x ? R ?(pi(x)?pj(x))
30Reduction with join
SELECT FROM EMP, ASG WHERE EMP.ENOASG.ENO
EMP is fragmented into EMP1 ?ENO? E3
(EMP) EMP2 ? E3 lt ENO? E6 (EMP) EMP3 ?ENO
gtE6 (EMP)
ASG is fragmented into ASG1 ?ENO? E3
(ASG) ASG2 ?ENO gtE3 (ASG)
31Reduction with Join (I)
(R1 ? R2) S ? (R1 S) ? (R2
S)
32Reduction with Join (II)
?
Given Ri ?pi(R) and Rj ?pj(R) Ri Rj ? if
?x ? Ri , ?y? Rj ?(pi(x)?pj(y))
Reduction with join 1. Distribute join over
union 2. Eliminate unnecessary work
33Reduction for VF
- Find useless intermediate relations
- Relation R defined over attributes A A1, A2,
, An vertically fragmented as Ri ?A (R) where
A? A - ?K,D (Ri) is useless if the set of projection
attributes D is not in A
EMP1 ?ENO,ENAME (EMP) EMP2 ?ENO,TITLE (EMP)
?ENAME
SELECT ENAME FROM EMP
EMP1
34Reduction for DHF
Distribute joins over union Apply the join
reduction for horizontal fragmentation
EMP1 ?TITLEProgrammer (EMP) EMP2
?TITLE?Programmer (EMP) ASG1 ASG ENO
EMP1 ASG2 ASG ENO EMP2
SELECT FROM EMP, ASG WHERE ASG.ENO
EMP.ENO AND EMP.TITLE Mech. Eng.
35Reduction for DHF (II)
Joins over union
?
36Reduction for HF --An Example
EMP1 ?ENO?E4 (?ENO,ENAME (EMP)) EMP2
?ENOgtE4 (?ENO,ENAME (EMP)) EMP3 ?ENO,TITLE
(EMP) QUERY SELECT ENAME FROM EMP WHERE ENO
E5
37Summary
- Query decomposition is done as in a centralized
DBMS - A lot of logic rules about transformations
- Data localization happens only in a distributed
DBMS - Use fragmentation characteristics to simplify the
queries on fragments