Query Optimization

About This Presentation

Title:

Query Optimization

Description:

... of null values in new relation. Introducing repeating groups 'relation becomes attributes' ... Most cost estimates are based upon the cardinality of the relation ... – PowerPoint PPT presentation

Number of Views:38

Avg rating:3.0/5.0

Slides: 95

Provided by: thebutl

Category:

more less

Transcript and Presenter's Notes

Title: Query Optimization

1
Query Optimization

Jonathon McDonald
Eric Nelson
Jordan Neshev
Ben Oberkirsch
Hung Tang

2
Agenda

Tuning the database for performance
Query decomposition
Heuristics
Cost estimation
Pipelining

3
Query Optimization

Tuning the Database for Performance

4
Controlled Redundancy

Objective - to introduce redundancy into
relations by denormalization to improve query
performance.
Denormalization - refinement to a relation such
that the new relation(s) may violate the degree
of normalization of the original relation.
Apply denormalization when
performance on current relations is not
satisfactory
update rate is low and query rate is high

5
Idea and Application

With one table, speed up SQL queries by using
indexes.
Create relations by duplicating certain
attributes or by combining relations so as to
accommodate SQL queries by reducing table joins.
7 situations to use denormalization.
Combining relations (3 situations)
Duplicating attributes in relations (3
situations)
Creating an extract table

6
Combining tables

Combining 11 relationships
should be considered if the relations are
frequently referenced together
consider amount of null values in new relation
Introducing repeating groups
relation becomes attributes
in general, used if the number in groups is
known, static and not too large.
Merging reference table into base relation

7
Duplicating Attributes in Relations to Reduce
Joins

Duplicating non-key attributes in 1
relationships
Duplicating foreign key attributes in 1
relationships
note this sets a new relationship in place
Duplicating attributes in (many-to-many)
relationships
a relationship is created between two
relations by introducing an intermediate
relation.
duplicate attribute(s) from one of the relations
into the intermediate relation.

8
Creating an Extract Table

Create a single table for report query.
The data stays static for a period of time.
The table is updated at certain off-peak time
(e.g., overnight) as a batch job.
This avoids running the complex multi-join SQL
during peak time.

9
Query Optimization

Query Decomposition

10
Intro

Query Processing - transforming a SQL query into
an equivalent relational algebra form,
implementing it, executing it, and returning the
data result
Four phases of Query Processing
Query Decomposition
Optimization
Code Generation
Execution

11
Objective

Transform a query in high-level language into an
equivalent relational algebra form.
checks for correct syntax
checks for correct semantic
Five steps
Analysis
Normalization
Semantic Analysis
Simplification
Query Restructuring

12
Analysis

SQL query is checked for syntax validity
Relations and attributes are verified with the
system catalogue
Proper application of operations to attributes
and object types
Creates an internal representation in the form of
a query tree

13
Normalization

Convert the predicate (WHERE clause) into a
normalized form that is more easily manipulated
Two normalization transformation
conjunctive normal form - conjuncts connected
with the AND operator
disjunctive normal form - conjuncts connected
with the OR operator

14
Semantic Analysis

Objective - check normalized queries for
incorrect formulation or contradiction
Incorrect formulation occurs when components do
not contribute to the generation of the result
Contradictory component occurs when no tuples
satisfy the component

15
Simplification

Objective - transform query into a semantically
equivalent and efficient form by eliminating
redundant qualifications and common
subexpressions
Applies idempotency rules of boolean algebra
Also takes into account access restrictions, view
definitions, and integrity constraints

16
Query Restructuring

More transformation of the query into a form with
a more efficient implementation
Gets into the query optimization aspect

17
Query Optimization

Heuristic Approaches

18
Overview

Represent the query as a relational algebra
statement
Use transformation rules to obtain a more
efficient representation of the query

19
Rules

Cascade of Selection
spqr sp(sq(sr(R)))
Commutativity of Selection
sp(sq(R)) sq(sp(R))
Intermediate Projections Extraneous
?att1 ?att1, att2(R) ?att1(R)
Commutativity of Selection with Projection
sp(?att3, att4(R)) ?att3, att4(sp(R)) where p
is a subset of att3, att4

20
Rules

Commutativity of theta join and Cartesian join
R lttheta joingt? S S lttheta joingt? R
R X S S X R
Commutativity of s with theta join (or Cartesian
join)
When selection criteria are from 1 relation
sp(R lttheta joingt? S) sp(R) lttheta joingt? S
sp(R X S) sp(R) X S
Criteria from both relations
spq(RXS) sp(R) X sq(S)

21
Rules

Commutativity of ? and theta join (or Cartesian
join)
?L1 ?L2(RXS) ?L1 (R) X ?L2(S) where L1
contains only attributes of R and L2 contains
only attributes of S
Commutativity of ? and ?
R ? S S ? R S ? R R ? S
Commutativity of s with set ops (?,?,-)
sp(R op S) sp(R) op sp(S)
Commutativity of ? with ?
?L(R?S) ?L(R) ? ?L(S)

22
Rules

Associativity of theta join (or Cartesian join)
(R X S) X T R X (S X T)
(R lttheta joingtp S) lttheta joingtqr T R lttheta
joingtpr (S lttheta joingtq T) where q is a subset
of attributes from S and T only
Associativity of ? and ?
(R ? S) ? T R ? (S ? T)
(R ? S) ? T R ? (S ? T)

23
Strategies for Rule usage

Select as early as possible
Combine X with subsequent s when s represents a
join condition into a join operation.
Using associativity, rearrange leaf nodes to
execute the most restrictive selection first.
Project as early as possible.
Compute repeated expressions only once.

24
Query Optimization

Relational Model Cost Estimation

25
Database Statistics

Why statistics
There are many ways to implement relational
algebra operations
A DBMS must chose the most efficient one
Efficiency is determined by calculating formulae
on current statistical information

26
Database Statistics

Which Statistics
Most cost estimates are based upon the
cardinality of the relation
The dominant cost is disk access
Accuracy of estimate depends upon the currency of
statistical information in the system catalog

27
Database Statistics

Types of Statistics
For each base relation R
nTuples(R) the number of tuples in R
bFactor(R) the number of tuples in R that fit
into one block
nBlocks(R) the number of blocks needed to store
R (assuming that the tuples of R are stored
together, then nBlocks(R) nTuples(R) /
bFactor(R)

28
Database Statistics

Types of Statistics (contd)
For each attribute A of base relation R
nDistinctA(R) the number of distinct values of
A in R
minA(R), maxA(R) the minimum and maximum
possible values of A in R
SCA(R) the selection cardinality of A in R.
This is the average number of tuples that satisfy
an equality condition on A
Equality SCA(R) 1 if A is key attribute of R
SCA(R) nTuples(R)/nDistinctA(R)
otherwise
Others

29
Database Statistics

Types of Statistics (contd)
For each multi-level index I on attribute set A
nLevelsA(R) the number of levels of I
nLfBlocksA(R) the number of leaf blocks in I

30
Relational Algebra Operations

Selection
Join
Project
Set Operations

31
Selection Operation

Works on a single relation R, and defines a
single relations S
Predicate may be simple or complicated
Implementation strategy depends upon
The structure of the file in which it is stored
Whether the attributes in the predicate have been
indexed/hashed

Implementation strategy is an algorithm
32
Selection Operation

Estimation of Cardinality of result set
Predicate of forms A ? x
ntuples(S) SCA(R)
For any attribute B ? A
nDistinctB(S) nTuples(S) if nTuples(S) lt
nDistinctB(R)/2
nDistinctB(S) nDistinctB(R) if nTuples(S)
gt 2nDistinctB(R)
Others

33
Selection Operation

Implementation Strategies
Linear search
Binary search
Equality on hash key
Equality on primary key
Inequality on primary key
Equality on clustering
Equality on non-clustering
Inequality on secondary B-tree index

34
Selection Operation

Linear Search
Unordered file (Heap file)
No index
Generally need to scan each tuple in each block
Equality condition of Primary/Unique Key
On average nBlocks(R)/2
Generally
nBlocks(R)

35
Selection Operation

Binary Search
Ordered file (Sequential file)
No index
Predicate of form Ax, file ordered on A
On average log2(nBlocks(R))
Generally
log2(nBlocks(R)) SCA(R)/bFactor(R)) -1

36
Selection Operation

Equality on hash key
Attribute A is the hash key
No overflow
Constant (1)
Overflow
Additional access may be necessary
Cost depends on amount of overflow and method to
handle overflow

37
Selection Operation

Equality condition on Primary Key
By definition indexed
Equality condition
nLevels(I) 1

38
Selection Operation

Inequality condition on Primary Key
Strategy
Find tuple satisfying equality condition
If index sorted, access all tuples before or
after this one
nLevels(I) nBlocks(R)/2

39
Selection Operation

Equality condition on clustering index
Equality condition on attribute A
Secondary index
Clustered index gt tuples are on same blocks
Number of tuples may require overflow into more
than one block
Cost determined as first index accessed, then the
number of blocks required to store the tuples
satisfying the equality condition
nLevels(I) SCA(R)/bFactor(R)

40
Selection Operation

Equality condition on a non-clustering index
Equality condition on attribute A
Non-clustered index gt must assume tuples on
different blocks
Cost determined as first index accessed, then the
selection cardinality
nLevels(I) SCA(R)

41
Selection Operation

Inequality condition on a secondary B-tree index
On average, half the leaf nodes accessed
Via index, on average half the tuples accessed
nLevels(I) nLfBlocksA(I)/2 nTuples(R)/2

42
Selection Operation

Composite predicates
Conjunctive selection
Contains tuples satisfying all conjuncts
(Aa ? B b) C c
Disjunctive selection
Contains union of tuples satisfying disjuncts
(Aa Bb) ? C c

43
Selection Operation

Composite predicates
Cost analysis
Conjunctive selection without disjunction
Use a selection implementation strategy (2-8)
discussed earlier if
Attribute in conjunct has index, or is ordered,
estimate as in 2-8 discussed earlier
Selection condition involves equality on two or
more attributes, and a composite index or hash
exists on the combined attributes
Secondary indexes are defined on one or more
attributes, and these attributes are involved
only in equality conditions in the predicate

44
Selection Operation

Composite predicates
Cost analysis
Selections with disjunction
If an attribute in the selection condition
contains an or, and that attribute has no index
or sort order, then entire selection operation
requires linear search
If an index exists on every attribute of the
selection, then optimize by retrieving tuples
that satisfy condition, and apply Union (see Set
Operations)

45
Selection Operation

Estimation example (Staff relation)
There exists a hash index on staffNo
There exists a clustering index on branchNo
There exists a B-tree index on salary
Staff has the following statistics
nTuples 3000 bFactor 30 gt nBlocks 100
nDistinctsalary 500 gt SCsalary 6
minsalary 10 000 maxsalary 50 000
nLevelssalary 2 nLfBlockssalary 50

46
Selection Operation

Estimation example (contd)
Linear search cost
Search on key attribute is 50 blocks
Search on non-key attributes is 100 blocks
Can we improve
? staffNo SG5
Equality condition in PK attribute
PK attribute is hashed
Use strategy 3 equality on hash key gt cost
of one block

47
Selection Operation

Estimation example (contd)
Can we improve (contd)
? salary gt 20 000
Range search on salary attribute
Salary attribute has Btree index
Use strategy 8
2 50/2 3000/2 1527
Strategy 7 worse than linear search
Use linear search
Estimated cardinality
3000(50000-20000)/(50000-10000) 2250

48
Join Operation

Commonly a performance concern
Most time consuming after the Cartesian Product
Critical to perform as efficiently as possible

49
Join Operation

Theta join
Tuples satisfying a predicate from the Cartesian
product of R and S
Predicate of form R.a ? S.b
? a logical operator
Equi-join
A theta-join where the predicate is equality
Natural join
Join involves all common attributes of R and S

50
Join Operation

Implementation Strategies
Block-nested loop join
Indexed nested loop join
Sort-merge join
Hash join

51
Join Operation

Estimation of cardinality
Cartesian product
nTuples(R) nTuples (S)
Join
Worst case
nTuples(T) ? nTuples(R) nTuples (S)
Assuming uniform distribution, better estimates
exist

52
Join Operation

Estimation of cardinality (contd)
Equi-join (R.A S.B)
If A is a key attribute of R, then a tuple of S
can only join with one tuple of R
nTuples(T) ? nTuples(S)
If B is a key attribute of S
nTuples(T) ? nTuples(S)
If neither A nor B are keys
nTuples(T) SCA(R) nTuples(S) or
nTuples(T) SCB(S) nTuples(R)

For any tuple s in S, expect SCA(R) for a given
value of A, and this number to appear in join.
Multiply this by number of tuples in S.
53
Join Operation

Nested loop join
Simplest algorithm
Outer loop iterates over each tuple in R
Inner loop iterates over each tuple in S
Can improve
Block nested loop join
Basic unit of read/write is a disk block
Add two additional loops that process blocks

54
Join Operation

Block nested loop join (contd)
Cost
Each block of R must be read
Each block of S must be read for each block of R
nBlocks(R) (nBlocks(R) nBlocks(S))
Improvement
Since cost is based upon disk access, read as
many blocks of the outer loop relation in as
possible into the database buffer
Cost

So choose smallest relation for outer loop
Save one buffer for inner relation and one buffer
for outer relation
55
Join Operation

Block nested loop join (contd)
Cost
nBlocks(R) nBlocks(S)(nBlocks(R)/(nBuffer-2))
nBlocks(R) nBlocks(S)

If all blocks of R can be read into the buffer
56
Join Operation

Index nested loop join
Index or hash function exists on join attributes
Replace inefficient file scan with index/hash
lookup
Avoids enumeration of Cartesian product
Again, read in as many blocks as possible into
database buffers
Cost
Join attribute A in S is PK

57
Join Operation

Index nested loop join (contd)
Join attribute A in S is PK
nBlocks(R) nTuples(R) (nLevelsA(I) 1)
Join attribute A in S is clustering index
nBlocks(R) nTuples(R)(nLevelsA(I)SCA(R)/bFact
or(R))

58
Join Operation

Sort-merge join
For Equi-join, most efficient join is when both
relations are sorted on join attributes
Then look for qualifying tuples by merging two
relations
Tuples with same join attribute value guarenteed
to be in consecutive order
Each block of each relation needs to be read only
once

Assume each set of tuples with same join value
can be held in DB buffer at same time
59
Join Operation

Sort-merge join (contd)
Cost
nBlocks(R) nBlocks(S)
Need to presort?
Add in nBlocks(R)log2(nBlocks(R))

60
Join Operation

Hash join
May be used for Natural join or Equijoin
Partition phase
Select hash function h to partition R and S
h(R.B) ! h(S.C) gt R.B ! S.C
Probing phase
Note h(R.B) h(S.C) ?gt R.B S.C
Read each R partition
Attempt to join tuples with tuples in equivalent
S partition
Can use nested loop join for this phase

61
Join Operation

Hash join (contd)
Cost
Read R and S to partition
nBlocks(R) nBlocks(S)
Write each partition to disk
nBlocks(R) nBlocks(S)
Read in each partition to find matching tuples
nBlocks(R) nBlocks(S)
Total
3(nBlocks(R) nBlocks(S)

Assumes no overflow and in memory hash index
62
Join Operation

Hash join (contd)
Cost (contd)
If above assumption doesnt hold, then
partitioning of relations cannot be done in one
pass
Use recursive partitioning algorithm
2(nBlocks(R) nBlocks(S))lognBuffer-1(nBlocks(S
))-1 nBlocks(R) nBLocks(S)

63
Join Operation

Estimation Example (Branch Staff)
There exists a hash index on PK staffNo
There exists a hash index on PK branchNo
There are 100 database buffer blocks
The system catalog holds
nTuples(Staff) 3000
bFactor(Staff) 30 gt nBlocks(Staff) 200
nTuples(Branch) 500
bFactor(Branch) 50 gt nBlocks(Branch) 10

64
Join Operation

Estimation Example (Branch Staff) (contd)
nTuples(PropertyForRent) 100 000
bFactor(PFR) 50 gt nBlocks(PFR) 2000
Estimate
J1 Staff ltJgtstaffNo PropertyForRent
J2 Branch ltJgtstaffNo PropertyForRent

65
Join Operation

Estimation Example (Branch Staff) (contd)

Best strategy depends on situation
66
Projection Operation

Extracts vertical subset of relation R, and
produces single relation S
Implementation strategy considers
Removal of attributes that arent required
Not problematic
Elimination of duplicate tuples produced from
previous step
More problematic

67
Projection Operation

Estimation of Cardinality of result set
Projection contains key
ntuples(S) nTuples(R)
Projection contains single non-key attribute A
ntuples(S) SCA(R)

68
Projection Operation

Cost Estimation of Duplicate Elimination
Duplicate elimination using sorting
Sort attributes left in reduced relation of R
Makes duplicates adjacent
Must create intermediate relation at a cost of
nBlocks(R)
Sorting costs nBlocks(R)log2(nBlocks(R))
Total cost -
nBlocks(R) nBlocks(R)log2(nBlocks(R))

69
Projection Operation

Cost Estimation of Duplicate Elimination
Duplicate elimination using hashing
Partition
Allocate one buffer block for reading R, allocate
one nBuffer 1 buffer blocks to store partitioned
output
Remove unwanted attributes
Apply hash function to remaining attributes
(chose hash function for uniform distribution)
Two tuples in different partitions guaranteed not
to be duplicates
Search area for duplicates now reduced

Useful if large number of buffer blocks compared
to number of blocks for R
70
Projection Operation

Cost Estimation of Duplicate Elimination
Duplicate elimination using hashing (contd)
Eliminate duplicates
For each nBuffer 1 buffer blocks of stored
partitioned output
Apply second hash funtion to tuple
Insert hash value into in-memory hash table
If tuples hash to same value as another tuple,
check if duplicate
Eliminate new tuple if duplicate
Write result tuples

71
Projection Operation

Cost Estimation of Duplicate Elimination
Duplicate elimination using hashing (contd)
Cost
Assume hashing has no overflow
Exclude cost of writing result relation
Cost is
nBlocks(R) nb

nb is number of blocks required by intermediate
relation before duplicate elimination.
72
Set Operations

Binary operations
Union
Intersection
Set difference
Must be union compatible
Implementation
Sort both relations, and scan result set once
Union
Place in result set any tuple that appears in
either original relation, eliminating duplicates

73
Set Operations

Implementation (contd)
Intersection
Place in result set any tuple that appears both
original relation, eliminating duplicates
Set difference (R-S)
Place in result set any tuple that appears in R
that is not in S
Cost
Use sort-merge join algorithm
nBlocks(R) nBlocks(S) nBlocks(R)log2(nBlocks
(R)) nBlocks(S)log2(nBlocks(S))

74
Set Operations

Estimation of Cardinality of result set
Estimate of set operations more problematic,
usually upper and lower bound given
Union (RUS)
max(nTuples(R), nTuples(S)) nTuples(T)
nTuples(R) nTuples(S)
Intersection (RnS)
0 nTuples(T) min(nTuples(R), nTuples(S))
Set Difference (R-S)
0 nTuples(T) nTuples(R)

75
Query Optimization

Pipelining and Query Optimization in Oracle

76
Pipelining

Used to improve query processing performance
(aka On-the-Fly Processing)
What weve assumed until now
Materialization The output of one operation is
stored in a temp relation (and written
temporarily to disk) for use by the next
operation
Alternative Pipeline results of one operation
to another operation so that no temporary
relation needs to be created for the intermediate
result

77
Pipeline Implementation

Typically a separate thread within DBMS
Uses a buffer to hold the tuples passing from one
operation to the other
Drawback Operation inputs may not all be
available at once for processing which can limit
algorithm options
(ie. Cant use Sort-Merge Join Algorithm if the
pipelines input tuples arent also sorted on the
join attribute)
Benefit Some execution strategies can be greatly
improved

78
Pipelining example

sposition Manager(ssalary gt 2000(Staff))
Normal approach
Process the selection on salary
Store result in temp relation
Do second selection on temp relation
Pipeline approach
Apply the second selection to each tuple in the
result of the first relation AS IT IS PRODUCED
Does away with need for a temp relation

79
Trees

Used to show the execution order of operations
2 general types of trees
Linear trees (Left-deep, Right-deep, Other)
Non-linear trees (aka Bushy trees)

80
Linear Trees

Relation on one side of the operator always a
base relation
When the operations are a join, the left child is
the outer relation and the right is the inner
relation
Inner relations must always be materialized
Makes left-deep trees extremely useful since all
inner relations are base relations
Allows left-deep trees to be implemented using a
fully-pipelined execution

81
Query Optimization in Oracle

Supports both Rule-Based and Cost-Based
Optimization
Gives user ability to create and maintain
Histograms
Allows the Execution Plan to be viewed

82
Oracle Rule-Based Optimizer

Contains a table of rankings for different access
paths available for queries.
Optimizer determines rankings according to this
table for each different Execution Strategy and
then selects the strategy with the best score

83
Rule-Based Optimization Rankings
84
Rule-Based Optimization Ex.

SELECT propertyNoFROM PropertyForRentWHERE
rooms gt 7 AND city St Louis
Available access paths
Full Table Scan (available for every query)Rank
15
Unbounded range search on rooms column indexRank
11
Single-Column access on the city indexRank 9

85
Oracle Cost-Based Optimizer

Use the Execution Strategy requiring the least
amount of resource use
User can specify whether this is based on
throughput or on response time
Can also suggest hints for the optimizer to
consider
Relies on gathering of statistics generated by
user
Manual updates necessary when database changes
Done on entire data structure or using sampling
(row-based and block-based)

86
Optimizer Hints in Oracle

Used to force the optimizer into a decision
different than it would otherwise choose
Can force the use of
Rule-Based Optimizer
Particular Access Path
Particular Join Order
Particular Join Operation (ie. Sort-Merge Join)

87
Optimizer Hints in Oracle

SELECT / INDEX(sexidx) / name, positionFROM
StaffWHERE sex M
Forces the use of the sex index
Can improve performance over a full-table scan if
we know that a relatively high of the staff are
female

88
Histograms

Provide a more realistic estimation of data value
distribution within columns
Used to better estimate of tuples returned by
queries/operations
Width-Balanced Data divided into fixed number of
equal width buckets
Height-Balanced Data divided such that each
bucket contains the same of values
Overhead Storing and Maintaining

89
Histograms
90
Width-Balanced Histogram
20
40
28
8
4
91
Height-Balanced Histogram
92
Oracle Execution Plan

Seen by using the EXPLAIN PLAN command
Output written to a database table
Main Table Columns
STATEMENT_ID - Defined as EXPLAIN PLAN parameter
OPERATION - Name of internal operation performed
(ie SELECT)
OPTIONS - Name of another internal operation
performed
OBJECT_NAME - Name of the table or index
ID - Number assigned to each execution plan step
PARENT_ID - ID of next step that operates on ID
step output
POSITION - Processing order for steps with same
PARENT_ID
COST - Estimated cost of the operation (null for
rule_based)
CARDINALITY - Estimated number of rows accessed
by this operation

93
Oracle Execution Plan
94
References

Alfred V. Aho, Yehoshua Sagiv, Jeffrey D. Ullman
Equivalences Among Relational Expressions. SIAM
J. Comput. 8(2) 218-246(1979)
Stefano Ceri, Georg Gottlob Translating SQL Into
Relational Algebra Optimization, Semantics, and
Equivalence of SQL Queries. TSE 11(4)
324-345(1985)
Sang K. Cha and Gio Wiederhold, "Kaleidoscope a
cooperative menu-guidedquery interface,"
Proceedings of the 1990 ACM SIGMOD
internationalconference on Management of data,
page 387, 1990, Atlantic City, NewJersey, United
States
Raghu Ramakrishnan, Johannes Gehrke, "Database
management systems Boston McGraw-Hill, 2000
Chapter 12, 13
Abraham Silberschatz, Henry F. Korth, S.
Sudarshan, Database system concepts Boston
McGraw-Hill, 2002 Chapter 14