Principles of Query Processing

About This Presentation

Title:

Principles of Query Processing

Description:

... result = R * selectivity. Scan. Clustered index: Good. Non-clustered index: Good for low selectivity. Worse than scan for high selectivity. SELECT * FROM Reserves ... – PowerPoint PPT presentation

Number of Views:44

Avg rating:3.0/5.0

Slides: 56

Provided by: hhp7

Category:

more less

Transcript and Presenter's Notes

Title: Principles of Query Processing

1
Principles of Query Processing
CS5226 Week 5

Pang Hwee Hwa
School of Computing, NUS

2
ApplicationProgrammer(e.g., business
analyst, Data architect)
Application
SophisticatedApplicationProgrammer(e.g., SAP
admin)
Query Processor
Indexes
Storage Subsystem
Concurrency Control
Recovery
DBA,Tuner
Operating System
HardwareProcessor(s), Disk(s), Memory
3
Overview of Query Processing
Database
Statistics
Cost Model
Query Optimizer
Query Evaluator
QEP
Parsed Query
Parser
High Level Query
Query Result
4
Outline

Processing relational operators
Query optimization
Performance tuning

5
Projection Operator

?R.attrib, .. (R)
Implementation is straightforward

SELECT bid FROM Reserves R WHERE R.rname lt
C
6
Selection Operator

?R.attr op value (R)
Size of result R selectivity
Scan
Clustered index Good
Non-clustered index
Good for low selectivity
Worse than scan for high selectivity

SELECT FROM Reserves R WHERE R.rname lt
C
7
Example of Join
SELECT FROM Sailors R, Reserve S WHERE
R.sidS.sid
8
Notations

R number of pages in outer table R
R number of tuples in outer table R
S number of pages in inner table S
S number of tuples in inner table S
M number of main memory pages allocated

9
Simple Nested Loop Join
R
S
Tuple
R tuples
10
Simple Nested Loop Join

Scan inner table S per R tuple R S
Each scan costs S pages
For R tuples
R pages for outer table R
Total cost R R S pages
Not optimal!

11
Block Nested Loop Join
R
S
M 2 pages
?R / (M 2)? blocks
12
Block Nested Loop Join

Scan inner table S per block of (M 2) pages of
R tuples
Each scan costs S pages
R / (M 2) blocks of R tuples
R pages for outer table R
Total cost R ?R / (M 2)? S pages
R should be the smaller table

13
Index Nested Loop Join
R
S
Index
Tuple
R tuples
14
Index Nested Loop Join

Probe S index for matching S tuples per R tuple
Probe hash index 1.2 I/Os
Probe B tree 2-4 I/Os, plus retrieve matching S
tuples 1 I/O
For R tuples
R pages for outer table R
Total cost R R index retrieval
Better than Block NL join only for small number
of R tuples

15
Sort Merge Join

External sort R
External sort S
Merge sorted R and sorted S

16
External Sort R
(m-1)-way merge
Size of R0,i M, R0,is ?R/M?
merge passes ?logM-1 R/M? Cost per pass
R input R output 2 R Total cost 2
R (?logM-1 R/M? 1) including split pass
17
Sort Merge Join

External-sort R 2 R (?logM-1 R/M? 1)
Split R into R/M sorted runs each of size M 2
R
Merge up to (M 1) runs repeatedly
?logM-1 R/M? passes, each costing 2 R
External-sort S 2 S (?logM-1 S/M? 1)
Merge matching tuples from sorted R and S R
S
Total cost 2 R (?logM-1 R/M? 1) 2 S
(?logM-1 S/M? 1) R S
If R lt M(M-1), cost 5 (R S)

18
GRACE Hash Join
S
0 1 2 3
X X X X X X X X X X X X
X X X X X X
X X X X X X
X X X
X X X
X X X
X X X
0
bucketID X mod 4 Join on R.X S.X
1
R S R0 S0 R1
S1 R2 S2
R3 S3
R
2
3
19
GRACE Hash Join Partition Phase
R ? (M 1) partitions, each of size R / (M 1)
20
GRACE Hash Join Join Phase
Partition must fit in memory R / (M 1) lt M -1
21
GRACE Hash Join Algorithm

Partition phase 2 (R S)
Partition table R using hash function h1 2 R
Partition table S using hash function h1 2 S
R tuples in partition i will match only S tuples
in partition I
R ? (M 1) partitions, each of size R / (M
1)
Join phase R S
Read in a partition of R (R / (M 1) lt M -1)
Hash it using function h2 (ltgt h1!)
Scan corresponding S partition, search for
matches
Total cost 3 (R S) pages
Condition M gt vfR, f 1.2 to account for hash
table

22
Summary of Join Operator

Simple nested loop R R S
Block nested loop R ?R / (M 2)? S
Index nested loop R R index retrieval
Sort-merge 2 R (?logM-1 R/M? 1) 2 S
(?logM-1 S/M? 1) R S
GRACE hash 3 (R S)
Condition M gt vfR

23
Overview of Query Processing
Database
Statistics
Cost Model
Query Optimizer
Query Evaluator
QEP
Parsed Query
Parser
High Level Query
Query Result
24
Query Optimization

Given An SQL query joining n tables
Dream Map to most efficient plan
Reality Avoid rotten plans
State of the art
Most optimizers follow System Rs technique
Works fine up to about 10 joins

SELECT S.sname FROM Reserves R, Sailors S WHERE
R.sidS.sid AND R.bid100 AND S.ratinggt5
25
Complexity of Query Optimization

Many degrees of freedom
Selection scan versus (clustered, non-clustered)
index
Join block nested loop, sort-merge, hash
Relative order of the operators
Exponential search space!

Heuristics
Push the selections down
Push the projections down
Delay Cartesian products
System R Only left-deep trees

26
Equivalences in Relational Algebra

Selection
- cascade
- commutative
Projection
- cascade
Join
- associative
- commutative

R (S T) (R S) T
(R S) (S R)
27
Equivalences in Relational Algebra

A projection commutes with a selection that only
uses attributes retained by the projection
Selection between attributes of the two arguments
of a cross-product converts cross-product to a
join
A selection on just attributes of R commutes with
join R S (i.e., (R S)
(R) S )
Similarly, if a projection follows a join R
S, we can push it by retaining only attributes
of R (and S) that are needed for the join or are
kept by the projection

28
System R Optimizer

Find all plans for accessing each base table
For each table
Save cheapest unordered plan
Save cheapest plan for each interesting order
Discard all others
Try all ways of joining pairs of 1-table plans
save cheapest unordered interesting ordered
plans
Try all ways of joining 2-table with 1-table
Combine k-table with 1-table till you have full
plan tree
At the top, to satisfy GROUP BY and ORDER BY
Use interesting ordered plan
Add a sort node to unordered plan

29
Source Selinger et al, Access Path Selection in
a Relational Database Management System
30
Note Only branches for NL join are shown here.
Additional branches for other join
methods (e.g. sort-merge) are not shown.
Source Selinger et al, Access Path Selection in
a Relational Database Management System
31
What is Cheapest?

Need information about the relations and indexes
involved
Catalogs typically contain at least
tuples (NTuples) and pages (NPages) for each
relation.
distinct key values (NKeys) and NPages for each
index.
Index height, low/high key values (Low/High) for
each tree index.
Catalogs updated periodically.
Updating whenever data changes is too expensive
lots of approximation anyway, so slight
inconsistency ok.
More detailed information (e.g., histograms of
the values in some field) are sometimes stored.

32
Estimating Result Size
SELECT attribute list FROM relation list WHERE
term1 AND ... AND termk

Consider a query block
Maximum tuples in result is the product of the
cardinalities of relations in the FROM clause.
Reduction factor (RF) associated with each termi
reflects the impact of the term in reducing
result size
Term colvalue has RF 1/NKeys(I)
Term col1col2 has RF 1/MAX(NKeys(I1), NKeys(I2))
Term colgtvalue has RF (High(I)-value)/(High(I)-Low
(I))
Result cardinality Max tuples product of
all RFs.
Implicit assumption that terms are independent!

33
Cost Estimates for Single-Table Plans

Index I on primary key matches selection
Cost is Height(I)1 for a B tree, about 1.2 for
hash index.
Clustered index I matching one or more selects
(NPages(I)NPages(R)) product of RFs of
matching selects.
Non-clustered index I matching one or more
selects
(NPages(I)NTuples(R)) product of RFs of
matching selects.
Sequential scan of file
NPages(R).
Note Typically, no duplicate elimination on
projections! (Exception Done on answers if user
says DISTINCT.)

34
Counting the Costs

With 5 buffers, cost of plan
Scan Reserves (1000) write temp T1 (10 pages,
if we have 100 boats, uniform distribution)
Scan Sailors (500) write temp T2 (250 pages, if
we have 10 ratings).
Sort T1 (2102), sort T2 (22504), merge
(10250), total2300
Total 4060 page I/Os
If we used BNL join, join cost 104250, total
cost 2770
If we push projections, T1 has only sid, T2
only sid and sname
T1 fits in 3 pages, cost of BNL drops to under
250 pages, total lt 2000

SELECT S.sname FROM Reserves R, Sailors S WHERE
R.sidS.sid AND R.bid100 AND S.ratinggt5
35
Exercise

Reserves 100,000 tuples, 100 tuples per page
With clustered index on bid of Reserves, we get
100,000/100 1000 tuples on 1000/100 10 pages
Join column sid is a key for Sailors - at most
one matching tuple
Decision not to push ratinggt5 before the join is
based on availability of sid index on Sailors
Cost Selection of Reserves tuples (10 I/Os)
for each tuple, must get matching Sailors tuple
(10001.2) total 1210 I/Os

(On-the-fly)
sname
(On-the-fly)
rating gt 5
(Index Nested Loops,
with pipelining )
sidsid
(Use hash Index on sid)
Sailors
bid100
(Use clustered index on sid)
Reserves
36
Query Tuning
37
Avoid Redundant DISTINCT

DISTINCT usually entails a sort operation
Slow down query optimization because one more
interesting order to consider
Remove if you know the result has no duplicates

SELECT DISTINCT ssnum FROM Employee WHERE dept
information systems
38
Change Nested Queries to Join

Might not use index on Employee.dept
Need DISTINCT if an employee might belong to
multiple departments

SELECT ssnum FROM Employee WHERE dept IN (SELECT
dept FROM Techdept)
SELECT ssnum FROM Employee, Techdept WHERE
Employee.dept Techdept.dept
39
Avoid Unnecessary Temp Tables

Creating temp table causes update to catalog
Cannot use any index on original table

SELECT INTO Temp FROM Employee WHERE salary gt
40000 SELECT ssnum FROM Temp WHERE Temp.dept
information systems
SELECT ssnum FROM Employee WHERE Employee.dept
information systems AND salary gt 40000
40
Avoid Complicated Correlation Subqueries

Search all of e2 for each e1 record!

SELECT ssnum FROM Employee e1 WHERE salary
(SELECT MAX(salary) FROM Employee e2
WHERE e2.dept e1.dept
SELECT MAX(salary) as bigsalary, dept INTO
Temp FROM Employee GROUP BY dept SELECT
ssnum FROM Employee, Temp WHERE salary
bigsalary AND Employee.dept Temp.dept
41
Avoid Complicated Correlation Subqueries

SQL Server 2000 does a good job at handling the
correlated subqueries (a hash join is used as
opposed to a nested loop between query blocks)
The techniques implemented in SQL Server 2000 are
described in Orthogonal Optimization of
Subqueries and Aggregates by C.Galindo-Legaria
and M.Joshi, SIGMOD 2001.

gt 10000
gt 1000
42
Join on Clustering and Integer Attributes

Employee is clustered on ssnum
ssnum is an integer

SELECT Employee.ssnum FROM Employee,
Student WHERE Employee.name Student.name
SELECT Employee.ssnum FROM Employee,
Student WHERE Employee.ssnum Student.ssnum
43
Avoid HAVING when WHERE is enough

May first perform grouping for all departments!

SELECT AVG(salary) as avgsalary, dept FROM
Employee GROUP BY dept HAVING dept information
systems
SELECT AVG(salary) as avgsalary FROM
Employee WHERE dept information systems GROUP
BY dept
44
Avoid Views with unnecessary Joins

Join with Techdept unnecessarily

CREATE VIEW Techlocation AS SELECT ssnum,
Techdept.dept, location FROM Employee,
Techdept WHERE Employee.dept Techdept.dept SELE
CT dept FROM Techlocation WHERE ssnum 4444
SELECT dept FROM Employee WHERE ssnum 4444
45
Aggregate Maintenance

Materialize an aggregate if needed frequently
Use trigger to update

create trigger updateVendorOutstanding on orders
for insert as update vendorOutstanding set amount
(select vendorOutstanding.amountsum(inserted.
quantityitem.price) from inserted,item where
inserted.itemnum item.itemnum ) where vendor
(select vendor from inserted)
46
Avoid External Loops

No loop
sqlStmt select from lineitem where l_partkey
lt 200
odbc-gtprepareStmt(sqlStmt)
odbc-gtexecPrepared(sqlStmt)
Loop
sqlStmt select from lineitem where l_partkey
?
odbc-gtprepareStmt(sqlStmt)
for (int i1 ilt200 i)
odbc-gtbindParameter(1, SQL_INTEGER, i)
odbc-gtexecPrepared(sqlStmt)

47
Avoid External Loops
Let the DBMS optimize set operations

SQL Server 2000 on Windows 2000
Crossing the application interface has a
significant impact on performance

48
Avoid Cursors

No cursor
select from employees
Cursor
DECLARE d_cursor CURSOR FOR select from
employees
OPEN d_cursorwhile (_at__at_FETCH_STATUS 0)
BEGIN
FETCH NEXT from d_cursorEND
CLOSE d_cursor
go

49
Avoid Cursors

SQL Server 2000 on Windows 2000
Response time is a few seconds with a SQL query
and more than an hour iterating over a cursor

50
Retrieve Needed Columns Only

All
Select from lineitem
Covered subset
Select l_orderkey, l_partkey, l_suppkey,
l_shipdate, l_commitdate from lineitem

Avoid transferring unnecessary data
May enable use of a covering index.

51
Use Direct Path for Bulk Loading

sqlldr directpathtrue controlload_lineitem.ctl
dataE\Data\lineitem.tbl
load data
infile "lineitem.tbl"
into table LINEITEM append
fields terminated by ''
(
L_ORDERKEY, L_PARTKEY, L_SUPPKEY, L_LINENUMBER,
L_QUANTITY, L_EXTENDEDPRICE, L_DISCOUNT, L_TAX,
L_RETURNFLAG, L_LINESTATUS, L_SHIPDATE DATE
"YYYY-MM-DD", L_COMMITDATE DATE "YYYY-MM-DD",
L_RECEIPTDATE DATE "YYYY-MM-DD", L_SHIPINSTRUCT,
L_SHIPMODE, L_COMMENT
)

52
Use Direct Path for Bulk Loading

Direct path loading bypasses the query engine and
the storage manager. It is orders of magnitude
faster than for conventional bulk load (commit
every 100 records) and inserts (commit for each
record).

53
Some Idiosyncrasies