Title: Principles of Query Processing
1Principles of Query Processing
CS5226 Week 5
- Pang Hwee Hwa
- School of Computing, NUS
2ApplicationProgrammer(e.g., business
analyst, Data architect)
Application
SophisticatedApplicationProgrammer(e.g., SAP
admin)
Query Processor
Indexes
Storage Subsystem
Concurrency Control
Recovery
DBA,Tuner
Operating System
HardwareProcessor(s), Disk(s), Memory
3Overview of Query Processing
Database
Statistics
Cost Model
Query Optimizer
Query Evaluator
QEP
Parsed Query
Parser
High Level Query
Query Result
4Outline
- Processing relational operators
- Query optimization
- Performance tuning
5Projection Operator
- ?R.attrib, .. (R)
- Implementation is straightforward
SELECT bid FROM Reserves R WHERE R.rname lt
C
6Selection Operator
- ?R.attr op value (R)
- Size of result R selectivity
- Scan
- Clustered index Good
- Non-clustered index
- Good for low selectivity
- Worse than scan for high selectivity
SELECT FROM Reserves R WHERE R.rname lt
C
7Example of Join
SELECT FROM Sailors R, Reserve S WHERE
R.sidS.sid
8Notations
- R number of pages in outer table R
- R number of tuples in outer table R
- S number of pages in inner table S
- S number of tuples in inner table S
- M number of main memory pages allocated
9Simple Nested Loop Join
R
S
Tuple
R tuples
10Simple Nested Loop Join
- Scan inner table S per R tuple R S
- Each scan costs S pages
- For R tuples
- R pages for outer table R
- Total cost R R S pages
- Not optimal!
11Block Nested Loop Join
R
S
M 2 pages
?R / (M 2)? blocks
12Block Nested Loop Join
- Scan inner table S per block of (M 2) pages of
R tuples - Each scan costs S pages
- R / (M 2) blocks of R tuples
- R pages for outer table R
- Total cost R ?R / (M 2)? S pages
- R should be the smaller table
13Index Nested Loop Join
R
S
Index
Tuple
R tuples
14Index Nested Loop Join
- Probe S index for matching S tuples per R tuple
- Probe hash index 1.2 I/Os
- Probe B tree 2-4 I/Os, plus retrieve matching S
tuples 1 I/O - For R tuples
- R pages for outer table R
- Total cost R R index retrieval
- Better than Block NL join only for small number
of R tuples
15Sort Merge Join
- External sort R
- External sort S
- Merge sorted R and sorted S
16External Sort R
(m-1)-way merge
Size of R0,i M, R0,is ?R/M?
merge passes ?logM-1 R/M? Cost per pass
R input R output 2 R Total cost 2
R (?logM-1 R/M? 1) including split pass
17Sort Merge Join
- External-sort R 2 R (?logM-1 R/M? 1)
- Split R into R/M sorted runs each of size M 2
R - Merge up to (M 1) runs repeatedly
- ?logM-1 R/M? passes, each costing 2 R
- External-sort S 2 S (?logM-1 S/M? 1)
- Merge matching tuples from sorted R and S R
S - Total cost 2 R (?logM-1 R/M? 1) 2 S
(?logM-1 S/M? 1) R S - If R lt M(M-1), cost 5 (R S)
18GRACE Hash Join
S
0 1 2 3
X X X X X X X X X X X X
X X X X X X
X X X X X X
X X X
X X X
X X X
X X X
0
bucketID X mod 4 Join on R.X S.X
1
R S R0 S0 R1
S1 R2 S2
R3 S3
R
2
3
19GRACE Hash Join Partition Phase
R ? (M 1) partitions, each of size R / (M 1)
20GRACE Hash Join Join Phase
Partition must fit in memory R / (M 1) lt M -1
21GRACE Hash Join Algorithm
- Partition phase 2 (R S)
- Partition table R using hash function h1 2 R
- Partition table S using hash function h1 2 S
- R tuples in partition i will match only S tuples
in partition I - R ? (M 1) partitions, each of size R / (M
1) - Join phase R S
- Read in a partition of R (R / (M 1) lt M -1)
- Hash it using function h2 (ltgt h1!)
- Scan corresponding S partition, search for
matches - Total cost 3 (R S) pages
- Condition M gt vfR, f 1.2 to account for hash
table
22Summary of Join Operator
- Simple nested loop R R S
- Block nested loop R ?R / (M 2)? S
- Index nested loop R R index retrieval
- Sort-merge 2 R (?logM-1 R/M? 1) 2 S
(?logM-1 S/M? 1) R S - GRACE hash 3 (R S)
- Condition M gt vfR
23Overview of Query Processing
Database
Statistics
Cost Model
Query Optimizer
Query Evaluator
QEP
Parsed Query
Parser
High Level Query
Query Result
24Query Optimization
- Given An SQL query joining n tables
- Dream Map to most efficient plan
- Reality Avoid rotten plans
- State of the art
- Most optimizers follow System Rs technique
- Works fine up to about 10 joins
SELECT S.sname FROM Reserves R, Sailors S WHERE
R.sidS.sid AND R.bid100 AND S.ratinggt5
25Complexity of Query Optimization
- Many degrees of freedom
- Selection scan versus (clustered, non-clustered)
index - Join block nested loop, sort-merge, hash
- Relative order of the operators
- Exponential search space!
- Heuristics
- Push the selections down
- Push the projections down
- Delay Cartesian products
- System R Only left-deep trees
26Equivalences in Relational Algebra
- Selection
- cascade -
- commutative - Projection
- cascade - Join
- associative -
- commutative
R (S T) (R S) T
(R S) (S R)
27Equivalences in Relational Algebra
- A projection commutes with a selection that only
uses attributes retained by the projection - Selection between attributes of the two arguments
of a cross-product converts cross-product to a
join - A selection on just attributes of R commutes with
join R S (i.e., (R S)
(R) S ) - Similarly, if a projection follows a join R
S, we can push it by retaining only attributes
of R (and S) that are needed for the join or are
kept by the projection
28System R Optimizer
- Find all plans for accessing each base table
- For each table
- Save cheapest unordered plan
- Save cheapest plan for each interesting order
- Discard all others
- Try all ways of joining pairs of 1-table plans
save cheapest unordered interesting ordered
plans - Try all ways of joining 2-table with 1-table
- Combine k-table with 1-table till you have full
plan tree - At the top, to satisfy GROUP BY and ORDER BY
- Use interesting ordered plan
- Add a sort node to unordered plan
29Source Selinger et al, Access Path Selection in
a Relational Database Management System
30Note Only branches for NL join are shown here.
Additional branches for other join
methods (e.g. sort-merge) are not shown.
Source Selinger et al, Access Path Selection in
a Relational Database Management System
31What is Cheapest?
- Need information about the relations and indexes
involved - Catalogs typically contain at least
- tuples (NTuples) and pages (NPages) for each
relation. - distinct key values (NKeys) and NPages for each
index. - Index height, low/high key values (Low/High) for
each tree index. - Catalogs updated periodically.
- Updating whenever data changes is too expensive
lots of approximation anyway, so slight
inconsistency ok. - More detailed information (e.g., histograms of
the values in some field) are sometimes stored.
32Estimating Result Size
SELECT attribute list FROM relation list WHERE
term1 AND ... AND termk
- Consider a query block
- Maximum tuples in result is the product of the
cardinalities of relations in the FROM clause. - Reduction factor (RF) associated with each termi
reflects the impact of the term in reducing
result size - Term colvalue has RF 1/NKeys(I)
- Term col1col2 has RF 1/MAX(NKeys(I1), NKeys(I2))
- Term colgtvalue has RF (High(I)-value)/(High(I)-Low
(I)) - Result cardinality Max tuples product of
all RFs. - Implicit assumption that terms are independent!
33Cost Estimates for Single-Table Plans
- Index I on primary key matches selection
- Cost is Height(I)1 for a B tree, about 1.2 for
hash index. - Clustered index I matching one or more selects
- (NPages(I)NPages(R)) product of RFs of
matching selects. - Non-clustered index I matching one or more
selects - (NPages(I)NTuples(R)) product of RFs of
matching selects. - Sequential scan of file
- NPages(R).
- Note Typically, no duplicate elimination on
projections! (Exception Done on answers if user
says DISTINCT.)
34Counting the Costs
- With 5 buffers, cost of plan
- Scan Reserves (1000) write temp T1 (10 pages,
if we have 100 boats, uniform distribution) - Scan Sailors (500) write temp T2 (250 pages, if
we have 10 ratings). - Sort T1 (2102), sort T2 (22504), merge
(10250), total2300 - Total 4060 page I/Os
- If we used BNL join, join cost 104250, total
cost 2770 - If we push projections, T1 has only sid, T2
only sid and sname - T1 fits in 3 pages, cost of BNL drops to under
250 pages, total lt 2000
SELECT S.sname FROM Reserves R, Sailors S WHERE
R.sidS.sid AND R.bid100 AND S.ratinggt5
35Exercise
- Reserves 100,000 tuples, 100 tuples per page
- With clustered index on bid of Reserves, we get
100,000/100 1000 tuples on 1000/100 10 pages - Join column sid is a key for Sailors - at most
one matching tuple - Decision not to push ratinggt5 before the join is
based on availability of sid index on Sailors - Cost Selection of Reserves tuples (10 I/Os)
for each tuple, must get matching Sailors tuple
(10001.2) total 1210 I/Os
(On-the-fly)
sname
(On-the-fly)
rating gt 5
(Index Nested Loops,
with pipelining )
sidsid
(Use hash Index on sid)
Sailors
bid100
(Use clustered index on sid)
Reserves
36Query Tuning
37Avoid Redundant DISTINCT
- DISTINCT usually entails a sort operation
- Slow down query optimization because one more
interesting order to consider - Remove if you know the result has no duplicates
SELECT DISTINCT ssnum FROM Employee WHERE dept
information systems
38Change Nested Queries to Join
- Might not use index on Employee.dept
- Need DISTINCT if an employee might belong to
multiple departments
SELECT ssnum FROM Employee WHERE dept IN (SELECT
dept FROM Techdept)
SELECT ssnum FROM Employee, Techdept WHERE
Employee.dept Techdept.dept
39Avoid Unnecessary Temp Tables
- Creating temp table causes update to catalog
- Cannot use any index on original table
SELECT INTO Temp FROM Employee WHERE salary gt
40000 SELECT ssnum FROM Temp WHERE Temp.dept
information systems
SELECT ssnum FROM Employee WHERE Employee.dept
information systems AND salary gt 40000
40Avoid Complicated Correlation Subqueries
- Search all of e2 for each e1 record!
SELECT ssnum FROM Employee e1 WHERE salary
(SELECT MAX(salary) FROM Employee e2
WHERE e2.dept e1.dept
SELECT MAX(salary) as bigsalary, dept INTO
Temp FROM Employee GROUP BY dept SELECT
ssnum FROM Employee, Temp WHERE salary
bigsalary AND Employee.dept Temp.dept
41Avoid Complicated Correlation Subqueries
- SQL Server 2000 does a good job at handling the
correlated subqueries (a hash join is used as
opposed to a nested loop between query blocks) - The techniques implemented in SQL Server 2000 are
described in Orthogonal Optimization of
Subqueries and Aggregates by C.Galindo-Legaria
and M.Joshi, SIGMOD 2001.
gt 10000
gt 1000
42Join on Clustering and Integer Attributes
- Employee is clustered on ssnum
- ssnum is an integer
SELECT Employee.ssnum FROM Employee,
Student WHERE Employee.name Student.name
SELECT Employee.ssnum FROM Employee,
Student WHERE Employee.ssnum Student.ssnum
43Avoid HAVING when WHERE is enough
- May first perform grouping for all departments!
SELECT AVG(salary) as avgsalary, dept FROM
Employee GROUP BY dept HAVING dept information
systems
SELECT AVG(salary) as avgsalary FROM
Employee WHERE dept information systems GROUP
BY dept
44Avoid Views with unnecessary Joins
- Join with Techdept unnecessarily
CREATE VIEW Techlocation AS SELECT ssnum,
Techdept.dept, location FROM Employee,
Techdept WHERE Employee.dept Techdept.dept SELE
CT dept FROM Techlocation WHERE ssnum 4444
SELECT dept FROM Employee WHERE ssnum 4444
45Aggregate Maintenance
- Materialize an aggregate if needed frequently
- Use trigger to update
create trigger updateVendorOutstanding on orders
for insert as update vendorOutstanding set amount
(select vendorOutstanding.amountsum(inserted.
quantityitem.price) from inserted,item where
inserted.itemnum item.itemnum ) where vendor
(select vendor from inserted)
46Avoid External Loops
- No loop
- sqlStmt select from lineitem where l_partkey
lt 200 - odbc-gtprepareStmt(sqlStmt)
- odbc-gtexecPrepared(sqlStmt)
-
- Loop
- sqlStmt select from lineitem where l_partkey
? - odbc-gtprepareStmt(sqlStmt)
- for (int i1 ilt200 i)
-
- odbc-gtbindParameter(1, SQL_INTEGER, i)
- odbc-gtexecPrepared(sqlStmt)
-
47Avoid External Loops
Let the DBMS optimize set operations
- SQL Server 2000 on Windows 2000
- Crossing the application interface has a
significant impact on performance
48Avoid Cursors
- No cursor
- select from employees
- Cursor
- DECLARE d_cursor CURSOR FOR select from
employees - OPEN d_cursorwhile (_at__at_FETCH_STATUS 0)
- BEGIN
- FETCH NEXT from d_cursorEND
- CLOSE d_cursor
- go
49Avoid Cursors
- SQL Server 2000 on Windows 2000
- Response time is a few seconds with a SQL query
and more than an hour iterating over a cursor
50Retrieve Needed Columns Only
- All
- Select from lineitem
- Covered subset
- Select l_orderkey, l_partkey, l_suppkey,
l_shipdate, l_commitdate from lineitem
- Avoid transferring unnecessary data
- May enable use of a covering index.
51Use Direct Path for Bulk Loading
- sqlldr directpathtrue controlload_lineitem.ctl
dataE\Data\lineitem.tbl - load data
- infile "lineitem.tbl"
- into table LINEITEM append
- fields terminated by ''
- (
- L_ORDERKEY, L_PARTKEY, L_SUPPKEY, L_LINENUMBER,
L_QUANTITY, L_EXTENDEDPRICE, L_DISCOUNT, L_TAX,
L_RETURNFLAG, L_LINESTATUS, L_SHIPDATE DATE
"YYYY-MM-DD", L_COMMITDATE DATE "YYYY-MM-DD",
L_RECEIPTDATE DATE "YYYY-MM-DD", L_SHIPINSTRUCT,
L_SHIPMODE, L_COMMENT - )
52Use Direct Path for Bulk Loading
- Direct path loading bypasses the query engine and
the storage manager. It is orders of magnitude
faster than for conventional bulk load (commit
every 100 records) and inserts (commit for each
record).
53Some Idiosyncrasies
- OR may stop the index being used
- break the query and use UNION
- Order of tables may affect join implementation
54Query Tuning Thou Shalt
- Avoid redundant DISTINCT
- Change nested queries to join
- Avoid unnecessary temp tables
- Avoid complicated correlation subqueries
- Join on clustering and integer attributes
- Avoid HAVING when WHERE is enough
- Avoid views with unnecessary joins
- Maintain frequently used aggregates
- Avoid external loops
55Query Tuning Thou Shalt
- Avoid cursors
- Retrieve needed columns only
- Use direct path for bulk loading