Title: Access Path Selection in a RDBMS
1Access Path Selectionin a RDBMS
- P. Griffiths Selinger et al.
- Presented by gwyou
2System R
- System R 1974-1978
- DBMS based on relational model
- IBM San Jose Labs, lots of PhD researchers
- Gray coming from OS
- lots of influence in RSS
- ACM SIGMOD Innovation Award 1992
- Turing Award 1998
- User need not know
- How the tuples are physically stored
- What access paths are available
- Optimizer choose optimal strategy to minimize
total access cost - RSS(Storage System) storage/access methods
(physical) - ? storage, access methods, buffer manager,
lock, log/recovery
3Query Processing in System R
SELECT list of items to be retrieved FROM the
table(s) referenced WHERE the boolean combination
of predicates ? Query Block
Optimal Access Path Selection
Code Generation
Parsed Query
Parser
High Level Query (SQL Statement)
Execution Query Results
4Query Optimizer Access Path Selection
- Access path selection based on expected costs
- SELECT name
- FROM employee
- WHERE job Programmer and city Chicago
- path 1 job index ? check city
- path 2 city index ? check job
5Outline
- Research Storage System
- Costs for single relation access paths
- Access path selection for joins
- Nested Queries
6Research Storage System
- Storage subsystem of System R responsible for
maintaining - Physical storage of relations
- Access paths of these relations
- Locking (in a multi-user environment)
- Logging
- Recovery
- Relations are stored in the RSS as a collection
of tuples whose columns are physically contiguous - Tuples are stored on 4 Kbyte - No tuple spans a
page - Pages are organized into logical units called
segments - Segments may contain one or more
relations, but no relation may span a segment
7RSS Scans Access tuples
- Two RSS Scans
- Segment Scan
- examines all pages of the segment
- returns all tuples of the given relation
- Index Scan
- searches on the column of the relation using
index (e.g., B-tree) - Clustered proximity of index corresponds to
physical proximity - ? each data page will be touched only once
- SARGS (Search arguments)
- Both scans may optionally take a set of
predicates called SARGS - Eliminate the overhead of making RSI calls
- RSI User Interface of System R,
Tuple-oriented interface - A SARGable predicate
- One of the form (or which can be put into the
form) - column comparison-operator value.
- ex. Name Smith, Salary gt 10000
8Outline
- Research Storage System
- Costs for single relation access paths
- Access path selection for joins
- Nested Queries
9Cost Model
- Cost page fetches W (RSI Calls)
- W is a weighting factor
- Pages fetched (I/O) vs. CPU cost
- (RSI Calls) of tuples returned by RSS
- Goal to minimize the cost
10Statistics for Optimization
- Each relation T
- NCARD(T) Cardinality of relation T
- TCARD(T) of pages in the segment containing
tuples from T - P(T) TCARD(T)/( of non-empty pages in the
segment) - ? Fraction of data pages in the segment that
holds tuples of T - Each index I on relation T
- ICARD(I) of distinct keys in index I
- NINDEX(I) of pages in index I
- ? These statistics are maintained in the System R
catalogs
11Query Optimization
- Estimate a selectivity factor F for each
Boolean factor in the predicate list using the
statistics - Selectivity factor
- ? Expected fraction of tuples satisfying the
predicate
12Predicate Selectivity Estimation
13Cost Formulas
COST index pages fetched data pages fetched
w(RSI calls) 1 1 W F(preds)
(NINDX(I)TCARD) WRSICARD F(preds)
(NINDX(I)NCARD) WRSICARD (NINDX(I)
TCARD) WRSICARD TCARD/P WRSICARD
SITUATION Unique index matching an
equal predicate Clustered index I matching
one or more boolean factors Non clustered index
I matching one or more boolean factors Clustered
index I not matching any boolean
factors Segment scan
14Outline
- Research Storage System
- Costs for single relation access paths
- Access path selection for joins
- 2-way Join n-way Join
- Nested Queries
15Two-way join method
R
S
Tuple
R tuples
16N-way joins
- Visualized as a sequence of 2-way joins
- Given n relations, there are n! ways of joining
- Search space can be reduced
- 1. Join of (k1) relation with previous k
relations is independent of first k join order ?
using dynamic programming O(2n) subsets of n
tables -
- 2. Inconsistent Join condition
- Example (2)
- Given relations T1, T2, and T3
- 1. Join predicates T1-T2 ? T1.a T2.b
- 2. Join predicates T2-T3 ? T2.b T2.c
- T1-T2-T3 T2-T1-T3 T3-T1-T2
- T1-T3-T2 T2-T3-T1 T3-T2-T1
X
X
17Cost formulae for Joins
- C-outer(path1) cost of scanning the outer
relation via path 1, - C-inner(path2) cost of scanning the inner
relation - ? The costs of the scans computed using cost
formulas for single relation - N (Product of cardinalities of all relations T
of the join so far) (Product of selectivity
factors of all applicable predicates) - ? cardinality of the outer relation tuples which
satisfy the applicable predicates - C-nested loop join (path1,path2) C-outer(path1)
N C-inner (path2)
18Outline
- Research Storage System
- Costs for single relation access paths
- Access path selection for joins
- Nested Queries
19Nested queries
SELECT NAME FROM EMPLOYEE WHERE SALARY (
SELECT AVG(SALARY) FROM EMPLOYEE )
? Subquery is evaluated before the main query and
is evaluated only once
SELECT NAME FROM EMPLOYEE X WHERE SALARY gt
(SELECT SALARY FROM EMPLOYEE
WHERE EMPLOYEE_NUMBER X.MANAGER)
- Correlation Subquery must be re-evaluated for
each candidate tuple - from the referenced query block
20Q AAny Questions?