Temple University - PowerPoint PPT Presentation

About This Presentation
Title:

Temple University

Description:

Temple University CIS Dept. CIS661 Principles of Data ... ceil(log(br/M) / log(M-1)) * 2 * br br. M. Q-opt steps. bring query in internal form (eg. ... – PowerPoint PPT presentation

Number of Views:40
Avg rating:3.0/5.0
Slides: 94
Provided by: Vas111
Category:

less

Transcript and Presenter's Notes

Title: Temple University


1
Temple University CIS Dept. CIS661 Principles
of Data Management
  • V. Megalooikonomou
  • Query Optimization
  • (based on slides by C. Faloutsos at CMU)

2
General Overview - rel. model
  • Relational model - SQL
  • Functional Dependencies Normalization
  • Physical Design
  • Indexing
  • Query optimization
  • Transaction processing

3
Overview of a DBMS
casual user
Naïve user
DBA
DML parser
DDL parser
DML precomp.
trans. mgr
buffer mgr
4
Overview - detailed
  • Why q-opt?
  • Equivalence of expressions
  • Cost estimation
  • Cost of indices
  • Join strategies

5
Why Q-opt?
  • SQL declarative
  • good q-opt -gt big difference
  • eg., seq. Scan vs B-tree index, on P1,000 pages

6
Q-opt steps
  • bring query in internal form (eg., parse tree)
  • into canonical form (syntactic q-opt)
  • generate alternative plans
  • estimate cost pick best

7
Q-opt - example
Canonical form
p
select name from STUDENT, TAKES where
c-idCIS661 and STUDENT.ssnTAKES.ssn
s
STUDENT
TAKES
8
Q-opt - example
Hash join merge join nested loops
Index seq. scan
9
Overview - detailed
  • Why q-opt?
  • Equivalence of expressions
  • Cost estimation
  • Cost of indices
  • Join strategies

10
Equivalence of expressions
  • A.k.a. syntactic q-opt
  • In short perform selections and projections
    early
  • More details
  • see transformation rules in text

11
Equivalence of expressions
  • Q How to prove a transformation rule?
  • A use TRC, to show that LHS RHS, e.g.

12
Equivalence of expressions
13
Equivalence of expressions
14
Equivalence of expressions
  • Q how to disprove a rule??

15
Equivalence of expressions
  • Selections
  • perform them early
  • break a complex predicate, and push
  • simplify a complex predicate
  • (XY and Y3) -gt X3 and Y3

16
Equivalence of expressions
  • Projections
  • perform them early (but carefully)
  • Smaller tuples
  • Fewer tuples (if duplicates are eliminated)
  • project out all attributes except the ones
    requested or required (e.g., joining attr.)

17
Equivalence of expressions
  • Joins
  • Commutative , associative
  • Q n-way join - how many diff. orderings?
    Exhaustive enumeration too slow

18
Q-opt steps
  • bring query in internal form (eg., parse tree)
  • into canonical form (syntactic q-opt)
  • generate alt. plans
  • estimate cost pick best

19
Cost estimation
  • Eg., find ssns of students with an A in CIS661
    (using seq. scanning)
  • How long will a query take?
  • CPU (but small cost decreasing tough to
    estimate)
  • Disk (mainly, block transfers)
  • How many tuples will qualify?
  • (what statistics do we need to keep?)

20
Cost estimation
  • Statistics for each relation r we keep
  • nr tuples
  • Sr size of tuple in bytes

21
Cost estimation
Sr
  • Statistics for each relation r we keep
  • V(A,r) number of distinct values of attr. A
  • (recently, histograms, too)

1
2
3

nr
22
Derivable statistics
  • fr blocking factor max records/block (??
    )
  • br blocks (?? )
  • SC(A,r) selection cardinality avg of records
    with Agiven (?? )

23
Derivable statistics
  • fr blocking factor max records/block ( B/Sr
    B block size in bytes)
  • br blocks ( nr / fr )

24
Derivable statistics
  • SC(A,r) selection cardinality avg of records
    with Agiven ( nr / V(A,r) ) (assumes
    uniformity...) eg 30,000 students, 10 colleges
    how many students in CST?

25
Additional quantities we need
  • For index i
  • fi average fanout - degree (50-100)
  • HTi levels of index i (2-3)
  • log(entries)/log(fi)
  • LBi blocks at leaf level

HTi
26
Statistics
  • Where do we store them?
  • How often do we update them?

27
Q-opt steps
  • bring query in internal form (eg., parse tree)
  • into canonical form (syntactic q-opt)
  • generate alt. plans
  • selections sorting projections
  • joins
  • estimate cost pick best

28
Cost estimation plan generation
  • Selections eg.,
  • select
  • from TAKES
  • where grade A
  • Plans?

29
Cost estimation plan generation
  • Plans?
  • seq. scan
  • binary search
  • (if sorted consecutive)
  • index search
  • if an index exists

30
Cost estimation plan generation
  • seq. scan cost?
  • br (worst case)
  • br/2 (average, if we search for primary key)

31
Cost estimation plan generation
  • binary search cost?
  • if sorted and consecutive
  • log(br)
  • SC(A,r)/fr (blocks spanned by qualified tuples)
  • -1

32
Cost estimation plan generation
  • estimation of selection cardinalities SC(A,r)
  • non-trivial details later

33
Cost estimation plan generation
  • method3 index cost?
  • levels of index
  • blocks w/ qual. tuples

...
case1 primary key case2 sec. key clustering
index case3 sec. key non-clust. index
34
Cost estimation plan generation
  • method3 index cost?
  • levels of index
  • blocks w/ qual. tuples

..
case1 primary key cost HTi 1
35
Cost estimation plan generation
Sr
  • method3 index - cost?
  • levels of index
  • blocks w/ qual. tuples

1
fr
2
case2 sec. key clustering index OR prim.
index on non-key retrieve multiple records HTi
SC(A,r)/fr

br
36
Cost estimation plan generation
  • method3 index cost?
  • levels of index
  • blocks w/ qual. tuples

...
case3 sec. key non-clust. index HTi
SC(A,r) (actually, pessimistic...)
37
Cost estimation arithmetic examples
  • find accounts with branch-name Perryridge
  • account(branch-name, balance, ...)

38
Arithm. examples contd
  • n-account 10,000 tuples
  • f-account 20 tuples/block
  • V(balance, account) 500 distinct values
  • V(branch-name, account) 50 distinct values
  • for branch-index fanout fi 20

39
Arithm. examples
  • Q1 cost of seq. scan?
  • A1 500 disk accesses
  • Q2 assume a clustering index on branch-name
    cost?

40
Cost estimation plan generation
Sr
  • method3 index cost?
  • levels of index
  • blocks w/ qual. tuples

1
fr
2
case2 sec. key clustering index HTi
SC(A,r)/fr

br
41
Arithm. examples
  • A2
  • HTi
  • SC(branch-name, account)/f-account
  • HTi 50 values, with index fanout 20 -gt HT2
    levels (log(50)/log(20) 1)
  • SC(..) qualified records
  • nr/V(A,r) 10,000/50 200 tuples
  • SC/f spanning 200/20 blocks 10 blocks

42
Arithm. examples
  • A2 final answer 210 12 block accesses
  • (vs. 500 block accesses of seq. scan)
  • footnote in all fairness
  • seq. disk accesses 2msec or less
  • random disk accesses 10msec

43
Q-opt steps
  • bring query in internal form (eg., parse tree)
  • into canonical form (syntactic q-opt)
  • generate alternative plans
  • selections sorting projections
  • joins
  • estimate cost pick best

44
Examples
  • (selection) find student record with ssn123

45
Reminder our Mini-U db
46
DML - nested subqueries
  • Drill find the ssn of the student with the
    highest GPA

47
File organization
  • Eg., Student records how would you store them
    on disk?

48
General Overview - rel. model
  • Relational model - SQL
  • Functional Dependencies Normalization
  • Physical Design
  • Indexing
  • Query optimization
  • Transaction processing

49
Q-opt steps
  • bring query in internal form (eg., parse tree)
  • into canonical form (syntactic q-opt)
  • generate alt. plans
  • selections (simple complex predicates)
  • sorting projections
  • joins
  • estimate cost pick best

50
Reminder statistics
  • for each relation r we keep
  • nr tuples
  • Sr size of tuple in bytes
  • V(A,r) number of distinct values of attr. A
  • fr blocking factor
  • br number of blocks
  • SC(A,r) selection cardinality (avg. of records
    with Agiven)

51
Selections
  • we saw simple predicates (Aconstant eg.,
    nameSmith)
  • how about more complex predicates, like
  • salary gt 10K
  • age 30 and job-codeanalyst
  • what is their selectivity?

52
Selections complex predicates
  • selectivity sel(P) of predicate P
  • fraction of tuples that qualify
  • sel(P) SC(P) nr

53
Selections complex predicates
  • eg., assume that V(grade, TAKES)5 distinct
    values
  • simple predicate P Aconstant
  • sel(Aconstant) 1/V(A,r)
  • eg., sel(gradeB) 1/5
  • (what if V(A,r) is unknown??)

54
Selections complex predicates
  • range query sel( grade gt C)
  • sel(Agta) (Amax a) / (Amax Amin)

55
Selections - complex predicates
  • negation sel( grade ! C)
  • sel( not P) 1 sel(P)
  • (Observation selectivity probability)

P
56
Selections complex predicates
  • conjunction
  • sel( grade C and course CIS661)
  • sel(P1 and P2) sel(P1) sel(P2)
  • INDEPENDENCE ASSUMPTION

57
Selections complex predicates
  • disjunction
  • sel( grade C or course CIS661)
  • sel(P1 or P2) sel(P1) sel(P2) sel(P1 and
    P2)
  • sel(P1) sel(P2) sel(P1)sel(P2)
  • INDEPENDENCE ASSUMPTION, again

58
Selections complex predicates
  • disjunction in general
  • sel(P1 or P2 or Pn)
  • 1 - (1- sel(P1) ) (1 - sel(P2) ) (1 -
    sel(Pn))

59
Selections summary
  • sel(Aconstant) 1/V(A,r)
  • sel( Agta) (Amax a) / (Amax Amin)
  • sel(not P) 1 sel(P)
  • sel(P1 and P2) sel(P1) sel(P2)
  • sel(P1 or P2) sel(P1) sel(P2)
    sel(P1)sel(P2)
  • UNIFORMITY and INDEPENDENCE ASSUMPTIONS

60
Q-opt steps
  • bring query in internal form (eg., parse tree)
  • into canonical form (syntactic q-opt)
  • generate alt. plans
  • selections (simple complex predicates)
  • sorting projections
  • joins
  • estimate cost pick best

61
Sorting
  • Assume br blocks of rel. r, and
  • only M (ltbr) buffers in main memory
  • Q1 how to sort (external sorting)?
  • Q2 cost?

62
Sorting
  • Q1 how to sort (external sorting)?
  • A1
  • create sorted runs of size M
  • merge

63
Sorting
  • create sorted runs of size M (how many?)
  • merge them (how?)

M
...
...
64
Sorting
  • create sorted runs of size M
  • merge first M-1 runs into a sorted run of
  • (M-1) M, ...

M
..
...
...
65
Sorting
  • How many steps we need to do?
  • i, where M(M-1)i gt br
  • How many reads/writes per step? brbr
  • (each step reads every block once and writes it
    once)

M
..
...
...
66
Sorting
  • In short, excluding the final write, we need
  • ceil(log(br/M) / log(M-1)) 2 br br

M
..
...
...
67
Q-opt steps
  • bring query in internal form (eg., parse tree)
  • into canonical form (syntactic q-opt)
  • generate alt. plans
  • selections (simple complex predicates)
  • sorting projections, aggregations
  • joins
  • estimate cost pick best

68
Projection - dupl. elimination
  • eg.,
  • select distinct c-id
  • from TAKES
  • How?
  • Pros and cons?

69
Set operations
  • eg.,
  • select from REGULAR-STUDENT
  • union
  • select from SPECIAL-STUDENT
  • How?
  • Pros and cons?

70
Aggregations
  • eg.,
  • select ssn, avg(grade)
  • from TAKES
  • How?

71
Q-opt steps
  • bring query in internal form (eg., parse tree)
  • into canonical form (syntactic q-opt)
  • generate alt. plans
  • selections sorting projections, aggregations
  • joins
  • 2-way joins
  • n-way joins
  • estimate cost pick best

72
2-way joins
  • output size estimation r JOIN s
  • nr, ns tuples each
  • case1 cartesian product (R, S have no common
    attribute)
  • of output tuples??

73
2-way joins
  • output size estimation r JOIN s
  • case2 r(A,B), s(A,C,D), A is cand. key for r
  • of output tuples??

ltns
r(A, ...)
s(A, ......)
nr
ns
74
2-way joins
  • output size estimation r JOIN s
  • case3 r(A,B), s(A,C,D), A is cand. key for
    neither (is it possible??)
  • of output tuples??

r(A, ...)
s(A, ......)
nr
ns
75
2-way joins
  • of output tuples
  • nr ns/V(A,s) or ns nr/V(A,r)
  • (whichever is less)

76
Q-opt steps
  • bring query in internal form (eg., parse tree)
  • into canonical form (syntactic q-opt)
  • generate alt. plans
  • selections sorting projections, aggregations
  • joins
  • 2-way joins - output size estimation algorithms
  • n-way joins
  • estimate cost pick best

77
2-way joins
  • algorithm(s) for r JOIN s?
  • nr, ns tuples each

78
2-way joins
  • Algorithm 0 (naive) nested loop (SLOW!)
  • for each tuple tr of r
  • for each tuple ts of s
  • print, if they match

79
2-way joins
  • Algorithm 0 why is it bad?
  • how many disk accesses (br and bs are the
    number of blocks for r and s)?

nrbs br
80
2-way joins
  • Algorithm 1 Blocked nested-loop join
  • read in a block of r
  • read in a block of s
  • print matching tuples

cost br br bs
r(A, ...)
s(A, ......)
nr, br
ns records, bs blocks
81
2-way joins
  • Arithmetic example
  • nr 10,000 tuples, br 1,000 blocks
  • ns 1,000 tuples, bs 200 blocks

alg0 2,001,000 d.a. alg1 201,000 d.a.
r(A, ...)
s(A, ......)
10,000 1,000
1,000 records, 200 blocks
82
2-way joins
  • Observation1 Algo1 asymmetric
  • cost br br bs - reverse roles
  • cost bs bsbr
  • Best choice?

smallest relation in outer loop
r(A, ...)
s(A, ......)
nr, br
ns records, bs blocks
83
2-way joins
  • Observation2 NOT IN BOOK
  • what if we have k buffers available?

read in k-1 blocks of r read in a block
of s print matching tuples
r(A, ...)
s(A, ......)
nr, br
ns records, bs blocks
84
2-way joins
  • Cost?

br br/(k-1) bs
read in k-1 blocks of r read in a block
of s print matching tuples
what if brk-1? what if we assign k-1 blocks to
inner?)
r(A, ...)
s(A, ......)
nr, br
ns records, bs blocks
85
2-way joins
  • Observation3 can we get rid of the br term?
  • cost br br bs

A read the inner relation backwards half of the
times! Q cons?
r(A, ...)
s(A, ......)
nr, br
ns records, bs blocks
86
2-way joins
  • Other algorithm(s) for r JOIN s?
  • nr, ns tuples each

87
2-way joins - other algos
  • sort-merge
  • sort r sort s merge sorted versions
  • (good, if one or both are already sorted)

88
2-way joins - other algos
  • sort-merge - cost
  • 2 br log(br) 2 bs log(bs) br bs
  • needs temporary space (for sorted versions)
  • gives output in sorted order

89
2-way joins - other algos
  • use an existing index, or even build one on the
    fly
  • cost br nr c (c look-up cost)

r(A, ...)
nr
90
2-way joins - other algos
  • hash join
  • hash r into (0, 1, ..., max) buckets
  • hash s into buckets (same hash function)
  • join each pair of matching buckets

r(A, ...)
0
1
max
91
2-way joins - hash join details
  • how to join each pair of partitions Hr-i, Hs-i ?
  • A build another hash table for Hs-i, and probe
    (look up) it with each tuple of Hr-i

Hr-0
Hs-0
r(A, ...)
0
1
max
92
2-way joins - hash join details
  • what if Hs-i is too large to fit in main-memory?
  • A recursive partitioning
  • more details (overflows, hybrid hash joins) in
    book
  • cost of hash join? (under certain assumptions)
  • 2(br bs) (br bs) 4 max

93
Q-opt steps
  • bring query in internal form (eg., parse tree)
  • into canonical form (syntactic q-opt)
  • generate alt. plans
  • selections sorting projections, aggregations
  • joins
  • 2-way joins - output size estimation algorithms
  • n-way joins
  • estimate cost pick best

94
n-way joins
  • r1 JOIN r2 JOIN ... JOIN rn
  • typically, break problem into 2-way joins

95
Structure of query optimizers
  • System R
  • break query in query blocks
  • simple queries (ie., no joins) look at stats
  • n-way joins left-deep join trees ie., only one
    intermediate result at a time
  • pros smaller search space pipelining
  • cons may miss optimal
  • 2-way joins nested-loop and sort-merge

96
Structure of query optimizers
  • More heuristics by Oracle, Sybase and Starburst
    (-gt DB2) in book
  • In general q-opt is very important for large
    databases.
  • (explain select ltsql-statementgt gives plan)

97
Q-opt steps
  • bring query in internal form (eg., parse tree)
  • into canonical form (syntactic q-opt)
  • generate alt. plans
  • selections (simple complex predicates)
  • sorting projections
  • joins
  • estimate cost pick best
Write a Comment
User Comments (0)
About PowerShow.com