Query Processing - PowerPoint PPT Presentation

About This Presentation
Title:

Query Processing

Description:

... scan: can be used for ... First approach: Find the most selective access path, retrieve ... The cost of joining two relations makes the choice of a join ... – PowerPoint PPT presentation

Number of Views:36
Avg rating:3.0/5.0
Slides: 42
Provided by: nihankes
Category:
Tags: processing | query | scan

less

Transcript and Presenter's Notes

Title: Query Processing


1
Query Processing
  • Chapters 12, 14

2
Basic Steps in Query Processing
  • 1. Parsing and translation
  • 2. Optimization
  • 3. Evaluation

3
Basic Steps in Query Processing (Cont.)
  • Parsing and translation
  • translate the query into its internal form. This
    is then translated into relational algebra.
  • Parser checks syntax, verifies relations
  • Evaluation
  • The query-execution engine takes a
    query-evaluation plan, executes that plan, and
    returns the answers to the query.

4
Basic Steps Optimization
  • A relational algebra expression may have many
    equivalent expressions
  • E.g., ?balance?2500(?balance(account)) is
    equivalent to ?balance(?balance?2500(acc
    ount))
  • Each relational algebra operation can be
    evaluated using one of several different
    algorithms
  • Correspondingly, a relational-algebra expression
    can be evaluated in many ways.
  • Annotated expression specifying detailed
    evaluation strategy is called an evaluation-plan.
  • e.g., can use an index on balance to find
    accounts with balance lt 2500,
  • or, can perform complete relation scan and
    discard accounts with balance ? 2500

5
Basic Steps Optimization (Cont.)
  • Query Optimization Amongst all equivalent
    evaluation plans choose the one with lowest cost.
  • Cost is estimated using statistical information
    from the database catalog
  • e.g. number of tuples in each relation, size of
    tuples, etc.
  • We first study
  • How to measure query costs
  • Algorithms for evaluating relational algebra
    operations
  • How to combine algorithms for individual
    operations in order to evaluate a complete
    expression
  • Then
  • We study how to optimize queries, that is, how to
    find an evaluation plan with lowest estimated cost

6
Measures of Query Cost
  • Cost is generally measured as total elapsed time
    for answering query
  • Many factors contribute to time cost
  • disk accesses, CPU, or even network communication
  • Typically disk access is the predominant cost,
    and is also relatively easy to estimate.
    Measured by taking into account
  • Number of seeks average-seek-cost
  • Number of blocks read average-block-transfer-co
    st
  • Number of blocks written average-block-transfer-
    cost

7
Statistics and Catalogs
  • Need information about the relations and indexes
    involved. Catalogs typically contain at least
  • tuples (NTuples) and pages (NPages) for each
    relation.
  • distinct key values (NKeys) and NPages for each
    index.
  • Index height, low/high key values (Low/High) for
    each tree index.
  • Catalogs updated periodically.
  • Updating whenever data changes is too expensive
    lots of approximation anyway, so slight
    inconsistency ok.
  • More detailed information (e.g., histograms of
    the values in some field) are sometimes stored.

8
Relational Operations
  • We will consider how to implement
  • Selection ( )
  • Projection ( )
  • Join ( )
  • Since each op returns a relation, ops can be
    composed. After we cover the operations, we will
    discuss how to optimize queries formed by
    composing them.

9
Computing Selection ?(attr op value)
  • No index on attr
  • If rows are not sorted on attr
  • Scan all data pages to find rows satisfying
    selection condition
  • Cost F (number of pages in file to be sorted)
  • If rows are sorted on attr and op is , gt, lt
    then
  • Use binary search (at log2 F ) to locate first
    data page containing row in which (attr value)
  • Scan further to get all rows satisfying (attr op
    value)
  • Cost log2 F (cost of scan)

10
Computing Selection ?(attr op value)
  • Clustered B tree index on attr (for or
    range search)
  • Locate first index entry corresponding to a row
    in which (attr value). Cost depth of tree
  • Rows satisfying condition packed in sequence in
    successive data pages scan those pages.
  • Cost number of pages occupied by qualifying
    rows

B tree
index entries (containing rows) that
satisfy condition
11
Computing Selection ?(attr op value)
  • Unclustered B tree index on attr (for or
    range search)
  • Locate first index entry corresponding to a row
    in which (attr value).
  • Cost depth of tree
  • Index entries with pointers to rows satisfying
    condition are packed in sequence in successive
    index pages
  • Scan entries and sort record Ids to identify
    table data pages with qualifying rows
  • Any page that has at least one such row must be
    fetched once.
  • Cost number of rows that satisfy selection
    condition

12
Unclustered B Tree Index
index entries (containing row Ids) that
satisfy condition
data page
Data file
B Tree
13
Computing Selection ?(attr value)
  • Hash index on attr (for search only)
  • Hash on value. Cost ? 1.2
  • 1.2 typical average cost of hashing (gt 1 due
    to possible overflow chains)
  • Finds the (unique) bucket containing all index
    entries satisfying selection condition
  • Clustered index all qualifying rows packed in
    the bucket (a few pages)
  • Cost number of pages occupies by the bucket
  • Unclustered index sort row Ids in the index
    entries to identify data pages with qualifying
    rows
  • Each page containing at least one such row must
    be fetched once
  • Cost number of rows in bucket

14
Computing Selection ?(attr value)
  • Unclustered hash index on attr (for equality
    search)

buckets
data pages
15
Access Path
  • Access path is the notion that denotes algorithm
    data structure used to locate rows satisfying
    some condition
  • Examples
  • File scan can be used for any condition
  • Hash equality search all search key
    attributes of hash index are specified in
    condition
  • B tree equality or range search a prefix of
    the search key attributes are specified in
    condition
  • B tree supports a variety of access paths
  • Binary search Relation sorted on a sequence of
    attributes and some prefix of that sequence is
    specified in condition

16
Access Paths Supported by B tree
  • Example Given a B tree whose search key is the
    sequence of attributes a2, a1, a3, a4
  • Access path for search ?a1gt5 ? a23 ? a3x (R)
    find first entry having a23 ? a1gt5 ? a3x and
    scan leaves from there until entry having a2gt3 or
    a3 ? x. Select satisfying entries
  • Access path for search ? a23 ? a3 gtx (R)
    locate first entry having a23 and scan leaves
    until entry having a2gt3. Select satisfying
    entries
  • Access path for search ? a1gt5 ? a3 x (R)
    Scan of R

17
Choosing an Access Path
  • Selectivity of an access path number of pages
    retrieved using that path
  • If several access paths support a query, DBMS
    chooses the one with lowest selectivity
  • Size of domain of attribute is an indicator of
    the selectivity of search conditions that involve
    that attribute
  • Example ? CrsCodeCS305 ? GradeB
  • a B tree with search key CrsCode has lower
    selectivity than a B tree with search key Grade

18
Schema for Examples
Sailors (sid integer, sname string, rating
integer, age real) Reserves (sid integer, bid
integer, day dates, rname string)
  • Similar to old schema rname added for
    variations.
  • Reserves
  • Each tuple is 40 bytes long, 100 tuples per
    page, 1000 pages.
  • Sailors
  • Each tuple is 50 bytes long, 80 tuples per page,
    500 pages.

19
Simple Selections
SELECT FROM Reserves R WHERE R.rname
Joe
  • With no index, unsorted
  • Must essentially scan the whole relation cost
    is 1000 I/Os (pages in R).
  • With no index, sorted data
  • Utilize the sort order on rname by doing a
    binary search to locate the first Joe. Cost is
    log2 1000 ? 10 I/Os.
  • With a B tree index on selection attribute
  • Use index to find qualifying data entries, then
    retrieve corresponding data records. Cost of
    finding the starting page is 2 or 3 I/Os for a
    clustered index add one more I/O for an
    unclustered index add one page per qualifying
    tuple.
  • Hash index
  • 1 or 2 I/Os to retrieve the index pages. If 100
    reservations by Joe then an additional 1-100 disk
    accesses depending how these records are
    distributed.

20
Using an Index for Selections
SELECT FROM Reserves R WHERE R.rname lt
C
  • Cost depends on qualifying tuples, and
    clustering.
  • Assume we estimate roughly 10 of Reserves tuples
    will be in result ( 10,000 tuples, or 100
    pages).
  • With a clustered index cost is 100 I/Os 1 or 2
    disk accesses for index.
  • With an unclustered index cost could be as high
    as 10,000 I/Os in the worst case. (might be
    cheaper to simply scan the entire relation)

21
A Note on Complex Selections
(daylt8/9/94 AND rnamePaul) OR bid5 OR sid3
  • Selection conditions are first converted to
    conjunctive normal form (CNF)
  • (daylt8/9/94 OR bid5 OR sid3 ) AND
    (rnamePaul OR bid5 OR sid3)

22
Two Approaches to General Selections
  • Consider the selection condition
  • daylt8/9/94 AND bid5 AND sid3
  • First approach Find the most selective access
    path, retrieve tuples using it, and apply any
    remaining terms that dont match the index
  • A B tree index on day can be used then, bid5
    and sid3 must be checked for each retrieved
    tuple.
  • Similarly, a hash index on ltbid, sidgt could be
    used daylt8/9/94 must then be checked.
  • Terms that match the index reduce the number of
    tuples retrieved other terms are used to discard
    some retrieved tuples, but do not affect number
    of tuples/pages fetched.

23
Intersection of Rids
  • Second approach (if we have 2 or more matching
    indexes)
  • Get sets of rids of data records using each
    matching index.
  • Then intersect these sets of rids.
  • Retrieve the records and apply any remaining
    terms.
  • For the given example condition
  • If we have a B tree index on day and an index
    on sid, we can retrieve rids of records
    satisfying daylt8/9/94 using the first, rids of
    records satisfying sid3 using the second,
    intersect, retrieve records and check bid5.

24
The Projection Operation
  • To implement projection we have to do the
    following
  • Remove unwanted attributes.
  • Eliminate any duplicate tuples produced.
  • The expensive part is removing duplicates.
  • SQL systems dont remove duplicates unless the
    keyword DISTINCT is specified in a query.
  • There are two basic algorithms
  • Sorting Approach.
  • Hashing Approach.

25
Approach based on sorting
  • Modify Pass 1 of external sort to eliminate
    unwanted fields. If B buffer pages are
    available, runs of about 2B pages can be
    produced, but tuples in runs are smaller than
    input tuples. (Size ratio depends on and size
    of fields that are dropped.)
  • Modify merging passes to eliminate duplicates.
    Thus, number of result tuples smaller than input.
    (Difference depends on of duplicates.)

26
Example
SELECT DISTINCT R.sid, R.bid FROM Reserves
R
  • Cost
  • In Pass 1, read original relation (1000 pages),
    write out same number of smaller tuples.
  • Assume we have 20 buffer pages
  • Assume that each smaller tuple is 10 bytes long.
  • Thus cost is 250 pages (7 runs about 40 pages
    each).
  • In merging passes, fewer tuples written out in
    each pass.
  • The temporary relation can be merged in 1 pass,
    since we have 20 buffer pages.
  • We read the runs at a cost of 250 I/Os and merge
    them.
  • The total cost is 1500 I/Os.

27
Projection Based on Hashing
  • Partitioning phase
  • Read R using one input buffer. For each tuple,
    discard unwanted fields, apply hash function h1
    to choose one of B-1 output buffers.
  • Result is B-1 partitions (of tuples with no
    unwanted fields). 2 tuples from different
    partitions guaranteed to be distinct.
  • Duplicate elimination phase
  • For each partition, read it and build an
    in-memory hash table, using hash fn h2 (? h1) on
    all fields, while discarding duplicates.
  • If partition does not fit in memory, can apply
    hash-based projection algorithm recursively to
    this partition.
  • Cost For partitioning, read R, write out each
    tuple, but with fewer fields. This is read in
    next phase.
  • In our projection example this cost is 1000 2
    250 1500 I/Os.

28
Discussion of Projection
  • Sort-based approach is the standard better
    handling of duplicate elimination and result is
    sorted.
  • If an index on the relation contains all wanted
    attributes in its search key, can do index-only
    scan.
  • Apply projection techniques to data entries (much
    smaller!)
  • If an ordered (i.e., tree) index contains all
    wanted attributes as prefix of search key, can do
    even better
  • Retrieve data entries in order (index-only scan),
    discard unwanted fields, compare adjacent tuples
    to check for duplicates.

29
Computing Joins R AB S
  • The cost of joining two relations makes the
    choice of a join algorithm crucial
  • Assume M pages in R, pR tuples per page, N pages
    in S, pS tuples per page.
  • In our examples, R is Reserves and S is Sailors.
  • Cost metric of I/Os. We will ignore output
    costs.

30
Simple Nested Loops Join
foreach tuple r in R do foreach tuple s in S
do if ri sj then add ltr, sgt to result
  • For each tuple in the outer relation R, we scan
    the entire inner relation S.
  • Cost M pR M N 1000 1001000500
    I/Os.
  • Page-oriented Nested Loops join For each page
    of R, get each page of S, and write out matching
    pairs of tuples ltr, sgt, where r is in R-page and
    S is in S-page.
  • Cost M MN 1000 1000500
  • If smaller relation (S) is outer, cost 500
    5001000
  • Choose smaller relation for the outer loop

31
Block Nested Loops Join
  • Use one page as an input buffer for scanning the
    inner S, one page as the output buffer, and use
    all remaining pages to hold block of outer R.
  • For each matching tuple r in R-block, s in
    S-page, add ltr, sgt to result. Then read next
    R-block, scan S, etc.
  • Cost can be reduced to
  • M (M/(B-2)) ? N (by using B
    buffer pages instead of 1.)

R S
Join Result
Hash table for block of R (B-2 pages)
. . .
. . .
Input buffer for S
Output buffer
32
Examples of Block Nested Loops
  • Cost Scan of outer outer blocks scan of
    inner
  • outer blocks ? of pages of outer / blocksize?
  • i.e. Cost M N ?M/(B-2) ?
  • With Reserves (R) as outer, and 100 pages of R
  • Cost of scanning R is 1000 I/Os a total of 10
    blocks.
  • Per block of R, we scan Sailors (S) 10500
    I/Os.
  • If space for just 90 pages of R, we would scan S
    12 times.
  • With 100-page block of Sailors as outer
  • Cost of scanning S is 500 I/Os a total of 5
    blocks.
  • Per block of S, we scan Reserves 51000 I/Os.

33
Index Nested Loops Join
foreach tuple r in R do foreach tuple s in S
where ri sj do add ltr, sgt to result
  • If there is an index on the join column of one
    relation (say S), can make it the inner and
    exploit the index.
  • Cost M ( (MpR) cost of finding matching S
    tuples)
  • For each R tuple, cost of probing S index is
    about 1.2 for hash index, 2-4 for B tree. Cost
    of then finding S tuples depends on clustering.
  • Clustered index 1 I/O (typical), unclustered
    upto 1 I/O per matching S tuple.
  • Effective if number of rows of S that match
    tuples in R is small and index is clustered

34
Examples of Index Nested Loops
  • Hash-index (unclustered) on sid of Sailors (as
    inner)
  • Scan Reserves 1000 page I/Os, 1001000 tuples.
  • For each Reserves tuple 1.2 I/Os to get data
    entry in index, plus 1 I/O to get (the exactly
    one) matching Sailors tuple. Total 220,000
    I/Os.
  • Hash-index (unclustered) on sid of Reserves (as
    inner)
  • Scan Sailors 500 page I/Os, 80500 tuples.
  • For each Sailors tuple 1.2 I/Os to find index
    page with data entries (1.2 40000 48,500
    I/Os), plus cost of retrieving matching Reserves
    tuples. Assuming uniform distribution, 2.5
    reservations per sailor (100,000 / 40,000), thus
    cost of retrieving them is 2.540,000
    I/Os.Total cost is 48,500 100,000 148,500
    I/Os (still better than simple nested loops)

35
Sort-Merge Join R AB S
sort R on A sort S on B while !eof(R) and
!eof(S) do scan r and s concurrently until
tr.Ats.B if (tr.Ats.Bc) output
?Ac(r)??Bc (s)
?Ac(r)
r
?
s
?Bc (s)
36
Sort-Merge Join
  • Cost of sorting assuming B buffers
    2 M log B-1 M 2 N log B-1 N
  • Cost of merging depends on whether ?Ac(R) and
    ?Bc (S) can be fit in buffers
  • If yes, merge step takes M N
  • In no, then each ?Ac(r)??Bc (s) has to be
    computed using block-nested join.

37
Example of Sort-Merge Join
  • Both Reserves and Sailors can be sorted in 2
    passes.
  • With 100 buffer pages Sorting R 22 1000
    sorting S 22 500 Thus total join cost 7500.

38
Hash-Join
  • Partition both relations using hash fn h R
    tuples in partition i will only match S tuples in
    partition i.
  • Read in a partition of R, hash it using h2 (ltgt
    h!). Scan matching partition of S, search for
    matches.

39
Observations on Hash-Join
  • partitions k ? B-1 (why?), and B-2 gt size of
    largest partition to be held in memory. Assuming
    uniformly sized partitions, and maximizing k, we
    get
  • k B-1, and M/(B-1) lt B-2, i.e., B must be gt
  • If we build an in-memory hash table to speed up
    the matching of tuples, a little more memory is
    needed.
  • If the hash function does not partition
    uniformly, one or more R partitions may not fit
    in memory. Can apply hash-join technique
    recursively to do the join of this R-partition
    with corresponding S-partition.

40
Cost of Hash-Join
  • In partitioning phase, readwrite both relns
    2(MN). In matching phase, read both relns MN
    I/Os.
  • Sort-Merge Join vs. Hash Join
  • Given a minimum amount of memory, both have a
    cost of 3(MN) I/Os. Hash Join superior on this
    count if relation sizes differ greatly. Also,
    Hash Join shown to be highly parallelizable.
  • Sort-Merge less sensitive to data skew result is
    sorted.

41
General Join Conditions
  • Nested loop and block nested loops can be used
    regardless of the join condition.
  • Equalities over several attributes (e.g.,
    R.sidS.sid AND R.rnameS.sname)
  • For Index NL, build index on ltsid, snamegt (if S
    is inner) or use existing indexes on sid or
    sname.
  • For Sort-Merge and Hash Join, sort/partition on
    combination of the two join columns.
  • Inequality conditions (e.g., R.rname lt S.sname)
  • For Index NL, need (clustered!) B tree index.
  • Range probes on inner matches likely to be
    much higher than for equality joins.
  • Hash Join, Sort Merge Join not applicable.
  • Block NL quite likely to be the best join method
    here.
Write a Comment
User Comments (0)
About PowerShow.com