File Processing : Query Processing - PowerPoint PPT Presentation

1 / 28
About This Presentation
Title:

File Processing : Query Processing

Description:

Retrieve records satisfying predicates. Example. Find Student ... Contiguously stored on blocks. Nblock = Nselected / Bf. STEM. PNU. Selectivity Estimation ... – PowerPoint PPT presentation

Number of Views:33
Avg rating:3.0/5.0
Slides: 29
Provided by: lik
Category:

less

Transcript and Presenter's Notes

Title: File Processing : Query Processing


1
File Processing Query Processing
  • 2008, Spring
  • Pusan National University
  • Ki-Joune Li

2
Basic Concepts of Query
  • Query
  • Retrieve records satisfying predicates
  • Types of Query
  • Operators
  • Aggregate Query
  • Sorting

3
Relational Operators Select
  • Selection (?condition)
  • Retrieve records satisfying predicates
  • Example
  • Find Student where Student.Score gt 3.5
  • ?scoregt3.5(Student)
  • Index or Hash

4
Relational Operators Project
  • Project (?attributes)
  • Extract interesting attributes
  • Example
  • Find Student.name where score gt 3.5
  • ?name(?acoregt3.5(Student))
  • Full Scan

5
Cartisan Product
  • Cartisan Product (?)
  • Two Tables R1 ? R2
  • Produce all cross products
  • Join ( )

?

6
Join
  • Join ( )
  • Select combined records of cartisan product with
    same value of a common attribute (Natural Join)
  • Example
  • Student (StudentName, AdvisorProfessorID,
    Department, Score)
  • Professor(ProfessorName, ProfessorID,
    Department)
  • Student AdivsorProfessorIDProfessorID
    Professor
  • ? AdivsorProfessorIDProfessorID(Student
    ?Professor)
  • Double Scan Expensive Operation

7
Relational Algebra
  • Relational Algebra
  • Operand Table (Relation)
  • Operator Relational Operator (?, ?, , etc)
  • Example
  • Find Student Name where Student Score gt 3.5 and
    Advisor Professor belongs to CSE Department
  • ?student.name(?acoregt3.5(Student)
    ?DepartmentCSE (Professor) )
  • Relational Algebra Specifies the sequence of
    operations

8
Query Processing Mechanism
  • Query Processing Steps
  • 1. Parsing and translation
  • 2. Optimization
  • 3. Evaluation

9
Parsing and Translation
  • Parsing Query Statement (e.g. in SQL)
  • Translation into relational algebra
  • Equivalent Expression
  • For a same query statement
  • several relation algebraic expressions are
    possible
  • Example
  • ?balance ? 2500(?name(account )) ? ?name(?balance
    ? 2500(account ))
  • Different execution schedules
  • Query Execution Plan (QEP)
  • Determined by relational algebra
  • Several QEPs may be produced by Parsing and
    Translation

10
Query Optimization
  • Choose ONE QEP among QEPs based on
  • Execution Cost of each QEP, where cost means
    execution time
  • How to find cost of each QEP ?
  • Real Execution
  • Exact but Not Feasible
  • Cost Estimation
  • Types of Operations
  • Number of Records
  • Selectivity
  • Distribution of data

11
Cost Model Basic Concepts
  • Cost Model Number of Block Accesses
  • Cost
  • C Cindex Cdata
  • where Cindex Cost for Index Access
  • Cdata Cost for Data Block Retrieval
  • Cindex vs. Cdata ?
  • Cindex depends on index
  • Cdata
  • depends on selectivity
  • Random Access or Sequential Access
  • Selectivity
  • Number (or Ratio) of Objects Selected by Query

12
Cost Model Type of Operations
  • Cost model for each type of operations
  • Select
  • Project
  • Join
  • Aggregate Query
  • Query Processing Method for each type of
    operations
  • Index/Hash or Not

13
Cost Model Number of Records
  • Number of Records
  • Nrecord ? Nblocks
  • Number of Scans
  • Single Scan
  • O(N) Linear Scan
  • O(logN ) Index
  • Multiple Scans
  • O(NM ) Multiple Linear Scans
  • O(N logM ) Multiple Scans with Index

14
Selectivity
  • Selectivity
  • Affects on Cdata
  • Random Access
  • Scattered on several blocks
  • Nblock ? Nselected
  • Sequential Access
  • Contiguously stored on blocks
  • Nblock Nselected / Bf

15
Selectivity Estimation
  • Selectivity Estimation
  • Depends on Data Distribution
  • Example
  • Q1 Find students where 60 lt weight lt 70
  • Q2 Find students where 80 lt weight lt 90
  • How to find the distribution
  • Parametric Method
  • e.g. Gaussian Distribution
  • No a priori knowledge
  • Non-Parametric Method
  • e.g. Histogram
  • Smoothing is necessary
  • Wavelet, Discrete Cosine

16
Select Linear Search
  • Algorithm linear search
  • Scan each file block and test all records to see
    whether they satisfy the selection condition.
  • Cost estimate (number of disk blocks scanned)
    br
  • br denotes number of blocks containing records
    from relation r
  • If selection is on a key attribute (sorted), cost
    (br /2)
  • stop on finding record
  • Linear search can be applied regardless of
  • selection condition or
  • ordering of records in the file, or
  • availability of indices

17
Select Range Search
  • Algorithm primary index, comparison
  • Relation is sorted on A
  • For ?A ? V (r)
  • Step 1 use index to find first tuple ? v and
  • Step 2 scan relation sequentially
  • For ?A?V (r)
  • just scan relation sequentially till first tuple
    gt v do not use index
  • Algorithm secondary index, comparison
  • For ?A ? V (r)
  • Step 1 use index to find first index entry ? v
    and
  • Step 2 scan index sequentially to find pointers
    to records.
  • For ?A?V (r)
  • scan leaf nodes of index finding pointers to
    records, till first entry gt v

18
Select Range Search
  • Comparison between
  • Searching with Index and
  • Linear Search
  • Secondary Index
  • retrieval of records that are pointed to
  • requires an I/O for each record
  • Linear file scan may be cheaper
  • if records are scattered on many blocks
  • clustering is important for this reason

19
Select Complex Query
  • Conjunction ??1 ? ?2 ?. . . ?n(r)
  • Algorithm selection using one index
  • Step 1 Select a combination of ?i (??i (r) )
  • Step 2 Test other conditions on tuple after
    fetching it into memory buffer.
  • Algorithm selection using multiple-key index
  • Use appropriate multiple-attribute index if
    available.
  • Algorithm selection by intersection of
    identifiers
  • Step 1 Requires indices with record pointers.
  • Step 2 Intersection of all the obtained sets of
    record pointers.
  • Step 3 Then fetch records from file
  • Disjunction ??1? ?2 ?. . . ?n (r)
  • Algorithm Disjunctive selection by union of
    identifiers

20
Join Operation
  • Several different algorithms to implement joins
  • Nested-loop join
  • Block nested-loop join
  • Indexed nested-loop join
  • Merge-join
  • Hash-join
  • Choice based on cost estimate
  • Examples use the following information
  • Number of records of customer 10,000
    depositor 5000
  • Number of blocks of customer 400
    depositor 100

21
Nested-Loop Join
  • Algorithm NLJ the theta join r ?
    sFor each tuple tr in r do begin For each tuple
    ts in s do begin test pair (tr,ts) to see if
    they satisfy the join condition ? if they do,
    add tr ts to the result. EndEnd
  • r outer relation, s inner relation.
  • No indices, any kind of join condition.
  • Expensive

22
Nested-Loop Join Performance
  • Worst case
  • the estimated cost is nr ? bs br disk accesses,
    if not enough memory only to hold one block of
    each relation,
  • Example
  • 5000 ? 400 100 2,000,100 disk accesses with
    depositor as outer relation, and
  • 1000 ? 100 400 1,000,400 disk accesses with
    customer as the outer relation.
  • If the smaller relation fits entirely in memory,
  • use that as the inner relation.
  • Reduces cost to br bs disk accesses.
  • If smaller relation (depositor) fits entirely in
    memory, cost estimate will be 500 disk accesses.

23
Block Nested-Loop Join
  • Algoritm BNLJ
  • For each block Br of r do
  • Get Block Br For each block Bs of s do
  • Get Block Bs For each tuple tr in Br
    do For each tuple ts in Bs do Check if
    (tr, ts) satisfy the join condition if they
    do, add tr ts to the result. End End En
    d End

No disk access required
24
Block Nested-Loop Join Performance
  • Worst case
  • Estimate br ? bs br block accesses.
  • Each block in the inner relation s is read once
    for each block in the outer relation (instead of
    once for each tuple in the outer relation)
  • Improvements If M blocks can be buffered
  • use (M-2) disk blocks as blocking unit for outer
    relations,
  • use remaining two blocks to buffer inner relation
    and output
  • Then the cost becomes ?br / (M-2)? ? bs br

25
Indexed Nested-Loop Join
  • Index lookups can replace file scans if
  • join is an equi-join or natural join and
  • an index is available on the inner relations
    join attribute
  • Can construct an index just to compute a join.
  • Algorithm INLJ
  • For each block Br of r do
  • Get Block Br For each tuple tr in Br do
    Search Index (IDXr , tr.key)
  • if found, add tr ts to the result. End
  • End

26
Indexed Nested-Loop Join Performance
  • Worst case
  • buffer has space for only one page of r,
  • Cost of the join br nr ? c
  • Where c is the cost of traversing index and
    fetching matching tuple
  • Number of matching tuples may be greater than
    one.
  • If indices are available on join attributes of
    both r and s,
  • use the relation with fewer tuples as the outer
    relation

27
Example of Nested-Loop Join Costs
  • Assume depositor customer,
  • with depositor as the outer relation.
  • customer have a primary B-tree index on the join
    attribute customer-name, which contains 20
    entries in each index node.
  • customer has 10,000 tuples,
  • the height of the tree is 4, and
  • one more access is needed to find the actual data
  • Depositor has 5000 tuples
  • Cost of block nested loops join
  • 400100 100 40,100 disk accesses assuming
    worst case memory
  • Cost of indexed nested loops join
  • 100 5000 5 25,100 disk accesses.

28
Hash-Join
  • Applicable for equi-joins and natural joins.
  • A hash function h is used to partition tuples of
    both relations
  • h A? 0, 1, ..., n
  • r0, r1, . . ., rn partitions of r tuples
  • s0, s1. . ., sn partitions of s tuples
  • r tuples in ri need only to be compared with s
    tuples in si .
Write a Comment
User Comments (0)
About PowerShow.com