File Processing : Query Processing

About This Presentation

Title:

File Processing : Query Processing

Description:

Retrieve records satisfying predicates. Example. Find Student ... Contiguously stored on blocks. Nblock = Nselected / Bf. STEM. PNU. Selectivity Estimation ... – PowerPoint PPT presentation

Number of Views:33

Avg rating:3.0/5.0

Slides: 29

Provided by: lik

Category:

more less

Transcript and Presenter's Notes

Title: File Processing : Query Processing

1
File Processing Query Processing

2008, Spring
Pusan National University
Ki-Joune Li

2
Basic Concepts of Query

Query
Retrieve records satisfying predicates
Types of Query
Operators
Aggregate Query
Sorting

3
Relational Operators Select

Selection (?condition)
Retrieve records satisfying predicates
Example
Find Student where Student.Score gt 3.5
?scoregt3.5(Student)
Index or Hash

4
Relational Operators Project

Project (?attributes)
Extract interesting attributes
Example
Find Student.name where score gt 3.5
?name(?acoregt3.5(Student))
Full Scan

5
Cartisan Product

Cartisan Product (?)
Two Tables R1 ? R2
Produce all cross products
Join ( )

?

6
Join

Join ( )
Select combined records of cartisan product with
same value of a common attribute (Natural Join)
Example
Student (StudentName, AdvisorProfessorID,
Department, Score)
Professor(ProfessorName, ProfessorID,
Department)
Student AdivsorProfessorIDProfessorID
Professor
? AdivsorProfessorIDProfessorID(Student
?Professor)
Double Scan Expensive Operation

7
Relational Algebra

Relational Algebra
Operand Table (Relation)
Operator Relational Operator (?, ?, , etc)
Example
Find Student Name where Student Score gt 3.5 and
Advisor Professor belongs to CSE Department
?student.name(?acoregt3.5(Student)
?DepartmentCSE (Professor) )
Relational Algebra Specifies the sequence of
operations

8
Query Processing Mechanism

Query Processing Steps
1. Parsing and translation
2. Optimization
3. Evaluation

9
Parsing and Translation

Parsing Query Statement (e.g. in SQL)
Translation into relational algebra
Equivalent Expression
For a same query statement
several relation algebraic expressions are
possible
Example
?balance ? 2500(?name(account )) ? ?name(?balance
? 2500(account ))
Different execution schedules
Query Execution Plan (QEP)
Determined by relational algebra
Several QEPs may be produced by Parsing and
Translation

10
Query Optimization

Choose ONE QEP among QEPs based on
Execution Cost of each QEP, where cost means
execution time
How to find cost of each QEP ?
Real Execution
Exact but Not Feasible
Cost Estimation
Types of Operations
Number of Records
Selectivity
Distribution of data

11
Cost Model Basic Concepts

Cost Model Number of Block Accesses
Cost
C Cindex Cdata
where Cindex Cost for Index Access
Cdata Cost for Data Block Retrieval
Cindex vs. Cdata ?
Cindex depends on index
Cdata
depends on selectivity
Random Access or Sequential Access
Selectivity
Number (or Ratio) of Objects Selected by Query

12
Cost Model Type of Operations

Cost model for each type of operations
Select
Project
Join
Aggregate Query
Query Processing Method for each type of
operations
Index/Hash or Not

13
Cost Model Number of Records

Number of Records
Nrecord ? Nblocks
Number of Scans
Single Scan
O(N) Linear Scan
O(logN ) Index
Multiple Scans
O(NM ) Multiple Linear Scans
O(N logM ) Multiple Scans with Index

14
Selectivity

Selectivity
Affects on Cdata
Random Access
Scattered on several blocks
Nblock ? Nselected
Sequential Access
Contiguously stored on blocks
Nblock Nselected / Bf

15
Selectivity Estimation

Selectivity Estimation
Depends on Data Distribution
Example
Q1 Find students where 60 lt weight lt 70
Q2 Find students where 80 lt weight lt 90
How to find the distribution
Parametric Method
e.g. Gaussian Distribution
No a priori knowledge
Non-Parametric Method
e.g. Histogram
Smoothing is necessary
Wavelet, Discrete Cosine

16
Select Linear Search

Algorithm linear search
Scan each file block and test all records to see
whether they satisfy the selection condition.
Cost estimate (number of disk blocks scanned)
br
br denotes number of blocks containing records
from relation r
If selection is on a key attribute (sorted), cost
(br /2)
stop on finding record
Linear search can be applied regardless of
selection condition or
ordering of records in the file, or
availability of indices

17
Select Range Search

Algorithm primary index, comparison
Relation is sorted on A
For ?A ? V (r)
Step 1 use index to find first tuple ? v and
Step 2 scan relation sequentially
For ?A?V (r)
just scan relation sequentially till first tuple
gt v do not use index
Algorithm secondary index, comparison
For ?A ? V (r)
Step 1 use index to find first index entry ? v
and
Step 2 scan index sequentially to find pointers
to records.
For ?A?V (r)
scan leaf nodes of index finding pointers to
records, till first entry gt v

18
Select Range Search

Comparison between
Searching with Index and
Linear Search
Secondary Index
retrieval of records that are pointed to
requires an I/O for each record
Linear file scan may be cheaper
if records are scattered on many blocks
clustering is important for this reason

19
Select Complex Query

Conjunction ??1 ? ?2 ?. . . ?n(r)
Algorithm selection using one index
Step 1 Select a combination of ?i (??i (r) )
Step 2 Test other conditions on tuple after
fetching it into memory buffer.
Algorithm selection using multiple-key index
Use appropriate multiple-attribute index if
available.
Algorithm selection by intersection of
identifiers
Step 1 Requires indices with record pointers.
Step 2 Intersection of all the obtained sets of
record pointers.
Step 3 Then fetch records from file
Disjunction ??1? ?2 ?. . . ?n (r)
Algorithm Disjunctive selection by union of
identifiers

20
Join Operation

Several different algorithms to implement joins
Nested-loop join
Block nested-loop join
Indexed nested-loop join
Merge-join
Hash-join
Choice based on cost estimate
Examples use the following information
Number of records of customer 10,000
depositor 5000
Number of blocks of customer 400
depositor 100

21
Nested-Loop Join

Algorithm NLJ the theta join r ?
sFor each tuple tr in r do begin For each tuple
ts in s do begin test pair (tr,ts) to see if
they satisfy the join condition ? if they do,
add tr ts to the result. EndEnd
r outer relation, s inner relation.
No indices, any kind of join condition.
Expensive

22
Nested-Loop Join Performance

Worst case
the estimated cost is nr ? bs br disk accesses,
if not enough memory only to hold one block of
each relation,
Example
5000 ? 400 100 2,000,100 disk accesses with
depositor as outer relation, and
1000 ? 100 400 1,000,400 disk accesses with
customer as the outer relation.
If the smaller relation fits entirely in memory,
use that as the inner relation.
Reduces cost to br bs disk accesses.
If smaller relation (depositor) fits entirely in
memory, cost estimate will be 500 disk accesses.

23
Block Nested-Loop Join

Algoritm BNLJ
For each block Br of r do
Get Block Br For each block Bs of s do
Get Block Bs For each tuple tr in Br
do For each tuple ts in Bs do Check if
(tr, ts) satisfy the join condition if they
do, add tr ts to the result. End End En
d End

No disk access required
24
Block Nested-Loop Join Performance

Worst case
Estimate br ? bs br block accesses.
Each block in the inner relation s is read once
for each block in the outer relation (instead of
once for each tuple in the outer relation)
Improvements If M blocks can be buffered
use (M-2) disk blocks as blocking unit for outer
relations,
use remaining two blocks to buffer inner relation
and output
Then the cost becomes ?br / (M-2)? ? bs br

25
Indexed Nested-Loop Join

Index lookups can replace file scans if
join is an equi-join or natural join and
an index is available on the inner relations
join attribute
Can construct an index just to compute a join.
Algorithm INLJ
For each block Br of r do
Get Block Br For each tuple tr in Br do
Search Index (IDXr , tr.key)
if found, add tr ts to the result. End
End

26
Indexed Nested-Loop Join Performance

Worst case
buffer has space for only one page of r,
Cost of the join br nr ? c
Where c is the cost of traversing index and
fetching matching tuple
Number of matching tuples may be greater than
one.
If indices are available on join attributes of
both r and s,
use the relation with fewer tuples as the outer
relation

27
Example of Nested-Loop Join Costs

Assume depositor customer,
with depositor as the outer relation.
customer have a primary B-tree index on the join
attribute customer-name, which contains 20
entries in each index node.
customer has 10,000 tuples,
the height of the tree is 4, and
one more access is needed to find the actual data
Depositor has 5000 tuples
Cost of block nested loops join
400100 100 40,100 disk accesses assuming
worst case memory
Cost of indexed nested loops join
100 5000 5 25,100 disk accesses.

28
Hash-Join

Applicable for equi-joins and natural joins.
A hash function h is used to partition tuples of
both relations
h A? 0, 1, ..., n
r0, r1, . . ., rn partitions of r tuples
s0, s1. . ., sn partitions of s tuples
r tuples in ri need only to be compared with s
tuples in si .

Write a Comment

User Comments (0)

About PowerShow.com

File Processing : Query Processing - PowerPoint PPT Presentation

File Processing : Query Processing

Retrieve records satisfying predicates. Example. Find Student ... Contiguously stored on blocks. Nblock = Nselected / Bf. STEM. PNU. Selectivity Estimation ... – PowerPoint PPT presentation