Title: Query Processing
1Query Processing
2Steps in Query Processing
- Validate and translate the query
- Good syntax.
- All referenced relations exist.
- Translate the SQL to relational algebra.
- Optimize
- Make it run faster.
- Evaluate
3Translation Example
- Possible SQL Query
- SELECT balance
- FROM account
- WHERE balancelt2500
- Possible Relational Algebra Query
- ?balance?balancelt2500(account))
4Tree Representation of Relational Algebra
- ?balance?balancelt2500(account))
?balance
?balancelt2500
account
5Making An Evaluation Plan
- Annotate Query Tree with evaluation instructions
- The query can now be executed by the query
execution engine.
?balance
?balancelt2500
use index 1
account
6Before Optimizing the Query
- Must predict the cost of execution plans.
- Measured by
- CPU time,
- Number of disk block reads,
- Network communication (in distributed DBs),
- where C(CPU) lt C(Disk) lt C(Network).
- Major factor is buffer space.
- Use statistics found in the catalog to help
predict the work required to evaluate a query.
7Disk Cost
- Seek time rotational latency arm movement.
- Scan time time to read the data.
- Typically, seek time is orders of magnitude
greater. - Disk cost is assumed to be highest, so it can be
used to approximate total cost.
8Reading Data, No Indices
- Linear scan
- Cost is a function of file size.
- Binary search on ordering attribute
- Cost is lg of the file size.
- Requires table to be sorted.
9Reading Data with Indices
- Primary index index on sort key.
- Can be dense or sparse.
- Secondary index index on non-sort key.
- Queries can be point queries or range queries.
- Point queries return a single record.
- Range queries return a sequence of consecutive
records.
10Point Queries
- Point queries
- Cost index cost block read cost.
- Range queries (c1 lt key lt c2)
- Primary index
- Cost index cost scan of blocks
- Secondary index
- Cost blocks(index cost scan of block)
11More on Range Queries
- Range query on sort key (c1 lt key)
- c1 lt key Linear scan until you find key.
- c1 gt key Use index to find key, then linear
scan. - Range query using secondary index
- Scan through index blocks. Requires accessing
index for every record.
12More Complex Selections
- Conditions on multiple attributes
- Negations
- Disjunctions
- Grouping pointers when selection is on multiple
attributes - Find a set of solutions for each condition.
- Either compute its union or intersection,
depending on the condition (disjunction or
conjunction.)
13Sorting
- Sorted relations are easier to scan.
- The cost of sorting a relation before querying it
can be less than querying an unsorted relation. - Two types of sorts
- In memory
- Out of memory (a.k.a., external sorting)
14External Merge Sort
- Use this when you cannot fit the relation in
memory. - Assume there are M memory buffers.
- Two phases
- Create sorted runs.
- Merge sorted runs.
15External Merge Sort, Phase 1
- Fill the M memory buffers with the next M blocks
of the relation. - Sort the M blocks.
- Write the sorted blocks to disk.
16External Merge Sort, Phase 2
- Assume there are at most M-1 runs.
- Read the first block of each run into memory.
- At each iteration, find the lowest record from
the M-1 runs. - Place it into the memory buffer.
- If any run is empty, read its next block.
17External Merge Sort Notes
- Can be extended to an arbitrarily large relation
using multiple passes. - Cost is
- Br(2 lg_(M-1) (Br/M) 1)
- Br is the number of blocks for the relation.
- B is the size of a memory buffer.
18Nested Loop Join
- No indices (for now).
- Nested Loop
- R join S
- R is the outer relation.
- S is the inner relation.
- Read a block of R, then read each block of S and
compare their contents using the join condition. - Write any matching tuples to another block.
19Nested Loop Join Cost
- If you read tuple by tuple, its
- tuples in R blocks in S blocks in R.
- Question Which should be in inner relation, and
which should be the outer?
20Block Nested Loop
- Nested Loop Join, but block by block instead.
- Cost for R join S, where R is outer, S is inner
- blocks in R blocks in S blocks in S
21Block Nested Loop Improvements
- Sorted relations?
- More memory?
22Indexed Nested Loop Join
- Assume we have an index on a join attribute of
one of the relations, R or S. - Questions
- Which should the index be on?
- Or, if both have indices on them, which should be
the outer one?
23Indexed Nested Loop Join Cost
- blocks in R rows in R Ls
- Ls is the cost of looking up a record in S using
the index.
24More Joins
- Merge join
- Sort R and S, and then merge them.
- Hash join
- Hash R and S into buckets, and compare the bucket
contents.
25Evaluation
- Materialization Build intermediate tables as
the expression goes up the tree. - Here, one intermediate table is created for the
select, and is the input of the project.
?balance
?balancelt2500
account
26Materialization Cost
- Cost of writing out intermediate results to disk.
27Pipelining
- Compute several operations simultaneously.
- As soon as a tuple is created from one operation,
send it to the next. Here, send selected tuples
straight to the projection.
?balance
?balancelt2500
account
28Implementation of Pipelining
- Requires buffers for each operation.
- Can be
- Demand driven an operator must be asked to
generate a tuple. - Producer driven an operator generates a tuple
whether its asked for or not.
29Query Optimization
30Some Actions of Query Optimization
- Reordering joins.
- Changing the positions of projects and selects.
- Changing the access structures used to read data.
31Catalog Info
- Number of tuples in r.
- Number of blocks for r.
- Size of tuple of r.
- Blocking factor a r the number of r tuples that
fit in a block. - The number of distinct values of each attribute
of r.