Title: Optimizing Query Execution
1Optimizing Query Execution
- Zachary G. Ives
- University of Pennsylvania
- CIS 650 Implementing Data Management Systems
- January 26, 2005
Content on hashing and sorting courtesy
Ramakrishnan Gehrke
2Administrivia
- My office hour moved up ½ hour to 200-300 on
Tuesdays - Next week
- Some initial suggestions for the project proposal
- Scheduling of the deadline for your midterm
report
3Todays Trivia Question
4Query Execution
- What are the goals?
- Logical vs. physical plans what are the
differences? - Some considerations in building execution
engines - Efficiency minimize copying, comparisons
- Scheduling make standard code-paths fast
- Data layout how to optimize cache behavior,
buffer management, distributed execution, etc.
5Speeding Operations over Data
- Three general data organization techniques
- Indexing
- Associative lookup synopses
- Both for selection and projection
- Inner loop of nested loops join
- And anywhere sorted data is useful
- Sorting
- Hashing
6Speeding Operations over Data
- Three general data organization techniques
- Indexing
- Sorting
- Hashing
7General External Merge Sort
- To sort a file with N pages using B buffer pages
- Pass 0 use B buffer pages. Produce dN / Be
sorted runs of B pages each - Pass 2, , etc. merge B-1 runs
- Number of passes 1dlogB-1 dN / Bee
- Cost 2N ( of passes)
INPUT 1
. . .
. . .
INPUT 2
. . .
OUTPUT
INPUT B-1
Disk
Disk
B Main memory buffers
8Applicability of Sort Techniques
- Aggregation
- Duplicate removal as an instance of aggregation
- XML nesting as an instance of aggregation
- Join, semi-join, and intersection
9Merge Join
- Requires data sorted by join attributes
- Merge and join sorted files, reading sequentially
a block at a time - Maintain two file pointers
- While tuple at R lt tuple at S, advance R (and
vice versa) - While tuples match, output all possible pairings
- Maintain a last in sequence pointer
- Preserves sorted order of outer relation
- Cost b(R) b(S) plus sort costs, if
necessaryIn practice, approximately linear, 3
(b(R) b(S))
10Hashing
- Several types of hashing
- Static hashing
- Extendible hashing
- Consistent hashing (used in P2P well see later)
11Static Hashing
- Fixed number of buckets (and pages) overflow
when necessary - h(k) mod N bucket to which data entry with key
k belongs - Downside long overflow chains
0
h(key) mod N
2
key
h
N-1
Primary bucket pages
Overflow pages
12Extendible Hashing
- If a bucket becomes full split in half
- Use directory of pointers to buckets, double the
directory, splitting just the bucket that
overflowed - Directory much smaller than file, so doubling it
is much cheaper - Only one page of data entries is split
- Trick lies in how hash function is adjusted!
13Insert h(r)20 (Causes Doubling)
2
LOCAL DEPTH
3
LOCAL DEPTH
Bucket A
16
32
GLOBAL DEPTH
32
16
Bucket A
GLOBAL DEPTH
2
2
2
3
Bucket B
1
5
21
13
00
1
5
21
13
000
Bucket B
01
001
2
10
2
010
Bucket C
10
11
10
Bucket C
011
100
2
2
DIRECTORY
101
Bucket D
15
7
19
15
19
7
Bucket D
110
111
2
3
Bucket A2
20
4
12
DIRECTORY
20
12
Bucket A2
4
(split image'
of Bucket A)
(split image'
of Bucket A)
14Relevance of Hashing Techniques
- Hash indices use extensible hashing
- Uses of static hashing
- Aggregation
- Intersection
- Joins
- Why isnt extendible hashing used in hash joins
only as a disk indexing technique?
15Hash Join
- Read entire inner relation into hash table (join
attributes as key) - For each tuple from outer, look up in hash table
join - Not fully pipelined
16Running out of Memory
- Prevention First partition the data by value
into memory-sized groups - Partition both relations in the same way, write
to files - Recursively join the partitions
- Resolution Similar, but do when hash tables
full - Split hash table into files along bucket
boundaries - Partition remaining data in same way
- Recursively join partitions with diff. hash fn!
- Hybrid hash join flush lazily a few buckets at
a time - Cost lt 3 (b(R) b(S))
17The Duality of Hash and Sort
- Different means of partitioning and merging data
when comparisons are necessary - Break on physical rule (mem size) in sorting
- Merge on logical step, the merge
- Break on logical rule (hash val) in hashing
- Combine using physical step (concat)
- When larger-than-memory sorting is necessary,
multiple operators use the same key, we can make
all operators work on the same in-memory portion
of data at the same time - Can we do this with hashing? Hash teams (Graefe)
18What If I Want to Distribute Query Processing?
- Where do I put the data in the first place (or do
I have a choice)? - How do we get data from point A -gt point B?
- What about delays?
- What about binding patterns?
- Looks kind of like an index join with a sargable
predicate
19Pipelined Hash Join Useful for Joining Web Sources
- Two hash tables
- As a tuple comes in, add to the appropriate side
join with opposite table - Fully pipelined, adaptive to source data rates
- Can handle overflow as with hash join
- Needs more memory
20The Dependent Join
- Take attributes from left and feed to the right
source as input/filter - Important in data integration
- Simple method
- for each tuple from left send to right
source get data back, join - More complex
- Hash cache of attributes mappings
- Dont send attribute already seen
- Bloom joins (use bit-vectors to reduce traffic)
JoinA.x B.y
A
B
x
21Wrap-Up of Execution
- Query execution is all about engineering for
efficiency - O(1) and O(lg n) algorithms wherever possible
- Avoid looking at or copying data wherever
possible - Note that larger-than-memory is of paramount
importance - Should that be so in todays world?
- As weve seen it so far, its all about
pipelining things through as fast as possible - But may also need to consider other axes
- Adaptivity/flexibility may sometimes need this
- Information flow to the optimizer, the runtime
system
22Query Optimization
- Challenge pick the query execution plan that
has minimum cost - Sources of cost
- Interactions with other work
- Size of intermediate results
- Choices of algorithms, access methods
- Mismatch between I/O, CPU rates
- Data properties skew, order, placement
- Strategy Estimate the cost of every query plan,
find cheapest - Given
- Some notion of CPU, disk speeds
- Cost model for every operator
- Some information about tables and data
23The General Model of Optimization
- Given an AST of a query
- Build a logical query plan
- (Tree of query algebraic operations)
- Transform into better logical plan
- Convert into a physical query plan
- (Includes strategies for executing operations)
24Which Operators Need Significant Optimization
Decisions?
- We typically make the following assumptions
- All predicates are evaluated as early as possible
- All data is projected away as early as possible
- As a general rule, those that produce
intermediate state or are blocking - Joins
- Aggregation
- Sorting
- By choosing a join ordering, were automatically
choosing where selections and projections are
pushed why is this so?
25The Basic Model System-R
- Breaks a query into its blocks, separately
optimizes them - Focuses strictly on joins (and only a few kinds)
in dynamic programming enumeration - Principle of optimality best k-way join includes
best (k-1)-way join - Use simple table statistics when available, based
on indices magic numbers where unavailable - Heuristics
- Push sargable selects, projects as low as
possible - Cartesian products after joins
- Left-linear trees only n2n-1 cost-est.
operations - Grouping last
- Extra interesting orders dimension
- Grouping, ordering, join attributes
26Next Time Beyond System-R
- Cross-query-block optimizations
- e.g., push a selection predicate from one block
to another - Better statistics
- More general kinds of optimizations
- Optimization of aggregation operations
- Different cost and data models, e.g., OO, XML
- Additional joins, e.g., containment joins
- Can we build an extensible architecture for this?
- Logical, physical, and logical-to-physical
transformations - Enforcers
- Alternative search strategies
- Left-deep plans arent always optimal
- Perhaps we can prune more efficiently
27Upcoming Readings
- For Monday
- Read EXODUS and Starburst papers
- Write one review contrasting the two on the major
issues