Title: Costbased Query Scrambling for Initial Delays
1Cost-based Query Scrambling for Initial Delays
- Tolga Urhan
- Michael J.Franklin
- Laurent Amsaleg
- Advanced DB
- AUEB MScIS
Pres.Giatrakos Nikos M3060007
2Introduction
- Problem response time unpredictability in wide-
- area distributed information systems
- Large number of remote data sources
- Intermediate sites
- Communication links
Vulnerable to congestion, failures which cause
random delays
Static a priori approaches of traditional
execution plans break down
3Query Scrambling Solution
- Key Idea hide unexpected delays by rescheduling
on the fly the operations of a query so as to
perform other useful work - Focus on Initial Delays
- - delays in receiving the first tuple from a
particular remote source - Decision Making Approaches
- - reduce total work
- - reduce response time
4Query Scrambling
Query Result
Site1
Communication Link
C
Select
Join
Site2
Site3
A
Site4
D
E
B
- Rescheduling execution plan of a query is
dynamically rescheduled when delay is detected - Operator Synthesis new operators can be created
when there are no other operators that can
execute.
5Query Scrambling - Scenario
- Query stalls while retrieving tuples of A
- Rescheduling Phase
- -retrieve tuples of B
- -Check A. Still not available
- -Then D E and
- C (D E)
- -Check A. Still unavailable
- Operator Synthesis Phase
- - C (D E) B
waiting
Site1
Query Result
C
A
D
E
B
Site2
Site3
Site4
- Remark Should a delay occurs in scrambling
operation, then scrambling is invoked further
6Cost-based Rescheduling
- Identify runnable subtrees subtrees made up
entirely of nonbocked operators. - Selection of runnable subtrees to execute
- Traditional way choose maximal one.
- MR The cost of reading the materialized
temporary result - MW The cost of writing the materialized
temporary result - P The cost of executing the subtree
- Choose the one with Maximal efficiency (P -
MR)/(P MW)
How much work will be saved in the future by
scheduling that tree
The duration of the scrambled operation
7Cost-based Operator Synthesis
- Second phase starts when no more progress can be
made in phase 1. - Three approaches of optimization strategies
-
- -Pair
- -(IN) Include Delayed
- -(ED) Estimated Delay
8Cost-based Operator Synthesis - Pair
- Construct a query plan containing only a single
join using two unblocked relations. - Analyzes each pair of unblocked relations sharing
a join predicate. - Chooses the join with the least total cost to
execute. - Materialize the results of the join to disk.
- Avoids Cartesian products, joins whose produced
results take longer to read from disk than to
compute from scratch.
9Cost-based Operator Synthesis - Pair
- At the end of each join, checks for the arrival
of delayed data. If not arrived, do another
iteration - If no qualified joins exist, wait for delayed
data to arrive - Reconstruction phase
- when all blocked relations become available, need
to construct a single query tree - necessary, since Pair policy works only on pairs
of relations and does not maintain a complete
query plan
10Cost-based Operator Synthesis - IN
- Each iteration generates a complete alternative
plan - Chooses a very long delay duration (relative to
response time) to postpone any access to the
delayed data. - Chooses a plan with the greatest benefit
(potential improvement in response time) whose
risk (duration of the optimization step) can be
overlapped with the expected delay duration.
11Cost-based Operator Synthesis - IN
- Use risk/benefit knob (Rbknob) to prevent
optimizer from choosing high-risk plans for
relatively small potential gains over low risk
plans. - Rbknob ratio of the amount of benefit the
optimizer is willing to give up for a given
savings in risk. - Increasing Rbknob - more conservative plans.
12Cost-based Operator Synthesis - ED
- Delay estimates successively increase when
necessary to make more progress - Motivation Use low risk plans when delays are
short, use high risk/high pay off plans for
larger delays. - Execution steps
- Starts by picking an estimated delay value equal
to 25 of the original query response time - Repeat iterations until progress is too small
- Increase delay value to 50 of response time
- Increase to 100 of response time if progress is
still too small.
13Experimental Setup
- Two-phase randomized query optimizer
- Workload based on queries from TPC-D benchmarks
- Single query site, six remote data source sites.
- Experimental methodology plots the duration of
initial delay of a remote source vs. the response
time achieved using scrambling
14Experiment1
memorygt1000 memory300
15Experiment2
memorygt1000 memory300
memory10000
memory1000
16Experiment3
memory10000
17Conclusions
- With sufficient memory, all cost-based approaches
can effectively hide initial delays - Cost-based scrambling tradeoff between
conservative approaches and aggressive ones - As memory available for scrambling is reduced,
scrambling plans are more expensive - Aggressiveness of IN and ED policies can be
adjusted using Rbknob - Pair (total work-based optimizer) may perform
unnecessary work. Hence, response time based
optimizer should be preferred
18(No Transcript)
19Thank you