Title: Eddies: Continuously Adaptive Query Processing
1Eddies Continuously Adaptive Query Processing
- Based on a SIGMOD2002 paper and talk
- by Avnur and Hellerstein.
2State-of-Art in Query Optimization
- Given
- Database state and statistics known a-priori
- One (short) user query to process
- Query may be run only once
- Query Processing
- A-priori decide on a (static) query plan
- Run query using this one plan
- Also
- Possibly update statistics sometimes (in steady
state)
3Adaptive Systems General Flavor
- Repeat
- Observe (model) environment
- Use observation to choose behavior
- Take action
4Adaptivity in Current DBs
- Limited coarse grain
- Repeat
- Observe (model) environment
- runstats (once per week!!) model changes in data
- Use observation to choose behavior
- query optimization fixes a single static query
plan - Take action
- query execution blindly follow plan
5Query Optimization
- Adaptivity at a per-week frequency!
- Not suited for volatile environments
6A Networking Problem!?
- Networks do dataflow!
- Significant history of adaptive techniques
- E.g. TCP congestion control
- E.g. routing
- But traditionally much lower function
- Ship bitstreams
- Minimal, fixed code
- Lately, moving up the foodchain?
- app-level routing
- active networks
7Query Plans are Dataflow
- Programming model iterators
- old idea, widely used in DB query processing
- object with three methods
- Init(), GetNext(), Close()
- input/output types
- query plan graph of iterators
- pipelining iterators that return results before
children Close()
8Querying in Volatile Environments
- Federated query processors
- No control over stats, performance, admin
(DataJoiner) - Shared-Nothing Systems
- No control over system balance
- User control of running queries
- No control over user interaction (online
aggregation) - Sensor Nets the next killer app
- No control over anything!
9Varying
- Computing resources
- Data flows unpredictably from sources
- Code performs unpredictably along flows
- Continuous volatility due to many decentralized
systems - Data Characteristics
- Distributions
- Burstiness
- User preferences
- What get fast
- How much data
10Toward Continuous Adaptivity
- Need much more frequent adaptivity
- Goal adapt per tuple of each relation??
- The traditional runstats-optimize-execute loop is
far too coarse-grained - So, continuously perform all 3 functions, at
runtime - Aim for adaptivity over best-case performance (as
the later never exists for long)
11Road Map
- Adaptive Query Processing
- Intra-join adaptivity
- Synchronization Barriers
- Moments of Symmetry
- Eddies
- Encapsulated, adaptive dataflow
12Adaptable Operators and Plans
- Moments of symmetry query
processing stage during which pipelined query
operators or inputs can be easily reordered (with
no or minimal state management) - Synchronization barriers require
inputs from different sources to be coordinated
and possibly restricted to the rate of the slower
input - We need good operators.
13Adaptable Joins, Issue 1
- Synchronization Barrier merge join
- Right input frozen,waiting for left
- Cant adapt while waitingfor barrier!
- So, favor joins that have
- no barriers or seldom barriers
- at worst, adaptable barriers
?
2000 2001 2002 2003 2004
2 3 4 5 6
14Adaptable Joins, Issue 2
- Would like to reorder in-flight (pipelined) joins
- Base case swap inputs to a join ??
- Moment of symmetry
- inputs can be swapped with no/little state
management - Aim for frequent moments of symmetry ? more
frequent adaptivity
15Adaptable Joins, Issue 2
- Moments of Symmetry
- Suppose you can adapt an in-flight query plan
- How would you do it?
- Base case reorder inputs of a single join
- Nested loops join
16Adaptable Joins, Issue 2
- Moments of Symmetry
- Suppose you can adapt an in-flight query plan
- How would you do it?
- Base case reorder inputs of a single join
- Nested loops join
- Cleaner if you waittil end of inner loop
17Adaptable Joins, Issue 2
- Moments of Symmetry
- Suppose you can adapt an in-flight query plan
- How would you do it?
- Base case reorder inputs of a single join
- Nested loops join
- Cleaner if you waittil end of inner loop
- Hybrid Hash
- Reorder while building?
18Moments of Symmetry, cont.
- Moment of Symmetry
- Can swap join inputs w/o state modification
- Nested Loops join end of each inner loop
- Hybrid Hash join never
- Sort-Merge join essentially always
- More frequent moments of symmetry ? more
frequent adaptivity
19Joins for Adaptivity
- Pipelined hash join (hash ripple or Xjoin)
- No synchronization barriers
- Continuous symmetry
- Good for equi-join
- Simple (or block) ripple join
- Synchronization barriers at corners
- Moments of symmetry at corners
- Good for non-equi-join
- When symmetry At corners, i.e., for each new
tuple, once it has been processed using the given
operator s state
R
S
?
20Beyond Binary Joins
- Think of swapping inners
- Can be done at a global moment of symmetry
- Intuition like an n-ary join
- Except that each pair can bejoined by a
different algorithm! - So
- Need to introduce n-ary joins to a query engine
21Need well-behaved join algorithms
- Pipelining
- Avoid synch barriers
- Frequent moments of symmetry
22Continuous Adaptivity Goal Eddies
Eddy
- Avoid need for traditional cost estimation
- Avoid generation of a good query plan
23Continuous Adaptivity Eddies
Eddy
- A pipelining n-ary tuple-routing iterator
(just like join or sort) - works well with ops that havefrequent moments of
symmetry
24Continuous Adaptivity Eddies
Eddy
- Adjusts flow adaptively
- Tuples flow in different orders
- Visit each op once before output
25Routing Eddies
Eddy
- Naïve routing policy
- All ops fetch from eddy as fast as possible
- Previously-seen tuples precede new tuples
26Schedule Grab when Ready?
- Two expensive selections s1 and s2
- Selectivity(s1)Selectivity(s2)50
- Cost(s2) 5.
- Vary Cost(s1).
- What expect? ?
- Does it make a difference at all?
27Cost Factor?
- Two expensive selections, 50 selectivity
- Cost(s2) 5. Vary cost of s1.
- Favors faster operation
28But is it Enough?
- Given two expensive selections
- Cost same, say cost(s1)cost(s2)5
- Selectivity(s2) 50.
- Vary selectivity of s1.
- Does that make a difference?
29Selectivity-based?
- Two expensive selections, cost 5
- Selectivity(s2) 50. Vary selectivity of s1.
30Schedule Selectivity-based?
- Conclude Heavy tuple shedder early on is good.
31How to choose?
- If we knew all selectivities and all costs (and
they were static), maybe we could pick the best
overall schedule here. - Otherwise, we need a cheap means to observe their
changes - And, we need a means to react in a simply manner
based on those perceived changes
32An Aside How to choose?
- A machine learning problem?
- Each agent pays off differently
- Explore Or Exploit?
- Heuristics ?
- Sometimes want to randomly choose one
- Usually want to go with the best
- If probabilities are stationary, dampen
exploration over time
33Eddies with Lottery Scheduling
- Operator gets 1 ticket when it takes a tuple
- Favor operators that run fast (low cost)
- Operator loses a ticket when it returns a tuple
- Favor operators that drop tuples (low
selectivity) - Winner?
- Large number of tickets measure of goodness
- Lottery Scheduling
- When two operators vie for the same tuple,
hold a lottery - Never let any operator go to zero tickets
- Support occasional random exploration
34Lottery-Based Eddy
- Two expensive selections, cost 5
- Selectivity(s2) 50. Vary selectivity of s1.
35In a Volatile Environment
- Two index joins
- Slow 5 second delay Fast no delay
- Toggle after 30 seconds
36Related Work
Competition Sampling
Query Scrambling
Ingres DECOMP
Inter-Operator
Late Binding
Future Work
Per Query
System R
Eddies
Frequency of Adaptivity
- Late Binding Dynamic, Parametric
HP88,GW89,IN92,GC94,AC96,LP97 - Per Query Mariposa SA96, ASE CR94
- Competition RDB AZ96
- Inter-Op KD98, Tukwila IF99
- Query Scrambling AF96,UFA98
- Survey Hellerstein, Franklin, et al., DE
Bulletin 2000
37Summary
- Eddies Continuously Adaptive Dataflow
- Suited for volatile performance environments
- Changes in operator/machine peformance
- Changes in selectivities (e.g. with sorted
inputs) - Changes in data delivery
- Currently adapts join order
- Competitive methods to adapt access join
methods? - Requires well-behaved join algorithms
- Pipelining
- Avoid synch barriers
- Frequent moments of symmetry
- The end of the runstats/optimizer/executor
boundary! - At best, System R is good for hints on initial
ticket distribution