Title: Chris Olston
1Offering a Precision-Performance Tradeoff for
Aggregation Queries over Replicated Data
- Chris Olston
- Jennifer Widom
Stanford University
2Replication Alternatives
performance
precision
3Replication Alternative 1
Exact Cache
5
3
5
3
Source (fresh)
Source (fresh)
4Replication Alternative 1
Exact Cache
5 8
3 4
Propagate all updates
5 8
3 4
Source (fresh)
Source (fresh)
5Replication Alternative 1
Exact Cache
performance
AVG 6
5 8
3 4
exact cache
precision
Propagate all updates
5 8
3 4
Source (fresh)
Source (fresh)
6Replication Alternative 2
Stale Cache
5
3
Periodic refresh
5 8
3 4
Source (fresh)
Source (fresh)
7Replication Alternative 2
stale cache
Stale Cache
performance
5
3
AVG 4
precision
Periodic refresh
5 8
3 4
Source (fresh)
Source (fresh)
8TRAPP Replication
Bounded Cache
4, 7
2, 4
5
3
Source (fresh)
Source (fresh)
9TRAPP Replication
Bounded Cache
4, 7
2, 4
6, 10
Refresh when value exceeds bounds
5 8
3 4
Source (fresh)
Source (fresh)
10TRAPP Replication
you decide
Bounded Cache
performance
AVG ? 4, 7
6, 10
2, 4
precision
8
4
Source (fresh)
Source (fresh)
11Outline
- TRAPP Architecture
- Query Execution for Bounded Answers
- Adjusting Bound Width
- Related Work
- Status and Future Work
12Overview of TRAPP
- Caches store bounds that include exact source
values - Sources refresh when value exceeds bound
- Queries over cached data include a precision
constraint - Our algorithms answer queries by refreshing as
few values as possible to meet precision
constraint
13Example TRAPP Query
Bounded Cache
AVG ? 4, 7 want within 1
6, 10
2, 4
8
4
Source (fresh)
Source (fresh)
14TRAPP Architecture
query precision constraint
bounded answer
Source
value-initiated refresh
Cache
Refresh Monitor
query-initiated refresh
Query Processor
query-initiated refresh request
15Precision-Performance Tradeoff
stale cache
TRAPP
performance
exact cache
precision
- Higher precision requires more refreshing
- Higher performance forces low precision
TRAPP offers a continuous tradeoff
16Application Network Monitoring
latency bandwidth traffic
latency bandwidth traffic
latency bandwidth traffic
17Query Execution for Bounded Answers
- Input
- Query aggregation w/selection predicate
- Precision constraint
- Set of bounded values with cost to refresh each
- Step 1 Compute initial bounded answer
- Step 2 Determine minimum-cost set of values to
refresh that guarantee satisfaction of the
precision constraint - Step 3 Use exact values from refreshes combined
with bounds to compute final bounded answer
18Example Query SUM
- SELECT SUM(A) WITHIN 2
- Steps 1 3 (computing bounded answer)
?
A
Li , Hi ? Li , ? Hi
L1, H1 L2, H2
2, 3 4, 8 6, 11
example
19SUM, Choosing Tuples to Refresh
- Isomorphic to 0/1 Knapsack Problem
- Objective fill knapsack with bounds that would
be expensive to refresh while not exceeding
capacity - Knapsack contents set of bounds not to refresh
- Benefit cost saved by not refreshing
- Knapsack capacity precision constraint
- Weight bound width
- Knapsack is NP-Hard -- we use ?-approximation
algorithm by Ibarra, Kim
- Observation width of answer bound sum of
non-refreshed bound widths - We need this quantity to be less than the
precision constraint
20SUM with a Selection Predicate
- SELECT SUM(A) WITHIN 2 WHERE B gt 10
- Three possibilities for each B value
- LBi gt 10 (e.g., 15, 20) yes
- HBi ? 10 (e.g., 5, 8) no
- else (e.g., 9, 12) maybe
- Ignore nos and process yess as before
- For maybes, pretend that bound on A includes 0
- e.g., A ? 3, 5 becomes A ? 0, 5
A
B
LA1, HA1 LA2, HA2
LB1, HB1 LB2, HB2
21Other Aggregation Functions
- COUNT
- MIN/MAX
- AVG
- MEDIAN
- see STOC00
22Realizing the Precision-Performance Tradeoff
0
1000
2000
3000
4000
150 100 50
0
23Adjusting Bound Width
- Dynamically adjust bound width to minimize the
probability of a refresh - Preliminary results indicate that this adaptive
algorithm is promising
value-initiated refresh or
query-initiated refresh
value exceeds bounds (bound was too narrow)
more precision required (bound was too wide)
grow
shrink
24Not Covered in Talk See Paper
- Details on other aggregates
- Many more examples
- Joins
- Time-varying bounds L(t), H(t)
25Related Work
- Approximate answers
- Mostly precomputation or sampling to provide
statistical results - Reduce representation size
- e.g., Multi-resolution data model Read et al.
- Still fetch all objects
26Related Work (cont.)
- Bounds on numerical values
- e.g., Quasi-copies Alonso et al., Moving
Objects Databases Wolfson et al., Demarcation
Protocol Barbara/Garcia-Molina - No user control of precision-performance tradeoff
- e.g., APPROXIMATE Jukic/Vrbsky, Constraint
Databases, Incomplete Information Databases
Abiteboul et al. - Bounded values are not approximations of exact
values available at cost - Bounds on the number of updates
- e.g., Divergence Caching Huang et al.
- No bounds on the values themselves
27Status and Future Work
- Underway
- Performance study on bound functions and width
adjustment algorithms - Non-numeric data (e.g., WWW)
- Multi-level replication systems
- Other types of queries
- Iterative refresh algorithms
- Delaying the propagation of insertions and
deletions - Planned
- Investigation of real-time and consistency issues
- Applying TRAPP to data visualization
28Thats all folks!
- To contact me
- olston_at_db.stanford.edu
- http//www.db.stanford.edu/olston