Title: Online Aggregation
1Online Aggregation
- Joseph M. Hellerstein
- Peter J.Haas
- Helen J.Wang
Presented By Bhushan Pachpande
2Contents
- Motivation
- Online aggregation interface
- System for online aggregation
- Approaches to building such system
- Future work
- Conclusion
3Online Aggregation - Motivation
- Aggregation in traditional RDBMS
- for large data after query submission, user is
forced to wait without any feedback - significant waiting time for the final answer
- Not useful when approximate answer is needed
- Author proposes interface to the aggregation
processing so that - progress of the queries can be observed
- execution of the queries can be controlled on the
fly
4Motivating Example
SELECT AVG (final_grade) FROM grades WHERE course
name CS186
AVG -------------- 2.631046 --------------
- If there is no index on course_name attribute,
this query scans entire grades table before
returning the result - Alternative
Running aggregate estimate of the final result
based on the records retrieved so far
Running confidence interval 2.6336 /- 0.0652539
with 95 probability
5Online Aggregation Interface with Groups
- If the records are retrieved in the random order
good approximate result can be obtained - Can stop sampling as soon as the length of the
confidence interval becomes sufficiently small.
- Consider the group-by query with 6 groups in
output - User can be displayed with 6 outputs and 6 stops
signs - Stopping condition can be set on the fly
- Easy to understand for non-statical user
6System for online aggregation
- Usability goals in designing system for online
aggregation - Continuous Observation
- good graphical, statistical interfaces should be
provided to get sense of current precision - Control Time/precision
- user should able to terminate processing at fine
granularity at any time tradeoff between time
and precision - Control of Fairness/partiality
- User should be able to control the rate at which
running aggregates are calculated - Need to ensure running aggregates are all updated
at the same rate - Widths of the running confidence intervals should
decrease at same rate
7System for online aggregation
- Performance goals in designing system for online
aggregation - Minimum time to accuracy
- Minimum time to completion
- Pacing
Control of fairness/partiality sampling rate
for the groups can be increases or decreased
can terminate processing with fine granularity
stop button per group
Multi-group online aggregation interface
8Building Online Aggregation System
- Author tried to add online aggregation in
POSTGRES - Approach I - Naïve Approach
SELECT running_avg(final_grade)
running_confidence(final_grade) running_interval
(final_grade) FROM grades
- can write functions like running_avg which
returns float by calculating average after each
tuple - Difficult to implement with group by clause
- Performance and functionality problem arise
- Approach II - Modifying DBMS
- difficult to implement online aggregation as user
level addition - Modification in database engine required
9Random access to data
- To have more accurate estimates of the running
aggregates, records should be retrieved randomly - Heap Scans
- Simple heap scans can be effective in traditional
heap file access methods where records are stored
in unspecified order - Need to choose different method for the aggregate
attributes, which are correlated to the logical
order of formation of heap - Index Scans
- Can be used if aggregate attributes are not used
for indexing - Sampling from Indices
- Techniques for pseudo random sampling from
various index structures can be used Olkens
work
10Non-blocking/Fair access
- Groups should receive updated in fair manner
- For grouping traditional way is to sort relation
on aggregation fields and collect groups - Sorting !!
- blocking algorithm and will produce sequential
results - Must use hash based techniques
- Hashing input relation on grouping column
- For DISTINCT columns similar hashing technique
can be used - Does not scale with number of grouping values
- Hybrid hashing can be used instead
11Index Striding
- Updates for the groups with few members will be
very difficult - For fair group by
- Read tuples in round robin fashion (a tuple from
group 1, a tuple from group 2, ) - Supported by technique index striding
- Given B-tree index on the grouping columns,
traverse tree - Additional advantages
- Group updating rate can be controlled
- Particular group processing can be stopped
- To support equal width confidence intervals
weighted round-robin scheme may be used
12Join Algorithms
- Algorithms should not block the query processing
- Sort-merge join
- Unacceptable as sorting is blocking operation
- Merge Join
- OK but produces sorted output
- Hybrid hash join
- Not good if inner relation is large
- Nested loops join always good,
- In case of large un-indexed inner relation its
too slow - An optimizer must be used to choose between these
strategies
13Optimization
- Avoid sorting
- Blocking sub-operations (processing inner
relation in case of hybrid hash joins) should
have costs and appropriate costs should be
considered - Preferences to the plans that maximize user
control (index striding)
14Extended aggregate functions
- Standard set of aggregate functions must be
extended - Aggregate functions must be written that provides
running estimates - Running computation
- SUM, COUNT, AVG straight forward
- VAR, STD DEV can be implemented using
algorithms - Aggregate function returning running confidence
must be written
15Running confidence intervals
- Conservative confidence interval
- For n (no of tuples retrieved) gt 1
- Answer guaranteed to be gt probability p based
on Hoeffdings inequality - Large-sample confidence intervals
- Comparative for large n
- Based Central Limit Theorem
- Deterministic confidence intervals
- When n is very large
- Answer lies in a - e1,a e2 with probability
1. - Running confidence interval can be dynamically
adjusted depending on the value of n.
16Future work
- Better UI
- Nested Queries
- Control without Indices
- Checkpointing / Continuation
17Conclusion
- Paper proposed important feature of a user
interface for online aggregation which produces
confidence intervals for running aggregates - Proposed modifications required in DBMS for such
online aggregation system
18Thank You