Online Aggregation

About This Presentation

Title:

Online Aggregation

Description:

VAR, STD DEV can be implemented using algorithms. Aggregate function returning running confidence must be written. Running confidence intervals ... – PowerPoint PPT presentation

Number of Views:29

Avg rating:3.0/5.0

Slides: 19

Provided by: cmk264

Learn more at: https://crystal.uta.edu

Category:

more less

Transcript and Presenter's Notes

Title: Online Aggregation

1
Online Aggregation

Joseph M. Hellerstein
Peter J.Haas
Helen J.Wang

Presented By Bhushan Pachpande
2
Contents

Motivation
Online aggregation interface
System for online aggregation
Approaches to building such system
Future work
Conclusion

3
Online Aggregation - Motivation

Aggregation in traditional RDBMS
for large data after query submission, user is
forced to wait without any feedback
significant waiting time for the final answer
Not useful when approximate answer is needed
Author proposes interface to the aggregation
processing so that
progress of the queries can be observed
execution of the queries can be controlled on the
fly

4
Motivating Example
SELECT AVG (final_grade) FROM grades WHERE course
name CS186
AVG -------------- 2.631046 --------------

If there is no index on course_name attribute,
this query scans entire grades table before
returning the result
Alternative

Running aggregate estimate of the final result
based on the records retrieved so far
Running confidence interval 2.6336 /- 0.0652539
with 95 probability
5
Online Aggregation Interface with Groups

If the records are retrieved in the random order
good approximate result can be obtained
Can stop sampling as soon as the length of the
confidence interval becomes sufficiently small.

Consider the group-by query with 6 groups in
output
User can be displayed with 6 outputs and 6 stops
signs
Stopping condition can be set on the fly
Easy to understand for non-statical user

6
System for online aggregation

Usability goals in designing system for online
aggregation
Continuous Observation
good graphical, statistical interfaces should be
provided to get sense of current precision
Control Time/precision
user should able to terminate processing at fine
granularity at any time tradeoff between time
and precision
Control of Fairness/partiality
User should be able to control the rate at which
running aggregates are calculated
Need to ensure running aggregates are all updated
at the same rate
Widths of the running confidence intervals should
decrease at same rate

7
System for online aggregation

Performance goals in designing system for online
aggregation
Minimum time to accuracy
Minimum time to completion
Pacing

Control of fairness/partiality sampling rate
for the groups can be increases or decreased
can terminate processing with fine granularity
stop button per group
Multi-group online aggregation interface
8
Building Online Aggregation System

Author tried to add online aggregation in
POSTGRES
Approach I - Naïve Approach

SELECT running_avg(final_grade)
running_confidence(final_grade) running_interval
(final_grade) FROM grades

can write functions like running_avg which
returns float by calculating average after each
tuple
Difficult to implement with group by clause
Performance and functionality problem arise
Approach II - Modifying DBMS
difficult to implement online aggregation as user
level addition
Modification in database engine required

9
Random access to data

To have more accurate estimates of the running
aggregates, records should be retrieved randomly
Heap Scans
Simple heap scans can be effective in traditional
heap file access methods where records are stored
in unspecified order
Need to choose different method for the aggregate
attributes, which are correlated to the logical
order of formation of heap
Index Scans
Can be used if aggregate attributes are not used
for indexing
Sampling from Indices
Techniques for pseudo random sampling from
various index structures can be used Olkens
work

10
Non-blocking/Fair access

Groups should receive updated in fair manner
For grouping traditional way is to sort relation
on aggregation fields and collect groups
Sorting !!
blocking algorithm and will produce sequential
results
Must use hash based techniques
Hashing input relation on grouping column
For DISTINCT columns similar hashing technique
can be used
Does not scale with number of grouping values
Hybrid hashing can be used instead

11
Index Striding

Updates for the groups with few members will be
very difficult
For fair group by
Read tuples in round robin fashion (a tuple from
group 1, a tuple from group 2, )
Supported by technique index striding
Given B-tree index on the grouping columns,
traverse tree
Additional advantages
Group updating rate can be controlled
Particular group processing can be stopped
To support equal width confidence intervals
weighted round-robin scheme may be used

12
Join Algorithms

Algorithms should not block the query processing
Sort-merge join
Unacceptable as sorting is blocking operation
Merge Join
OK but produces sorted output
Hybrid hash join
Not good if inner relation is large
Nested loops join always good,
In case of large un-indexed inner relation its
too slow
An optimizer must be used to choose between these
strategies

13
Optimization

Avoid sorting
Blocking sub-operations (processing inner
relation in case of hybrid hash joins) should
have costs and appropriate costs should be
considered
Preferences to the plans that maximize user
control (index striding)

14
Extended aggregate functions

Standard set of aggregate functions must be
extended
Aggregate functions must be written that provides
running estimates
Running computation
SUM, COUNT, AVG straight forward
VAR, STD DEV can be implemented using
algorithms
Aggregate function returning running confidence
must be written

15
Running confidence intervals

Conservative confidence interval
For n (no of tuples retrieved) gt 1
Answer guaranteed to be gt probability p based
on Hoeffdings inequality
Large-sample confidence intervals
Comparative for large n
Based Central Limit Theorem
Deterministic confidence intervals
When n is very large
Answer lies in a - e1,a e2 with probability
1.
Running confidence interval can be dynamically
adjusted depending on the value of n.

16
Future work

Better UI
Nested Queries
Control without Indices
Checkpointing / Continuation

17
Conclusion

Paper proposed important feature of a user
interface for online aggregation which produces
confidence intervals for running aggregates
Proposed modifications required in DBMS for such
online aggregation system

18
Thank You

Write a Comment

User Comments (0)