Online Aggregation - PowerPoint PPT Presentation

About This Presentation
Title:

Online Aggregation

Description:

VAR, STD DEV can be implemented using algorithms. Aggregate function returning running confidence must be written. Running confidence intervals ... – PowerPoint PPT presentation

Number of Views:29
Avg rating:3.0/5.0
Slides: 19
Provided by: cmk264
Learn more at: https://crystal.uta.edu
Category:

less

Transcript and Presenter's Notes

Title: Online Aggregation


1
Online Aggregation
  • Joseph M. Hellerstein
  • Peter J.Haas
  • Helen J.Wang

Presented By Bhushan Pachpande
2
Contents
  • Motivation
  • Online aggregation interface
  • System for online aggregation
  • Approaches to building such system
  • Future work
  • Conclusion

3
Online Aggregation - Motivation
  • Aggregation in traditional RDBMS
  • for large data after query submission, user is
    forced to wait without any feedback
  • significant waiting time for the final answer
  • Not useful when approximate answer is needed
  • Author proposes interface to the aggregation
    processing so that
  • progress of the queries can be observed
  • execution of the queries can be controlled on the
    fly

4
Motivating Example
SELECT AVG (final_grade) FROM grades WHERE course
name CS186
AVG -------------- 2.631046 --------------
  • If there is no index on course_name attribute,
    this query scans entire grades table before
    returning the result
  • Alternative

Running aggregate estimate of the final result
based on the records retrieved so far
Running confidence interval 2.6336 /- 0.0652539
with 95 probability
5
Online Aggregation Interface with Groups
  • If the records are retrieved in the random order
    good approximate result can be obtained
  • Can stop sampling as soon as the length of the
    confidence interval becomes sufficiently small.
  • Consider the group-by query with 6 groups in
    output
  • User can be displayed with 6 outputs and 6 stops
    signs
  • Stopping condition can be set on the fly
  • Easy to understand for non-statical user

6
System for online aggregation
  • Usability goals in designing system for online
    aggregation
  • Continuous Observation
  • good graphical, statistical interfaces should be
    provided to get sense of current precision
  • Control Time/precision
  • user should able to terminate processing at fine
    granularity at any time tradeoff between time
    and precision
  • Control of Fairness/partiality
  • User should be able to control the rate at which
    running aggregates are calculated
  • Need to ensure running aggregates are all updated
    at the same rate
  • Widths of the running confidence intervals should
    decrease at same rate

7
System for online aggregation
  • Performance goals in designing system for online
    aggregation
  • Minimum time to accuracy
  • Minimum time to completion
  • Pacing

Control of fairness/partiality sampling rate
for the groups can be increases or decreased
can terminate processing with fine granularity
stop button per group
Multi-group online aggregation interface
8
Building Online Aggregation System
  • Author tried to add online aggregation in
    POSTGRES
  • Approach I - Naïve Approach

SELECT running_avg(final_grade)
running_confidence(final_grade) running_interval
(final_grade) FROM grades
  • can write functions like running_avg which
    returns float by calculating average after each
    tuple
  • Difficult to implement with group by clause
  • Performance and functionality problem arise
  • Approach II - Modifying DBMS
  • difficult to implement online aggregation as user
    level addition
  • Modification in database engine required

9
Random access to data
  • To have more accurate estimates of the running
    aggregates, records should be retrieved randomly
  • Heap Scans
  • Simple heap scans can be effective in traditional
    heap file access methods where records are stored
    in unspecified order
  • Need to choose different method for the aggregate
    attributes, which are correlated to the logical
    order of formation of heap
  • Index Scans
  • Can be used if aggregate attributes are not used
    for indexing
  • Sampling from Indices
  • Techniques for pseudo random sampling from
    various index structures can be used Olkens
    work

10
Non-blocking/Fair access
  • Groups should receive updated in fair manner
  • For grouping traditional way is to sort relation
    on aggregation fields and collect groups
  • Sorting !!
  • blocking algorithm and will produce sequential
    results
  • Must use hash based techniques
  • Hashing input relation on grouping column
  • For DISTINCT columns similar hashing technique
    can be used
  • Does not scale with number of grouping values
  • Hybrid hashing can be used instead

11
Index Striding
  • Updates for the groups with few members will be
    very difficult
  • For fair group by
  • Read tuples in round robin fashion (a tuple from
    group 1, a tuple from group 2, )
  • Supported by technique index striding
  • Given B-tree index on the grouping columns,
    traverse tree
  • Additional advantages
  • Group updating rate can be controlled
  • Particular group processing can be stopped
  • To support equal width confidence intervals
    weighted round-robin scheme may be used

12
Join Algorithms
  • Algorithms should not block the query processing
  • Sort-merge join
  • Unacceptable as sorting is blocking operation
  • Merge Join
  • OK but produces sorted output
  • Hybrid hash join
  • Not good if inner relation is large
  • Nested loops join always good,
  • In case of large un-indexed inner relation its
    too slow
  • An optimizer must be used to choose between these
    strategies

13
Optimization
  • Avoid sorting
  • Blocking sub-operations (processing inner
    relation in case of hybrid hash joins) should
    have costs and appropriate costs should be
    considered
  • Preferences to the plans that maximize user
    control (index striding)

14
Extended aggregate functions
  • Standard set of aggregate functions must be
    extended
  • Aggregate functions must be written that provides
    running estimates
  • Running computation
  • SUM, COUNT, AVG straight forward
  • VAR, STD DEV can be implemented using
    algorithms
  • Aggregate function returning running confidence
    must be written

15
Running confidence intervals
  • Conservative confidence interval
  • For n (no of tuples retrieved) gt 1
  • Answer guaranteed to be gt probability p based
    on Hoeffdings inequality
  • Large-sample confidence intervals
  • Comparative for large n
  • Based Central Limit Theorem
  • Deterministic confidence intervals
  • When n is very large
  • Answer lies in a - e1,a e2 with probability
    1.
  • Running confidence interval can be dynamically
    adjusted depending on the value of n.

16
Future work
  • Better UI
  • Nested Queries
  • Control without Indices
  • Checkpointing / Continuation

17
Conclusion
  • Paper proposed important feature of a user
    interface for online aggregation which produces
    confidence intervals for running aggregates
  • Proposed modifications required in DBMS for such
    online aggregation system

18
Thank You
Write a Comment
User Comments (0)
About PowerShow.com