Ripple Joins for Online Aggregation - PowerPoint PPT Presentation

About This Presentation

Title:

Ripple Joins for Online Aggregation

Description:

Ripple Joins for Online Aggregation by Peter J. Haas and Joseph M. Hellerstein published in June 1999 presented by Nag Prajval B.C – PowerPoint PPT presentation

Number of Views:115

Avg rating:3.0/5.0

Slides: 25

Provided by: Rayt191

Learn more at: https://crystal.uta.edu

Category:

more less

Transcript and Presenter's Notes

Title: Ripple Joins for Online Aggregation

1
Ripple Joins for Online Aggregation

by
Peter J. Haas and Joseph M. Hellerstein
published in June 1999
presented by
Nag Prajval B.C

2
In Simple words

This paper tells how to join a bunch of
tables and get the SUM, COUNT, or AVG in GROUP BY
clauses showing approximate results immediately
and the confidence interval of the results from
the first few tuples retrieved updating a GUI
display with closer approximation information as
the join adds more tuples.

3
Motivation

Traditional Join Algorithms
Nested- Loop Join
Merge Join
Hash Join
Main Goal These join algorithms are designed
minimize the time to completion of query.

4
Online System

Huge database with large tables.
Select student.honors_code,avg(enroll grade)
From enroll, student
Where enroll.sidstudent.sid
Group by student.honors_code

5
Traditional Algorithms

Traditional algorithms take a lot of time since
they have to process the entire tables or
relations
The users have to wait for a long time before the
results are returned.
Many times aggregation queries are used to get a
big picture of a dataset.
More Appropriate method would be Online
Aggregation.

6
Online Aggregation

A running estimate of the final aggregates are
continuously displayed to the user.
The proximity of the running estimate to the
final result is also displayed to the
user.(confidence interval)
Quick results rather than minimize time for
completion.
This required some changes to be made to the
traditional algorithms for query processing.

7
GUI, 1999
8
Join Algorithms for Online Processing

Traditional Nested loop algorithm could be used
in an online fashion.
Drawbacks
1. If R was of non trivial size then time
between successive updates could be excessive.
2. consider a query like this
Select AVG(S.a R.b/1000000)
From S and R
Where S.xR.x
As scan progresses no new information is
retrieved ie the Confidence Intervals do not
shrink well.

9
Ripple Join Operation

Assume ripple join of relations R and S
Select a random tuple r from R.
Join with previously selected S tuples.
Select a random tuple s from S.
Join with previously selected R tuples.
Join r and s.

10
Example
11
Aspect Ratios

Aspect ratio How many tuples are retrieved from
each base relation per sampling step.
e.g. ß1 1, ß2 3,
Square Ripple JoinSamples are drawn from R and S
at the same rate.
Rectangular Ripple JoinOne relation is sampled
at a higher rate than another to provide shortest
possible Confidence Intervals.

12
Ripple Join Algorithm

For(max1 to infinity)
for(i1 to max-1)
if(predicate(Ri,smax))
output(Ri,Smax)
for(i1 to max)
if(predicate(Rmax,si))
output(Rmax,Si)

13
Ripple Join Algorithm

Generalization of Nested loop join
Roles of Inner and Outer Relations continuosly
interchange during processing
Cursor at S fixed at maxn whil cursor into R
loops from 1 to n-1
When cursor into R reaches value n,the cursor
into S loops from 1 to n.

14
Why call this "Ripple Join"?

The algorithm seems to ripple out from a corner
of the join.
Acronym "Rectangles of Increasing Perimeter
Length"

15
Iterators

Ripple joins can be developed using the iterator
model.
2 variants
1.Square Binary Ripple Join Iterator
2.Enhanced Ripple Join Iterator

16
Square Binary Ripple Join Iterator
17
Enhanced Ripple Join Iterator

Drawbacks of Square Ripple join iterator are
handled
1)Non Unitary aspect ratios
2)Pipelining multiple ripple joins

18
Ripple Join Variants

Block Ripple Join
Index Ripple Join
Hash Ripple Join

19
Performance

Goal Provide efficient and accurate estimation.
Join results should be returned in a such a way
that aggregates are updated regularly and
confidence intervals shrink rapidly.
Depends on 2 important things
1.Estimators for SUM,COUNT and AVG
2.Confidence intervals

20
Estimators for SUM,COUNT and AVG

Consider a query of the form
SELECT op(expression) from R,S
WHERE predicate
At end of nth sampling step estimator of
SUM(expression) is

21
Estimators for SUM,COUNT and AVG

Rn and Sn Set of Tuples that have been
read at nth sampling step
R and S Set of Relations
expressionP(r,s) equals expression(r,s) if (r,s)
satisfies where clause and 0 otherwise
Count() expressionP(r,s) replaced by
oneP(r,s) where oneP(r,s) 1 if (r,s)
satisfies where clause and 0 otherwise.
AVG(expression)SUM estimator/COUNT estimator.

22
Confidence interval

A running confidence interval displays how close
this answer is to the final result.
This could be calculated in many ways.
The authors present an example calculation built
on extending the Central Limit Theorem.

23
Ripple Joins

Ripple joins are designed to minimize the time
until an acceptably precise estimate of the query
result is available, as measured by the length of
a confidence interval.
Ripple joins are adaptive, adjusting their
behavior during processing in accordance with the
statistical properties of the data.

24
Ripple Joins

Ripple joins also permit the user to dynamically
trade off the two key performance factors of
online aggregation the time between successive
updates of the running aggregate, and the amount
by which the confidence-interval length decreases
at each update.
Ripple joins appear to be among the first
database algorithms to use statistical
information about the data not just to estimate
selectivities and processing costs, but to
estimate the quality of the result currently
being displayed to the user and to dynamically
adjust algorithm behavior accordingly.