Online Query Processing

About This Presentation

Title:

Online Query Processing

Description:

Scalable Spreadsheet for Online Data Cleaning. Interactive anomaly detection & transformation ... getNext chooses best item currently on buffer ... – PowerPoint PPT presentation

Number of Views:121

Avg rating:3.0/5.0

Slides: 80

Provided by: joehell

Learn more at: https://dsf.berkeley.edu

more less

Transcript and Presenter's Notes

Title: Online Query Processing

1
Online Query Processing

Joseph M. Hellerstein
UC Berkeley

2
Road Map

Requisite tech trends and prognostications
CONTROL of the situation online query processing
HCI motivations
Example applications
Example Algorithms
Reordering and Ripple Joins
Implications for hot topics
Big data, small pipes P2P, Sensor nets
Continuous Queries over data streams

3
Computers Keep Getting Faster

Moores Law
of transistors per chip doubles every 18
months (1965)
? 2x price/performance every 18 months
About right, 24 periods running

4
But Data Grows Faster Yet!
Source J. Porter, Disk/Trend, Inc. http//www.dis
ktrend.com/pdf/portrpkg.pdf
5
Disk Appetite, cont.

Greg Papadopoulos, CTO Sun
Disk sales doubling every 9 months
Translate
Time to process all your data doubles every 18
months
MOORES LAW INVERTED!
Big challenge (opportunity?) for SW systems
research
Feel free to discount the analysis above
significantly
Even so, traditional scalability research wont
help
Ideal linear scaleup is NOT NEARLY ENOUGH!
HW will not take us there.

6
Data Volume Prognostications

Today
SwipeStream
E.g. Wal-Mart 100 TB Data Warehouse
ClickStream
One service generates 100GB of log per day
Web
Internet Archive WayBack Machine 100 TB
Replicated OS/Apps, MP3s, etc.
Tomorrow
Sensor feeds galore
Big data, small pipes Sensor nets, P2P
Continuous queries over data streams

7
Road Map

Requisite tech trends and prognostications
CONTROL of the situation online query processing
HCI motivations
Example applications
Example Algorithms
Reordering and Ripple Joins
Implications for hot topics
P2P query systems
Sensor nets
Stream Query Processing

8
The Latest Commercial Technology
9
Drawbacks of Current Systems

Only exact answers are available
A losing proposition as data volume grows
HCI solution interactive tools dont do big jobs
E.g., spreadsheet programs (64Krow limit)
Systems solution big jobs arent interactive
No user feedback or control in big DBMS queries
back to the 60s
Long processing times
Fundamental mismatch with preferred modes of HCI
Best solutions to date precompute
E.g. OLAP
Dont handle ad hoc queries or data sets well

10
CONTROLContinuous Output, Navigation and
Transformation with Refinement On-Line

Of all men's miseries, the bitterest is thisto
know so much and have control over nothing
-- Herodotus
Requirements for CONTROL systems
Early answers
Refinement over time
Interaction and ad-hoc control
Crystal Ball vs. Black Box
vs. Lucite Watch

11
CONTROL Project, 96-01

Main SW Artifacts
Online Aggregation in SQL
Postgres, Informix, IBM DB2
Potters Wheel
Scalable Spreadsheet for Online Data Cleaning
Interactive anomaly detection transformation
CARMA online association rule mining
2 Online Data Viz prototypes
CLOUDS Phoebus
Seeded current Telegraph project
Adaptive dataflow for querying networked sources

12
Goals for Online Processing

New greedy performance regime
Maximize 1st derivative of the mirth index
Mirth defined on-the-fly
Therefore need FEEDBACK and CONTROL

100
Online
?
Traditional
Time
13
Road Map

Requisite tech trends and prognostications
CONTROL of the situation online query processing
HCI motivations
Example applications
Example Algorithms
Reordering and Ripple Joins
Implications for hot topics
P2P query systems
Sensor nets
Stream Query Processing

14
Online Aggregation

SELECT AVG(temp) FROM t GROUP BY site
330K rows in table (synthetic data)
the exact answer

15
Online Aggregation, contd

A simple online aggregation interface (after 74
rows)

Courtesy Peter Haas, IBM
16
Online Aggregation, contd

After 834 rows

Courtesy Peter Haas, IBM
17
Example Online Aggregation
Additional Features Speed up Slow
down Terminate
18
Online Data Visualization

In Tioga DataSplash

19
Online Browsing Editing

Potters Wheel VLDB 2001
Scalable spreadsheet
A fraction of data is materialized in GUI widget
at any time
Scrolling preference for delivering quantiles
to widget
Interactive data cleaning
Direct-manipulation interface via visual algebra
Transformation by example
Online structure and discrepancy detection
MDL meets ADTs
Minimum Description Length Rissinen for
modeling data
Based on a grammar of user-defined Abstract Data
Types
Simple API for programmers to register new ADTs

20
Scalable Spreadsheets
21
Visual Transformation Shot
22
(No Transcript)
23
Road Map

Requisite tech trends and prognostications
CONTROL of the situation online query processing
HCI motivations
Example applications
Example Algorithms
Reordering and Ripple Joins
Implications for hot topics
P2P query systems
Sensor nets
Stream Query Processing

24
Sampling w/o Replacement

We want i.i.d. samples w/o replacement
At any time, the input to the query is a sample
Input grows over time
Can pre-sort tables randomly
And start scans from random positions
Best I/O behavior
Can re-randomize incrementally in the background
Note can do this as a secondary index
Techniques for random sampling in DBs well
studied
Both from files and from indexes
Some tricks here (e.g. acceptance-rejection
sampling)

25
An AsideSampling in Commercial DBMSs

Simulated Bernoulli sampling
SQL SELECT WHERE RAND()
Similar capability in SAS
Bernoulli Sampling with pre-specified rate
Informix, Oracle 8i, (DB2)
Ex SELECT FROM T1 SAMPLE ROW(10), T2
Ex SELECT FROM T1 SAMPLE BLOCK(10), T2
Not for novices
Need to pre-specify precision
no feedback/control
recall the multiresolution patterns from
example
No estimators provided in current systems

26
Preferential Data Delivery

Why needed? Examples.
Speedup/slowdown arrows
Spreadsheet scrollbars
Pipeline quasi-sort
Mechanisms
Juggle Raman2/Hellerstein, VLDB 99
General purpose works with streaming data, etc.
Index stride Hellerstein/Haas/Wang, SIGMOD 97
Needs index, high I/O cost, but good for outliers
Mix adaptively! Raman/Hellerstein, ICDE 03

27
Online Reordering
join
transmit
process
consume

Deliver interesting items first
Interesting determined on the fly
Exploit rate gap between produce and
process/consume

28
Online Reordering
join
transmit
process
consume
reorder

Deliver interesting items first
Interesting determined on the fly based on user
gestures at the interface
Exploit rate gap between produce and
process/consume

29
Mechanism
process/consume
getNext
buffer
spool
prefetch
enrich
side disk
produce

Two threads
prefetch from input
spool/enrich from auxiliary side disk, deliver
output
Juggle data between buffer and side disk
keep buffer full of interesting items
getNext chooses best item currently on buffer
getNext, enrich/spool decisions -- based on
reordering policy
Side disk management
hash index, populated in a way that postpones
random I/O
play both sides of sort/hash duality

30
Policies
?
QoF
time

Want a good permutation of items t1tn to
t?1t?n
Quality of Feedback for a prefix t?1t?2t?k
QoF(UP(t?1), UP(t?2), UP(t?k )), UP user
preference
determined by application
Goodness of reordering dQoF/dt
Delivery Priority (DP)
At any given time, try to deliver tuple that
improves QoF most

31
Example Online Aggregation

Metric avg weighted confidence interval
Preference acts as weight on confidence interval
QOF ?Si UPi / (ni)1/2
ni number of tuples processed from group i
DPi UPi / (ni)3/2
Metric explicit proportional rates (NW QoS)
QOF ?Si (ni nUPi)2
where n Si ni
DPi nUPj nj
Explicit ranking (scrollbar)
QOF ?Si UPi
DPi UPi

32
Ripple Joins
Haas/Hellerstein, SIGMOD 99

Need online join processing
Pipelined produce output while consuming input
Statistically robust estimates/CIs as you go
SELECT AVG(R.a S.b)
FROM R, S
WHERE R.c S.c

33
Ripple Joins
Haas/Hellerstein, SIGMOD 99

Need online join processing
Pipelined produce output while consuming input
Statistically robust estimates/CIs as you go
SELECT AVG(R.a S.b)
FROM R, S
WHERE R.c S.c
Solution space
Simple idea sample a join

34
Ripple Joins
Haas/Hellerstein, SIGMOD 99

Need online join processing
Pipelined produce output while consuming input
Statistically robust estimates/CIs as you go
SELECT AVG(R.a S.b)
FROM R, S
WHERE R.c S.c
Solution space
Simple idea sample a join
Better join samples

35
Ripple Joins
Haas/Hellerstein, SIGMOD 99

Need online join processing
Pipelined produce output while consuming input
Statistically robust estimates/CIs as you go
SELECT AVG(R.a S.b)
FROM R, S
WHERE R.c S.c
Solution space
Simple idea sample a join
Better join samples
Ripple Joins
Family of algorithms for joining increasing
samples
Optimized to shrink confidence intervals ASAP
Q Which input to sample faster? Aspect ratio
of ripples?
A Adapt to estimators variance contribution
from different inputs

36
Traditional Nested Loops
SELECT AVG(R.a S.b) FROM R, S WHERE R.c S.c
R
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
S
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5
37
Basic Ripple Join

Simplest version
read new tuples s from S and r from R
join r and s
join r with old S tuples
join s with old R tuples

38
Basic Ripple Join
R
x
S
39
Better I/O Block Ripple Join
Batch IOs a 2
R

40
Adapt Rectangular Ripple Join
bR 2, bS 1
R

41
Ripple Joins, contd

Really a family of join algorithms
Block minimizes I/O in alternating nested loops
Hash symmetric hash tables to avoid scans
Index basic index join
Adaptive aspect ratio
User sets frame rate of animation at GUI
System goal minibatch optimizations
minimize CI width subject to frame-rate time
constraint
CIs computable at corners, so a new frame each
corner
Challenge choose the next minibatch corner
Sample from higher variance-contributing relation
faster
System solves linear programming problem to do
this (approximately)

42
Nice and Tidy! But

Significant details missing
Systems issues
Make it work in a re-entrant iterator model
RippleJoin.get_next() ?
Multi-join queries
Corner cases -- e.g. EOT on one input of k
Statistical matters
Estimators and CIs for joins of samples
Not textbook statistics!
The adaptivity issue choosing the next corner
Based on cost (I/O model for algorithm) and
benefit (CI formula)
Approximate a non-linear integer programming
problem

43
Implementation Issues

Basically a symmetric nested loops
inner and outer roles switch back and forth
Note asymmetry in inner loops mitres the box
for (ever)
for (i 1 to max -1) // S fixed, loop on R
if (predicate(Ri, Smax))
output(Ri, Smax)
for (i 1 to max) // R fixed, loop on S
if (predicate(Rmax, Si))
output(Rmax, Si)

44
Now as an iterator

Most DBMSs implement query operators as iterators
get_next() must re-enter and generate next result
tuple
state for square, 2-table ripple join
cursors on both inputs (as in nested loops join)
target -- wheres the next corner
currel -- which relation is currently looping

45
Fussy complications in general

Non-square aspect ratios
Not just wrapping 1 layer on each side
must pad side i with bi layers correctly
Multiple joins pipeline of binary joins
But cant be done naively
Wrapping hypercubes not rectangles
Given subtree of size i, factor in bs and a
Subtrees hyperplane is b1b2biai tuples
Boundary conditions
Overload control messages
EOT may mean end of sampling step, not end of
table
Need to signal aggregator of arrival at corner
(EOT on all tables)
Also, EOTable on a relation may change
mitre-ing rules!

46
Empirical Results for Square Ripples (208 sec.
Query)
Select Online AVG(E.grade) From enroll E,
student S Where E.sid S.sid And
S.honors_code IS NULL
47
Square Results (cont.)
48
Statistical Matters

Need robust formulas for
running estimates
running CIs
Adaptive update of aspect ratio
Efficient maintenance of running statistics

49
Running Estimators

ExampleSELECT SUM(expression) FROM R, SWHERE
predicate
Natural estimator after n sampling steps
Unbiased, consistent
A scaled average -- hence amenable to CLT

50
Large-Sample Confidence Intervals

Review single table R
sampling steps 1
Based on Central Limit Theorem for averages of
i.i.d. observations
Complications for multiple tables
Estimator for AVG is a ratio of averages
expressionp(r, s) and expressionp(r, s) are
correlated!
Solution
CLT for (ratios of) cross-product averages
Based on theory of convergence in distribution
Courtesy Peter Haas, IBM
Also conservative CIs that can be used for very
small n
Based on Hoeffding inequality
And deterministic CIs for the endgame

51
Large-Sample Confidence Intervals

General form For K tables and a 100p
CIWhere
zp is the area under the standard normal curve 0
p
n is the number of tuples so far
d(k) is agg-fn dependent, computed on tuples of
input
dn(k) a consistent estimator by applying d(k) to
samples so far
a the block size, a constant
bi the number of blocks to fetch from relation i
Tradeoff between update rate/CI shrinkage
Bigger bs smaller CIs, slower animation

CI width
52
Aspect Ratio Selection

Cost of block ripple join after n steps
c b1b2bkaK-1nK o(nK)
Cost of hash ripple after n steps
c (b1 b2 bk)n
Non-linear integer programming problemminimize
r2(b1, b2, , bk, a) such that
c (b1, b2, , bk, a) ? constant animation
speed
1 ? bk ? Rk/ a for 1 ? k ? K
b1, b2, , bk integer

53
Aspect Ratio Selection

Adaptive procedure
Start with b1 b2 bK 1
Execute m sampling steps (m 1)
Update each d(k) estimate
Solve optimization problem to get new bs
Sweep out new rectangle
Go to 2
Approximate optimization algorithm for bs
First ignore all constraints, minimize r2
Then scale up all so smallest bi 1 (constraint
2)
Scale down by ever-larger bis until animation
goal met (constraint 1)
Round down to integers (constraint 3)

c (b1, b2, , bk, a) ? c
1 ? bk ? Rk/ a
b1, b2, , bk integer

54
Empirical Results aspect ratios
Select Online AVG(D.grade/A.grade) From enroll
D, enroll A Where D.college Education And
A.college Agriculture And A.year d.year
-- 1x6
55
Ripple Joins, contd

Prototypes in Postgres, Informix, IBM DB2
Follow-on work
Subqueries in Singapore Tan, et al, VLDB 99
Scaling at Wisconsin/IBM Luo, et al, SIGMOD 02
Parallel and out-of-core variants
Query optimization issues
Motivated eddies SIGMOD 99 and the rest of the
Telegraph project CIDR 03
A number of API and other systems issues
Informix implementation DMKD 2000

56
Road Map

Requisite tech trends and prognostications
CONTROL of the situation online query processing
HCI motivations
Example applications
Example Algorithms
Reordering and Ripple Joins
Implications for hot topics
P2P query systems
Sensor nets
Stream Query Processing

57
Two Hot DB Topics

Queries over networks
Querying sensor nets (e.g. TinyDB)
Querying in P2P networks (e.g. PIER)
Continuous queries over data streams
E.g. TelegraphCQ, Aurora, STREAM, NiagaraCQ
Natural settings for these ideas!

58
Queries over Networks Big Data, Small Pipes

Given Large Data Sets
E.g. disks, traces in P2P setting
E.g. the physical world in sensornets
Constraint Small pipes
P2P ! cluster IPTPS03
Sensor nets have extremely limited comm per
battery
Will have long-running dataflow programs! So
Need to provide streaming, approximate answers
Need to prioritize delivery a la juggle
Need to decide adaptively on how to sample
physical world
Acquisitional Query Processing in TinyDB
related to both Ripple Join and Juggle
Need to consider human factors while the query
runs

59
Continuous Queries over Streams

Hot topic of the moment
Berkeleys TelegraphCQ, OGI/Wisconsins
NiagaraCQ, MIT/Brown/Brandeis Aurora, Stanfords
STREAM
Again, natural setting for these ideas
Need pipelining operators for STREAMS
Interest in approximate answers during bursts
And prioritized delivery
Need to consider human factors!
DB community focused on the next query language
Also need to focus on interactivity!
Initiation of a Continuous Query is only half the
battle

60
CONTROL Lessons Learned

Dream about UIs, work on systems
User needs drive systems design
For long-running, data-intensive apps especially
Systems and statistics intertwine
Adaptive systems
All 3 go together naturally
User desires and behavior 2 more things to
model, predict
Performance metrics need to reflect key user
needs
What unlike things must meet and mate
-- Art, Herman Melville

61
Related Work on Online QP

Morgensteins PhD, Berkeley 80
Online Association Rules
Ng, et als CAP, SIGMOD 98
Hidbers CARMA, SIGMOD 99
Implications for deductive DB semantics
Monotone aggregation in LDL, Zaniolo and Wang
Online agg with subqueries
Tan, et al. VLDB 99
Dynamic Pipeline Scheduling
Urhan/Franklin VLDB 01
Pipelining Hash Joins
Su/Raschid, Wilschut/Apers, Tukwila, Xjoin
Relation to semi-naive evaluation
Anytime Algorithms
Zilberstein, Russell, et al.

62
More information?

http//control.cs.berkeley.edu
Papers
Potters Wheel open source
jmh_at_cs.berkeley.edu

63
Backup Slides
64
Sampling Design Issues

Granularity of sample
Instance-level (row-level) high I/O cost
Block-level (page-level) high variability from
clustering
Type of sample
Often simple random sample (SRS)
Especially for on-the-fly
With/without replacement usually not critical
Data structure from which to sample
Files or relational tables
Indexes (B trees, etc)

65
Row-level Sampling Techniques

Maintain file in random order
Sampling scan
Is file initially in random order?
Statistical tests needed e.g., Runs test,
Smirnov test
In DB systems cluster via RAND function
Must freshen ordering (online reorg)
On-the-fly sampling
Via index on random column
Else get random page, then row within page
Ex extent-map sampling
Problem variable number of records on page

66
Acceptance/Rejection Sampling

Accept row on page i with probability ni/nMAX
Commonly used in other settings
E.g. sampling from joins
E.g. sampling from indexes

Original pages
Modified pages
67
Cost of Row-Level Sampling

100,000 pages
200 rows/page

68
Estimation for Aggregates

Point estimates
Easy SUM, COUNT, AVERAGE
Hard MAX, MIN, quantiles, distinct values
Confidence intervals a measure of precision
Two cases single-table and joins

69
Confidence Intervals
70
The Good and Bad News

Good news 1/n1/2 magic (n chosen on-the-fly)
Bad news needle-in-a-haystack problem

71
Running confidence intervals (1)

Confidence parameter p?(0,1) is prespecified
Display a precision parameter ?n such that
running aggregate Yn is within ? ?n of the final
answer µ with probability approximately equal to
p. Yn- ?n,Yn ?n contains µ with probability
approximately equal to p

72
Running confidence intervals (2)

Three types to construct from n retrieved
records
Conservative confidence intervals based on
Hoeffdings inequality or recent extention of
this inequality, for all n1
Large-sample confidence intervals based on
central limit theorems (CLTs), for n both small
and large enough
Deterministic confidence intervals contain µ with
probability 1, only for very large n

73
Running confidence intervals (3)

SELECT AVG(exp) FROM R
v(i) (1 ? i ? m) the value of exp when applied
to tuple i
Li the random index of the ith tuple retrieved
from R
a and b are a priori bounds a ? v(i) ? b for 1 ?
i ? m
Conservative confidence interval equations

74
Running confidence intervals (4)

Large-sample confidence interval equations
By central limit theorems (CLTs) , Ynapproaches
a normal distribution with a mean (m) and a
variance s2/n as n, the sample size, increases.
s2 can be replaced by the estimator Tn,2(v)

75
Estimators for SUM, COUNT and AVG
76
Confidence intervals for ripple joins

Use central limit theorems (CLTs) to compute
large-sample confidence intervals
Fix the problems in classic CLTs with newly
defined ?2 for different aggregate queries

77
Ripple optimizationchoosing aspect ratios (1)

Blocking factor ? is prespecified, we want to
optimize ?ks the aspect-ratio parameters
minimize

such that
?1?2 ?3 ...?K ?K-1?c (decided by animation speed)
1 ? ?k ? mk/ ? for 1 ? k ? K
?1,?2 ,?3 ,...?K interger

78
Precomputation Explicit

OLAP Data Cubes (drill-down hierarchies)
MOLAP, ROLAP, HOLAP
Semantic hierarchies
APPROXIMATE (Vrbsky, et al.)
Query Relaxation, e.g. CoBase
Multiresolution Data Models (Silberschatz/Reed/Fus
sell)
More general materialized views
See Gupta/Mumicks text

79
Precomputation Stat. Summaries

Histograms
Originally for aggregation queries, many flavors
Extended to enumeration queries recently
Multi-dimensional histograms
Parametric estimation
Wavelets and Fractals
Discrete cosine transform
Regression
Curve fitting and splines
Singular-Value Decomposition (aka LSI, PCA)
Indexes hierarchical histograms
Ranking and pseudo-ranking
Aokis use of GiSTs as estimators for ADTs
Data Mining
Clustering, classification, other
multidimensional models

80
Precomputed Samples

Materialized sample views
Olkens original work
Chaudhuri et al. join samples
Statistical inferences complicated over
recycled samples?
Barbarás quasi-cubes
AQUA join synopses on universal relation
Maintenance issues
AQUAs backing samples
Can use fancier/more efficient sampling
techniques
Stratified sampling or AQUAs congressional
samples
Haas and Swami AFV statistics
Combine precomputed outliers with on-the-fly
samples

Write a Comment

User Comments (0)