Title: Interactive Query Processing
1Interactive Query Processing
- Vijayshankar Raman
- Computer Science Division
- University of California at Berkeley
2Motivation Nature of querying
- querying extracting information from data sets
- different techniques in different settings
- intrinsically slow
- significant user-system interaction
- info. seekers work iteratively, gradually
refining requests based on feedback Oday and
Jeffries 93, Koenemann and Belkin 96
3Problems with traditional solutions
- mismatch between system functionality and
iterative mode of HCI - black box functionality
- batch processing
- frustrating delays in iterative process
Query find average grade of students in each
college
4Interactive processing
process
query
exact answer
- HCI requirements
- users must get continual feedback on results of
processing - allow users to control processing based on prior
feedback
5Interactive processing
result estimates
- HCI requirements
- users must get continual feedback on results of
processing - allow users to control processing based on prior
feedback - performance goals
- not to minimize time to give complete results
- give continually improving partial results
- adapt to dynamically specified performance goals
6My Research
- interactive query processing
- interactive data cleaning (Potters
Wheel)
- assume dataflow thru pipelined operators
- support for dynamic user control in
traditional query proc. architectures - adaptively generating partial results in
response to user control - still more aggressive adaptation -- state
modules
T
S
R
S
R
T
S
R
T
T
7Talk Outline
- motivation and context
- support for dynamic user control in traditional
query proc. architectures - architecture for adaptively generating partial
results - policy for generating partial results
- user interface for displaying partial results
- impact on routing
- wrapup
8Design goals in supporting user control
- make minimal change to system architecture
- must be independent of particular query
processing algorithms - no delay in processing
9Online Reordering (Raman et al. 99,00)
- users perceive data being processed over time
- prioritize processing for interesting tuples
- interest based on user-specified preferences
- reorder dataflow so that interesting tuples go
first - encapsulate reordering as pipelined dataflow
operator
T
T
S
S
R
R
10Context an application of reordering
- online aggregation Hellerstein/Haas/Wang 97,
Haas/Hellerstein 99 - for SQL aggregate queries, give gradually
improving estimates - with confidence intervals
-
- allow users to speed up estimate refinement for
groups of interest - prioritize for processing at a per-group
granularity
SELECT AVG(gpa) FROM students GROUP BY college
11Online Aggregation Screenshot
SELECT AVG(gpa) FROM students GROUP BY college
12Framework for Online Reordering
network xfer.
acddbadb...
f(t)
abcdabc..
process
reorder
user interest
- want no delay in processing
- in general, reordering can only be best-effort
- typically process/consume slower than produce
- exploit throughput difference to reorder
- two aspects
- mechanism for best-effort reordering
- reordering policy
13Juggle mechanism for reordering
process/consume
getNext
buffer
spool
prefetch
enrich
produce
side disk
- two threads -- prefetch from input
-- spool/enrich from auxiliary side disk - juggle data between buffer and side disk
- keep buffer full of interesting items
- getNext chooses best item currently on buffer
- getNext, enrich/spool decisions -- based on
reordering policy - side disk management
- hash index, populated in a way that postpones
random I/O
14Reordering policies
good permutation of items t1tn to t?1t?n
GOAL
QOF
time
- quality of feedback for a prefix t?1t?2t?k
- QOF(UP(t?1), UP(t?2), UP(t?k )), UP user
preference - determined by application
- goodness of reordering dQOF/dt
- implication for juggle mechanism
- process gets item from buffer that increases QOF
the most - juggle tries to maintain buffer with such items
15QOF in Online Aggregation
- avg weighted confidence interval
- preference acts as weight on confidence interval
- QOF ?? UPi / ?ni , ni number of tuples
processed from group i - process pulls items from group with max UPi /
ni?ni - desired ratio of group i tuples on buffer
UPi2/3/? UPj2/3 - juggle tries to maintain this by enrich/spool
16Other QOF functions
- rate of processing (for a group) ? preference
- QOF ?? (ni - nUPi)2 (variance from ideal
proportions) - process pulls items from group with max (nUPi -
ni ) - desired ratio of group i tuples in buffer UPi
17Results Reordering in Online Aggregation
- implemented in Informix UDO server
- experiments with modified TPC-D queries
- questions
- how much throughput difference is needed for
reordering - can we reorder handle skewed data
- one stress test skew, very small proc. cost
- index-only join
- 5 orderpriorities, zipf distribution
consume
SELECT AVG(o_totalprice), o_orderpriority FROM
order WHERE exists ( SELECT FROM lineitem
WHERE l_orderkey o_orderkey) GROUP BY
o_orderpriority
juggle
index
scan
process
18Performance results
confidence interval
tuples processed
E
C
A
time
time
- 3 times faster for interesting groups
- 2 completion time overhead
19Overall Findings
- higher processing costs
- index/hash join, subquery,
- reordering easy
- very low processing costs
- juggle constrained by density of interesting
tuples - outlier groups hard to speed up
- better to use index stride Hellerstein/Haas/Wang
97 - needs pre-computed index
- reordering becomes easier over time
- question to answer
- where to place juggle?
20Outline
- motivation and context
- support for dynamic user control in traditional
query proc. architectures - architecture for adaptively generating partial
results - policy for generating partial results
- user interface for displaying partial results
- impact on routing
- wrapup
21Incremental results
- traditional arch. also generate continual result
tuples - arises from continual dataflow thru pipelining
operators - much work on pipelining joins Wilschut/Apers91,
Haas/Hellerstein99, Ives et al.99,
Urhan/Franklin00 - this is too rigid
- especially in distributed envirorments
22Context Query processing in Telegraph
- Telegraph adaptive dataflow system to query
diverse, distributed sources - much data available as services over Internet
- currently only accessible by browse/search/forms
- semi-automatically combine this data
- examples
- campaign finance information (Election
2000)Federal Election Commision Yahoo
Home PricesCensus APBnews Crime Ratings
Maps IMDB - restaurant information
- SwitchBoard Fodors S.F.Chronicle
MapQuest Health inspection reports
23Partial results
- complete result tuples too rigid
- source latencies high, diverse
- dynamic source variations, delays
- query does not capture user desires
- non-expert and even expert users query too broad
24Partial results
- complete tuples too rigid
- source latencies high, diverse
- dynamic source variations, delays
- query does not capture user desires
- non-expert and even expert users query too broad
- want to process queries flexibly
- give partial result tuples asap
- adapt dynamically to user preferences and source
variations
25Correctness of partial results
- some columns essential for UI
- e.g. group-by columns / sort-by columns
- for maximum flexibility -- outer-join semantics
- good idea for Web sources
- or -- strict join semantics
- no partial results without ensuring match exists
- key constraints helpful
- aggregates
- update early, and compensate later
- statistical guarantees for aggregates difficult
- fanouts unknown (can be 0!)
- key constraints helpful
26Dynamic query plans
T
R
S
T
S
R
- Eddy Avnur and Hellerstein 2000
- router for directing data thru modules
- minimize completion time by adaptively choosing
join order - all partial tuples generable
- limitations
- prioritized on modules, not data
- small memory buffer
27Partial result generation
. . . modules
. . . inputs
R
P
S
- our focus continual partial results
- Eddy must decide
- what tuple to route next
- where to route it
- based on user preferences and module properties
- need routing policy and a reordering mechanism
- eddy memory buffer and module queues bounded
28Prioritizing tuples
. . . modules
R
P
S
. . . inputs
- enhance Eddy with Juggle
- reorder exploiting slowness of modules
- synergy -- juggle location problem solved!
- juggle everywhere, to the extent of throughput
difference
29Routing and reordering policy
GOAL
QOF
time
- GOAL at any time, route to max. dQOF/dt
benefit of sending tuple to module / cost - cost estimate data rates to/from module
- benefit dependent on application and UI
- how partial results impact the UI
- user preferences
30Outline
- motivation and context
- online reordering for user prioritization of
partial results - architecture for adaptively generating partial
results - policy for generating partial results
- Telegraph UI displaying results and inferring
preferences - experimental results
- wrapup
31Telegraph UI
32Telegraph UI
- partial results displayed on screen as they
arrive - multi-resolution spreadsheet
- values clustered into groups
- can roll-up or drill-down to see different
granularities - different columns visible at different drilldown
levels - client has hash table mapping groups to values
- navigation
- vertical/horizontal scrolling
- rollup/drilldown
33Getting user preferences
- infer from navigation
- row/column scrolling, group drill down and rollup
- prioritize visible rows/columns
- query evolution
- subset one-size-fits-all queries
- future work query expansion
- explicit
- up/down buttons on columns
- at the cell level -- need for some expensive
sources - map to QOF metric on partial results
34QOF Benefit of a partial result
- depends on user preferences
- benefit of updating a cell in output
- row priority x column weight x ? cell resolution
- incremental cell resolution how much does one
extra update add to the cells value - scalars -- 1
- aggregations -- change in confidence interval
- informing user about execution progress
- convey cell resolution on UI
- future work
35QOF Benefit of Routing a Tuple
- route tuple according to expected benefit and
cost - benefit of sending a tuple t to a module M and
forming set T ?cells c ?T benefit of updating c
36Throughput difference
- Bush Contributors Income (index) AFB
Crime Ratings (index)
40000
Bush Contributors
30000
20000
Number of tuples read
10000
Income
Crime Ratings
0
200
400
600
time (s)
37Benefit of giving partial results
- Bush Contributors Income (index) AFB
Crime Ratings (index)
40000
Bush Contributors
30000
partial results
number of partial results
complete results
20000
10000
Income
Crime Ratings
0
200
400
600
time (s)
38Effect of Delays
Bush Contributors Income (index) AFB
Crime Ratings (index) with delay in AFB Crime
Ratings
number of partial results
10000
delay
1000
number of complete results
100
10
200
400
600
0
time(s)
39Prioritizing particular values
- SELECT AVG(Income) FROM Bush Donors, Census
GROUP BY BushDonors.State - scrolling AZ, AR, CA, CO, CT,
KY, LA, MA, MD, MI, TX, UT,
VA, VI, VT
600
400
Aggr. Updates
200
0
100
200
300
400
time (s)
40Prioritizing particular values
- SELECT AVG(Income) FROM Bush Donors, Census
GROUP BY BushDonors.State - scrolling AZ, AR, CA, CO, CT,
KY, LA, MA, MD, MI, TX, UT,
VA, VI, VT
600
400
Aggr. Updates
200
0
100
200
300
400
time (s)
41Prioritizing particular values
- SELECT AVG(Income) FROM Bush Donors, Census
GROUP BY BushDonors.State - scrolling AZ, AR, CA, CO, CT,
KY, LA, MA, MD, MI, TX, UT,
VA, VI, VT
600
400
Aggr. Updates
200
0
100
200
300
400
time (s)
42Distribution of contributions
5000
4000
3000
Bush Contributors
2000
1000
State
43Prioritizing particular columns (1)
- how do we prioritize external sources?
- increase number of threads used to probe it
- resource scheduling rather than data scheduling
- if eddy sends na tuples to A, nb to B, per second
- na lt threadsa/latency(A)
- nb lt threadsb/latency(B)
- threadsa threadsb lt Number of Threads Per
Query - GOAL maximize na ?QOF(A) nb ?QOF(B)
44Prioritizing particular columns (2)
- with 3 threads for each source
Income
3000
2000
Number of tuples read
1000
Crime Ratings
300
100
200
0
time (s)
45Prioritizing particular columns (2)
- with 5 threads for Crime Ratings, 1 for Income
Income
3000
2000
Number of tuples read
1000
Crime Ratings
300
100
200
0
time (s)
46Prioritizing particular columns (3)
- more generally, optimize with multiple resource
constraints - netizenship
- s
- future work
47Outline
- motivation and context
- online reordering for user prioritization of
partial results - architecture for adaptively generating partial
results - policy for generating partial results
- more aggressive adaptation state modules
- wrapup
48Granularity of Query Operators
- relational operators logical abstractions
- encapsulate multiple physical effects
- inflexible in handling unexpected changes
- cannot gracefully adapt
- access method/data source selection
- join algorithm selection
- resource allocation e.g. memory
- delays Query Scrambling, XJoin
- want to encapsulate at level of physical operators
49State Modules
hash jn
ind.jn
Eddy
R
R
P
- isolate (and share) state in State Modules
(SteMs) - work sharing
- routing flexibility
- can devise constraints so that query execn.
routing - can adapt access methods, join algorithms
gracefully - directly measure adapt resource consumption ??
50Outline
- motivation and context
- online reordering for user prioritization of
partial results - architecture for generating more aggressive
partial results - policy for generating partial results
- more aggressive adaptation state modules
- wrapup
51Related Work
- information retrieval
- ranked retrieval, relevance feedback
- search strategies, Berry Picking
- incremental query processing
- pipelining hash joins (Haas/Hellerstein99,
Ives
et al.99, Urhan/Franklin00) - top N/fast first queries (Carey/Kossman97,
Antoshenkov/Ziauddin96) - adaptivity
- parametric query plans (Graefe/Cole94)
- mid-query reoptimization (Urhan/Franklin/Amsaleg9
8,
Kabra/DeWitt98, Ives et al.99) - competition
(Antoshenkov/Ziauddin96) - miscellaneous
- precomputed summaries (OLAP, materialized views,
AQUA) - parachute queries (Bonnet/Tomasic98)
- APPROXIMATE (Vrbsky/Liu93)
52Summary
- applications, query processors need tighter
coupling - online reordering
- effective way of supporting dynamic user control
- partial results as desired
- embed reordering within dynamically controlled
dataflow - hard to map user-interaction needs intoconcrete
algorithm performance goals - wanted benchmarks based on user/application
traces
53Future Work
- session-granularity query processing
- query evolution
- lazy evaluation user/client navigates through
partial results - helps thin clients
- closer dbms-application interaction
- evaluation tools
- benchmarks driven by traces/user studies
- for more information
- http//telegraph.cs.berkeley.edu/
- http//control.cs.berkeley.edu/
54HCI Motivation Why Interactivity?
- Berry picking (Bates 90, 93)
- user studies
- decision support (Oday and Jeffries 93)
- relevance feedback (Koenemann and Belkin 96)
55Disk data layout during juggling
- performance goal
- favor early results optimize Phase 1, at expense
of Phase 2 - spool sequential I/O, enrich random I/O
- reordering in Phase 2 much easier than in Phase 1
- enrich needs approx. index on side-disk
- have hash-index on tuples, according to user
interest - done at group granularity
56Other applications of reordering
- can add reorder to any dataflow
- will later discuss application for query systems
that give aggressive partial results - also useful in batch query processing
- sorting often used in query plans for performance
gains - can be replaced by best-effort reordering
- little performance hit, but plan is now
pipelining - will not discuss further in this talk
- spreadsheets
57Telegraph-Client API
sql query
client
Telegraph
partial results
- preferences
- col priorities
- ltpredicate, row priority, row-col prioritiesgt
- e.g. lt1, 1.0, nonegt, lt2, 1.7, nonegt,
lt3, 1.3, maphighgt, lt8/33014, 1.0,
nonegt, lt8/60610, 1.0, nonegt
preferences
58Granularity of Reordering/Routing
- reorder and route tuples at granularity of a
group - group ltbase relations, predicategt
- groups created and deleted dynamically, as user
navigates in UI - group predicate may be a subset of
application-specified row predicate - final row depends on values this tuple joins with
59Interactive query processing with non-pipelining
operators
- basic techniques independent of pipelined
operators. - reordering --- can be used in general query
plans, although effectiveness may be hindered by
blocking operators - State Modules applicable in general -- helps with
adaptivity - much work on making plan operators pipelining --
ripple join, xjoin, tuqwila - reordering itself can be used to avoid blocking
sorts in some cases
60Query processing using SteMs (1)
- SteM operations -- works like index on a table
- build(tuple) add tuple to SteM
- probe(tuple) find all matches among build values
- bounce back tuple if all matches not found and
probe tuple not cached elsewhere - performing joins by routing through SteMs
- index joins
- regular synchronous index joins easy -- have no
state - over distributed sources, lookups cached by
building into SteM - separating cache allows Eddy to distinguish
access cost - asynchronous index joins GW00
- helps for high-throughput Internet sources
- SteM acts as rendezvous buffer for pending probe
results
61Query processing using SteMs (2)
- performing joins by routing through SteMs (contd)
- doubly pipelined hash join
- use two SteMs, one on each source
- can also do extensions (e.g. Tuqwila IFF99),
and even non-pipelined joins, using appropriate
SteMs - thus can simulate any static query plan by
appropriate routing
SteMS
SteMR
R probe
S probe
R bld.
S bld.
Eddy
R
S
62Query processing using SteMs (3)
- how to do this in general?
- dont want Eddy to be a super-join
- want it to route as per user desires and source
properties, independent of any join algorithm - routing constraint
- each tuple must have a non-zero probability of
being sent to every appropriate module - can avoid repeated looping by marking tuples
- Theorem any routing that satisfies above
constraint will produce all query results and
terminate - could get duplicates
- prune at application, or
- enforce atomicity in tuple routing
63Benefits of SteMs
- adaptivity
- join algorithm and access path selection are
determined by eddy routing - hence Eddy can dynamically choose these based on
source properties - in fact, Eddy can do hybrid join algorithms
- adapt to dynamic delays, etc.
- work sharing -- for query evolution, competition
- better information to the Eddy
- speeds, memory consumption
64Lazy Evaluation
- User navigates thru partial results
- can see only a few at a time
- exploit to delay processing
- do on demand on the region user currently
navigating over - things to delay
- expensive source lookups (e.g. map)
- joins (e.g. )
- any other formatting etc.
65Giving probabilistic results
- view partial result as somethng that throws
light on outer-join space - positive result probabilistic
- negative result deterministic
- benefit of positive result
- weighted sum of cells it is likely to show
penalty for false result - repudiation of earlier false results
- benefit of negative result
- direct benefit unlikely
- repudiation of earlier false results
- complications
- many tuples may contribute to single output tuple
-- aggregation
66Probabilistic partial results
- partial result gt may not apply all filters
- still want to show result probabilistically
- filters ill-specified
- data on Internet often inconsistent
- show all partial results full disjunction
(Galindo-Legaria 94) - positive and negative results
Output Space
Outer Join Space
67What does user see?
- user navigates thru partial results
- with complete row partial results
- can either explore these results in detail
(sort/scroll along columns as in spreadsheet) - compute approx. aggregates on them in online
fashion - errors shrinking as more results come in
- with probabilistic, incomplete results?
- easy if primary key in partial result
- what to show otherwise?
68Quantifying benefit of partial results
- partial result sheds light on outer-join space
- positive result probabilistic
- negative result deterministic
- benefit of positive result
- weighted sum of cells it is likely to show
- penalty for false result ?
- repudiation of earlier false results ?
- benefit of negative result
- direct benefit unlikely
- repudiation of earlier false results ?
69Why probabilistic results are good
- data on Internet is often inconsistent
- hence even exact processing cannot give perfect
answers - people are used to sloppy answers
- can allow negations in expert interfaces only
70Data growth vs. Computer Speedup
- Moores Law -- of transistors/chip doubles
every 18 months (1965) - data growth
Source J. Porter, Disk/Trend, Inc.
(http//www.disktrend.com/pdf/portrpkg.pdf)
71Disk Appetite, contd.
- Greg Papadopoulos, CTO Sun
- Disk sales doubling every 9 months
- similar results from Winter VLDB survey
- time to process all your data doubles every 18
months!