Title: Eddies: Continuously Adaptive Query processing
1Eddies Continuously Adaptive Query processing
- R. Avnur, J.M. Hellerstein
- UCB
- ACM Sigmod 2000
2Problem Statement
- Context large federated and shared-nothing
databases - Problem assumptions made at query optimization
rarely hold during execution - Hypothesis do away with traditional optimizers,
solve it thru adaptation - Focus scheduling in a tuple-based pipeline query
execution model
3Problem Statement Refinement
- Large scale systems are unpredictable, because
- Hardware and workload complexity,
- bursty servers networks, heterogenity, hardware
characteristics - Data complexity,
- Federated database often come without proper
statistical summaries - User Interface Complexity
- Online aggregation may involve user control
4Research Laboratory setting
- Telegraph, a system designed to query all data
available online - River, a low level distributed record management
system for shared-nothing databases - Eddies, a scheduler for dispatching work over
operators in a query graph
5The Idea
- Relational algebra operators consume a stream
from multiple sources to produce a new stream - A priori you dont now how selective- how fast-
tuples are consumed/produced - You have to adapt continuously and learn this
information on the fly - Adapt the order of processing based on these
lessons
6The Idea
next
next
next
next
next
next
7The Idea
- Standard method derive a spanning tree over the
query graph - Pre-optimize a query plan to determine operator
pairs and their algorithm, e.g. to exploit access
paths - Re-optimization a query pipeline on the fly
requires careful state management, coupled with - Synchronization barriers
- Operators have widely differing arrival rates for
their operands - This limits concurrency, e.g. merge-join
algorithm - Moments of symmetry
- Algorithm provides option to exchange the role of
the operands without too much complications - E.g switching the role of R and S in a
nested-loop join
8Nested-loop
R
s
9Join and sorting
- Index-joins are asymmetric, you can not easily
change their role - Combine index-join operands as a unit in the
process - Sorting requires look-ahead
- Merge-joins are combined into unit
- Ripple joins
- Break the space into smaller pieces and solve the
join operation for each piece individually - The piece crossings are moments of symmetry
10The Idea
next
next
next
next
11Rivers and Eddies
- Eddies are tuple routers that distribute arriving
tuples to interested operators - What are efficient scheduling policies?
- Fixed strategy? Random ? Learning?
- Static Eddies
- Delivery of tuples to operators can be hardwired
in the Eddie to reflect a traditional query
execution plan - Naïve Eddie
- Operators are delivered tuples based on a
priority queue - Intermediate results get highest priority to
avoid buffer congestion
12Observations for selections
- Extended priority queue for the operators
- Receiving a tuple leads to a credit increment
- Returning a tuple leads to a credit decrement
- Priority is determined by weighted lottery
- Naïve Eddies exhibit back pressure in the tuple
flow production is limited by the rate of
consumption at the output - Lottery Eddies approach the cost of optimal
ordering, without a need to a priory determine
the order - Lottery Eddies outperform heuristics
- Hash-use first, or Index-use first, Naive
13Observations
- The dynamics during a run can be controlled by a
learning scheme - Split the processing in steps (windows) to
re-adjust the weight during tuple delivery - Initial delays can not be handled efficiently
- Research challenges
- Better learning algorithms to adjust flow
- Aggressive adjustments
- Remove pre-optimization
- Balance hostile parallel environment
- Deploy eddies to control degree of partitioning
(and replication)
14Database streams You only get one chance to look
- Prof. Dr. Martin Kersten
- CWI
- Amsterdam
- March 2003
15Database research topic list
- Indexing, Access methods, data structures
- Query/transaction processing and optimization
- Distributed, heterogeneous, mobile databases
- View maintenance/materialisation
- Mining data, text, and web
- Semi-structured data, metadata and XML
- Temporal, Spatial, Scientific, Statistical,
Biological DB - Data warehousing and OLAP
- Middleware, Workflow and Security
HOT XML, Semantic Web, P2P, Streams, Biological
16Outline
- Introduction to Data Streaming Management System
(DSMS) - A reference architecture for a DSMS
- Grouping thousands of user queries
- Merging and abstraction of streams
- Conclusions
17The tranquil database scene
- Traditional DBMS data stored in finite,
persistent data sets, SQL-based applications to
manage and access it
OLTP-web application
Ad-hoc reporting
Data entry application
RDBMS
18The tranquil database scene
- The user community grows and MANY wants
up-to-the-second (aggregate) information from the
database
OLTP-web application
Ad-hoc reporting
Data entry application
RDBMS
19The tranquil database scene
- Database entry is taken over by a remote device
which issues a high-volume of update transactions
OLTP-web application
Ad-hoc reporting
Data entry application
Dataentry application
RDBMS
20The tranquil database scene
- Database entry is taken over by MANY remote
devices which issues a high-volume of update
transactions
OLTP-web application
Adhoc reporting
Dataentry application
Dataentry application
RDBMS
21The tranquil database scene
- Database solutions can not carry the weight
OLTP-web application
Adhoc reporting
Dataentry application
Dataentry application
RDBMS
22Application domains
- Personalized financial tickers
- Personalized information delivery
- Personalized environment control
- Business to business middelware
- Web-services application based on XML exchange
- Monitoring the real-world environment (pollution,
traffic) - Monitoring the data flow in an ISP
- Monitoring web-traffic behaviour
- Monitoring the load on a telecom switch
- Monitoring external news-feeds
23Application vision
- Re-define the role of a DBMS in the complete
application support line - It manages a persistent store
- It handles and coordinates updates
- It supports ad-hoc querying
- Application servers carry the load
- J2EE, JBOS, Websphere,BEA,.
- Or partly redesign the DBMS
24Application domains
- Personalized financial tickers
- Personalized information delivery
- Personalized environment control
- Business to business middelware
- Web-services application based on XML exchange
- Monitoring the real-world environment (pollution,
traffic) - Monitoring the data flow in an ISP
- Monitoring web-traffic behaviour
- Monitoring the load on a telecom switch
- Monitoring external news-feeds
25Application domains
- Personalized
- Personalized
- Personalized
-
middelware - on XML
exchange - Monitoring
- Monitoring
- Monitoring
- Monitoring
- Monitoring
QUERYING
WEB SERVICES
STREAM UPDATE
26Continuous queries
- Continous query the user observes the changes
made to the database through a query - Query registration once
- Continously up-to-date answers.
Continuous queries
RDBMS
27Data Streams
- Data streams
- The database is in constant bulk load mode
- The update rate is often non-uniform
- The entries are time-stamped
- The source could be web-service, sensor, wrapped
source
Dataentry application
DSMS
28DSMS
- Data Stream Management Systems (DSMS) support
- high volume update streams and real-time response
- to ad-hoc complex queries.
- What can be salvaged from the DBMS core
technology ? - What should be re-designed from scratch ?
Dataentry application
DSMS
29DBMS versus DSMS
- Persistent relations
- Transaction oriented
- One-time queries
- Precise query answering
- Access plan determines physical database design
- Transient streams
- Query orientation
- Continuous queries
- Best-effort query answering
- Unpredictable data characteristics
30Old technology to rescue?
- Many stream based applications are low-volume
with simple queries - Thus we can live with automatic query refresh
- Triggers are available for notification of
changes - They are hooked up to simple changes to the
datastore - There is no technology to merge/optimize trigger
groups
31Outline of remainder
- Query processing over multiple streams
- Organizing hundreds of ad-hoc queries
- Sensor-network based querying
32A stream application
- Widom Consider a network traffic system for an
ISP - with customer link and backbone link and two
streams - keeping track of the IP traffic
33A stream application
- Widom Consider a network traffic system for an
ISP - with customer link and backbone link and two
streams - keeping track of the IP traffic
- TPc(saddr, daddr, id, length, timestamp)
- TPb(saddr, daddr, id, length, timestamp)
PTc
PTb
34A stream application
- Q1 Compute the load on the backbone link averaged
over one minute period and notify the operator
when the load exceeds a threshold T - Select notifyoperator(sum(length))
- From PTb
- Group By getminute(timestamp)
- Having sum(length) gtT
- With low stream flow it could be handled with a
DBMS trigger, - Otherwise sample the stream to get an approximate
answer
35A stream application
- Q2 Find the fraction of traffic on the backbone
link coming from the customer network to check
cause of congestion. - ( Select count()
- From PTc as C, PTb as B
- Where C.saddr B.saddr and C.daddrB.daddr
- and C.idB.id ) /
- ( Select count() From PTb)
- Both streams might require an unbounded resource
to perform the join, which could be avoided with
an approximate answer and synopsis
36A stream application
- Q3 Monitor the 5 source-to-destination pairs in
terms of traffic on the backbone. - With Load As (Select saddr, daddr,sum(length) as
traffic - From PTb Group By saddr,daddr)
- Select saddr, daddr, traffic
- From Load as l1
- Where (Select count() From Load as l2
- Where l2.traffic ltl1.traffic) gt
- (Select 0.95count() From Load)
- Order By Traffic
-
- This query contains blocking operators
37STREAM architecture
Answer
38- Q1 Compute the load on the backbone link averaged
over one minute period and notify the operator
when the load exceeds a threshold T - Select notifyoperator(sum(length))
- From PTb
- Group By getminute(timestamp)
- Having sum(length) gtT
The answer store area simply needs an integer
39- Q2 Find the fraction of traffic on the backbone
link coming from the customer network to check
cause of congestion. - ( Select count()
- From PTc as C, PTb as B
- Where C.saddr B.saddr and C.daddrB.daddr
- and C.idB.id ) /
- ( Select count() From PTb)
The scratch area should maintain part of the two
streams to implement the join. Or a complete
list of saddr and daddr.
40Joining two tables
RelA
Nested loop join
RelB
41Joining two tables
RelA
Nested loop join
RelB
42Joining two tables
RelA
Nested loop join
RelB
43Joining two stream
..
PTa
Nested loop join
PTb
..
An unbounded store would be required
44Joining two stream
..
PTa
merge join
PTb
..
If the streams are ordered a simple merge join is
possible With limited resource requirements
45Joining two stream
window
..
PTa
histogram
Join synopsis
histogram
PTb
..
A statistical summary could provide an
approximate answer
46- Q3 Monitor the 5 source-to-destination pairs in
terms of traffic on the backbone. - With Load As (Select saddr, daddr,sum(length) as
traffic - From PTb Group By saddr,daddr)
- Select saddr, daddr, traffic
- From Load as l1
- Where (Select count() From Load as l2
- Where l2.traffic ltl1.traffic) gt
- (Select 0.95count() From Load)
- Order By Traffic
-
- The scratch area should maintain part of the two
streams to - implement the join.
47Finance
- DeWitt Consider a financial feed where
thousands of clients can register arbitrary
complex continues queries. - XML stream querying
XML
48Finance
- Q5 Notify me whenever the price of KPN stock
drops below 6 euro - Select notifyUser(name, price)
- From ticker t1
- Where t1.name KPN and t1.price lt 6
49Finance
- Q5 Notify me whenever the price of KPN stock
drops by 5 over the last hour - Select notifyUser(name, price)
- From ticker t1,t2
- Where t1.name KPN and t2.name t1.name
- and getminutes(t1.timestamp-t2.timestamp) lt60
- and t1.price lt 0.95 t2.price
50Finance
- Q6 Notify me whenever the price of KPN stock
drops by 5 over the last hour and T-mobile
remains constant - Select notifyUser(name, price)
- From ticker t1,t2, t3,t4
- Where t1.name KPN and t2.name t1.name
- and getminutes(t1.timestamp-t2.timestamp) lt60
- and t1.price lt 0.95 t2.price
- and t1.timestampt3.timestamp and
t2.timestampt4.timestamp - and t3.name T-Mobile and t4.name t3.name
- and getminutes(t3.timestamp-t4.timestamp) lt60
- and t3.price t4.price
51Query signatures
- Traditional SQL applications already use the
notion of parameterised queries, I.e. some
constants are replaced by a program variable. - Subsequent calls use the same query evaluation
plan - In a DSMS we should recognize such queries as
quick as possible - Organize similar queries into a group
- Decompose complex queries into smaller queries
- Manage the amount of intermediate store
52Finance
- Queries can be organized in groups using a
signature and evaluation can be replaced by
single multi-user request.
Select notifyUser(name, price) From ticker
t1 Where t1.name KPN and t1.price lt 6
Client Name Threshold Price
192.871.12.1 KPN 6
192.777.021 ING 12
53Finance
- Queries can be organized in groups using a
signature and evalution can be replaced by single
multi-user request.
Select notifyUser(c.client, t1.name,
t1.price) From ticker t1, clients c Where t1.name
c.name and t1.price lt c.price
Client Name Threshold Price
192.871.12.1 KPN 6
192.777.021 ING 12
54Finance
- Timer-based queries call for a stream window with
incremental evaluation - Multiple requests can be organized by time-table
and event detection methods provided by database
triggers. - Select notifyUser(name, price)
- From ticker t1,t2
- Where t1.name KPN and t2.name t1.name
- and getminutes(t1.timestamp-t2.timestamp) lt60
- and t1.price lt 0.95 t2.price
55Finance
- Complex queries can be broken down into
independent components - Select notifyUser(name, price)
- From ticker t1,t2, t3,t4
- Where t1.name KPN and t2.name t1.name
- and getminutes(t1.timestamp-t2.timestamp) lt60
- and t1.price lt 0.95 t2.price
- and t1.timestampt3.timestamp and
t2.timestampt4.timestamp - and t3.name T-Mobile and t4.name t3.name
- and getminutes(t3.timestamp-t4.timestamp) lt60
- and t3.price t4.price
56Finance
- Intermediate results should be materialized. Can
be integrated in tradition query evaluation
schemes
t1.timestampt3.timestamp and t2.timestampt4.time
stamp
57Sensor networks
- Madden Sensor networks are composed of
thousands of small devices, interconnected
through radio links. This network can be queried. - Sensors have limited energy
- Sensors have limited reachability
- Sensors can be crushed
58Aggregate Queries Over Ad-Hoc Wireless Sensor
Networks
59Sensor networks
- Q7 Give me the traffic density on the A1 for the
last hour - Select avg(t.car)
- From traffic t
- Where t.segment in (Select segment From roads
- Where name A1)
- Group By gethour(t.timestamp)
60Sensor networks
- The sensors should organize themselves into a P2P
infrastructure - An aggregate query is broadcasted through the
network - Each Mote calculates a partial answer and sent it
to its peers - Peers aggregate the information to produce the
final answer. - Problems
- The energy to broadcast some information is high
- Tuples and partial results may be dropped
61Conclusions and outlook
- Data stream management technology require changes
in our expectation of a DBMS functionality - Queries not necessarily provide a precise answer
- Queries continue as long as we are interested in
their approximate result - The persistent store not necessarily contains a
consistent and timeless view on the state of the
database.
62Conclusions and outlook
- Datastream management technology capitalizes upon
proven DBMS technology - DSMS provide a basis for ambient home settings,
sensor networks, and globe spanning information
systems - It is realistic to expect that some of the
properties to support efficient datastream
management will become part of the major products - Multi query optimization techniques should be
added.
63Literature
- NiagaraCQ A Scalable Contious Query System for
Internet Databases, J. Chen, D.J. deWitt, F.
Tian, Y. Wang, Wisconsin Univ. - Streaming Queries over Streaming Data , Sirish
Chandrasekaran, Michael J. Franklin, Univ
Berkeley - Continous Queries over Data Streams, S.Babu, J.
Widom, Stanford University