Title: Continuous%20Query%20Languages%20for%20DSMS
1Continuous Query Languages for DSMS
- CS240B Notes
- by
- Carlo Zaniolo
2DSMS Languages for Continuous Queries
- Relational Algebra Operators
- SQL and User-Defined aggregates
- The Blocking problem
- The expressive Power problem
- XML streams and their query languages.
3CQLs for DSMS
- Most of DSMS projects use SQL for continuous
queriesfor good reasons, since - Many applications span data streams and DB tables
- A CQL based on SQL will be easier to learn use
- Moreover the fewer the differences the better!
- But DSMS were designed for persistent data and
transient queries---not for persistent queries on
transient data - Adaptation of SQL and its enabling technology
presents many research challenges - Lack of expressive powereven worse now since
only nonblocking operators are allowed.
4Continuous Query Graph many componentsarbitrary
DAGs
?
Source1
U
Sink
Source2
s
5Relational Algebra Operators
- Stored data
- Selection, Projection
- Union
- Join (including X) on tables
- Set Difference
- Aggregates
- Traditional Blocking aggregates
- OLAP functions on windows or unlimited preceding
-
- Data Streams
- ... same
- Union by Sort-Merging on timestamps
- Join of Stream with table
- Window joins on streams (timestamps merged into 1
column) - No stream difference (blockingdiff of stream
with table OK). - Aggregates
- No blocking aggregate
- OLAP functions on windows or unlimited preceding
- Slides, and tumbles.
-
6Bolts and Nuts
- create stream bids(bid, item, offer, Time)
- create stream mybids as (select bid, offer, Time
- from bids
- where itembolt
- union
- select bid, offer,
Time - from bids
- where itemnut)
- Result same as select bid, offer, Time where
- item bolt or itemnut
7Joins
- We could create a stream called interesting bids
by say joining bids with the interesting_items
table. - We next find the bolt bids for which there was a
nut bid offered in the last 5 minutes for the
same price. - create stream selfjoinbids as
- (select S1.bid, S1.offer, S2.bid,
S2.Time - from bids as S1, bids as S2 window of
5 minutes - where S1.itembolt and S2.itemnut and
- S1.offerS2.offer)
- The window condition implies that S1.Time gt
S2.Time and S2.Time gt S1.Time-5 minutes. - Windows on both streams are used very often.
-
-
\
8Processing Unions
- Union When tuples are present at all inputs,
select one with minimal timestamp and - Production add this tuple to the output, and
- Consumption remove it from the input.
9Window Joins
A
B
- Window Join of Stream A and Stream BWhen tuples
are present at both inputs, and the timestamp of
A is less or equal than that of B, then perform
the following operations (symmetric operations
are performed if timestamp of B is less or equal
than that of A) - Production compute the join of the tuple in A
with the tuples in W(B) and add the resulting
tuples to output buffer (these tuple have the
same timestamp a the tuple in A) - Consumption the current tuple in A is removed
from the input and added to the window buffer
W(A) (from which the expired tuples are also
removed)
10Relational Algebra Operators
- Stored data
- Selection, Projection
- Union
- Join (including X) on tables
- Set Difference
- Aggregates
- Traditional Blocking aggregates
- OLAP functions on windows or unlimited preceding
-
- Data Streams
- ... same
- Union by Sort-Merging on timestamps
- Join of Stream with table
- Window joins on streams (timestamps merged into 1
column) - No stream difference (blockingdiff of stream
with table OK). - Aggregates
- No blocking aggregate
- OLAP functions on windows or unlimited preceding
- Slides, and tumbles.
- Including UDAs
11User-Defined AggregatesMax Power via Min SQL
Extensions
- Windows (logical, physical, slides, tumbles,)
flexible synopses that solve the blocking problem
for aggregates - DSMS only support these constructs on built-in
aggregates - ESL is the first to support the complete
integration of these two - User Defined Aggregates (UDAs) the key to power
and extensibility, and - And thus can support data mining,
- XML,
- sequences not supported by other DSMS
- One framework for aggregates and windows, whether
they are built-ins or user-defined, and
independent on the language used to define
them.
12Defining Traditional Aggregates
- Specification consists of 3 blocks of code---
Written in an external PL (as DBMS and other DSMS
do), or - In SQL itself (SQL becomesTuring Complete!)
- INITIALIZE
- Executed upon the arrival of the first tuple
- ITERATE
- Executed upon the arrival of each subsequent
tuples (an incremental computation suitable for
streams) - TERMINATE
- Executed after the end of the relation/stream has
been reached - Invocation SELECT myavg(start_price) FROM
OpenAuction
13The UDA AVG in SQL
- AGGREGATE avg(Next Int) Real
- TABLE state(tsum Int, cnt Int)
- INITIALIZE
- INSERT INTO state VALUES (Next, 1)
-
- ITERATE
- UPDATE state
- SET tsumtsumNext, cntcnt1
-
- TERMINATE
- INSERT INTO RETURN
- SELECT tsum/cnt FROM state
-
-
- INSERT INTO RETURN in TERMINATE ? a blocking
UDA
14NonBlocking UDA AVG of last 200 Values
- AGGREGATE myavg(Next Int) Real
- TABLE state(tsum Int, cnt Int)
- INITIALIZE
- INSERT INTO state VALUES (Next, 1)
-
- ITERATE
- UPDATE state SET tsumtsumNext, cntcnt1
- INSERT INTO RETURN
- SELECT tsum/cnt FROM state
- WHERE cnt 200 0
- UPDATE state SET tsumNext, cnt1
- WHERE cnt 200 1
-
- TERMINATE
-
- Empty TERMINATE Denotes a non-blocking UDA
15UDAs in ESL
- In ESL user-defined Aggregates (UDAs) can be
defined directly in SQL, rather than in a PL - Native extensibility in SQL via UDAs (which can
also be defined in a PL for better performance) - No impedance mismatch
- Access to DB tables from UDAs
- Data Independence and optimization
- Good ease of use and performance
- Turing completeness nb-completeness.
16Data Intensive Applications UDAs
- Complex Applications can expressed concisely,
with good performance - ATLAS a single-user DBMS developed at UCLA.
- Support for SQL with UDAs
- On top of Berkeley-DB record manager.
- Data Mining Algorithms in ATLAS
- Decision Tree Classifiers 18 lines of codes
- APriori 40 lines of codes
- Modest overhead lt50 w.r.t procedural UDA
- Data Stream Applications in ESL
- Data Stream Mining, approximate aggregates,
sketches, histograms,
17SQL2003 OLAP FunctionsAggregates on Windows
CREATE STREAM ClosedAuction (/auction closings
/ itemID, /id of
the item in this auction./ buyerID
/buyer of this item./) Final price real
/final price of the item /,
Current_time) order by source
Auctions
- For each seller, show the average selling price
over the last 10 items sold (physical window)
CREATE STREAM LastTenAvg SELECT
sellerID, AVG(price) OVER(PARTITION BY sellerID
ROWS 9 PRECEDING), Current_time FROM
ClosedPrice
18Optimizing Window AVG in ESL
- For each expired tuple decrease the count by one
and the sum by the expired valueworks for
logical physical windows
- WINDOW AGGREGATE avg(Next Real) Real TABLE
state(tsum Int, cnt Real) TABLE
inwindow(wnext Real) - INITIALIZE INSERT INTO state
VALUES (Next, 1) - ITERATE UPDATE state SET tsumtsumNext,
cntcnt1 INSERT INTO RETURN SELECT
tsum/cnt FROM state - EXPIRE /if there are expired tuples, take the
oldest / UPDATE state SET cnt
cnt-1, tsum tsum (select wnext
FROM inwindow
WHERE oldest(inwindow))
19MAX
- System maintains inwindow
- Remove dominated (less older) values
- The oldest is always the max.
WINDOW AGGREGATE max (Next Real) Real TABLE
inwindow(wnext real) INITIALIZE etc.
/system adds new tuples to inwindow/
ITERATE DELETE FROM inwindow WHERE
wnext ltNext INSERT INTO RETURN
SELECT wnext FROM inwindow
WHERE oldest(inwindow)
EXPIRE /expired tuples removed
automatically/
20For Each Aggregate two versions
- The traditional Base aggregate with terminate
- The Window aggregate with inwindow and expire.
- These definitions will take care of both logical
and physical windows. - But there are more complications slides and
- tumbles.
21Slides and Tumbles
- Every two minutes, show the average selling
price over the last 10 minutes (logical window)
CREATE STREAM LastTenAvg SELECT sellerID,
max(price) OVER(RANGE 10 MINUTE PRECEDING
SLIDE 2
MINUTE), Current_time FROM ClosedPrice
Here the window is W10 and the slide is S2.
Tumble When S W
22SLIDEs
- The slide constructs divides a window into panes,
results only returned at the end of each pane - Slide is conducive to optimization.
- Combine summaries into the desired aggregation
- E.g. MAX(1, 2, 3, 4) MAX(MAX(1,2), MAX(3,4))
4 - I.e., for MAX, we can perform MAX on subsets of
numbers as local summaries, then combine them
together to get the true MAX - Proposed before but what constructs should be
used to integrate these concepts into the
language?
23Slides Tumbles--Examples
- Tumble where the SLIDE size is equal or larger
than the window size - E.g. Once every 50 tuples, compute and return
average over the last 10 tuples - Easy to optimize
- Skip the first 40 tuples of every 50 tuples, and
compute the blocking base version of the
aggregate on the last 10 - Slide where slide size is smaller than the
window size - E.g. Once every 10 tuples, compute and return
average over the last 50 tuples - Naïve implementation--not optimized
- Perform incremental maintenance on every incoming
tuple - Ignore RETURN statements for most incoming tuples
- Only invoke RETURN once every 10 tuples
24Pane-based SLIDE Optimization
- Two-level cascading aggregates using two existing
aggregates - Perform sub-aggregation inside each pane using
the base aggregate No need for incremental
maintenance here - Computed with a blocking aggregate once for each
pane - Combine the summary tuples using the window
aggregate that returns on every incoming tuple
(non-blocking) - With incremental maintenance here
- At any time, only the last un-finished pane needs
to store data tuples - all finished panes are reduced to one reusable
summary tuple
25Pane-based SLIDE optimization
- Example SUM with window size 50 tuples, and
slide size 10 tuples - First create a stream of summary tuples using
base aggregate - CREATE STREAM temp AS (
- SELECT itemID, base_max(sale_price) as s
- OVER(PARTITION BY itemID ROWS 9 PRECEDING
SLIDE 10) - FROM Auction)
- Then apply the window version of the aggregate
- SELECT itemID, window_max(s)
- OVER(PARTITION BY itemID ROWS 4 PRECEDING)
- FROM temp
- This simple approach can be used to implement
very complex aggregations (e.g. ensemble
classifiers) - Applies uniformly to logical/physical windows
defined in SQL or in an external language
26Summary
- Logical, Physical x tumble, slide,
unlimited_preceding - Six different types of calls, supported by two
definitions - Both SQL or procedural languages can be used in
the definition.
27Window UDAs vs. Base UDAs
- Base UDAs
- called as traditional SQL-2 aggregates, with
- optional GROUP BY
- Window UDAs
- called with SQL2003 OVER clause
- logical or physical windows
- optional PARTITION BY and SLIDE clauses in ESL
- Clear semantics and optimization rules unify
- UDAsSQL or PL-defined, algebraic or not
- window (logical physical), slice, tumbles,
etc. - System and user roles in optimization.
28Window UDAs Physical Optimization
- The Stream Mill System provides efficient support
for - Management of new expiring tuples in buffer
- Main memory intelligent paging into disk
- Events caused by tuple expiration
- Users can access the buffer as the table called
inwindow
29Conclusion
- Language Technology
- ESL a very powerful language for data stream and
DB applications - Simple semantics and unified syntax conforming
to SQL2003 standards - Strong case for the DB-oriented approach to data
streams - System Technology
- Some performance-oriented techniques
well-developede.g., buffer management for
windows - For others work is still in progressstay tuned
for latest news - Stream Mill is up and running http//wis.cs.ucla.
edu/stream-mill
30 31References
- 1ATLaS user manual. http//wis.cs.ucla.edu/atlas
. - 2SQL/LPP A Time Series Extension of SQL Based
on Limited Patience Patterns, volume 1677 of
Lecture Notes in Computer Science. Springer,
1999. - 4A. Arasu, S. Babu, and J. Widom. An abstract
semantics and concrete language for continuous
queries over streams and relations. Technical
report, Stanford University, 2002. - 5B. Babcock, S. Babu, M. Datar, R. Motwani, and
J. Widom. Models and issues in data stream
systems. In PODS, 2002. - 9D. Carney, U. Cetintemel, M. Cherniack, C.
Convey, S. Lee, G. Seidman, M. Stonebraker, N.
Tatbul, and S. Zdonik. Monitoring streams - a new
class of data management applications. In VLDB,
Hong Kong, China, 2002. - 10J. Celko. SQL for Smarties, chapter Advanced
SQL Programming. Morgan Kaufmann, 1995. - 11S. Chandrasekaran and M. Franklin. Streaming
queries over streaming data. In VLDB, 2002. - 12J. Chen, D. J. DeWitt, F. Tian, and Y. Wang.
NiagaraCQ A scalable continuous query system for
internet databases. In SIGMOD, pages 379-390, May
2000. - 13C. Cranor, Y. Gao, T. Johnson, V. Shkapenyuk,
and O. Spatscheck. Gigascope A stream database
for network applications. In SIGMOD Conference,
pages 647-651. ACM Press, 2003. - 14Lukasz Golab and M. Tamer Özsu. Issues in
data stream management. ACM SIGMOD Record,
32(2)5-14, 2003. - 15J. M. Hellerstein, P. J. Haas, and H. J.
Wang. Online aggregation. In SIGMOD, 1997. - 16 Yijian Bai, Hetal Thakkar, Chang Luo, Haixun
Wang, Carlo Zaniolo A Data Stream Language and
System Designed for Power and Extensibility.
Proc. of the ACM 15th Conference on Information
and Knowledge Management (CIKM'06), 2006 - 17 Yijian Bai, Hetal Thakkar, Haixun Wang and
Carlo Zaniolo Optimizing Timestamp Management in
Data Stream Management Systems. ICDE 2007.
32References (Cont.)
- 18 Yan-Nei Law, Haixun Wang, Carlo Zaniolo
Query Languages and Data Models for Database
Sequences and Data Streams. VLDB 2004 492-503 - 19 Sam Madden, Mehul A. Shah, Joseph M.
Hellerstein, and Vijayshankar Raman. Continuously
adaptive continuous queries over streams. In
SIGMOD, pages 49-61, 2002. - 20R. Motwani, J. Widom, A. Arasu, B. Babcock,
M. Datar S. Babu, G. Manku, C. Olston, J.
Rosenstein, and R. Varma. Query processing,
approximation, and resource management in a data
stream management system. In First CIDR 2003
Conference, Asilomar, CA, 2003. - 21R. Ramakrishnan, D. Donjerkovic, A.
Ranganathan, K. Beyer, and M. Krishnaprasad.
SRQL Sorted relational query language, 1998. - 23Reza Sadri, Carlo Zaniolo, and Amir M.
Zarkesh andJafar Adibi. A sequential pattern
query language for supporting instant data
minining for e-services. In VLDB, pages 653-656,
2001. - 24Reza Sadri, Carlo Zaniolo, Amir Zarkesh, and
Jafar Adibi. Optimization of sequence queries in
database systems. In PODS, Santa Barbara, CA, May
2001. - 25P. Seshadri. Predator A resource for
database research. SIGMOD Record, 27(1)16-20,
1998. - 26P. Seshadri, M. Livny, and R. Ramakrishnan.
SEQ A model for sequence databases. In ICDE,
pages 232-239, Taipei, Taiwan, March 1995. - 27Praveen Seshadri, Miron Livny, and Raghu
Ramakrishnan. Sequence query processing. In ACM
SIGMOD 1994, pages 430-441. ACM Press, 1994. - 28M. Sullivan. Tribeca A stream database
manager for network traffic analysis. In VLDB,
1996. - 29D. Terry, D. Goldberg, D. Nichols, and B.
Oki. Continuous queries over append-only
databases. In SIGMOD, pages 321-330, 6 1992. - 30Peter A. Tucker, David Maier, Tim Sheard, and
Leonidas Fegaras. Exploiting punctuation
semantics in continuous data streams. IEEE Trans.
Knowl. Data Eng, 15(3)555-568, 2003. - 31Haixun Wang and Carlo Zaniolo. ATLaS a
native extension of SQL for data minining. In
Proceedings of Third SIAM Int. Conference on Data
MIning, pages 130-141, 2003.