Title: Continuous Query Languages for DSMS
1Continuous Query Languages for DSMS
- CS240B Notes
- by
- Carlo Zaniolo
2Outline
- Design Objectives for Data Stream Management
System (DSMS) - Languages for expressing continuous queries
- The Blocking problem
- The expressive Power problem
- The Expressive Stream Language ESL
3Blocking Operators
- A blocking query operator is one that is unable
to produce the first tuple of the output until it
has seen the entire input Babcock et al.
PODS02 - But continuous queries cannot wait for the end of
the stream must return results while the data is
streaming in. Blocking operators cannot be used. - Only non-blocking (nb) queries and operators can
be used on data streams (i.e. those that return
their results before they have detected the end
of the input). - Current DBMSs make heavy usage of blocking
computations - For operators that are intrinsically blocking
e.g., SQL aggregates, - And for those that are not e.g., sort-based
implementation of joins and group by - We only need to be concerned with 1 find a
characterization for blocking nonblocking
independent of implementation.
4Partial Ordering
- Let S t1, ¼, tn be a sequence and 0 k
n. - Then t1, ¼, tk is said to be the presequence
of S, of length k, denoted by Sk. - We write L ? S to denote that L is a presequence
of S, - ? Defines a Partial Order reflexive,
antisymmetric and transitive. - ? generalizes to the subset notion when order and
duplicates are immaterial - The empty sequence, , is a subsequence of
every other sequence.
5employees(E, Dept)
- select dept, count(E)
- over (partition by dept range unbounded
preceding) - from employees
- SQL2003 OLAP functions Non-Blocking
- select dept, count(E)
- from employees
- group by dept
- Traditional SQL-2 aggregates Blocking
Continuous count returns, for each new tuple,
the count so far. Consider a sequence of length
n. At each step jltn, j is returned ? cumulative
return up to j sumj (S) 1,2, , j
independent on whether jn or jltn. Traditional
count For each jltn --nothing sumj (S)
Final sumn (S)n
6 Operators on Sequences S G G(S)
Operators viewed as incremental transducers on
a sequence S of legth n.
-
- G(S) result of applying G to the whole S
- Sj the first j elements of S (presequence of
length j n) - Gj (S) denotes the cumulative output produced up
to the j-th input tuple included. - Then G is said to be
- Blocking when Gj(S) Â for j lt n, and Gn(S)
G(S) - Nonblocking when Gj(S) G(Sj), for every j n.
- For example say that G produces one output tuple
for each input tuple.
7Examples
- Traditional SQL-2 aggregates are
blockingSQL2003 OLAP functions are not. - Selection is nonblocking.
- Continuous count (i.e., the unlimited preceding
count of OLAP functions) is non-blocking - Also window aggregates are non-blocking
- In between cases e.g., traditional aggregates on
input that is already sorted on group-by values.
8Characterization of NonBlocking (nb)
- Many functions expressible by nb-computations can
also be expressed by blocking ones. E.g., joins
can be implemented using sorting. Ditto for
projections with duplicate elimination. - But many functions implemented using blocking
computation cannot be given an nb-implementation.
- We must distinguish between the two kinds of
functions, since one can be supported in our DSMS
(via suitable nb-implementation) and the other
cannot. - Theorem Queries can be expressed via nb
computations iff they are monotonic w.r.t. the
presequence ordering.
9NB-completeness
- A query language L can express a given set of
functions F on its input (DB, sequences, data
streams)---the larger F, the greater the
expressive power of L. - Non-monotonic functions are intrinsically
blocking and they cannot be used on data streams.
Thus, if we use L in a DSMS, we give up the
non-monotonic subset of F with no regret.
However, let us make sure that we do not give up
anything more! - More? Yes, because for continuous queries of
streams, we will normally disallow Ls blocking
(i.e. nonmonotonic) operators constructs, and
only allow nb (i.e., monotonic ) operators. - But are ALL the monotonic functions expressible
by L using the nb-operators of L ? Or by
disallowing blocking operators did we also lose
the ability of expressing some monotonic queries? - Definition L is said to be nb-complete when it
can express all the monotonic queries expressible
by L using only its nb-operators.
10Expressive Power and NB-Completeness
- NB-completeness is a test that a language is as
suitable for continuous queries on data streams
as it is on stored database. - In a language L lacking nb-completeness, there
are monotonic functions that L cannot express as
continuous queries, that L can express if the
stream had been stored in a database. - For instance, Relational Algebra and SQL are not
nb-complete (in addition to the shortcomings they
might have on DBs).
11Sets versus Sequences
- Sets are sequences where duplicates are allowed
and order is immaterial. - Thus S1 is a subset of S2 iff S1 can be
reordered in a presequence of S2. - Theorem Lifted from sequences to sets. A
function is is nb iff it is monotonic. - NBmonotonic selection, projection, and OLAP
functions - BlockingNon-Monotonic e.g. Traditional
aggregates. - Operators of more than one argument
- Join are monotonic (i.e., NB) in both arguments.
- R-S is monotonic on R and antimonotonic on S
i.e., will block on S but not on R (after it has
seen the whole S, though)
12Relational Algebra (RA)
- Set difference can produce monotonic queries
Intersection R1 Ç R2 R1 - (R1 - R2) - Are these still expressible without set diff?
- Intersection can be expressed as a joins
productselect - But interval coalescing and Until queries are
monotonic queries that can be expressed in RA but
not in nb-RA. - Example Temporal domain isomorfic to nonnegative
integers.Intervals closed to the left but open to
the right - p(0, 3). 0,1, and 2 are in p but 3 is not
- p(2, 4). 3 is not a hole because is
covered by this - p(4, 5). 5 is a hole because not covered
by any other interval - p(6, 8).
13Coalesce p (cp) p Until q
- p(0, 3). p(2, 4). p(4, 5). p(6, 8).
- cp(0, 3). cp(2, 4). cp(4, 5). cp(6, 8).
cp(0, 4). cp(2, 5). cp(0,5). - cp contains intervals from the start point of
any p interval to the endpoint of any p interval
unless the endpoint of an interval in between is
a hole. - cp(I1, J2) p(I1, J1), p(I2, J2), J1 lt J2,
Øhole(I1, J2). - hole(I1, J2) p(I1, J1), p(I2, J2), p(_,K), J1
K, K lt I2, Øcep(K). - cep(K) p(_, K), p(I, J), I K, K lt J.
- q(5,_) holds if cp has an interval that starts at
0 contains 5pUntil q(yes) q(0, J). - pUntil q(yes) cp(0, I), q(J, _), I ³ J .
14Relational Algebra
- NonMonotonic (i.e., blocking) RA operators set
difference and division - We are left with select, project, join, and
union. Can these express all FO monotonic
queries? - Some interesting temporal queries coalesce and
until - They are expressible in RA (by double negation)
- They are monotonic
- They cannot be expressed in nb-RA.
- Theorem RA and SQL are not nb-complete.
- SQL faces two problems (i) the exclusion of
EXCEPT/NOT EXISTS, and
(ii) the exclusion of aggregates.
15E-Bay Example
- Auctions a stream of bids on an item.
- bidStream(Item, BidValue, Time)
- Items for which sum of bids is gt 100K
-      SELECT Item  FROM bidStream   GROUP BY
Item HAVING SUM(BidValue) gt 100000 - This is a monotonic query. Thus it can be
expressed in a language containing suitable
query operators, but not in SQL-2. SQL-2 is not
nb-complete thus it is ill-suited for continuous
queries on data streams. - So SQL-2 is not nb-complete because of its
blocking aggregates. What about relational
algebra?
16Incompleteness of Relational QL
- The coalesce and until queries
- can be expressed in safe nonrecursive Datalog,
thus - They are expressible in RA,
- They are monotonic
- They cannot be expressed in nb-RA
- Theorem RA and SQL are not nb-complete.
- A new limitation for DB query languages (which
were already severely challenged in terms of
expressive power)
17Embedding SQL Queries in a PL
- In DB applications, SQL can be embedded in a PL
(Java, C) where the PL accesses the tuples
returned by SQL using a Get Next of Cursor
statement. - Operations that could not be expressed in SQL can
then be expressed in the PL - an effective remedy for the lack of expressive
power of SQL - But cursors is a pull-based mechanism and
cannot be used on data streams the DSMS cannot
hold tuples until the PL request them. - The DSMS can only deliver its output to the PL as
a streamThis is OK to drive a GUI. But if most
of the work has not been done yet, who is the
DSMS? - Contrast this to DBMS who are useful even with a
weak QL.
18Reviewing the Situation
- SQLs lack of expressive power is a major problem
for database-centric applications. - These problems are significantly more serious for
data streams since - Only monotonic queries can be used,
- Actually, not even all the monotonic ones since
SQL is not nb-complete, - These problems cannot be really by using PLs with
embedded SQL statements on streams - DSMS will be impaired--unless significant
improvements can be made.
19UDAs to the Rescue
- Full support for UDAs with all window
combinationseffective on UDAs written in SQL,
PLs, and even built-ins - Support for continuous queries and ad hoc
queries, under a simple and unified semantics - Turing completeness --all possible queries
- nb-completeness all monotonic queries using only
non-blocking operators (e.g., window UDAs those
without TERMINATE) - Effective on a broad range of data-intensive
applications data/stream mining, approximate
queries, sequential patters (XML not there) - Making a strong case for the DB-oriented approach
to data streams.
20Conclusion
- Language Technology
- ESL a very powerful language for data stream and
DB applications - Simple semantics and unified syntax conforming
to SQL2003 standards - Strong case for the DB-oriented approach to data
streams - System Technology
- Some performance-oriented techniques
well-developede.g., buffer management for
windows - For others work is still in progressstay tuned
for latest news - Stream Mill is up and running http//wis.cs.ucla.
edu/stream-mill
21References
- 1ATLaS user manual. http//wis.cs.ucla.edu/atlas
. - 2SQL/LPP A Time Series Extension of SQL Based
on Limited Patience Patterns, volume 1677 of
Lecture Notes in Computer Science. Springer,
1999. - 4A. Arasu, S. Babu, and J. Widom. An abstract
semantics and concrete language for continuous
queries over streams and relations. Technical
report, Stanford University, 2002. - 5B. Babcock, S. Babu, M. Datar, R. Motwani, and
J. Widom. Models and issues in data stream
systems. In PODS, 2002. - 9D. Carney, U. Cetintemel, M. Cherniack, C.
Convey, S. Lee, G. Seidman, M. Stonebraker, N.
Tatbul, and S. Zdonik. Monitoring streams - a new
class of data management applications. In VLDB,
Hong Kong, China, 2002. - 10J. Celko. SQL for Smarties, chapter Advanced
SQL Programming. Morgan Kaufmann, 1995. - 11S. Chandrasekaran and M. Franklin. Streaming
queries over streaming data. In VLDB, 2002. - 12J. Chen, D. J. DeWitt, F. Tian, and Y. Wang.
NiagaraCQ A scalable continuous query system for
internet databases. In SIGMOD, pages 379-390, May
2000. - 13C. Cranor, Y. Gao, T. Johnson, V. Shkapenyuk,
and O. Spatscheck. Gigascope A stream database
for network applications. In SIGMOD Conference,
pages 647-651. ACM Press, 2003. - 14Lukasz Golab and M. Tamer Özsu. Issues in
data stream management. ACM SIGMOD Record,
32(2)5-14, 2003. - 15J. M. Hellerstein, P. J. Haas, and H. J.
Wang. Online aggregation. In SIGMOD, 1997. - 16 Yijian Bai, Hetal Thakkar, Chang Luo, Haixun
Wang, Carlo Zaniolo A Data Stream Language and
System Designed for Power and Extensibility.
Proc. of the ACM 15th Conference on Information
and Knowledge Management (CIKM'06), 2006 - 17 Yijian Bai, Hetal Thakkar, Haixun Wang and
Carlo Zaniolo Optimizing Timestamp Management in
Data Stream Management Systems. ICDE 2007.
22References (Cont.)
- 18 Yan-Nei Law, Haixun Wang, Carlo Zaniolo
Query Languages and Data Models for Database
Sequences and Data Streams. VLDB 2004 492-503 - 19 Sam Madden, Mehul A. Shah, Joseph M.
Hellerstein, and Vijayshankar Raman. Continuously
adaptive continuous queries over streams. In
SIGMOD, pages 49-61, 2002. - 20R. Motwani, J. Widom, A. Arasu, B. Babcock,
M. Datar S. Babu, G. Manku, C. Olston, J.
Rosenstein, and R. Varma. Query processing,
approximation, and resource management in a data
stream management system. In First CIDR 2003
Conference, Asilomar, CA, 2003. - 21R. Ramakrishnan, D. Donjerkovic, A.
Ranganathan, K. Beyer, and M. Krishnaprasad.
SRQL Sorted relational query language, 1998. - 23Reza Sadri, Carlo Zaniolo, and Amir M.
Zarkesh andJafar Adibi. A sequential pattern
query language for supporting instant data
minining for e-services. In VLDB, pages 653-656,
2001. - 24Reza Sadri, Carlo Zaniolo, Amir Zarkesh, and
Jafar Adibi. Optimization of sequence queries in
database systems. In PODS, Santa Barbara, CA, May
2001. - 25P. Seshadri. Predator A resource for
database research. SIGMOD Record, 27(1)16-20,
1998. - 26P. Seshadri, M. Livny, and R. Ramakrishnan.
SEQ A model for sequence databases. In ICDE,
pages 232-239, Taipei, Taiwan, March 1995. - 27Praveen Seshadri, Miron Livny, and Raghu
Ramakrishnan. Sequence query processing. In ACM
SIGMOD 1994, pages 430-441. ACM Press, 1994. - 28M. Sullivan. Tribeca A stream database
manager for network traffic analysis. In VLDB,
1996. - 29D. Terry, D. Goldberg, D. Nichols, and B.
Oki. Continuous queries over append-only
databases. In SIGMOD, pages 321-330, 6 1992. - 30Peter A. Tucker, David Maier, Tim Sheard, and
Leonidas Fegaras. Exploiting punctuation
semantics in continuous data streams. IEEE Trans.
Knowl. Data Eng, 15(3)555-568, 2003. - 31Haixun Wang and Carlo Zaniolo. ATLaS a
native extension of SQL for data minining. In
Proceedings of Third SIAM Int. Conference on Data
MIning, pages 130-141, 2003.
23DSMS Research Projects
- Aurora (Brandeis/Brown/MIT) http//www.cs.brown.ed
u/research/aurora/ - Cougar (Cornell) http//www.cs.cornell.edu/databas
e/cougar/ - Telegraph (Berkeley)- http//telegraph.cs.berkeley
.edu - STREAM (Stanford) http//www-db.stanford.edu/stre
am - Niagara (OGI/Wisconsin)-http//www.cs.wisc.edu/nia
gara/ - OpenCQ (Georgia Tech) http//disl.cc.gatech.edu/
CQ - Tapestry (Xerox) electronic documents stream
filtering - Hancock (ATT) http//www.research.att.com/kfishe
r/hancock/ - Cape (WPI) http//davis.wpi.edu/dsrg/CAPE/home.htm
l - Tribeca (Bellcore) network monitoring
- Stream Mill (UCLA) http//wis.cs.ucla.edu/stream
-mill - Gigascope
24CQLs for DSMS
- Most of DSMS projects use SQL for continuous
queriesfor good reasons, since - Many applications span data streams and DB tables
- A CQL based on SQL will be easier to learn use
- Moreover the fewer the differences the better!
- But DSMS were designed for persistent data and
transient queries---not for persistent queries on
transient data - Adaptation of SQL and its enabling technology
presents difficult research challenges - These combine with traditional SQL problem, such
as inability to deal with sequences, DM tasks,
and other complex query tasks---i.e., lack of
expressive power
25Language Problems
- Most DSMS projects use SQL queries spanning
both data streams and DBs will be easier. But - Even for persistent data, SQL is far from
perfect.Important application areas poorly
supported include - Data Mining, and we need to mine data streams,
- Sequence queries, and data streams are infinite
time series! - Major new problems for SQL on data stream
applications. (After all, it was designed for
persistent data on secondary store, not for
streaming data) - Only NonBlocking operators in DSMS blocking
forbidden - Distinction not clear in DBMS which often use
blocking implementations for nonblocking
operators - The distinction needs to formally characterized
- and so is the loss of query power of the QL.