Continuous Query Languages for DSMS - PowerPoint PPT Presentation

About This Presentation
Title:

Continuous Query Languages for DSMS

Description:

... the PL accesses the tuples returned by SQL using a Get Next of Cursor statement. ... But cursors is a pull-based' mechanism and cannot be used on data streams: the ... – PowerPoint PPT presentation

Number of Views:69
Avg rating:3.0/5.0
Slides: 24
Provided by: mir135
Learn more at: http://web.cs.ucla.edu
Category:

less

Transcript and Presenter's Notes

Title: Continuous Query Languages for DSMS


1
Continuous Query Languages for DSMS
  • CS240B Notes
  • by
  • Carlo Zaniolo

2
Outline
  • Design Objectives for Data Stream Management
    System (DSMS)
  • Languages for expressing continuous queries
  • The Blocking problem
  • The expressive Power problem
  • The Expressive Stream Language ESL

3
Blocking Operators
  • A blocking query operator is one that is unable
    to produce the first tuple of the output until it
    has seen the entire input Babcock et al.
    PODS02
  • But continuous queries cannot wait for the end of
    the stream must return results while the data is
    streaming in. Blocking operators cannot be used.
  • Only non-blocking (nb) queries and operators can
    be used on data streams (i.e. those that return
    their results before they have detected the end
    of the input).
  • Current DBMSs make heavy usage of blocking
    computations
  • For operators that are intrinsically blocking
    e.g., SQL aggregates,
  • And for those that are not e.g., sort-based
    implementation of joins and group by
  • We only need to be concerned with 1 find a
    characterization for blocking nonblocking
    independent of implementation.

4
Partial Ordering
  • Let S t1, ¼, tn be a sequence and 0 k
    n.
  • Then t1, ¼, tk is said to be the presequence
    of S, of length k, denoted by Sk.
  • We write L ? S to denote that L is a presequence
    of S,
  • ? Defines a Partial Order reflexive,
    antisymmetric and transitive.
  • ? generalizes to the subset notion when order and
    duplicates are immaterial
  • The empty sequence, , is a subsequence of
    every other sequence.

5
employees(E, Dept)
  • select dept, count(E)
  • over (partition by dept range unbounded
    preceding)
  • from employees
  • SQL2003 OLAP functions Non-Blocking
  • select dept, count(E)
  • from employees
  • group by dept
  • Traditional SQL-2 aggregates Blocking

Continuous count returns, for each new tuple,
the count so far. Consider a sequence of length
n. At each step jltn, j is returned ? cumulative
return up to j sumj (S) 1,2, , j
independent on whether jn or jltn. Traditional
count For each jltn --nothing sumj (S)
Final sumn (S)n
6
Operators on Sequences S G G(S)
Operators viewed as incremental transducers on
a sequence S of legth n.
  • G(S) result of applying G to the whole S
  • Sj the first j elements of S (presequence of
    length j n)
  • Gj (S) denotes the cumulative output produced up
    to the j-th input tuple included.
  • Then G is said to be
  • Blocking when Gj(S)   for j lt n, and Gn(S)
    G(S)
  • Nonblocking when Gj(S) G(Sj), for every j n.
  • For example say that G produces one output tuple
    for each input tuple.

7
Examples
  • Traditional SQL-2 aggregates are
    blockingSQL2003 OLAP functions are not.
  • Selection is nonblocking.
  • Continuous count (i.e., the unlimited preceding
    count of OLAP functions) is non-blocking
  • Also window aggregates are non-blocking
  • In between cases e.g., traditional aggregates on
    input that is already sorted on group-by values.

8
Characterization of NonBlocking (nb)
  • Many functions expressible by nb-computations can
    also be expressed by blocking ones. E.g., joins
    can be implemented using sorting. Ditto for
    projections with duplicate elimination.
  • But many functions implemented using blocking
    computation cannot be given an nb-implementation.
  • We must distinguish between the two kinds of
    functions, since one can be supported in our DSMS
    (via suitable nb-implementation) and the other
    cannot.
  • Theorem Queries can be expressed via nb
    computations iff they are monotonic w.r.t. the
    presequence ordering.

9
NB-completeness
  • A query language L can express a given set of
    functions F on its input (DB, sequences, data
    streams)---the larger F, the greater the
    expressive power of L.
  • Non-monotonic functions are intrinsically
    blocking and they cannot be used on data streams.
    Thus, if we use L in a DSMS, we give up the
    non-monotonic subset of F with no regret.
    However, let us make sure that we do not give up
    anything more!
  • More? Yes, because for continuous queries of
    streams, we will normally disallow Ls blocking
    (i.e. nonmonotonic) operators constructs, and
    only allow nb (i.e., monotonic ) operators.
  • But are ALL the monotonic functions expressible
    by L using the nb-operators of L ? Or by
    disallowing blocking operators did we also lose
    the ability of expressing some monotonic queries?
  • Definition L is said to be nb-complete when it
    can express all the monotonic queries expressible
    by L using only its nb-operators.

10
Expressive Power and NB-Completeness
  • NB-completeness is a test that a language is as
    suitable for continuous queries on data streams
    as it is on stored database.
  • In a language L lacking nb-completeness, there
    are monotonic functions that L cannot express as
    continuous queries, that L can express if the
    stream had been stored in a database.
  • For instance, Relational Algebra and SQL are not
    nb-complete (in addition to the shortcomings they
    might have on DBs).

11
Sets versus Sequences
  • Sets are sequences where duplicates are allowed
    and order is immaterial.
  • Thus S1 is a subset of S2 iff S1 can be
    reordered in a presequence of S2.
  • Theorem Lifted from sequences to sets. A
    function is is nb iff it is monotonic.
  • NBmonotonic selection, projection, and OLAP
    functions
  • BlockingNon-Monotonic e.g. Traditional
    aggregates.
  • Operators of more than one argument
  • Join are monotonic (i.e., NB) in both arguments.
  • R-S is monotonic on R and antimonotonic on S
    i.e., will block on S but not on R (after it has
    seen the whole S, though)

12
Relational Algebra (RA)
  • Set difference can produce monotonic queries
    Intersection R1 Ç R2 R1 - (R1 - R2)
  • Are these still expressible without set diff?
  • Intersection can be expressed as a joins
    productselect
  • But interval coalescing and Until queries are
    monotonic queries that can be expressed in RA but
    not in nb-RA.
  • Example Temporal domain isomorfic to nonnegative
    integers.Intervals closed to the left but open to
    the right
  • p(0, 3). 0,1, and 2 are in p but 3 is not
  • p(2, 4). 3 is not a hole because is
    covered by this
  • p(4, 5). 5 is a hole because not covered
    by any other interval
  • p(6, 8).

13
Coalesce p (cp) p Until q
  • p(0, 3). p(2, 4). p(4, 5). p(6, 8).
  • cp(0, 3). cp(2, 4). cp(4, 5). cp(6, 8).
    cp(0, 4). cp(2, 5). cp(0,5).
  • cp contains intervals from the start point of
    any p interval to the endpoint of any p interval
    unless the endpoint of an interval in between is
    a hole.
  • cp(I1, J2) p(I1, J1), p(I2, J2), J1 lt J2,
    Øhole(I1, J2).
  • hole(I1, J2) p(I1, J1), p(I2, J2), p(_,K), J1
    K, K lt I2, Øcep(K).
  • cep(K) p(_, K), p(I, J), I K, K lt J.
  • q(5,_) holds if cp has an interval that starts at
    0 contains 5pUntil q(yes) q(0, J).
  • pUntil q(yes) cp(0, I), q(J, _), I ³ J .

14
Relational Algebra
  • NonMonotonic (i.e., blocking) RA operators set
    difference and division
  • We are left with select, project, join, and
    union. Can these express all FO monotonic
    queries?
  • Some interesting temporal queries coalesce and
    until
  • They are expressible in RA (by double negation)
  • They are monotonic
  • They cannot be expressed in nb-RA.
  • Theorem RA and SQL are not nb-complete.
  • SQL faces two problems (i) the exclusion of
    EXCEPT/NOT EXISTS, and
    (ii) the exclusion of aggregates.

15
E-Bay Example
  • Auctions a stream of bids on an item.
  • bidStream(Item, BidValue, Time)
  • Items for which sum of bids is gt 100K
  •       SELECT Item  FROM bidStream    GROUP BY
    Item HAVING SUM(BidValue) gt 100000
  • This is a monotonic query. Thus it can be
    expressed in a language containing suitable
    query operators, but not in SQL-2. SQL-2 is not
    nb-complete thus it is ill-suited for continuous
    queries on data streams.
  • So SQL-2 is not nb-complete because of its
    blocking aggregates. What about relational
    algebra?

16
Incompleteness of Relational QL
  • The coalesce and until queries
  • can be expressed in safe nonrecursive Datalog,
    thus
  • They are expressible in RA,
  • They are monotonic
  • They cannot be expressed in nb-RA
  • Theorem RA and SQL are not nb-complete.
  • A new limitation for DB query languages (which
    were already severely challenged in terms of
    expressive power)

17
Embedding SQL Queries in a PL
  • In DB applications, SQL can be embedded in a PL
    (Java, C) where the PL accesses the tuples
    returned by SQL using a Get Next of Cursor
    statement.
  • Operations that could not be expressed in SQL can
    then be expressed in the PL
  • an effective remedy for the lack of expressive
    power of SQL
  • But cursors is a pull-based mechanism and
    cannot be used on data streams the DSMS cannot
    hold tuples until the PL request them.
  • The DSMS can only deliver its output to the PL as
    a streamThis is OK to drive a GUI. But if most
    of the work has not been done yet, who is the
    DSMS?
  • Contrast this to DBMS who are useful even with a
    weak QL.

18
Reviewing the Situation
  • SQLs lack of expressive power is a major problem
    for database-centric applications.
  • These problems are significantly more serious for
    data streams since
  • Only monotonic queries can be used,
  • Actually, not even all the monotonic ones since
    SQL is not nb-complete,
  • These problems cannot be really by using PLs with
    embedded SQL statements on streams
  • DSMS will be impaired--unless significant
    improvements can be made.

19
UDAs to the Rescue
  • Full support for UDAs with all window
    combinationseffective on UDAs written in SQL,
    PLs, and even built-ins
  • Support for continuous queries and ad hoc
    queries, under a simple and unified semantics
  • Turing completeness --all possible queries
  • nb-completeness all monotonic queries using only
    non-blocking operators (e.g., window UDAs those
    without TERMINATE)
  • Effective on a broad range of data-intensive
    applications data/stream mining, approximate
    queries, sequential patters (XML not there)
  • Making a strong case for the DB-oriented approach
    to data streams.

20
Conclusion
  • Language Technology
  • ESL a very powerful language for data stream and
    DB applications
  • Simple semantics and unified syntax conforming
    to SQL2003 standards
  • Strong case for the DB-oriented approach to data
    streams
  • System Technology
  • Some performance-oriented techniques
    well-developede.g., buffer management for
    windows
  • For others work is still in progressstay tuned
    for latest news
  • Stream Mill is up and running http//wis.cs.ucla.
    edu/stream-mill

21
References
  • 1ATLaS user manual. http//wis.cs.ucla.edu/atlas
    .
  • 2SQL/LPP A Time Series Extension of SQL Based
    on Limited Patience Patterns, volume 1677 of
    Lecture Notes in Computer Science. Springer,
    1999.
  • 4A. Arasu, S. Babu, and J. Widom. An abstract
    semantics and concrete language for continuous
    queries over streams and relations. Technical
    report, Stanford University, 2002.
  • 5B. Babcock, S. Babu, M. Datar, R. Motwani, and
    J. Widom. Models and issues in data stream
    systems. In PODS, 2002.
  • 9D. Carney, U. Cetintemel, M. Cherniack, C.
    Convey, S. Lee, G. Seidman, M. Stonebraker, N.
    Tatbul, and S. Zdonik. Monitoring streams - a new
    class of data management applications. In VLDB,
    Hong Kong, China, 2002.
  • 10J. Celko. SQL for Smarties, chapter Advanced
    SQL Programming. Morgan Kaufmann, 1995.
  • 11S. Chandrasekaran and M. Franklin. Streaming
    queries over streaming data. In VLDB, 2002.
  • 12J. Chen, D. J. DeWitt, F. Tian, and Y. Wang.
    NiagaraCQ A scalable continuous query system for
    internet databases. In SIGMOD, pages 379-390, May
    2000.
  • 13C. Cranor, Y. Gao, T. Johnson, V. Shkapenyuk,
    and O. Spatscheck. Gigascope A stream database
    for network applications. In SIGMOD Conference,
    pages 647-651. ACM Press, 2003.
  • 14Lukasz Golab and M. Tamer Özsu. Issues in
    data stream management. ACM SIGMOD Record,
    32(2)5-14, 2003.
  • 15J. M. Hellerstein, P. J. Haas, and H. J.
    Wang. Online aggregation. In SIGMOD, 1997.
  • 16 Yijian Bai, Hetal Thakkar, Chang Luo, Haixun
    Wang, Carlo Zaniolo A Data Stream Language and
    System Designed for Power and Extensibility.
    Proc. of the ACM 15th Conference on Information
    and Knowledge Management (CIKM'06), 2006
  • 17 Yijian Bai, Hetal Thakkar, Haixun Wang and
    Carlo Zaniolo Optimizing Timestamp Management in
    Data Stream Management Systems. ICDE 2007.

22
References (Cont.)
  • 18 Yan-Nei Law, Haixun Wang, Carlo Zaniolo
    Query Languages and Data Models for Database
    Sequences and Data Streams. VLDB 2004 492-503
  • 19 Sam Madden, Mehul A. Shah, Joseph M.
    Hellerstein, and Vijayshankar Raman. Continuously
    adaptive continuous queries over streams. In
    SIGMOD, pages 49-61, 2002.
  • 20R. Motwani, J. Widom, A. Arasu, B. Babcock,
    M. Datar S. Babu, G. Manku, C. Olston, J.
    Rosenstein, and R. Varma. Query processing,
    approximation, and resource management in a data
    stream management system. In First CIDR 2003
    Conference, Asilomar, CA, 2003.
  • 21R. Ramakrishnan, D. Donjerkovic, A.
    Ranganathan, K. Beyer, and M. Krishnaprasad.
    SRQL Sorted relational query language, 1998.
  • 23Reza Sadri, Carlo Zaniolo, and Amir M.
    Zarkesh andJafar Adibi. A sequential pattern
    query language for supporting instant data
    minining for e-services. In VLDB, pages 653-656,
    2001.
  • 24Reza Sadri, Carlo Zaniolo, Amir Zarkesh, and
    Jafar Adibi. Optimization of sequence queries in
    database systems. In PODS, Santa Barbara, CA, May
    2001.
  • 25P. Seshadri. Predator A resource for
    database research. SIGMOD Record, 27(1)16-20,
    1998.
  • 26P. Seshadri, M. Livny, and R. Ramakrishnan.
    SEQ A model for sequence databases. In ICDE,
    pages 232-239, Taipei, Taiwan, March 1995.
  • 27Praveen Seshadri, Miron Livny, and Raghu
    Ramakrishnan. Sequence query processing. In ACM
    SIGMOD 1994, pages 430-441. ACM Press, 1994.
  • 28M. Sullivan. Tribeca A stream database
    manager for network traffic analysis. In VLDB,
    1996.
  • 29D. Terry, D. Goldberg, D. Nichols, and B.
    Oki. Continuous queries over append-only
    databases. In SIGMOD, pages 321-330, 6 1992.
  • 30Peter A. Tucker, David Maier, Tim Sheard, and
    Leonidas Fegaras. Exploiting punctuation
    semantics in continuous data streams. IEEE Trans.
    Knowl. Data Eng, 15(3)555-568, 2003.
  • 31Haixun Wang and Carlo Zaniolo. ATLaS a
    native extension of SQL for data minining. In
    Proceedings of Third SIAM Int. Conference on Data
    MIning, pages 130-141, 2003.

23
DSMS Research Projects
  • Aurora (Brandeis/Brown/MIT) http//www.cs.brown.ed
    u/research/aurora/
  • Cougar (Cornell) http//www.cs.cornell.edu/databas
    e/cougar/
  • Telegraph (Berkeley)- http//telegraph.cs.berkeley
    .edu
  • STREAM (Stanford) http//www-db.stanford.edu/stre
    am
  • Niagara (OGI/Wisconsin)-http//www.cs.wisc.edu/nia
    gara/
  • OpenCQ (Georgia Tech) http//disl.cc.gatech.edu/
    CQ
  • Tapestry (Xerox) electronic documents stream
    filtering
  • Hancock (ATT) http//www.research.att.com/kfishe
    r/hancock/
  • Cape (WPI) http//davis.wpi.edu/dsrg/CAPE/home.htm
    l
  • Tribeca (Bellcore) network monitoring
  • Stream Mill (UCLA) http//wis.cs.ucla.edu/stream
    -mill
  • Gigascope

24
CQLs for DSMS
  • Most of DSMS projects use SQL for continuous
    queriesfor good reasons, since
  • Many applications span data streams and DB tables
  • A CQL based on SQL will be easier to learn use
  • Moreover the fewer the differences the better!
  • But DSMS were designed for persistent data and
    transient queries---not for persistent queries on
    transient data
  • Adaptation of SQL and its enabling technology
    presents difficult research challenges
  • These combine with traditional SQL problem, such
    as inability to deal with sequences, DM tasks,
    and other complex query tasks---i.e., lack of
    expressive power

25
Language Problems
  • Most DSMS projects use SQL queries spanning
    both data streams and DBs will be easier. But
  • Even for persistent data, SQL is far from
    perfect.Important application areas poorly
    supported include
  • Data Mining, and we need to mine data streams,
  • Sequence queries, and data streams are infinite
    time series!
  • Major new problems for SQL on data stream
    applications. (After all, it was designed for
    persistent data on secondary store, not for
    streaming data)
  • Only NonBlocking operators in DSMS blocking
    forbidden
  • Distinction not clear in DBMS which often use
    blocking implementations for nonblocking
    operators
  • The distinction needs to formally characterized
  • and so is the loss of query power of the QL.
Write a Comment
User Comments (0)
About PowerShow.com