Title: Models and Issues in Data Stream Systems
1Models and Issues in Data Stream Systems
Brian Babcock, Shivnath Babu, Mayur Datar, Rajeev
Motwani Jennifer Widom ACM SIGMOD/PODS, 2002
- Adesola Omotayo
- September 17, 2004
2Goals
- The need for and research issues arising from a
new model of data processing - Review past work relevant to data stream systems
and current projects in that area. - Explore topics in stream query languages, new
requirements and challenges in query processing,
and algorithmic issues.
3Presentation Outline
- The Data Stream Model
- Review of Data Stream Projects
- Queries of Data Streams
- Proposal for a DSMS
- Algorithmic Issues
4The Data Stream Model
- DS vs. Stored Relational Model
- data elements arrive online
- system has no control over arrival order
- data streams are unbounded
- processed data stream elements are discarded or
archived. - Use of data in conventional stored relations
5Queries
- One-time and Continuous queries
- One-time queries
- evaluated once over a snapshot of data set
- Continuous queries
- evaluated continuously
- answers may be stored and updated or may be
produced as data streams
6Queries (contd)
- Predefined and Ad hoc queries
- Predefined
- supplied before any relevant data arrives
- generally continuous queries
- scheduled one-time queries possible
- Ad hoc
- either one-time or continuous queries
- complicates design of data stream management
systems
7Motivating Examples
- Web-based financial search engine (e.g.
Traderbot) - Modern security applications (e.g. iPolicy
Networks) - Web logs monitoring (e.g. Yahoo)
- Sensor monitoring (e.g. HP Data Center)
- Network traffic management (e.g. ISPs)
8Concrete Example
- Fraction of backbones traffic attributed to
customer network - (SELECT count()
- FROM C, B
- WHERE C.scr B.scr and C.dest B.dest
- and C.id B.id) /
- (SELECT count() FROM B)
9Review of Data Stream Projects
- Tapestry System
- Continuous queries
- Restricted subset of SQL
- Alert System
- Event-condition-action style triggers
- Continuous queries
10Review of Data Stream Projects (contd)
- XFilter System
- Efficient content-based filtering of XML
documents - Continuous queries in XPath language
- Xyleme System
- Content-based filtering system
- High throughput with a restricted query language
11Review of Data Stream Projects (contd)
- Tribeca SDB Manager
- Restricted querying capability over network
packet streams - Tangram System
- Uses stream processing techniques to analyze
large quantities of stored data
12Review of Data Stream Projects (contd)
- OpenCQ
- Continuous queries
- Query processing algorithm based on incremental
view maintenance. - NiagraCQ
- Continuous queries
- Groups continuous queries for efficient
evaluation - Support of blocking operators in query plans over
data streams
13Review of Data Stream Projects (contd)
- Viglas and Naughton proposed rate-based
optimization for queries over data streams - Chronicle Data Model
- Append-only ordered sequences of tuples
(chronicles) - Restricted view definition language and algebra
(chronicle algebra) - Views defined in chronicle algebra could be
maintained incrementally without storing any of
the chronicles.
14Review of Data Stream Projects (contd)
- Seshadri, Livny, and Ramakrishhnan proposed an
algebra and a declarative query language for
querying ordered relations (sequences) - Related work includes work on temporal and
time-series databases
15Review of Data Stream Projects (contd)
- Materialized Views
- Queries that need to be reevaluated or
incrementally updated - Important work in this area
- self-maintenance
- data expiration
- Different from continuous queries
- stream rather than store results
- deal with append-only input data
- provide approximate rather than exact answers
- processing strategy may adapt as characteristics
of data streams change
16Review of Data Stream Projects (contd)
- Telegraph Project
- Adaptive query engine for volatile and
unpredictable environments - Query execution strategies over data streams
generated by sensors - Adaptive processing techniques for multiple
continuous queries - Tukwila system
- Supports adaptive query processing, in order to
perform dynamic data integration over autonomous
data sources
17Review of Data Stream Projects (contd)
- Aurora Project
- Targeted towards stream monitoring applications
- Consists of large network of triggers (data-flow
graph) - Application administrators create and add
triggers - Compile-time and run-time optimization of trigger
network - Detects resource overload and performs load
shedding based on application-specific measures
of QoS
18Queries over Data Streams
- Unbounded Memory Requirements
- Approximate Query Answering
- Data reduction techniques
- Sketches
- Random sampling
- Histograms
- Wavelets
- Approaches to approximation
- Sliding Windows
- Batch Processing, Sampling, and Synopses
- Blocking Operators
- Queries Referencing Past Data
19Sliding Windows
- Evaluate query over sliding window of recent data
from streams - Attractive Properties
- Well-defined and understood
- Deterministic
- Emphasizes recent data
- Research Issues
- How to define timestamps over streams
- How to implement sliding window queries
- Whats their impact on query optimization?
- How to give approximate answers if window is too
big to fit in main memory
Window
Past Data
Future Data
Recent Data
20Sliding Windows (contd)
- Sequence and Temporal DB
- Temporal DB
- Concerned with full history of each data value
over time - Sequence DB
- Attempts to produce query plans that allow for
stream access - Assumes DB system has control over which sequence
to process tuples from next
21Batch Processing, Sampling, and Synopses
- Dont process data elements as they arrive
- Two possible bottlenecks
- Batch processing
- Sampling
- Synopsis data structure
22Blocking Operators
- Unable to produce the first tuple of its output
until it has seen its entire input. (e.g.,
sorting, aggregation operators like SUM) - Operators that are root of tree of query
operators are more tractable than interior nodes
operators - juggle operator (a non-blocking version of sort)
23Blocking Operators (contd)
- Tucker et al. suggested augmenting data streams
with assertions about what can and cannot appear
in remainder of data stream
daynumber ? 10
daynumber lt 10
Assertion daynumber ? 10
24Queries Referencing Past Data
- Ad hoc queries that are issued after some data
has already been discarded may be impossible to
answer accurately - ad hoc queries allowed to reference future data
only - maintain summaries of data streams (synopses or
aggregates) that can approximate answers to
future ad hoc queries
25Thank You!
to be continued