Title: Continuous Queries over Data Streams
1Continuous Queries over Data Streams
Vitaly Kroivets, Lyan Marina Presentation for
The Seminar on Database and InternetThe Hebrew
University of Jerusalem, Fall 2002
2Contents of the lecture
- Introduction
- Proposed Architecture of Data Stream Management
System - Research problems
- Query Optimization
- Bibliography
3Data Streams vs. Data Sets
- Data changed constantly (sometimes additions only)
- Old data required many times
- Mostly only freshest data used
- Example employees personal data table
- Examples financial tickers, data feeds from
sensors, network monitoring, etc
4Using Traditional Database
User/Application
Loader
5Data Streams Paradigm
User/Application
Stream Query Processor
6Data Streams Paradigm
User/Application
Result
Stream Query Processor
7What Is A Continuous Query ?
- Query which is issued once and logically run
continuously.
8What is Continuous Query ?
- Query which is issued once and run continuously.
Example detect abnormalities in network traffic
behavior in real-time and their cause -- like
link congestion due to hardware failure.
9What is Continuous Query ?
- Query which is issued once and run continuously.
More examples Continues queries used to support
load balancing, online automatic trading at Stock
Exchange
10Special Challenges
- Timely online answers even for rapid data
streams - Ability of fast access to large portions of
data - Processing of multiple streams simultaneously
11Making Things Concrete
BOB
ALICE
Outgoing (call_ID, caller, time, event)
Incoming (call_ID, callee, time, event)
event start or end
12Making Things Concrete
- Database two streams of mobile call records
- Outgoing(connectionID, caller, start, end)
- Incoming(connectionID, callee, start, end)
- Query language SQL
- FROM clauses can refer to streams and/or relations
13Query 1 (self-join)
- Find all outgoing calls longer than 2 minutes
- SELECT O1.call_ID, O1.caller
- FROM Outgoing O1, Outgoing O2
- WHERE (O2.time O1.time 2
- AND O1.call_ID O2.call_ID
- AND O1.event start
- AND O2.event end)
- Result requires unbounded storage
- Can provide result as data stream
- Can output after 2 min, without seeing end
14Query 2 (join)
- Pair up callers and callees
- SELECT O.caller, I.callee
- FROM Outgoing O, Incoming I
- WHERE O.call_ID I.call_ID
- Can still provide result as data stream
- Requires unbounded temporary storage
- unless streams are near-synchronized
15Query 3 (group-by aggregation)
- Total connection time for each caller
- SELECT O1.caller, sum(O2.time O1.time)
- FROM Outgoing O1, Outgoing O2
- WHERE (O1.call_ID O2.call_ID
- AND O1.event start
- AND O2.event end)
- GROUP BY O1.caller
- Cannot provide result in (append-only) stream.
- Alternatives
- Output stream with updates
- Provide current value on demand
- Keep answer in memory
16Conclusions
- Conventional DBMS technology is inadequate
- We need reconsider all aspects of data management
and processing in presence of data streams
17DBMS versus DSMS
- Transient streams (and persistent relations)
18DBMS versus DSMS
- Transient streams (and persistent relations)
19DBMS versus DSMS
- Transient streams (and persistent relations)
20DBMS versus DSMS
- Transient streams (and persistent relations)
- Access plan determined by query processor and
physical DB design
- Unpredictable data arrival and characteristics
21DBMS versus DSMS
- Transient streams (and persistent relations)
- Access plan determined by query processor and
physical DB design
- Unpredictable data arrival and characteristics
22Related work
- Tapestry system
- Content-based filtering of email messages.
Restricted subset of SQL append-only query
results - Cronicle data model
- Append-only ordered sequences of tuples
restricted view-definition language doesnt store
any cronicles - Alert system
- Event-condition Action triggers in
conventional SQL DB Continuous Queries over
append-only "active tables".
23Related workMaterialized Views
- Materialized Views are queries which need to be
reevaluated whenever database changes. - Materialized Views vs. Continuous Queries
- Continuous Queries
- May stream rather then store result
- May deal with append only relations
- May provide approximate answers
- Processing strategy may adapt characteristics of
data stream
24Architecture for continuous queries
Q
A? Answer
Data Stream
Continuous Query
- Single stream of tuples D, single continuous
Query Q - and Answer to the query A
- Q is issued once and operates continuously
25Architecture for continuous queries
Q
A? Answer
Data Stream
Continuous Query
- We consider data streams that adhere to the
relation model (i. e. streams of tuples),
although many of the ideas and techniques are
independent of the data model being considered
26Architecture for continuous queries
- Scenario 1 (simplest)
- Data stream D is append only - no updates or
deletions. How to handle Q? - 1) Always store current answer A to Q .
- D is of unbounded size A may be too.
- 2) Not to store A, but make new tuples in A
available as another continuous stream. - No need for unbounded storage for A, but
may need unbounded storage to determine new
tuples in A.
27Architecture for continuous queries
- Scenario 2
- Input stream is append-only, but may cause
updates and deletions in answer A. - May need to update/delete tuples in output
data stream -
- Scenario3 (most general)
- Input stream D includes updates and deletions.
- Much data of stream should be stored to
determine answer.
28Architecture for continuous queries
- How to solve?
- 1) Restrict expressiveness of Q.
- 2) Impose constrains on data stream to
- guarantee that answer to Q is bounded
- and amount of data needed to compute Q .
- 3) Provide approximate answer.
-
29Arcitecture for processing continuous queries
Stream
Stream 1
Stream 2
Store
Stream Query Processor
. . .
Scratch
Stream N
Throw
30Architecture for continuous queries
- STREAM is data stream containing tuples appended
to A. It is append-only stream (shouldnt include
updates/deletions) - STREAM and STORE define current answer A.
31Architecture for continuous queries
Stream
- When query Q is notified of new
- tuple t in a relevant data stream,
- it can perform number of actions,
- which are not mutually exclusive
- 1) t causes new tuples in A
- if tuple a will remain in A forever
- send a to STREAM
- 2) if a should be in A, but may be removed at
some moment add a to STORE
Stream Query Processor
Throw
Scratch
Store
Stream
32Architecture for continuous queries
Stream
- When query Q is notified of new tuple t in a
relevant - data stream, it can perform number of actions,
- which are not mutually exclusive
- 3) t may cause update or deletion
- of answer tuples in Store. Answer
- tuples may be moved from
- STORE to STREAM
- 4) May need to save t or derived
- data to ensure in future can compute
- query result send t to SCRATCH
Stream Query Processor
Throw
Scratch
Store
Stream
33Architecture for continuous queries
Stream
- When query Q is notified of new tuple t in a
relevant - data stream, it can perform number of actions,
- which are not mutually exclusive
- 5) t not needed and will not be
- needed. Send it to THROW
- (unless we like to archive it)
- 6) As a result of t we may move
- data from STORE or SCRATCH
- to THROW
Stream Query Processor
Throw
Scratch
Store
Stream
34Architecture for continuous queries
- Scenario1
- Data stream D is append only - no updates or
- deletions. Always store current answer A to Q .
STREAM empty STORE always contain A SCRATCH
contains whatever needed to to keep answer in
STORE up to date
35Architecture for continuous queries
- Scenario2
- Answer A exclusively as data stream D.
-
- STREAM stream answer A
- STORE empty
- SCRATCH contains whatever needed to to keep
answer in STORE up to date
36Architecture for continuous queries
- Scenario 3
- Input stream append only, answer A may have
- updates and deletions
- Example Q is group-by with Min aggregation
function.
- Answer A maintained in STORE
- SCRATCH is empty
37Architecture for continuous queries
- Scenario 4
- Input streams may include updates and
- deletions
- Unbounded storage required for SCRATCH
- to ensure that Min always will be computed
-
- Both in 3 and 4 data moved to STREAM only
whenever known that no further updates/deletions
etc of tuples of this group will occur.
38The Architecture and Related Work
- Implementing Triggers in terms of proposed
architecture (for launching triggered actions
assume actions performed by SQL
stored-procedures.) - STREAM and STORE empty.
- SCRATCH used for data required to moniotor
complex events - Benefits complex multitable events conditions
to be monitored - Trigger processing benefit from efficient data
management / processing - Techniques ( see below)
39The Architecture and Related Work
- Implementing Materialized views in terms of
- proposed architecture
- View itsef is maintained in STORE
- Base data in SCRATCH
- Data expiration to expedite cleanup of
- SCRATCH
- No way to ensure bounding of size of STORE and
SCRATCH
40End of Part I
41Research Problems
- Designing Query Language
- Online processing of rapid streams
- Approximation techniques
- Storage constrains vs. performance requirements
- Summarization
- Query Planning / Optimization
- Building good Query Plan
- Scheduling
- Sub-Plans Sharing
- Resource Management
- Adaptation
42Research Problems Languages for Continuous
Queries
- Bounding the size of scratch/store
- Open problem to determine for arbitrary SQL
query whether properties satisfied
43Query Language
- Query language allows both streams and
relations - Assumptions
- Streams
- Ordered
- Append-only
- Unbounded
- Multiple streams allowed
- Relations
- Unordered
- Support updates and deletions
44SQL ExtensionsFor Continuous Queries
- FROM allowed both to Streams and Relations
- Sliding Window for FROM clause (for streams)
- Optional "Partitioning" clause
- Mandatory "Window size"
- Optional "Filtering predicate"
45Windows specification
- Using ROWS
- ROWS 50 PRECEEDING
- Using RANGE
- RANGE 15 minutes PRECEEDING
46Example 1
Clients
DSMS
.NF
CL1
CL7
- S ( Client_id, URL, domain, time )
.il
CL2
CL5
Internet
Web Server
.com
CL3
CS web
Math web
CL4
47Example 1 (CQL)From with Range
- Stream "Requests" of requests to web server with
attributes - (client_id, URL, domain, time)
- Query counting number of request of pages from
domain cs.huji.ac.il in the last day - SELECT COUNT()
- FROM Request SRANGE 1 DAY PRECEEDING
- WHERE S.domain "cs.huji.ac.il"
48Partitioning Clause
- Partitions data in several groups
- Computes separate window for each group
- Merges windows into single result
- Is syntactically same as GROUP BY clause
- Example
49Example 2 Partition By
- How many pages served (only each clients 10 most
recent requests) by request from domain - CS.HUJI.AC.IL from CS website ?
- SELECT COUNT () FROM requests S
- PARTITION BY s.Client_id
- Rows 10 PRECEEDING
- Where s.Domain CS.HUJI.AC.IL
- Where s.URL LIKE 'http//cs.huji.Ac.Il/'
50Example 3 Join with relation
- Classify domain by primary type of web content
they serve -
- .ac.il EDUCATION
- .gov.il Government
- .co.il COMMERCE
- .com COMMERCE
- Count number of requests from "commerce" domains
out of last 10000 records - 10 sample of requests stream is used
51Example 3 (Cont.)
- SELECT COUNT () FROM
- (SELECT R.class
- FROM Requests S 10 SAMPLE , Domains R
- WHERE S.DomainR.Domain) T
- ROWS 10000 PRECEEDING
- WHERE T.class "commerce"
- Note stream of Requests is joined with Domains
relation resulting in stream T , before applying
sliding window
52Performance Challenge
- Multiple rapid incoming data streams
- Multiple complex queries with timeliness
requirements - Finite resources
53Solution Approximation
- Approximate answers
- Graceful degradation
- Maximize precision based on available resources
54 Approximation Static vs. Dynamic
- Queries modified at submission time to use fewer
resources - User guaranteed certain query behavior
- User can configure approximation mechanism
- Adaptation mechanisms not needed
- Queries modified at run time
- Not suitable for some applications
55Approximation Techniques
- Window Reduction
- Sampling rate reduction
- Summarization (Synopses)
56Window reduction
- Decreasing size of window
- Introduce Window where none was specified
originally
- May increase output rate (duplicate elimination
for example) - Must detect bad cases statically
- Affects resources used by operator
57Sampling rate reduction
- Introduce SAMPLE if not specified
- Reduce sampling rate
- will reduce output rate
- will not to influence resource requirements of
operation
58Summarization
- Summaries(data synopses) - concise representation
at expense of accuracy Sampling, Histograms
Wavelets
- How to make guaranties about query results based
on summaries ? - How to maintain efficiently in rapid data
streams ? - What summarization techniques are better ?
59Dynamic approximation Challenges
- Some apps will not tolerate unpredicted and
variable accuracy - Extend Language to specify tolerable imprecision
60Dynamic approximation techniques
- Synopses compression
- Sampling
- Load shedding
61Synopses compression
- Synopses concise representation at expense of
accuracy - Reducing memory overhead
- Methods
- histograms, Wavelets, etc
62Load shedding
- Drop tuples from queries, when they grow too
large - Drops chunks of tuples at time -- differs from
sampling, which eliminates probabilistically - load shedding -- biased, but easier to implement
63Query Plans How DSMS process Query?
Issues to consider
- Separate Query Plan for each Continuous Query
vs. one Mega-Query plan for all computations for
all users - Plan components may be shared
- Query registers before streams start to produce
data - How about adding queries over existing streams
- Queries over archived / discarded Data
64STREAM System Query Plans
- Query Operators
- Reads stream of tuples from set of input queues,
processes them, writes output tuples into single
output queue
Operator
Input Queue
Output Queue
Input Queue
65Query Plans (Cont.)
- Inter-Operator Queues
- Queues connect different operators and define
- tuples flow
- Synopses
- Summarizes tuples seen so far at intermediate
- operator as needed for future
66When Synopses Needed ?
- Join operator
- Must remember tuples seen so far on each of
input streams maintain synopses for each
- Filter operator (selection)
- Do not maintain state no need for synopses
67Example
Join of R,S, T
Selection Over Join of R and S
Query 1
Query 2
Q3 is Shared
Operator O3 (Join)
Operator O2 (select)
Synop4
Synop3
Queue3
Operator O1 (Join)
Synop1
Synop2
Queue 4
Queue1
Queue2
Scheduler
Stream T
Stream R
Stream S
68Explanations to Example
- Two plans (for Q1 and Q2) share a sub-plan
joining streams R and S by sharing it output
queue q3 - Execution of operators controlled by Global
Scheduler - When operator O scheduled, control passes to O
for period determined by number of tuples - Possible time-slice based scheduling
69Resource Sharing for Query Plans
- When Continuous Queries share common
sub-expressions - Similar to traditional DBMS
- Resource sharing and Approximation considered
separately - Do not share , if sharing introduces
approximation like merging sub-expressions with
different window sizes
70Implementation of Shared Queue
Op1
Op3
Op2
Op4
Shared Queue
t1
t2
t3
t4
t5
t6
t7
t8
- Queue maintains pointer to first unread tuple for
each operator - Discard tuple once they had been read by all
operators
71Resource Sharing (cont.)
- Base Data Stream accessed by multiple queries
shared as common sub-expression - Number of tuples in shared queue depends on
- Rate of addition to the queue
- Rate at which slowest operator consumes tuples
- Common sub-expression of 2 queries with very
different consumption rates
72Shared Queue Issues
P1 Heavy consumer
Stream
Operator J (Join)
Queue q
Stream
P2 Light consumer
- P1, P2 parents of operator J
- J will be scheduled frequently, for sake of P1
- J should be scheduled less frequently for P2 (to
avoid proliferation of tuples in q)
73Sub-Plan Sharing
- Formally proven
- sub-plan sharing may be sub-optimal for common
sub-expressions with joins - for common sub-expressions without joins sharing
is always preferable
74Synopses Sharing
- Issues to consider
- Which operator responsible to manage shared
synopses ? - Synopses required by different operators , how to
choose size of common synopses? - If synopses are identical, how to cope with
different consumption rates?
75Scheduling
- Objective for Scheduler
- Stream-based variation of response time
- Throughput
- Weighted fairness among queues
- Minimize intermediate queues sizes
- Granularity for Scheduler
- Max number of tuples consumed by operator
- Time-unit
- Parallelism in scheduling algorithm ?
76Scheduling Example
Op. O1
Op. O2
q1
q2
- O1 takes 1 time unit to operate on n tuples from
q1, with 20 selectivity, produces n/5 tuples in
q2
- O2 takes 1 time unit to operate on n/5 tuples
from q2, and it doesnt produces tuples.
77Scheduling Example (Cont.)
- Assume, average arrival rate on q1 is no more
than n per 2 time units queues are bounded - Arrivals may be bursty
- Possible scheduling strategies
- Algoritm1 (time-slicing)
- tuples processed 1 time unit by each operator.
- O1 consumes n units, O2 consumes n/5
- O1, O2
- Algoritm2 O1 operates until its queue empty,
afterwards O2
78Algorithm 1
Queue Size
2n tuples arrived
n tuples arrived
n tuples arrived
26
1
Time
1
2
3
4
5
6
7
8
- Orange Tuples in Q1
- Yellow Tuples in Q2
79Algorithm2
Queue Size
2n tuples arrived
n tuples arrived
n tuples arrived
Time
1
2
3
4
5
6
7
8
- Orange tuples in Q1
- Yellow Tuples in Q2
80Comparison. Which is better?
Total size of both queues
2n tuples arrived
n tuples arrived
n tuples arrived
26
1
Time
2
3
4
5
6
7
8
1
- Orange Algorithm1
- Yellow Algorithm2
81Greedy Scheduler Rule
- Schedule the operator that consumes largest
number of of tuples per time and is the most
selective (produces fewest tuples) - Operators with full batches in input queues are
favored over high priority operators with
under-full inputs (better utilization of
time-slice) - High-priority operator may be underutilized if
feeders are low priority consider chains of
operators
82Scheduling Algorithm Discussion
- Queue size minimization
- Increased time to initial results
- Strategy 1 would produce initial results faster
- Incorporate response time and weighted fairness
into algorithm - Flexible time-slices
- Taking context-switching into account
83Resource Management
- Relevant Resources
- Memory
- CPU
- I/O (if disk used)
- Network (in Distributed DSMS)
- Our Goal
- Maximize query precision by making best use
- of available resources and have a capability to
- do that dynamically and adaptively
84Resource Management Cont.
- Focus on memory used by synopses and
- queues
- Algorithms developed in STREAM
- Allocating memory to query plan
- Incorporating known constraints on input streams
to reduce synopses without compromising precision - Operator scheduling to minimize queue size
85Resource Management Approaches (Cont.)
- Exploiting constraints over data streams
- When additional information about streams is
available (gathered stats, constraint specs) --
reduce resource utilization with same result
precision
86Adaptation why?
- Adaptation
- Queries are long running
- Parameters
- Stream flow rate
- Stream data characteristics
- Environment (available RAM)
- may vary -- how to adapt?
87Exploiting Constraints over Data Streams
- Answering Requires synopses of unbounded size !
Query Q join , to monitor fulfillments delays
Synop-F
Synop-O
Order_IDItem_ID
O
F
Stream Orders
Stream Fulfillments
88Constraints (cont.)
- Tuples for given (orderID, itemID) arrive at
stream O before corresponding tuples arrive to F - No need to maintain a join synopses for F !!
- Another constrain tuples arrive at O clustered
by orderID - We need only to save tuples for given orderID,
until next orderID seen
More RAM needed for synapse
Ord1, item 1
Ord1, item 1
Ord1, item 2
Ord1, item 2
Ord1, item 3
Ord3, item 1
Ord1, item 4
Ord1, item 3
Ord3, item 1
Ord3, item 2
Ord3, item 4
Ord3, item 4
89Constraints (Cont.)
- Referential integrity
- Unique-value
- Clustered-Arrival
- Ordered-Arrival
90Summary
- Architecture for DSMS
- Query Language
- Common Design Problems
- Tradeoff efficiency, accuracy, storage
91References
- Continuous Queries over Data Streams by
S.Babu, J.Widom (Stanford University) - Query Processing, Approximation, and Resource
Management In a Data Stream Management System by
R.Motiwani, J.Widom and others (Stanford
University) - http//www.db.stanford.edu/stream
92 93Thank you