Continuous Queries over Data Streams

About This Presentation

Title:

Continuous Queries over Data Streams

Description:

... in network traffic behavior in real-time and their cause -- like link congestion ... to support load balancing, online automatic trading at Stock Exchange ... – PowerPoint PPT presentation

Number of Views:224

Avg rating:3.0/5.0

Slides: 94

Provided by: bor62

Category:

more less

Transcript and Presenter's Notes

Title: Continuous Queries over Data Streams

1
Continuous Queries over Data Streams
Vitaly Kroivets, Lyan Marina Presentation for
The Seminar on Database and InternetThe Hebrew
University of Jerusalem, Fall 2002
2
Contents of the lecture

Introduction
Proposed Architecture of Data Stream Management
System
Research problems
Query Optimization
Bibliography

3
Data Streams vs. Data Sets

Data Sets Data Streams

Updates infrequent

Data changed constantly (sometimes additions only)

Old data required many times

Mostly only freshest data used

Example employees personal data table

Examples financial tickers, data feeds from
sensors, network monitoring, etc

4
Using Traditional Database
User/Application
Loader
5
Data Streams Paradigm
User/Application
Stream Query Processor
6
Data Streams Paradigm
User/Application
Result
Stream Query Processor
7
What Is A Continuous Query ?

Query which is issued once and logically run
continuously.

8
What is Continuous Query ?

Query which is issued once and run continuously.

Example detect abnormalities in network traffic
behavior in real-time and their cause -- like
link congestion due to hardware failure.
9
What is Continuous Query ?

Query which is issued once and run continuously.

More examples Continues queries used to support
load balancing, online automatic trading at Stock
Exchange
10
Special Challenges

Timely online answers even for rapid data
streams
Ability of fast access to large portions of
data
Processing of multiple streams simultaneously

11
Making Things Concrete
BOB
ALICE
Outgoing (call_ID, caller, time, event)
Incoming (call_ID, callee, time, event)
event start or end
12
Making Things Concrete

Database two streams of mobile call records
Outgoing(connectionID, caller, start, end)
Incoming(connectionID, callee, start, end)
Query language SQL
FROM clauses can refer to streams and/or relations

13
Query 1 (self-join)

Find all outgoing calls longer than 2 minutes
SELECT O1.call_ID, O1.caller
FROM Outgoing O1, Outgoing O2
WHERE (O2.time O1.time 2
AND O1.call_ID O2.call_ID
AND O1.event start
AND O2.event end)
Result requires unbounded storage
Can provide result as data stream
Can output after 2 min, without seeing end

14
Query 2 (join)

Pair up callers and callees
SELECT O.caller, I.callee
FROM Outgoing O, Incoming I
WHERE O.call_ID I.call_ID
Can still provide result as data stream
Requires unbounded temporary storage
unless streams are near-synchronized

15
Query 3 (group-by aggregation)

Total connection time for each caller
SELECT O1.caller, sum(O2.time O1.time)
FROM Outgoing O1, Outgoing O2
WHERE (O1.call_ID O2.call_ID
AND O1.event start
AND O2.event end)
GROUP BY O1.caller
Cannot provide result in (append-only) stream.
Alternatives
Output stream with updates
Provide current value on demand
Keep answer in memory

16
Conclusions

Conventional DBMS technology is inadequate
We need reconsider all aspects of data management
and processing in presence of data streams

17
DBMS versus DSMS

Persistent relations

Transient streams (and persistent relations)

18
DBMS versus DSMS

Persistent relations

Transient streams (and persistent relations)

One-time queries

Continuous queries

19
DBMS versus DSMS

Persistent relations

Transient streams (and persistent relations)

One-time queries

Continuous queries

Random access

Sequential access

20
DBMS versus DSMS

Persistent relations

Transient streams (and persistent relations)

One-time queries

Continuous queries

Random access

Sequential access

Access plan determined by query processor and
physical DB design

Unpredictable data arrival and characteristics

21
DBMS versus DSMS

Persistent relations

Transient streams (and persistent relations)

One-time queries

Continuous queries

Random access

Sequential access

Access plan determined by query processor and
physical DB design

Unpredictable data arrival and characteristics

Unbounded disk store

Bounded main memory

22
Related work

Tapestry system
Content-based filtering of email messages.
Restricted subset of SQL append-only query
results
Cronicle data model
Append-only ordered sequences of tuples
restricted view-definition language doesnt store
any cronicles
Alert system
Event-condition Action triggers in
conventional SQL DB Continuous Queries over
append-only "active tables".

23
Related workMaterialized Views

Materialized Views are queries which need to be
reevaluated whenever database changes.
Materialized Views vs. Continuous Queries
Continuous Queries
May stream rather then store result
May deal with append only relations
May provide approximate answers
Processing strategy may adapt characteristics of
data stream

24
Architecture for continuous queries
Q
A? Answer

Data Stream
Continuous Query

Single stream of tuples D, single continuous
Query Q
and Answer to the query A
Q is issued once and operates continuously

25
Architecture for continuous queries
Q
A? Answer

Data Stream
Continuous Query

We consider data streams that adhere to the
relation model (i. e. streams of tuples),
although many of the ideas and techniques are
independent of the data model being considered

26
Architecture for continuous queries

Scenario 1 (simplest)
Data stream D is append only - no updates or
deletions. How to handle Q?
1) Always store current answer A to Q .
D is of unbounded size A may be too.
2) Not to store A, but make new tuples in A
available as another continuous stream.
No need for unbounded storage for A, but
may need unbounded storage to determine new
tuples in A.

27
Architecture for continuous queries

Scenario 2
Input stream is append-only, but may cause
updates and deletions in answer A.
May need to update/delete tuples in output
data stream
Scenario3 (most general)
Input stream D includes updates and deletions.
Much data of stream should be stored to
determine answer.

28
Architecture for continuous queries

How to solve?
1) Restrict expressiveness of Q.
2) Impose constrains on data stream to
guarantee that answer to Q is bounded
and amount of data needed to compute Q .
3) Provide approximate answer.

29
Arcitecture for processing continuous queries
Stream
Stream 1
Stream 2
Store
Stream Query Processor
. . .
Scratch
Stream N
Throw
30
Architecture for continuous queries

STREAM is data stream containing tuples appended
to A. It is append-only stream (shouldnt include
updates/deletions)
STREAM and STORE define current answer A.

31
Architecture for continuous queries
Stream

When query Q is notified of new
tuple t in a relevant data stream,
it can perform number of actions,
which are not mutually exclusive
1) t causes new tuples in A
if tuple a will remain in A forever
send a to STREAM
2) if a should be in A, but may be removed at
some moment add a to STORE

Stream Query Processor
Throw
Scratch
Store
Stream
32
Architecture for continuous queries
Stream

When query Q is notified of new tuple t in a
relevant
data stream, it can perform number of actions,
which are not mutually exclusive
3) t may cause update or deletion
of answer tuples in Store. Answer
tuples may be moved from
STORE to STREAM
4) May need to save t or derived
data to ensure in future can compute
query result send t to SCRATCH

Stream Query Processor
Throw
Scratch
Store
Stream
33
Architecture for continuous queries
Stream

When query Q is notified of new tuple t in a
relevant
data stream, it can perform number of actions,
which are not mutually exclusive
5) t not needed and will not be
needed. Send it to THROW
(unless we like to archive it)
6) As a result of t we may move
data from STORE or SCRATCH
to THROW

Stream Query Processor
Throw
Scratch
Store
Stream
34
Architecture for continuous queries

Scenario1
Data stream D is append only - no updates or
deletions. Always store current answer A to Q .

STREAM empty STORE always contain A SCRATCH
contains whatever needed to to keep answer in
STORE up to date
35
Architecture for continuous queries

Scenario2
Answer A exclusively as data stream D.

STREAM stream answer A
STORE empty
SCRATCH contains whatever needed to to keep
answer in STORE up to date

36
Architecture for continuous queries

Scenario 3
Input stream append only, answer A may have
updates and deletions
Example Q is group-by with Min aggregation
function.

Answer A maintained in STORE
SCRATCH is empty

37
Architecture for continuous queries

Scenario 4
Input streams may include updates and
deletions

Unbounded storage required for SCRATCH
to ensure that Min always will be computed
Both in 3 and 4 data moved to STREAM only
whenever known that no further updates/deletions
etc of tuples of this group will occur.

38
The Architecture and Related Work

Implementing Triggers in terms of proposed
architecture (for launching triggered actions
assume actions performed by SQL
stored-procedures.)
STREAM and STORE empty.
SCRATCH used for data required to moniotor
complex events
Benefits complex multitable events conditions
to be monitored
Trigger processing benefit from efficient data
management / processing
Techniques ( see below)

39
The Architecture and Related Work

Implementing Materialized views in terms of
proposed architecture
View itsef is maintained in STORE
Base data in SCRATCH
Data expiration to expedite cleanup of
SCRATCH
No way to ensure bounding of size of STORE and
SCRATCH

40
End of Part I
41
Research Problems

Designing Query Language
Online processing of rapid streams
Approximation techniques
Storage constrains vs. performance requirements
Summarization
Query Planning / Optimization
Building good Query Plan
Scheduling
Sub-Plans Sharing
Resource Management
Adaptation

42
Research Problems Languages for Continuous
Queries

Bounding the size of scratch/store
Open problem to determine for arbitrary SQL
query whether properties satisfied

43
Query Language

Query language allows both streams and
relations
Assumptions

Streams
Ordered
Append-only
Unbounded
Multiple streams allowed

Relations
Unordered
Support updates and deletions

44
SQL ExtensionsFor Continuous Queries

FROM allowed both to Streams and Relations
Sliding Window for FROM clause (for streams)
Optional "Partitioning" clause
Mandatory "Window size"
Optional "Filtering predicate"

45
Windows specification

Using ROWS
ROWS 50 PRECEEDING
Using RANGE
RANGE 15 minutes PRECEEDING

46
Example 1
Clients
DSMS
.NF
CL1
CL7

S ( Client_id, URL, domain, time )

.il
CL2
CL5
Internet
Web Server
.com
CL3
CS web
Math web
CL4
47
Example 1 (CQL)From with Range

Stream "Requests" of requests to web server with
attributes
(client_id, URL, domain, time)
Query counting number of request of pages from
domain cs.huji.ac.il in the last day
SELECT COUNT()
FROM Request SRANGE 1 DAY PRECEEDING
WHERE S.domain "cs.huji.ac.il"

48
Partitioning Clause

Partitions data in several groups
Computes separate window for each group
Merges windows into single result
Is syntactically same as GROUP BY clause
Example

49
Example 2 Partition By

How many pages served (only each clients 10 most
recent requests) by request from domain
CS.HUJI.AC.IL from CS website ?
SELECT COUNT () FROM requests S
PARTITION BY s.Client_id
Rows 10 PRECEEDING
Where s.Domain CS.HUJI.AC.IL
Where s.URL LIKE 'http//cs.huji.Ac.Il/'

50
Example 3 Join with relation

Classify domain by primary type of web content
they serve
.ac.il EDUCATION
.gov.il Government
.co.il COMMERCE
.com COMMERCE
Count number of requests from "commerce" domains
out of last 10000 records
10 sample of requests stream is used

51
Example 3 (Cont.)

SELECT COUNT () FROM
(SELECT R.class
FROM Requests S 10 SAMPLE , Domains R
WHERE S.DomainR.Domain) T
ROWS 10000 PRECEEDING
WHERE T.class "commerce"
Note stream of Requests is joined with Domains
relation resulting in stream T , before applying
sliding window

52
Performance Challenge

Multiple rapid incoming data streams
Multiple complex queries with timeliness
requirements
Finite resources

53
Solution Approximation

Approximate answers
Graceful degradation
Maximize precision based on available resources

54
Approximation Static vs. Dynamic

Queries modified at submission time to use fewer
resources
User guaranteed certain query behavior
User can configure approximation mechanism
Adaptation mechanisms not needed

Queries modified at run time
Not suitable for some applications

55
Approximation Techniques

Window Reduction
Sampling rate reduction
Summarization (Synopses)

56
Window reduction

Decreasing size of window
Introduce Window where none was specified
originally

May increase output rate (duplicate elimination
for example)
Must detect bad cases statically
Affects resources used by operator

57
Sampling rate reduction

Introduce SAMPLE if not specified
Reduce sampling rate

will reduce output rate
will not to influence resource requirements of
operation

58
Summarization

Summaries(data synopses) - concise representation
at expense of accuracy Sampling, Histograms
Wavelets

How to make guaranties about query results based
on summaries ?
How to maintain efficiently in rapid data
streams ?
What summarization techniques are better ?

59
Dynamic approximation Challenges

Some apps will not tolerate unpredicted and
variable accuracy
Extend Language to specify tolerable imprecision

60
Dynamic approximation techniques

Synopses compression
Sampling
Load shedding

61
Synopses compression

Synopses concise representation at expense of
accuracy
Reducing memory overhead
Methods
histograms, Wavelets, etc

62
Load shedding

Drop tuples from queries, when they grow too
large
Drops chunks of tuples at time -- differs from
sampling, which eliminates probabilistically
load shedding -- biased, but easier to implement

63
Query Plans How DSMS process Query?
Issues to consider

Separate Query Plan for each Continuous Query
vs. one Mega-Query plan for all computations for
all users
Plan components may be shared
Query registers before streams start to produce
data
How about adding queries over existing streams
Queries over archived / discarded Data

64
STREAM System Query Plans

Query Operators
Reads stream of tuples from set of input queues,
processes them, writes output tuples into single
output queue

Operator
Input Queue
Output Queue
Input Queue
65
Query Plans (Cont.)

Inter-Operator Queues
Queues connect different operators and define
tuples flow
Synopses
Summarizes tuples seen so far at intermediate
operator as needed for future

66
When Synopses Needed ?

Join operator
Must remember tuples seen so far on each of
input streams maintain synopses for each

Filter operator (selection)
Do not maintain state no need for synopses

67
Example
Join of R,S, T
Selection Over Join of R and S
Query 1
Query 2
Q3 is Shared
Operator O3 (Join)
Operator O2 (select)
Synop4
Synop3
Queue3
Operator O1 (Join)
Synop1
Synop2
Queue 4
Queue1
Queue2
Scheduler
Stream T
Stream R
Stream S
68
Explanations to Example

Two plans (for Q1 and Q2) share a sub-plan
joining streams R and S by sharing it output
queue q3
Execution of operators controlled by Global
Scheduler
When operator O scheduled, control passes to O
for period determined by number of tuples
Possible time-slice based scheduling

69
Resource Sharing for Query Plans

When Continuous Queries share common
sub-expressions
Similar to traditional DBMS
Resource sharing and Approximation considered
separately
Do not share , if sharing introduces
approximation like merging sub-expressions with
different window sizes

70
Implementation of Shared Queue
Op1
Op3
Op2
Op4
Shared Queue
t1
t2
t3
t4
t5
t6
t7
t8

Queue maintains pointer to first unread tuple for
each operator
Discard tuple once they had been read by all
operators

71
Resource Sharing (cont.)

Base Data Stream accessed by multiple queries
shared as common sub-expression
Number of tuples in shared queue depends on
Rate of addition to the queue
Rate at which slowest operator consumes tuples
Common sub-expression of 2 queries with very
different consumption rates

72
Shared Queue Issues
P1 Heavy consumer
Stream
Operator J (Join)
Queue q
Stream
P2 Light consumer

P1, P2 parents of operator J
J will be scheduled frequently, for sake of P1
J should be scheduled less frequently for P2 (to
avoid proliferation of tuples in q)

73
Sub-Plan Sharing

Formally proven
sub-plan sharing may be sub-optimal for common
sub-expressions with joins
for common sub-expressions without joins sharing
is always preferable

74
Synopses Sharing

Issues to consider
Which operator responsible to manage shared
synopses ?
Synopses required by different operators , how to
choose size of common synopses?
If synopses are identical, how to cope with
different consumption rates?

75
Scheduling

Objective for Scheduler
Stream-based variation of response time
Throughput
Weighted fairness among queues
Minimize intermediate queues sizes

Granularity for Scheduler
Max number of tuples consumed by operator
Time-unit
Parallelism in scheduling algorithm ?

76
Scheduling Example
Op. O1
Op. O2
q1
q2

O1 takes 1 time unit to operate on n tuples from
q1, with 20 selectivity, produces n/5 tuples in
q2

O2 takes 1 time unit to operate on n/5 tuples
from q2, and it doesnt produces tuples.

77
Scheduling Example (Cont.)

Assume, average arrival rate on q1 is no more
than n per 2 time units queues are bounded
Arrivals may be bursty

Possible scheduling strategies
Algoritm1 (time-slicing)
tuples processed 1 time unit by each operator.
O1 consumes n units, O2 consumes n/5
O1, O2
Algoritm2 O1 operates until its queue empty,
afterwards O2

78
Algorithm 1
Queue Size
2n tuples arrived
n tuples arrived
n tuples arrived
26
1
Time
1
2
3
4
5
6
7
8

Orange Tuples in Q1
Yellow Tuples in Q2

79
Algorithm2
Queue Size
2n tuples arrived
n tuples arrived
n tuples arrived
Time
1
2
3
4
5
6
7
8

Orange tuples in Q1
Yellow Tuples in Q2

80
Comparison. Which is better?
Total size of both queues
2n tuples arrived
n tuples arrived
n tuples arrived
26
1
Time
2
3
4
5
6
7
8
1

Orange Algorithm1
Yellow Algorithm2

81
Greedy Scheduler Rule

Schedule the operator that consumes largest
number of of tuples per time and is the most
selective (produces fewest tuples)
Operators with full batches in input queues are
favored over high priority operators with
under-full inputs (better utilization of
time-slice)
High-priority operator may be underutilized if
feeders are low priority consider chains of
operators

82
Scheduling Algorithm Discussion

Queue size minimization
Increased time to initial results
Strategy 1 would produce initial results faster
Incorporate response time and weighted fairness
into algorithm
Flexible time-slices
Taking context-switching into account

83
Resource Management

Relevant Resources
Memory
CPU
I/O (if disk used)
Network (in Distributed DSMS)

Our Goal
Maximize query precision by making best use
of available resources and have a capability to
do that dynamically and adaptively

84
Resource Management Cont.

Focus on memory used by synopses and
queues
Algorithms developed in STREAM

Allocating memory to query plan
Incorporating known constraints on input streams
to reduce synopses without compromising precision
Operator scheduling to minimize queue size

85
Resource Management Approaches (Cont.)

Exploiting constraints over data streams
When additional information about streams is
available (gathered stats, constraint specs) --
reduce resource utilization with same result
precision

86
Adaptation why?

Adaptation
Queries are long running
Parameters
Stream flow rate
Stream data characteristics
Environment (available RAM)
may vary -- how to adapt?

87
Exploiting Constraints over Data Streams

Answering Requires synopses of unbounded size !

Query Q join , to monitor fulfillments delays
Synop-F
Synop-O
Order_IDItem_ID
O
F
Stream Orders
Stream Fulfillments
88
Constraints (cont.)

Tuples for given (orderID, itemID) arrive at
stream O before corresponding tuples arrive to F
No need to maintain a join synopses for F !!
Another constrain tuples arrive at O clustered
by orderID
We need only to save tuples for given orderID,
until next orderID seen

More RAM needed for synapse
Ord1, item 1
Ord1, item 1
Ord1, item 2
Ord1, item 2
Ord1, item 3
Ord3, item 1
Ord1, item 4
Ord1, item 3
Ord3, item 1
Ord3, item 2
Ord3, item 4
Ord3, item 4
89
Constraints (Cont.)

Referential integrity
Unique-value
Clustered-Arrival
Ordered-Arrival

90
Summary

Architecture for DSMS
Query Language
Common Design Problems
Tradeoff efficiency, accuracy, storage

91
References

Continuous Queries over Data Streams by
S.Babu, J.Widom (Stanford University)
Query Processing, Approximation, and Resource
Management In a Data Stream Management System by
R.Motiwani, J.Widom and others (Stanford
University)
http//www.db.stanford.edu/stream

Continuous Queries over Data Streams - PowerPoint PPT Presentation

Continuous Queries over Data Streams

... in network traffic behavior in real-time and their cause -- like link congestion ... to support load balancing, online automatic trading at Stock Exchange ... – PowerPoint PPT presentation