Continuous Queries over Data Streams - PowerPoint PPT Presentation

About This Presentation
Title:

Continuous Queries over Data Streams

Description:

... in network traffic behavior in real-time and their cause -- like link congestion ... to support load balancing, online automatic trading at Stock Exchange ... – PowerPoint PPT presentation

Number of Views:224
Avg rating:3.0/5.0
Slides: 94
Provided by: bor62
Category:

less

Transcript and Presenter's Notes

Title: Continuous Queries over Data Streams


1
Continuous Queries over Data Streams
Vitaly Kroivets, Lyan Marina Presentation for
The Seminar on Database and InternetThe Hebrew
University of Jerusalem, Fall 2002
2
Contents of the lecture
  • Introduction
  • Proposed Architecture of Data Stream Management
    System
  • Research problems
  • Query Optimization
  • Bibliography

3
Data Streams vs. Data Sets
  • Data Sets Data Streams
  • Updates infrequent
  • Data changed constantly (sometimes additions only)
  • Old data required many times
  • Mostly only freshest data used
  • Example employees personal data table
  • Examples financial tickers, data feeds from
    sensors, network monitoring, etc

4
Using Traditional Database
User/Application
Loader
5
Data Streams Paradigm
User/Application
Stream Query Processor
6
Data Streams Paradigm
User/Application
Result
Stream Query Processor
7
What Is A Continuous Query ?
  • Query which is issued once and logically run
    continuously.

8
What is Continuous Query ?
  • Query which is issued once and run continuously.

Example detect abnormalities in network traffic
behavior in real-time and their cause -- like
link congestion due to hardware failure.
9
What is Continuous Query ?
  • Query which is issued once and run continuously.

More examples Continues queries used to support
load balancing, online automatic trading at Stock
Exchange
10
Special Challenges
  • Timely online answers even for rapid data
    streams
  • Ability of fast access to large portions of
    data
  • Processing of multiple streams simultaneously

11
Making Things Concrete
BOB
ALICE
Outgoing (call_ID, caller, time, event)
Incoming (call_ID, callee, time, event)
event start or end
12
Making Things Concrete
  • Database two streams of mobile call records
  • Outgoing(connectionID, caller, start, end)
  • Incoming(connectionID, callee, start, end)
  • Query language SQL
  • FROM clauses can refer to streams and/or relations

13
Query 1 (self-join)
  • Find all outgoing calls longer than 2 minutes
  • SELECT O1.call_ID, O1.caller
  • FROM Outgoing O1, Outgoing O2
  • WHERE (O2.time O1.time 2
  • AND O1.call_ID O2.call_ID
  • AND O1.event start
  • AND O2.event end)
  • Result requires unbounded storage
  • Can provide result as data stream
  • Can output after 2 min, without seeing end

14
Query 2 (join)
  • Pair up callers and callees
  • SELECT O.caller, I.callee
  • FROM Outgoing O, Incoming I
  • WHERE O.call_ID I.call_ID
  • Can still provide result as data stream
  • Requires unbounded temporary storage
  • unless streams are near-synchronized

15
Query 3 (group-by aggregation)
  • Total connection time for each caller
  • SELECT O1.caller, sum(O2.time O1.time)
  • FROM Outgoing O1, Outgoing O2
  • WHERE (O1.call_ID O2.call_ID
  • AND O1.event start
  • AND O2.event end)
  • GROUP BY O1.caller
  • Cannot provide result in (append-only) stream.
  • Alternatives
  • Output stream with updates
  • Provide current value on demand
  • Keep answer in memory

16
Conclusions
  • Conventional DBMS technology is inadequate
  • We need reconsider all aspects of data management
    and processing in presence of data streams

17
DBMS versus DSMS
  • Persistent relations
  • Transient streams (and persistent relations)

18
DBMS versus DSMS
  • Persistent relations

  • Transient streams (and persistent relations)
  • One-time queries
  • Continuous queries

19
DBMS versus DSMS
  • Persistent relations

  • Transient streams (and persistent relations)
  • One-time queries
  • Continuous queries
  • Random access
  • Sequential access

20
DBMS versus DSMS
  • Persistent relations

  • Transient streams (and persistent relations)
  • One-time queries
  • Continuous queries
  • Random access
  • Sequential access
  • Access plan determined by query processor and
    physical DB design
  • Unpredictable data arrival and characteristics

21
DBMS versus DSMS
  • Persistent relations

  • Transient streams (and persistent relations)
  • One-time queries
  • Continuous queries
  • Random access
  • Sequential access
  • Access plan determined by query processor and
    physical DB design
  • Unpredictable data arrival and characteristics
  • Unbounded disk store
  • Bounded main memory

22
Related work
  • Tapestry system
  • Content-based filtering of email messages.
    Restricted subset of SQL append-only query
    results
  • Cronicle data model
  • Append-only ordered sequences of tuples
    restricted view-definition language doesnt store
    any cronicles
  • Alert system
  • Event-condition Action triggers in
    conventional SQL DB Continuous Queries over
    append-only "active tables".

23
Related workMaterialized Views
  • Materialized Views are queries which need to be
    reevaluated whenever database changes.
  • Materialized Views vs. Continuous Queries
  • Continuous Queries
  • May stream rather then store result
  • May deal with append only relations
  • May provide approximate answers
  • Processing strategy may adapt characteristics of
    data stream

24
Architecture for continuous queries
Q
A? Answer

Data Stream
Continuous Query
  • Single stream of tuples D, single continuous
    Query Q
  • and Answer to the query A
  • Q is issued once and operates continuously

25
Architecture for continuous queries
Q
A? Answer

Data Stream
Continuous Query
  • We consider data streams that adhere to the
    relation model (i. e. streams of tuples),
    although many of the ideas and techniques are
    independent of the data model being considered

26
Architecture for continuous queries
  • Scenario 1 (simplest)
  • Data stream D is append only - no updates or
    deletions. How to handle Q?
  • 1) Always store current answer A to Q .
  • D is of unbounded size A may be too.
  • 2) Not to store A, but make new tuples in A
    available as another continuous stream.
  • No need for unbounded storage for A, but
    may need unbounded storage to determine new
    tuples in A.

27
Architecture for continuous queries
  • Scenario 2
  • Input stream is append-only, but may cause
    updates and deletions in answer A.
  • May need to update/delete tuples in output
    data stream
  • Scenario3 (most general)
  • Input stream D includes updates and deletions.
  • Much data of stream should be stored to
    determine answer.

28
Architecture for continuous queries
  • How to solve?
  • 1) Restrict expressiveness of Q.
  • 2) Impose constrains on data stream to
  • guarantee that answer to Q is bounded
  • and amount of data needed to compute Q .
  • 3) Provide approximate answer.

29
Arcitecture for processing continuous queries
Stream
Stream 1
Stream 2
Store
Stream Query Processor
. . .
Scratch
Stream N
Throw
30
Architecture for continuous queries
  • STREAM is data stream containing tuples appended
    to A. It is append-only stream (shouldnt include
    updates/deletions)
  • STREAM and STORE define current answer A.

31
Architecture for continuous queries
Stream
  • When query Q is notified of new
  • tuple t in a relevant data stream,
  • it can perform number of actions,
  • which are not mutually exclusive
  • 1) t causes new tuples in A
  • if tuple a will remain in A forever
  • send a to STREAM
  • 2) if a should be in A, but may be removed at
    some moment add a to STORE

Stream Query Processor
Throw
Scratch
Store
Stream
32
Architecture for continuous queries
Stream
  • When query Q is notified of new tuple t in a
    relevant
  • data stream, it can perform number of actions,
  • which are not mutually exclusive
  • 3) t may cause update or deletion
  • of answer tuples in Store. Answer
  • tuples may be moved from
  • STORE to STREAM
  • 4) May need to save t or derived
  • data to ensure in future can compute
  • query result send t to SCRATCH

Stream Query Processor
Throw
Scratch
Store
Stream
33
Architecture for continuous queries
Stream
  • When query Q is notified of new tuple t in a
    relevant
  • data stream, it can perform number of actions,
  • which are not mutually exclusive
  • 5) t not needed and will not be
  • needed. Send it to THROW
  • (unless we like to archive it)
  • 6) As a result of t we may move
  • data from STORE or SCRATCH
  • to THROW

Stream Query Processor
Throw
Scratch
Store
Stream
34
Architecture for continuous queries
  • Scenario1
  • Data stream D is append only - no updates or
  • deletions. Always store current answer A to Q .

STREAM empty STORE always contain A SCRATCH
contains whatever needed to to keep answer in
STORE up to date
35
Architecture for continuous queries
  • Scenario2
  • Answer A exclusively as data stream D.
  • STREAM stream answer A
  • STORE empty
  • SCRATCH contains whatever needed to to keep
    answer in STORE up to date

36
Architecture for continuous queries
  • Scenario 3
  • Input stream append only, answer A may have
  • updates and deletions
  • Example Q is group-by with Min aggregation
    function.
  • Answer A maintained in STORE
  • SCRATCH is empty

37
Architecture for continuous queries
  • Scenario 4
  • Input streams may include updates and
  • deletions
  • Unbounded storage required for SCRATCH
  • to ensure that Min always will be computed
  • Both in 3 and 4 data moved to STREAM only
    whenever known that no further updates/deletions
    etc of tuples of this group will occur.

38
The Architecture and Related Work
  • Implementing Triggers in terms of proposed
    architecture (for launching triggered actions
    assume actions performed by SQL
    stored-procedures.)
  • STREAM and STORE empty.
  • SCRATCH used for data required to moniotor
    complex events
  • Benefits complex multitable events conditions
    to be monitored
  • Trigger processing benefit from efficient data
    management / processing
  • Techniques ( see below)

39
The Architecture and Related Work
  • Implementing Materialized views in terms of
  • proposed architecture
  • View itsef is maintained in STORE
  • Base data in SCRATCH
  • Data expiration to expedite cleanup of
  • SCRATCH
  • No way to ensure bounding of size of STORE and
    SCRATCH

40
End of Part I
41
Research Problems
  • Designing Query Language
  • Online processing of rapid streams
  • Approximation techniques
  • Storage constrains vs. performance requirements
  • Summarization
  • Query Planning / Optimization
  • Building good Query Plan
  • Scheduling
  • Sub-Plans Sharing
  • Resource Management
  • Adaptation

42
Research Problems Languages for Continuous
Queries
  • Bounding the size of scratch/store
  • Open problem to determine for arbitrary SQL
    query whether properties satisfied

43
Query Language
  • Query language allows both streams and
    relations
  • Assumptions
  • Streams
  • Ordered
  • Append-only
  • Unbounded
  • Multiple streams allowed
  • Relations
  • Unordered
  • Support updates and deletions

44
SQL ExtensionsFor Continuous Queries
  • FROM allowed both to Streams and Relations
  • Sliding Window for FROM clause (for streams)
  • Optional "Partitioning" clause
  • Mandatory "Window size"
  • Optional "Filtering predicate"

45
Windows specification
  • Using ROWS
  • ROWS 50 PRECEEDING
  • Using RANGE
  • RANGE 15 minutes PRECEEDING

46
Example 1
Clients
DSMS
.NF
CL1
CL7
  • S ( Client_id, URL, domain, time )

.il
CL2
CL5
Internet
Web Server
.com
CL3
CS web
Math web
CL4
47
Example 1 (CQL)From with Range
  • Stream "Requests" of requests to web server with
    attributes
  • (client_id, URL, domain, time)
  • Query counting number of request of pages from
    domain cs.huji.ac.il in the last day
  • SELECT COUNT()
  • FROM Request SRANGE 1 DAY PRECEEDING
  • WHERE S.domain "cs.huji.ac.il"

48
Partitioning Clause
  • Partitions data in several groups
  • Computes separate window for each group
  • Merges windows into single result
  • Is syntactically same as GROUP BY clause
  • Example

49
Example 2 Partition By
  • How many pages served (only each clients 10 most
    recent requests) by request from domain
  • CS.HUJI.AC.IL from CS website ?
  • SELECT COUNT () FROM requests S
  • PARTITION BY s.Client_id
  • Rows 10 PRECEEDING
  • Where s.Domain CS.HUJI.AC.IL
  • Where s.URL LIKE 'http//cs.huji.Ac.Il/'

50
Example 3 Join with relation
  • Classify domain by primary type of web content
    they serve
  • .ac.il EDUCATION
  • .gov.il Government
  • .co.il COMMERCE
  • .com COMMERCE
  • Count number of requests from "commerce" domains
    out of last 10000 records
  • 10 sample of requests stream is used

51
Example 3 (Cont.)
  • SELECT COUNT () FROM
  • (SELECT R.class
  • FROM Requests S 10 SAMPLE , Domains R
  • WHERE S.DomainR.Domain) T
  • ROWS 10000 PRECEEDING
  • WHERE T.class "commerce"
  • Note stream of Requests is joined with Domains
    relation resulting in stream T , before applying
    sliding window

52
Performance Challenge
  • Multiple rapid incoming data streams
  • Multiple complex queries with timeliness
    requirements
  • Finite resources

53
Solution Approximation
  • Approximate answers
  • Graceful degradation
  • Maximize precision based on available resources

54
Approximation Static vs. Dynamic
  • Queries modified at submission time to use fewer
    resources
  • User guaranteed certain query behavior
  • User can configure approximation mechanism
  • Adaptation mechanisms not needed
  • Queries modified at run time
  • Not suitable for some applications

55
Approximation Techniques
  • Window Reduction
  • Sampling rate reduction
  • Summarization (Synopses)

56
Window reduction
  • Decreasing size of window
  • Introduce Window where none was specified
    originally
  • May increase output rate (duplicate elimination
    for example)
  • Must detect bad cases statically
  • Affects resources used by operator

57
Sampling rate reduction
  • Introduce SAMPLE if not specified
  • Reduce sampling rate
  • will reduce output rate
  • will not to influence resource requirements of
    operation

58
Summarization
  • Summaries(data synopses) - concise representation
    at expense of accuracy Sampling, Histograms
    Wavelets
  • How to make guaranties about query results based
    on summaries ?
  • How to maintain efficiently in rapid data
    streams ?
  • What summarization techniques are better ?

59
Dynamic approximation Challenges
  • Some apps will not tolerate unpredicted and
    variable accuracy
  • Extend Language to specify tolerable imprecision

60
Dynamic approximation techniques
  • Synopses compression
  • Sampling
  • Load shedding

61
Synopses compression
  • Synopses concise representation at expense of
    accuracy
  • Reducing memory overhead
  • Methods
  • histograms, Wavelets, etc

62
Load shedding
  • Drop tuples from queries, when they grow too
    large
  • Drops chunks of tuples at time -- differs from
    sampling, which eliminates probabilistically
  • load shedding -- biased, but easier to implement

63
Query Plans How DSMS process Query?
Issues to consider
  • Separate Query Plan for each Continuous Query
    vs. one Mega-Query plan for all computations for
    all users
  • Plan components may be shared
  • Query registers before streams start to produce
    data
  • How about adding queries over existing streams
  • Queries over archived / discarded Data

64
STREAM System Query Plans
  • Query Operators
  • Reads stream of tuples from set of input queues,
    processes them, writes output tuples into single
    output queue

Operator
Input Queue
Output Queue
Input Queue
65
Query Plans (Cont.)
  • Inter-Operator Queues
  • Queues connect different operators and define
  • tuples flow
  • Synopses
  • Summarizes tuples seen so far at intermediate
  • operator as needed for future

66
When Synopses Needed ?
  • Join operator
  • Must remember tuples seen so far on each of
    input streams maintain synopses for each
  • Filter operator (selection)
  • Do not maintain state no need for synopses

67
Example
Join of R,S, T
Selection Over Join of R and S
Query 1
Query 2
Q3 is Shared
Operator O3 (Join)
Operator O2 (select)
Synop4
Synop3
Queue3
Operator O1 (Join)
Synop1
Synop2
Queue 4
Queue1
Queue2
Scheduler
Stream T
Stream R
Stream S
68
Explanations to Example
  • Two plans (for Q1 and Q2) share a sub-plan
    joining streams R and S by sharing it output
    queue q3
  • Execution of operators controlled by Global
    Scheduler
  • When operator O scheduled, control passes to O
    for period determined by number of tuples
  • Possible time-slice based scheduling

69
Resource Sharing for Query Plans
  • When Continuous Queries share common
    sub-expressions
  • Similar to traditional DBMS
  • Resource sharing and Approximation considered
    separately
  • Do not share , if sharing introduces
    approximation like merging sub-expressions with
    different window sizes

70
Implementation of Shared Queue
Op1
Op3
Op2
Op4
Shared Queue
t1
t2
t3
t4
t5
t6
t7
t8
  • Queue maintains pointer to first unread tuple for
    each operator
  • Discard tuple once they had been read by all
    operators

71
Resource Sharing (cont.)
  • Base Data Stream accessed by multiple queries
    shared as common sub-expression
  • Number of tuples in shared queue depends on
  • Rate of addition to the queue
  • Rate at which slowest operator consumes tuples
  • Common sub-expression of 2 queries with very
    different consumption rates

72
Shared Queue Issues
P1 Heavy consumer
Stream
Operator J (Join)
Queue q
Stream
P2 Light consumer
  • P1, P2 parents of operator J
  • J will be scheduled frequently, for sake of P1
  • J should be scheduled less frequently for P2 (to
    avoid proliferation of tuples in q)

73
Sub-Plan Sharing
  • Formally proven
  • sub-plan sharing may be sub-optimal for common
    sub-expressions with joins
  • for common sub-expressions without joins sharing
    is always preferable

74
Synopses Sharing
  • Issues to consider
  • Which operator responsible to manage shared
    synopses ?
  • Synopses required by different operators , how to
    choose size of common synopses?
  • If synopses are identical, how to cope with
    different consumption rates?

75
Scheduling
  • Objective for Scheduler
  • Stream-based variation of response time
  • Throughput
  • Weighted fairness among queues
  • Minimize intermediate queues sizes
  • Granularity for Scheduler
  • Max number of tuples consumed by operator
  • Time-unit
  • Parallelism in scheduling algorithm ?

76
Scheduling Example
Op. O1
Op. O2
q1
q2
  • O1 takes 1 time unit to operate on n tuples from
    q1, with 20 selectivity, produces n/5 tuples in
    q2
  • O2 takes 1 time unit to operate on n/5 tuples
    from q2, and it doesnt produces tuples.

77
Scheduling Example (Cont.)
  • Assume, average arrival rate on q1 is no more
    than n per 2 time units queues are bounded
  • Arrivals may be bursty
  • Possible scheduling strategies
  • Algoritm1 (time-slicing)
  • tuples processed 1 time unit by each operator.
  • O1 consumes n units, O2 consumes n/5
  • O1, O2
  • Algoritm2 O1 operates until its queue empty,
    afterwards O2

78
Algorithm 1
Queue Size
2n tuples arrived
n tuples arrived
n tuples arrived
26
1
Time
1
2
3
4
5
6
7
8
  • Orange Tuples in Q1
  • Yellow Tuples in Q2

79
Algorithm2
Queue Size
2n tuples arrived
n tuples arrived
n tuples arrived
Time
1
2
3
4
5
6
7
8
  • Orange tuples in Q1
  • Yellow Tuples in Q2

80
Comparison. Which is better?
Total size of both queues
2n tuples arrived
n tuples arrived
n tuples arrived
26
1
Time
2
3
4
5
6
7
8
1
  • Orange Algorithm1
  • Yellow Algorithm2

81
Greedy Scheduler Rule
  • Schedule the operator that consumes largest
    number of of tuples per time and is the most
    selective (produces fewest tuples)
  • Operators with full batches in input queues are
    favored over high priority operators with
    under-full inputs (better utilization of
    time-slice)
  • High-priority operator may be underutilized if
    feeders are low priority consider chains of
    operators

82
Scheduling Algorithm Discussion
  • Queue size minimization
  • Increased time to initial results
  • Strategy 1 would produce initial results faster
  • Incorporate response time and weighted fairness
    into algorithm
  • Flexible time-slices
  • Taking context-switching into account

83
Resource Management
  • Relevant Resources
  • Memory
  • CPU
  • I/O (if disk used)
  • Network (in Distributed DSMS)
  • Our Goal
  • Maximize query precision by making best use
  • of available resources and have a capability to
  • do that dynamically and adaptively

84
Resource Management Cont.
  • Focus on memory used by synopses and
  • queues
  • Algorithms developed in STREAM
  • Allocating memory to query plan
  • Incorporating known constraints on input streams
    to reduce synopses without compromising precision
  • Operator scheduling to minimize queue size

85
Resource Management Approaches (Cont.)
  • Exploiting constraints over data streams
  • When additional information about streams is
    available (gathered stats, constraint specs) --
    reduce resource utilization with same result
    precision

86
Adaptation why?
  • Adaptation
  • Queries are long running
  • Parameters
  • Stream flow rate
  • Stream data characteristics
  • Environment (available RAM)
  • may vary -- how to adapt?

87
Exploiting Constraints over Data Streams
  • Answering Requires synopses of unbounded size !

Query Q join , to monitor fulfillments delays
Synop-F
Synop-O
Order_IDItem_ID
O
F
Stream Orders
Stream Fulfillments
88
Constraints (cont.)
  • Tuples for given (orderID, itemID) arrive at
    stream O before corresponding tuples arrive to F
  • No need to maintain a join synopses for F !!
  • Another constrain tuples arrive at O clustered
    by orderID
  • We need only to save tuples for given orderID,
    until next orderID seen

More RAM needed for synapse
Ord1, item 1
Ord1, item 1
Ord1, item 2
Ord1, item 2
Ord1, item 3
Ord3, item 1
Ord1, item 4
Ord1, item 3
Ord3, item 1
Ord3, item 2
Ord3, item 4
Ord3, item 4
89
Constraints (Cont.)
  • Referential integrity
  • Unique-value
  • Clustered-Arrival
  • Ordered-Arrival

90
Summary
  • Architecture for DSMS
  • Query Language
  • Common Design Problems
  • Tradeoff efficiency, accuracy, storage

91
References
  • Continuous Queries over Data Streams by
    S.Babu, J.Widom (Stanford University)
  • Query Processing, Approximation, and Resource
    Management In a Data Stream Management System by
    R.Motiwani, J.Widom and others (Stanford
    University)
  • http//www.db.stanford.edu/stream

92
  • Questions ?

93
Thank you
Write a Comment
User Comments (0)
About PowerShow.com