Data Stream Management Systems - PowerPoint PPT Presentation

About This Presentation

Title:

Data Stream Management Systems

Description:

Output tuples should be produced in a timely fashion. Tuple drops ... The table shows the total size of queues q1 and q2, each table entry is a multiplier for n ... – PowerPoint PPT presentation

Number of Views:606

Avg rating:3.0/5.0

Slides: 45

Provided by: Amy369

Learn more at: http://web.cs.ucla.edu

Category:

more less

Transcript and Presenter's Notes

Title: Data Stream Management Systems

1

Data Stream Management Systems
Presented by
Chung-Yan Kwan
Amy Lau
June 5, 2003
CS 240B
Professor Carlo Zaniolo

2
Outline

Characteristics of Data Stream Management System
(DSMS)
AURORA Brandeis Uni., Brown Uni., MIT
Introduction
System Architecture
System Model
Operators
Query Model
Optimization
QoS Data Structure
Future Work

3
Outline (cont)

STREAM The Stanford Stream Data Manager
Introduction
System Architecture
Query Language
Query Plans
Approximation Techniques
Resource Management
Implementation and Interfaces

4
Characteristics of Data Stream Management System
(DSMS)

Manage traditional stored data (relations)
Handle multiple continuous, unbounded, possibly
rapid and time-varying data streams
Supports long-running continuous queries, and
produce answers in a continuous and timely fashion

5
Introduction of Aurora

General-purpose DSMS
Efficiently support a variety or real-time
monitoring applications
3 Key Components
Scheduler
Storage Manager
Load Shedder

6
Scheduler

decides which operators to execute and in which
order to execute them
pays special attention to reducing operator
scheduling and invocation overheads
batches (i.e., groups) multiple tuples and
operators and executes each batch at once

7
Storage Manager

designed for storing ordered queues of tuples
instead of sets of tuples (relations)
combines the storage of push-based queues with
pull-based access to historical data stored at
connection points.

8
Load Shedder

responsible for detecting and handling overload
situations.
Handling Overload Situation
accomplished by shedding tuples by temporarily
adding drop operators to the Aurora processing
network.
Goal filter messages, in order to rectify the
overload situation and provide better overall QoS
at the expense of reduced answer quality.

9
Aurora System Model

The basic job of Aurora is to process incoming
streams in the way defined by an application
Administrator

Data Stream Flow
Input from external Stream
Data flow through a loop-free, directed graph of
processing operations (ie. boxes)
Output streams are presented to applications
Maintain historical storage (support ad-hoc
query)

10
Operators

Eight Primitive Operators
Windowed Operators
Slide
Tumble
Latch
Resample
Non-Windowed Operators
Filter Drop
Map
GroupBy
Join

11
Query Model
12
Query Model (cont)

3 types
Continual queries (real-time processing)
Views
Ad-hoc queries

13
Continual Query

No need to store the data once they are processed
The QoS specification at the end of the path
controls how resources are allocated to the
processing elements along the path
Application programmed to deal with
asynchronous tuples.

14
Views

a path is defined with no connected application.
It is allowed to have a QoS specification as an
indication of the importance of the view.
Applications can connect to the end of this path
whenever there is a need.
Moreover, it can store these partial results at
any point along a view path.

15
Ad-hoc Query

Connection Point
A connection point is an arc that will support
dynamic modification to the network.
An ad-hoc query can be attached to a connection
point at any time.
Data stored in the connection point is delivered
to ad-hoc query
Thus, the semantics for an Aurora ad-hoc query is
the same as a continuous query that starts
executing at tnow-T and continues until explicit
termination.

16
Optimization

Inserting Projections
Project out all unneeded attributes
Combining Boxes
If possible, it could at least saves the box
execution overhead and reduces the totla number
of boxes.
Reordering Boxes
(next slide)

17
Reordering Boxes

Cost of b, c(b), expected execution time for b
per input tuple
Selectivity of b, s(b), expected number of output
tuples per input tuple
If bi before bj, expected cost
c(bi) s(bi) x c(bj)
1-s(bj)/c(bj) gt 1-s(bi)/c(bi)

18
QoS Data Structure

Multidimensional function of several attributes
of an Aurora system
Response times
Tuple drops
Values produced

19
QoS Data Structure (cont)

Response times
Output tuples should be produced in a timely
fashion
Tuple drops
Tuples dropped to shed load will deteriorate QoS
Values produced
Depends on whether important values are being
produced or not.

20
Future Work

Implementing an Aurora prototype system
Working on a distributed architecture, Aurora.

21
Introduction of STREAM

A general-purpose DSMS
Supports a declarative query language (CQL)
registering continuous query
Flexible query plans
Designed to cope with high data rates and large
number of continuous queries
provides approximate answers when resources are
limited
careful resource allocation and usage

22
System Architecture
23
Query Language (CQL)

An extended version of SQL
Includes
Sliding window specification
partitioning clause (grouping)
window size (ROWS or RANGE)
e.g. ROWS 50 PRECEDING
e.g. RANGE 15 MINUTES PRECEDING
filtering predicate (WHERE)
Sampling clause
specifies that a random sample of the data
elements should be used for query processing
(e.g. 1 SAMPLE means each data element in
the stream should be retained with probability
0.01 and discarded with probability 0.99)

24
Query Example

the example queries reference a stream Requests
of requests to a web proxy server, each with four
attributes
client_id, domain, URL, and reqTime
counts the number of requests for pages from the
domain stanford.edu in the last day
SELECT COUNT()
FROM Requests SRANGE 1 DAY PRECEDING
WHERE S.domain standford.edu
counts how many page requests were for pages
served by Stanfords CS department web server,
considering only each clients 10 most recent
page requests from the domain stanford.edu
SELECT COUNT()
FROM Requests S
PARTITION BY S.client_id
ROWS 10 PRECEDING
WHERE S.domain stanford.edu
WHERE S.URL LIKE http//cs.stanford.edu/

25
Query Example (cont.)

this example references a stored relation Domains
that classifies domains by the primary type of
web content they serve
counts the number of requests for pages from
commerce domains out of the last 10,000
requests for pages from domains that have been
classified
SELECT COUNT()
FROM (SELECT R.class
FROM Requests S 10 SAMPLE, Domains R
WHERE S.domain R.domain) T
ROWS 10000 PRECEDING
WHERE T.class commerce
Note the stream of requests must be joined
with the Domains relation (resulting in a stream
labeled T) before applying the sliding window

26
Query Language (cont.)

Stream Ordering and Timestamps
Assume global, discrete, ordered time domain
Each stream tuple has a timestamp
Explicit
Use attribute TIMESTAMP (type DATETIME) in CREATE
STREAM statement
Arrival-based
Value of the system clock at that time
Inactive and Weighted Queries
Queries may be assigned weights indicating their
relative importance
Provide more precision with higher weight
Inactive queries
queries with negligible weight
Influence query plans and resource allocation

27
Query Plans

Accounting for plan sharing and approximation
techniques
Compiles declarative queries into individual
plans, system may merge plan
Aurora uses directly manipulate one large
execution plan
Allows direct input of query plans
Similar to Aurora
Plans composed of three types of components
Query operators (similar to traditional DBMS)
Inter-operator queues (similar to some
traditional DBMS)
Synopses
used to maintain state associated with operators
summarization technique (sliding windows) used to
limit their size (produce approximate results)
Global scheduler for plan execution

28
Query Plans (cont.)

Generic methods of the Operator class
Create, changeMem, run
Generic methods of the Synopsis class
Create, changeMem, insert and delete, query
Separate implementation allows us to couple any
operator type with any synopsis type, and paves
the way for operator and synopsis sharing

29
Example of Query Plans
30
Resource Sharing in Query Plans

Can combine plans that have exact matching
subexpressions
multiple queries assessing the same incoming base
data stream S share S as a common subexpression
The implementation of a shared queue
maintains a pointer to the first unread tuple for
each operator that reads from the queue, and
it discards tuples once they have been read by
all parent operators
Not to use a shared subplan if two queries with a
common subexpression produce parent operators
with very different consumption rates
May need to introduce synopsis sharing
Automatic resource sharing is less crucial in
Aurora
Resource sharing is primarily programmed by users
when they augment the current mega-plan

31
Approximation Techniques

Goal is to maximize the precision of query
answers based on the available resources
Static and Dynamic Approximations
Static Approximation
Queries are modified when they are submitted to
the system (use less resources)
Two techniques
Window Reduction (reduce memory and computation)
Decrease the window size or introduce a window
where none was specified originally (band joins)
This can have a ripple effect that propagates up
the operator tree
Sampling Rate Reduction (reduce output rate)
Reduce the sampling rate of the SAMPLE clause or
introduce one where none was specified originally
Can take an existing sample operator and push it
down the query plan

32
Advantages of Static Approximation

Advantages of Static Approximation
User is guaranteed certain query behavior if
query is being executed precisely by the system
User can participate in the process by guiding or
approving the systems query modifications
Adaptive approximation techniques and continuous
monitoring of system activity are not required

33
Dynamic Approximation

Dynamic Approximation
Queries are unchanged
System may not always provide precise query
answer
Three techniques
Synopsis Compression (analogous to window
reduction)
Reduce synopsis sizes at one or more operators
Incorporating a sliding window into a synopsis or
shrinking the existing window
Maintaining a sample of the intended synopsis
content
Sampling (reduce queue size)
Introduce one or more sample operators into the
query plan, or to reduce the sampling rate at
existing operators
Load Shedding (reduce queue size)
Simply drop tuples from queues when they grow too
large

34
Advantages of Dynamic Approximation

Advantages of Dynamic Approximation
The level of approximation can vary with
fluctuations in data rates and distributions,
query workload, and resource availability
Approximation can occur at the plan operator
level, and decisions can be made based on the
global set of (possibly shared) query plans
running in the system

35
Resource Management

Focus primarily on memory consumed by query plan
synopses and queues
Static Resource Allocation
Allocating resources to queries (in a limited
environment) that maximizes query result
precision
Assume that all plan operators map allocated
resources to precision specifications (FP, FN)
Where FP FN ? 0, 1
FP captures the false positive rate the
probability that an output stream tuple is
incorrect
FN captures the false negative rate the
probability, for each correct output stream
tuple, that there is another correct tuple that
was missed
(FP, FN) also can denote the precision of an
operator
For each operator type, compute output stream
precision (FP, FN) values from the precision of
the input streams and the precision of the
operator itself
Apply the formulas bottom-up to the query plan,
feeding the result to the numerical solver which
produces the optimal resource allocation

36
Exploiting Constraints Over Data Streams

To reduce memory overhead in query plan operators
Specify an adherence parameter k to captures
how closely a given stream or sets of streams
adheres to a constraint of that type
e.g. Clustered-arrival constraints on a stream
attribute S.A
If two tuples in stream S have the same value v
for A, then at most k tuples with non-v values
for A occur on S between them
The closer the streams adhere to the specified
constraints at run-time, the smaller the required
synopses (state)
Constraints considered
Between two streams
many-one join, and referential integrity
constraints
Individual stream
unique-value, cluster ed-arrival, and
ordered-arrival
Algorithm accepts select-project-join queries
over streams with arbitrary constraints, and it
produces a query plan that exploits constraints
to reduce synopsis sizes without comprising
precision

37
Scheduling

Global scheduler for plan execution (calls run
methods)
uses round-robin scheme
Focus on minimizing intermediate (inter-operator)
queue sizes
Parallelism not considered
Greedily schedule the operator that consumes
the largest number of tuples per time unit and is
the most selective (i.e. produces the fewest
tuples)
Example
a query plan with two unary operators
O1 operates on input queue q1, writing results to
queue q2 which is input to operator O2
O1 takes one time unit to operate on a batch of n
tuples from q1, and has 20 selectivity (produces
n/5 tuples in q2)
operator O2 takes one time unit to operate on n/5
tuples, produces no tuples on its output queue
assume the average arrival rate of tuples on q1
is no more than n tuples per two time units, so
all tuples can be processed and queues will not
grow without bound

38
Scheduling (cont.)

Two possible scheduling strategies for the
example
Tuples are processed to completion in the order
they arrive on q1. Each batch of n tuples in q1
is processed by O1 and then O2 based on arrival
time, consuming two time units overall
If there is a batch of n tuples in q1, then O1
operates on them using one time unit, producing
n/5 new tuples in q2. Otherwise, if there are
any tuples in q2 then up to n/5 of these tuples
are operated on by O2, consuming one time unit
e.g. 2n tuples arrive on q1 at time ? 0, no
tuples at time ? 1, n tuples each at times
? 2 and ? 3
The table shows the total size of queues q1 and
q2, each table entry is a multiplier for n
both finish at the 8th step. Strategy 2 is
clearly preferable in terms of memory overhead

39
Scheduling (cont.)

Can achieve queue size minimization, but pay in
increased time to initial results
Two additional considerations
Favor operators with full batches of tuples in
their input queues over higher-priority operators
with underfull input queues
Chains of operators within a plan
do not schedule chains as a unit as in Auroras
train scheduling algorithm
Auroras objective is to improve throughput by
reducing context-switching between operators,
batching the processing of tuples through
operators, and reducing I/O overhead
(inter-operator queues may be written to disk)
Aurora
QoS graphs capture tradeoffs among precision,
response time, resource usage, and usefulness to
the application. However, approximation appears
solely through drop-boxes that perform load
shedding.

40
Implementation and Interfaces

Three features of the design
Generic entities
Coding of query plans
System interface
Entities and Control Tables
Operators, queues and synopses are subclasses of
a generic Entity class
Each entity has a table of attribute-values
pairs--Control Table (CT), and each entity
exports an interface to query and update its CT
Dynamically control the behavior of an entity
The amount of memory used by a synopsis S can be
controlled by updating the value of attribute
Memory in Ss control table
Collect statistics about entity behavior for
resource management and for user-level system
monitoring
The number of tuples that have passed through a
queue q is stored in attribute Count of qs
control table
Offer extensibility (add new attributes to a CT)

41
Implementation and Interfaces (cont.)

Query Plans
Implemented as networks of entities, stored in
main memory
A graphical interface is provided for creating
and viewing plans, and for adjusting attributes
of operators, queues, and synopses
Query plans may be viewed and edited even as
queries are running
Main-memory plan structures in XML files
(persistent continuous query)
Plans are loaded at system startup, any
modifications to plans during system execution
are reflected in the corresponding XML
Users are free to create and edit XML plans
offline

42
Implementation and Interfaces (cont.)

Programmatic and Human Interfaces
a web interface through direct HTTP
planing to expose as a web service through SOAP
remote applications
can be written in any language and on any
platform
can register queries
can request and update CT attribute values
can receive the results of a query as a streaming
HTTP response in XML
human users
web-based GUI exposing the same functionality

43
Conclusion

Both prototype are still under development
STREAM need to design the query processor with a
migration to distributed processing
STREAM may extend the system to handle XML data
streams
Both systems are quiet alike
We think they could join their efforts together
to come up with a even better DSMS

44
References

Aurora website
http//www.cs.brown.edu/research/aurora/
Carney, D., et al., Monitoring Streams - A New
Class of Data Management Applications, Proc. of
Very Large Databases (VLDB), Hong Kong, China,
August 2002. http//www.cs.uml.edu/kajal/courses/
91.580-S03/papers/cccc-monitoring-streams.pdf
Motwani, R., et al., Query Processing,
Approximation, and Resource Management in a Data
Stream Management System, In Proc. of the 2003
CIDR
Babcock, B., et al., Models and issues in data
stream systems, In Proc. 21st ACM
SIGACT-SIGMOD-SIGART Symp. On Principles of
Database Systems, p. 1-16, Madison, Wisconsin,
May 2002
Stanford University STREAM website
http//www.db.stanford.edu/stream