Data Stream Management Systems - PowerPoint PPT Presentation

About This Presentation
Title:

Data Stream Management Systems

Description:

Output tuples should be produced in a timely fashion. Tuple drops ... The table shows the total size of queues q1 and q2, each table entry is a multiplier for n ... – PowerPoint PPT presentation

Number of Views:606
Avg rating:3.0/5.0
Slides: 45
Provided by: Amy369
Learn more at: http://web.cs.ucla.edu
Category:

less

Transcript and Presenter's Notes

Title: Data Stream Management Systems


1
  • Data Stream Management Systems
  • Presented by
  • Chung-Yan Kwan
  • Amy Lau
  • June 5, 2003
  • CS 240B
  • Professor Carlo Zaniolo

2
Outline
  • Characteristics of Data Stream Management System
    (DSMS)
  • AURORA Brandeis Uni., Brown Uni., MIT
  • Introduction
  • System Architecture
  • System Model
  • Operators
  • Query Model
  • Optimization
  • QoS Data Structure
  • Future Work

3
Outline (cont)
  • STREAM The Stanford Stream Data Manager
  • Introduction
  • System Architecture
  • Query Language
  • Query Plans
  • Approximation Techniques
  • Resource Management
  • Implementation and Interfaces

4
Characteristics of Data Stream Management System
(DSMS)
  • Manage traditional stored data (relations)
  • Handle multiple continuous, unbounded, possibly
    rapid and time-varying data streams
  • Supports long-running continuous queries, and
    produce answers in a continuous and timely fashion

5
Introduction of Aurora
  • General-purpose DSMS
  • Efficiently support a variety or real-time
    monitoring applications
  • 3 Key Components
  • Scheduler
  • Storage Manager
  • Load Shedder

6
Scheduler
  • decides which operators to execute and in which
    order to execute them
  • pays special attention to reducing operator
    scheduling and invocation overheads
  • batches (i.e., groups) multiple tuples and
    operators and executes each batch at once

7
Storage Manager
  • designed for storing ordered queues of tuples
    instead of sets of tuples (relations)
  • combines the storage of push-based queues with
    pull-based access to historical data stored at
    connection points.

8
Load Shedder
  • responsible for detecting and handling overload
    situations.
  • Handling Overload Situation
  • accomplished by shedding tuples by temporarily
    adding drop operators to the Aurora processing
    network.
  • Goal filter messages, in order to rectify the
    overload situation and provide better overall QoS
    at the expense of reduced answer quality.

9
Aurora System Model
  • The basic job of Aurora is to process incoming
    streams in the way defined by an application
    Administrator
  • Data Stream Flow
  • Input from external Stream
  • Data flow through a loop-free, directed graph of
    processing operations (ie. boxes)
  • Output streams are presented to applications
  • Maintain historical storage (support ad-hoc
    query)

10
Operators
  • Eight Primitive Operators
  • Windowed Operators
  • Slide
  • Tumble
  • Latch
  • Resample
  • Non-Windowed Operators
  • Filter Drop
  • Map
  • GroupBy
  • Join

11
Query Model
12
Query Model (cont)
  • 3 types
  • Continual queries (real-time processing)
  • Views
  • Ad-hoc queries

13
Continual Query
  • No need to store the data once they are processed
  • The QoS specification at the end of the path
    controls how resources are allocated to the
    processing elements along the path
  • Application programmed to deal with
    asynchronous tuples.

14
Views
  • a path is defined with no connected application.
  • It is allowed to have a QoS specification as an
    indication of the importance of the view.
  • Applications can connect to the end of this path
    whenever there is a need.
  • Moreover, it can store these partial results at
    any point along a view path.

15
Ad-hoc Query
  • Connection Point
  • A connection point is an arc that will support
    dynamic modification to the network.
  • An ad-hoc query can be attached to a connection
    point at any time.
  • Data stored in the connection point is delivered
    to ad-hoc query
  • Thus, the semantics for an Aurora ad-hoc query is
    the same as a continuous query that starts
    executing at tnow-T and continues until explicit
    termination.

16
Optimization
  • Inserting Projections
  • Project out all unneeded attributes
  • Combining Boxes
  • If possible, it could at least saves the box
    execution overhead and reduces the totla number
    of boxes.
  • Reordering Boxes
  • (next slide)

17
Reordering Boxes
  • Cost of b, c(b), expected execution time for b
    per input tuple
  • Selectivity of b, s(b), expected number of output
    tuples per input tuple
  • If bi before bj, expected cost
  • c(bi) s(bi) x c(bj)
  • 1-s(bj)/c(bj) gt 1-s(bi)/c(bi)

18
QoS Data Structure
  • Multidimensional function of several attributes
    of an Aurora system
  • Response times
  • Tuple drops
  • Values produced

19
QoS Data Structure (cont)
  • Response times
  • Output tuples should be produced in a timely
    fashion
  • Tuple drops
  • Tuples dropped to shed load will deteriorate QoS
  • Values produced
  • Depends on whether important values are being
    produced or not.

20
Future Work
  • Implementing an Aurora prototype system
  • Working on a distributed architecture, Aurora.

21
Introduction of STREAM
  • A general-purpose DSMS
  • Supports a declarative query language (CQL)
  • registering continuous query
  • Flexible query plans
  • Designed to cope with high data rates and large
    number of continuous queries
  • provides approximate answers when resources are
    limited
  • careful resource allocation and usage

22
System Architecture
23
Query Language (CQL)
  • An extended version of SQL
  • Includes
  • Sliding window specification
  • partitioning clause (grouping)
  • window size (ROWS or RANGE)
  • e.g. ROWS 50 PRECEDING
  • e.g. RANGE 15 MINUTES PRECEDING
  • filtering predicate (WHERE)
  • Sampling clause
  • specifies that a random sample of the data
    elements should be used for query processing
  • (e.g. 1 SAMPLE means each data element in
    the stream should be retained with probability
    0.01 and discarded with probability 0.99)

24
Query Example
  • the example queries reference a stream Requests
    of requests to a web proxy server, each with four
    attributes
  • client_id, domain, URL, and reqTime
  • counts the number of requests for pages from the
    domain stanford.edu in the last day
  • SELECT COUNT()
  • FROM Requests SRANGE 1 DAY PRECEDING
  • WHERE S.domain standford.edu
  • counts how many page requests were for pages
    served by Stanfords CS department web server,
    considering only each clients 10 most recent
    page requests from the domain stanford.edu
  • SELECT COUNT()
  • FROM Requests S
  • PARTITION BY S.client_id
  • ROWS 10 PRECEDING
  • WHERE S.domain stanford.edu
  • WHERE S.URL LIKE http//cs.stanford.edu/

25
Query Example (cont.)
  • this example references a stored relation Domains
    that classifies domains by the primary type of
    web content they serve
  • counts the number of requests for pages from
    commerce domains out of the last 10,000
    requests for pages from domains that have been
    classified
  • SELECT COUNT()
  • FROM (SELECT R.class
  • FROM Requests S 10 SAMPLE, Domains R
  • WHERE S.domain R.domain) T
  • ROWS 10000 PRECEDING
  • WHERE T.class commerce
  • Note the stream of requests must be joined
    with the Domains relation (resulting in a stream
    labeled T) before applying the sliding window

26
Query Language (cont.)
  • Stream Ordering and Timestamps
  • Assume global, discrete, ordered time domain
  • Each stream tuple has a timestamp
  • Explicit
  • Use attribute TIMESTAMP (type DATETIME) in CREATE
    STREAM statement
  • Arrival-based
  • Value of the system clock at that time
  • Inactive and Weighted Queries
  • Queries may be assigned weights indicating their
    relative importance
  • Provide more precision with higher weight
  • Inactive queries
  • queries with negligible weight
  • Influence query plans and resource allocation

27
Query Plans
  • Accounting for plan sharing and approximation
    techniques
  • Compiles declarative queries into individual
    plans, system may merge plan
  • Aurora uses directly manipulate one large
    execution plan
  • Allows direct input of query plans
  • Similar to Aurora
  • Plans composed of three types of components
  • Query operators (similar to traditional DBMS)
  • Inter-operator queues (similar to some
    traditional DBMS)
  • Synopses
  • used to maintain state associated with operators
  • summarization technique (sliding windows) used to
    limit their size (produce approximate results)
  • Global scheduler for plan execution

28
Query Plans (cont.)
  • Generic methods of the Operator class
  • Create, changeMem, run
  • Generic methods of the Synopsis class
  • Create, changeMem, insert and delete, query
  • Separate implementation allows us to couple any
    operator type with any synopsis type, and paves
    the way for operator and synopsis sharing

29
Example of Query Plans
30
Resource Sharing in Query Plans
  • Can combine plans that have exact matching
    subexpressions
  • multiple queries assessing the same incoming base
    data stream S share S as a common subexpression
  • The implementation of a shared queue
  • maintains a pointer to the first unread tuple for
    each operator that reads from the queue, and
  • it discards tuples once they have been read by
    all parent operators
  • Not to use a shared subplan if two queries with a
    common subexpression produce parent operators
    with very different consumption rates
  • May need to introduce synopsis sharing
  • Automatic resource sharing is less crucial in
    Aurora
  • Resource sharing is primarily programmed by users
    when they augment the current mega-plan

31
Approximation Techniques
  • Goal is to maximize the precision of query
    answers based on the available resources
  • Static and Dynamic Approximations
  • Static Approximation
  • Queries are modified when they are submitted to
    the system (use less resources)
  • Two techniques
  • Window Reduction (reduce memory and computation)
  • Decrease the window size or introduce a window
    where none was specified originally (band joins)
  • This can have a ripple effect that propagates up
    the operator tree
  • Sampling Rate Reduction (reduce output rate)
  • Reduce the sampling rate of the SAMPLE clause or
    introduce one where none was specified originally
  • Can take an existing sample operator and push it
    down the query plan

32
Advantages of Static Approximation
  • Advantages of Static Approximation
  • User is guaranteed certain query behavior if
    query is being executed precisely by the system
  • User can participate in the process by guiding or
    approving the systems query modifications
  • Adaptive approximation techniques and continuous
    monitoring of system activity are not required

33
Dynamic Approximation
  • Dynamic Approximation
  • Queries are unchanged
  • System may not always provide precise query
    answer
  • Three techniques
  • Synopsis Compression (analogous to window
    reduction)
  • Reduce synopsis sizes at one or more operators
  • Incorporating a sliding window into a synopsis or
    shrinking the existing window
  • Maintaining a sample of the intended synopsis
    content
  • Sampling (reduce queue size)
  • Introduce one or more sample operators into the
    query plan, or to reduce the sampling rate at
    existing operators
  • Load Shedding (reduce queue size)
  • Simply drop tuples from queues when they grow too
    large

34
Advantages of Dynamic Approximation
  • Advantages of Dynamic Approximation
  • The level of approximation can vary with
    fluctuations in data rates and distributions,
    query workload, and resource availability
  • Approximation can occur at the plan operator
    level, and decisions can be made based on the
    global set of (possibly shared) query plans
    running in the system

35
Resource Management
  • Focus primarily on memory consumed by query plan
    synopses and queues
  • Static Resource Allocation
  • Allocating resources to queries (in a limited
    environment) that maximizes query result
    precision
  • Assume that all plan operators map allocated
    resources to precision specifications (FP, FN)
  • Where FP FN ? 0, 1
  • FP captures the false positive rate the
    probability that an output stream tuple is
    incorrect
  • FN captures the false negative rate the
    probability, for each correct output stream
    tuple, that there is another correct tuple that
    was missed
  • (FP, FN) also can denote the precision of an
    operator
  • For each operator type, compute output stream
    precision (FP, FN) values from the precision of
    the input streams and the precision of the
    operator itself
  • Apply the formulas bottom-up to the query plan,
    feeding the result to the numerical solver which
    produces the optimal resource allocation

36
Exploiting Constraints Over Data Streams
  • To reduce memory overhead in query plan operators
  • Specify an adherence parameter k to captures
    how closely a given stream or sets of streams
    adheres to a constraint of that type
  • e.g. Clustered-arrival constraints on a stream
    attribute S.A
  • If two tuples in stream S have the same value v
    for A, then at most k tuples with non-v values
    for A occur on S between them
  • The closer the streams adhere to the specified
    constraints at run-time, the smaller the required
    synopses (state)
  • Constraints considered
  • Between two streams
  • many-one join, and referential integrity
    constraints
  • Individual stream
  • unique-value, cluster ed-arrival, and
    ordered-arrival
  • Algorithm accepts select-project-join queries
    over streams with arbitrary constraints, and it
    produces a query plan that exploits constraints
    to reduce synopsis sizes without comprising
    precision

37
Scheduling
  • Global scheduler for plan execution (calls run
    methods)
  • uses round-robin scheme
  • Focus on minimizing intermediate (inter-operator)
    queue sizes
  • Parallelism not considered
  • Greedily schedule the operator that consumes
    the largest number of tuples per time unit and is
    the most selective (i.e. produces the fewest
    tuples)
  • Example
  • a query plan with two unary operators
  • O1 operates on input queue q1, writing results to
    queue q2 which is input to operator O2
  • O1 takes one time unit to operate on a batch of n
    tuples from q1, and has 20 selectivity (produces
    n/5 tuples in q2)
  • operator O2 takes one time unit to operate on n/5
    tuples, produces no tuples on its output queue
  • assume the average arrival rate of tuples on q1
    is no more than n tuples per two time units, so
    all tuples can be processed and queues will not
    grow without bound

38
Scheduling (cont.)
  • Two possible scheduling strategies for the
    example
  • Tuples are processed to completion in the order
    they arrive on q1. Each batch of n tuples in q1
    is processed by O1 and then O2 based on arrival
    time, consuming two time units overall
  • If there is a batch of n tuples in q1, then O1
    operates on them using one time unit, producing
    n/5 new tuples in q2. Otherwise, if there are
    any tuples in q2 then up to n/5 of these tuples
    are operated on by O2, consuming one time unit
  • e.g. 2n tuples arrive on q1 at time ? 0, no
    tuples at time ? 1, n tuples each at times
  • ? 2 and ? 3
  • The table shows the total size of queues q1 and
    q2, each table entry is a multiplier for n
  • both finish at the 8th step. Strategy 2 is
    clearly preferable in terms of memory overhead

39
Scheduling (cont.)
  • Can achieve queue size minimization, but pay in
    increased time to initial results
  • Two additional considerations
  • Favor operators with full batches of tuples in
    their input queues over higher-priority operators
    with underfull input queues
  • Chains of operators within a plan
  • do not schedule chains as a unit as in Auroras
    train scheduling algorithm
  • Auroras objective is to improve throughput by
    reducing context-switching between operators,
    batching the processing of tuples through
    operators, and reducing I/O overhead
    (inter-operator queues may be written to disk)
  • Aurora
  • QoS graphs capture tradeoffs among precision,
    response time, resource usage, and usefulness to
    the application. However, approximation appears
    solely through drop-boxes that perform load
    shedding.

40
Implementation and Interfaces
  • Three features of the design
  • Generic entities
  • Coding of query plans
  • System interface
  • Entities and Control Tables
  • Operators, queues and synopses are subclasses of
    a generic Entity class
  • Each entity has a table of attribute-values
    pairs--Control Table (CT), and each entity
    exports an interface to query and update its CT
  • Dynamically control the behavior of an entity
  • The amount of memory used by a synopsis S can be
    controlled by updating the value of attribute
    Memory in Ss control table
  • Collect statistics about entity behavior for
    resource management and for user-level system
    monitoring
  • The number of tuples that have passed through a
    queue q is stored in attribute Count of qs
    control table
  • Offer extensibility (add new attributes to a CT)

41
Implementation and Interfaces (cont.)
  • Query Plans
  • Implemented as networks of entities, stored in
    main memory
  • A graphical interface is provided for creating
    and viewing plans, and for adjusting attributes
    of operators, queues, and synopses
  • Query plans may be viewed and edited even as
    queries are running
  • Main-memory plan structures in XML files
    (persistent continuous query)
  • Plans are loaded at system startup, any
    modifications to plans during system execution
    are reflected in the corresponding XML
  • Users are free to create and edit XML plans
    offline

42
Implementation and Interfaces (cont.)
  • Programmatic and Human Interfaces
  • a web interface through direct HTTP
  • planing to expose as a web service through SOAP
  • remote applications
  • can be written in any language and on any
    platform
  • can register queries
  • can request and update CT attribute values
  • can receive the results of a query as a streaming
    HTTP response in XML
  • human users
  • web-based GUI exposing the same functionality

43
Conclusion
  • Both prototype are still under development
  • STREAM need to design the query processor with a
    migration to distributed processing
  • STREAM may extend the system to handle XML data
    streams
  • Both systems are quiet alike
  • We think they could join their efforts together
    to come up with a even better DSMS

44
References
  • Aurora website
  • http//www.cs.brown.edu/research/aurora/
  • Carney, D., et al., Monitoring Streams - A New
    Class of Data Management Applications, Proc. of
    Very Large Databases (VLDB), Hong Kong, China,
    August 2002. http//www.cs.uml.edu/kajal/courses/
    91.580-S03/papers/cccc-monitoring-streams.pdf
  • Motwani, R., et al., Query Processing,
    Approximation, and Resource Management in a Data
    Stream Management System, In Proc. of the 2003
    CIDR
  • Babcock, B., et al., Models and issues in data
    stream systems, In Proc. 21st ACM
    SIGACT-SIGMOD-SIGART Symp. On Principles of
    Database Systems, p. 1-16, Madison, Wisconsin,
    May 2002
  • Stanford University STREAM website
  • http//www.db.stanford.edu/stream
Write a Comment
User Comments (0)
About PowerShow.com