... over real-time streaming financial data such as stoc - PowerPoint PPT Presentation

1 / 50
About This Presentation

... over real-time streaming financial data such as stoc


... over real-time streaming financial data such as stock tickers and news ... real-time ... corresponds to real-world event at particular time that is of ... – PowerPoint PPT presentation

Number of Views:111
Avg rating:3.0/5.0
Slides: 51
Provided by: evrena


Transcript and Presenter's Notes

Title: ... over real-time streaming financial data such as stoc

Models and Issuesin Data Stream Systems
  • 2001700537

  • The need for and research issues arising from a
    new model of data processing.
  • Review past work relevant to data stream systems
    and current projects in that area.
  • Explore topics in stream query languages, new
    requirements and challenges in query processing,
    and algorithmic issues.

  • The Data Stream Model
  • Review of Data Stream Projects
  • Queries of Data Streams
  • Stanfords Proposal for DSMS
  • Algorithmic Issues

The Data Stream Model
  • Data streams differ from conventional stored
    relation model
  • Data elements in the stream arrive online
  • System has no control over order in which data
    elements to be processed
  • Data streams are potentially unbounded in size
  • Once an element from a data stream has been
    processed, it is discarded or archived. It cannot
    be retrieved easily unless it is stored in
    memory, which is small relative to the size of
    data streams
  • Operating in data stream model does not preclude
    use of data in conventional stored relations.

  • One-time queries and Continuous queries
  • One-time queries
  • Evaluated once over a point-in-time snapshot of
    data set
  • Continuous queries
  • Evaluated continuously as data streams continue
    to arrive
  • May be stored and updated as new data arrives, or
    may produce data streams themselves

  • Predefined and Ad hoc queries
  • Predefined
  • Supplied to data stream management system before
    any relevant data has arrived
  • Usually continuous queries
  • Scheduled one-time queries possible
  • Ad hoc
  • Can be either one-time or continuous queries
  • Complicates design of data stream management
    system (DSMS), because they are not known in
    advance for purposes of query optimization and
    correctly answering it may require referencing
    data that may have already arrived on data
    streams and potentially have already been

Motivating Examples
  • Web-based financial search engine that evaluates
    queries over real-time streaming financial data
    such as stock tickers and news feeds.
  • Modern security applications.
  • Provides integrated security platform providing
    services such as firewall support and intrusion
    detection over multi-gigabit network packet
  • Needs to perform complex stream processing
    including URL-filtering based on table lookups
    and correlation across multiple network traffic
  • Large web site monitor web logs online to enable
    applications such as personalization, performance
    monitoring, and load-balancing. (e.g., Yahoo)
  • Sensor monitoring
  • Network traffic management

Review of Data Stream Projects
  • Tapestry System
  • Continuous queries used for content-based
    filtering over an append-only database of email
    and bulletin board messages
  • Restricted subset of SQL used as query language
    in order to provide guarantees about efficient
    evaluation and append-only results

Review of Data Stream Projects
  • Alert system
  • Mechanism for implementing event-condition-action
    style triggers in conventional SQL database
  • Used continuous queries defined over special
    append-only active tables
  • XFilter content-based filtering system
  • Efficient filtering of XML documents based on
    user profiles as continuous queries in XPath

Review of Data Stream Projects
  • Xyleme
  • Similar to Xfilter (content-based filtering
  • Enables high throughput with restricted query
  • Tribeca stream database manager
  • Restricted querying capability over network
    packet streams
  • Tangram stream query processing system
  • Used stream processing techniques to analyze
    large quantities of stored data

Review of Data Stream Projects
  • OpenCQ
  • Support continuous queries for monitoring
    persistent data sets spread over wide-area
  • Uses query processing algorithm based on
    incremental view maintenance.
  • NiagraCQ
  • Support continuous queries for monitoring
    persistent data sets spread over wide-area
  • Addresses scalability in number of queries by
    proposing techniques for grouping continuous
    queries for efficient evaluation.
  • Problem of supporting blocking operators in query
    plans over data streams discussed
  • Viglas and Naughton proposed rate-based
    optimization for queries over data streams (based
    on stream-arrival and data-processing rates)

Review of Data Stream Projects
  • Chronicle data model
  • Append-only ordered sequences of tuples
    (chronicles), a form of data streams
  • Defined restricted view definition language and
    algebra (chronicle algebra) that operates over
    chronicles together with traditional relations.
  • Focus was to ensure that views defined in
    chronicle algebra could be maintained
    incrementally without storing any of the
  • Seshadri, Livny, and Ramakrishhnan proposed an
    algebra and a declarative query language for
    querying ordered relations
  • Related work in this area includes work on
    temporal and time-series databases, where the
    ordering of tuples that can be implied by time
    can be used in querying, indexing, and query

Review of Data Stream Projects
  • Materialized views relates to continuous queries
  • Materialized views are really queries that need
    to be reevaluated or incrementally updated
    whenever the base data changes
  • Important work in this area
  • Self-maintenanceEnsuring that enough data has
    been saved to maintain a view even when the base
    data is unavailable
  • Data expirationDetermining when certain base
    data can be discarded without compromising the
    ability to maintain a view
  • Differences where continuous queries may
  • Involve streams rather than store results
  • Deal with append-only input data
  • Provide approximate rather than exact answers
  • Processing strategy may adapt as characteristics
    of data streams change

Review of Data Stream Projects
  • Telegraph project
  • Uses adaptive query engine to process queries
    efficiently in volatile and unpredictable
  • Query execution strategies over data streams
    generated by sensors
  • Processing techniques for multiple continuous
  • Tukwila system
  • Supports query processing, in order to perform
    dynamic data integration over autonomous data

Review of Data Stream Projects
  • Aurora Project
  • New data processing system targeted towards
    stream monitoring applications
  • Consists of large network of triggers
  • Each trigger is data-flow graph with each node
    being one among seven built-in operators
  • For each stream monitoring application using
    system, an application administrator creates and
    adds one or more triggers into trigger network
  • Performs compile-time optimization and run-time
    optimization of trigger network
  • Detects resource overload and perform load
    shedding based on application-specific measures
    of QoS

Queries of Data Streams
  • Unbounded Memory Requirements
  • Approximate Query Answering
  • Sliding Windows
  • Batch Processing, Sampling, and Synopses
  • Blocking Operators
  • Queries Referencing Past Data

Unbounded Memory Requirements
  • Since data streams are potentially unbounded in
    size, amount of storage required to compute exact
    answer to the query may grow without bound
  • External memory algorithms for handling data sets
    larger than main memory cannot be used.
  • Do not support continuous queries
  • Too slow real-time response
  • With new data constantly arriving even as old
    data is being processed, amount of computation
    time per data element must be low
  • Interested in algorithms that are able to confine
    themselves to main memory without accessing disk

Approximate Query Answering
  • Since we limited to bounded amount of memory, it
    may not be possible to produce exact answers
  • High-quality approximate answers can be an
    acceptable solution
  • Techniques for data reduction and synopsis
  • Sketches
  • Random sampling
  • Histograms
  • Wavelets

Sliding Windows
  • Evaluate query over sliding window of recent data
    from streams
  • Attractive Properties
  • Well-defined and understood
  • Deterministic so there is no danger that bad
    random choices will produce bad approximation
  • Emphasizes recent data, which in many real-world
    applications is more important than old data

Future Data
Past Data
Recent Data
Sliding Windows
  • Research Issues
  • How do we define timestamps over streams to
    facilitate use of windows?
  • How do we implementation of sliding window
  • What is their impact on query optimization?
  • If window is too big to fit in main memory, how
    can we give approximate answers using only
    available memory?

Sliding Windows
  • Differences in sequence and temporal DB and
    stream computation model
  • Temporal DB
  • Concerned with full history of each data value
    over time
  • Stream system concerned with processing new data
    elements on-the-fly
  • Sequence DB
  • Attempt to produce query plans that allow for
    stream access.
  • A single scan of input data is sufficient to
    evaluate plan and amount of memory required for
    plan evaluation is constant, independent of data.
  • Assumes that DB system has control over which
    sequence to process tuples from next (e.g.,
    merging multiple sequences, which cannot be
    assumed in stream system)

Batch Processing, Sampling, and Synopses
  • Dont process data elements as it arrives
  • Resort to sampling or batch processing technique
    to speed up query execution
  • Framework
  • Query answered using data structures that can be
    maintained incrementally
  • Data structure supports two operations
  • update(tuple) updates data structure as each new
    data element arrives
  • computeAnswer() produces new or updated results
    to query
  • Best case scenario is that both operations are
    fast relative to arrival rate of elements in data
    streams no special techniques needed

Batch Processing
  • update(tuple) is fast but computeAnswer() is slow
  • Data elements buffered as they arrive
  • Answer to query is computed periodically as time
  • Does not cause any uncertainty about accuracy of
    answer, sacrificing timeliness instead.
  • Good when data streams are bursty

  • computeAnswer() fast, but update(tuple) slow
  • Some tuples skipped altogether so query is
    evaluated over sample of data stream rather than
    over entire data stream.
  • Give confidence bounds on degree of error
    introduced by sampling process
  • For many situations and queries involving joins,
    it is not reliable

Synopsis Data Structures
  • computeAnswer() fast, and update(tuple) fast
  • Used for queries where no exact data structure
    with desired properties exists
  • Approximate data structure that maintains small
    synopsis or sketch of data rather than exact
    representation, so computation per data element
    is low.

Blocking Operators
  • Query operator that is unable to produce the
    first tuple of its output until it has seen its
    entire input. (e.g., sorting, aggregation
    operators like SUM)
  • Since streams may be infinite, a blocking
    operator using a stream as one of its inputs will
    never see entire input and will never produce
  • Operators that are root of tree of query
    operators are more tractable than operators that
    are interior nodes in tree, producing results
    that feed to other operators.
  • Aggregation operator at root produces a single
    value or small number of values and updates to
    answer can be streamed out as they are produced
  • When answer is larger like in a sort, it is more
    practical to maintain a data structure with
    up-to-date answer rather than retransmitting an
    entire answer
  • Results produced by blocking operators may
    continue to change over time, so operators
    consuming those results cannot make reliable
    decisions based on results at intermediate stage
    of query execution

Blocking Operators
  • We can handle operators as interior nodes in
    query tree by replacing them with non-blocking
  • juggle operator is a non-blocking version of
    sort. It aims to locally reorder a data stream so
    that tuples that come earlier in desired sort
    order are produced before tuples that come later
    in sort order, although some tuples may be
    delivered out of order

Blocking Operators
  • Tucker et al. suggested augmenting data streams
    with assertions about what can and cannot appear
    in remainder of data stream
  • Assertions (punctuations) interleaved with data
    elements in stream
  • Example with assertion for all future tuples,
    daynumber ? 10
  • Aggregation operator that was grouping by
    daynumber could stream out its answers for all
    daynumbers lt 10
  • Join operator could discard all its saved state
    relating to previously-seen tuples in joining
    stream with daynumber lt 10

daynumber ? 10
daynumber lt 10
Assertion daynumber ? 10
Queries Referencing Past Data
  • Ad hoc queries that are issued after some data
    has already been discarded may be impossible to
    answer accurately
  • One solution is to only allow ad hoc queries that
    reference future data. It may be acceptable in
    some applications
  • Another solution is to maintain summaries of data
    streams (synopses or aggregates) that can
    approximate answers to future ad hoc queries
  • Problem similar to problems in physical DB design
    such as selection of indexes and materialized
    views, but in traditional DB design, we can still
    get the right answer at higher cost if no index
    present. But in stream model, if no summary
    structure present, we cant get the answer

Stanfords Proposal for DSMS
  • STREAM (Stanford Stream Data Manager)
  • Query Language for a DSMS
  • Timestamps in Streams
  • Query Processing Architecture of a DSMS

Query Language for a DSMS
  • Modified version of SQL
  • Allowed the FROM clause to refer to streams as
    well as relations
  • Allowing optional window specification to be
    provided after a stream that is supplied into a
    querys FROM clause
  • Sliding window requires an ordering of data
    stream elements, using implicit timestamp
    attached to each data element
  • Example Compute average call length, considering
    only ten most recent long-distance calls placed
    by each customer

S.typeLong Distance
Timestamps in Streams
  • Timestamps are ambiguous for streams derived from
    multiple streams (e.g., join)
  • Previous example uses implicit timestamps, in
    which system adds a special field to each
    incoming tuple
  • Explicit timestamp is data attribute used as a
  • Used when each tuple corresponds to real-world
    event at particular time that is of importance to
    meaning of tuple
  • Drawback is that tuples may not arrive in same
    order as timestamps tuples with later
    timestamps may come before tuples with earlier
    timestamps. Makes it difficult to perform sliding
    window computation
  • But if input stream is almost-sorted, we can
    fix it with a little buffering.

Timestamps in Streams
  • Methods of assigning timestamps output of binary
  • Provide no guarantee about output order of tuples
    from a join operator.
  • Assume that tuples that arrive earlier are likely
    to pass through join earlier.
  • Each tuple that is produced by join op is assign
    implicit timestamp that is set to time that it
    was produced by join op
  • Flexible in implementation
  • But impossible to impose defined deterministic
    sliding window semantics on results of subqueries

Timestamps in Streams
  • User specifies as part of query what timestamp is
    to be assigned to tuples resulting from join of
    multiple streams
  • Order in which streams are listed in FROM clause
    of query represents a prioritization of streams
  • Implementation can be difficult (e.g., if output
    is to be sorted by timestamp, join op needs to
    buffer output until it can be determine that
    future input tuples will not disrupt ordering of
    output tuples)

Output tuple will have same timestamp as S1
Timestamps in Streams
  • Best-effort

Query Processing Architecture
  • Query execution plans consist of operators
    connected by queues
  • Operators scheduled for execution by central
  • During execution operator reads data from its
    input queues, updates synopsis structure and
    writes results to output queues
  • Period of execution of operator determined
    dynamically by scheduler and operator returns
    control back to scheduler once period expires

Query Processing Architecture
  • To handle stream data characteristic
    fluctuations, operators are adaptive (primarily
    to memory)
  • Trading accuracy for memory
  • Operator maximizes accuracy of output based on
    size of available memory
  • Handles dynamic changes in size of its available
  • Example For a sliding window join, the larger
    the window, the better the approximation

Query Processing Architecture
  • Issues in Memory Management
  • How do different query ops produce approximate
    answers under limited memory?
  • How approximate results behave when operators are
    composed in query results?
  • How can the DSMS allocate memory to operators to
    maximize accuracy of answer?
  • How can DSMS reallocate memory among operators
    under changing conditions?
  • How does the query optimizer come up with a query
    plan when given a query with best memory
    allocation and minimizes approximation? Should
    plans be modified when conditions changed?
  • Since synopses can be shared among query plans,
    how do we optimally consider a set of queries,
    which may be weighted by importance?

Query Processing Architecture
  • Issues in Scheduling
  • Scheduler needs to provide rate synchronization
    within operators and pipelined operators in query
  • Time-varying arrival rates of data streams and
    time-varying output rates of operators complicate
  • Need to take into account
  • Memory allocation across operators
  • Mgt of buffers for incoming streams
  • Availability of synopses on disk (instead of
  • Performance requirements of individual queries

Algorithmic Issues
  • Random Samples
  • Sketching Techniques
  • Histograms
  • Sliding Windows
  • Negative Results
  • Miscellaneous algorithms

Random Samples
  • Used as summary structure in many scenarios where
    small sample is expected to capture essential
    characteristics of data set
  • Easiest form of summarization
  • Other synopses can be built from sample itself

Sketching Techniques
  • Building summary of data stream using small
    amount of memory
  • Makes it possible to estimate answer to certain
    queries (like distance queries) over data set
  • F0 is number of distinct values in S
  • F1 is the length of S
  • F2 is the self-join size
  • F? is the most frequent items multiplicity

  • V-Optimal Histogram approximate distribution of a
    set of values by a piecewise-constant function so
    as to minimize the sum of squared error.
  • Equi-Width Histograms partition the domain into
    buckets such that the number of values falling
    into each bucket is uniform across all buckets.
    They maintain quantiles for the underlying data
    distribution as the bucket boundaries.
  • End-Biased Histograms maintain exact counts of
    items that occur with frequency above a
    threshold, and approximate other counts by an
    uniform distribution.

  • Uses to provide a summary representation of data
  • Wavelet coefficients are projections of the given
    signal onto an orthogonal set of basis vector
  • Choice of basis vectors determines type of
  • Haar wavelets are used in DB for ease of
  • The signal reconstructed from top few wavelet
    coefficients best approximate the original signal

Sliding Windows
  • Prevent stale data from influencing analysis and
  • Serve as tool for approximation in face of
    bounded memory
  • Open problems
  • Clustering
  • Maintaining top wavelet coefficients
  • Maintaining statistics like variance
  • Computing correlated aggregates

Negative Results
  • Emerging set of negative results on space-time
    requirements of algorithms that operate in stream
  • Henzinger, Raghavan, and Rajagopalan provided
    space lower bounds for concrete problems in
    stream model, derived from results in
    communication complexity
  • Alon, Matia, Szeged provided almost tight lower
    bounds for computing the frequency moments.
  • General lower bound techique for sampling-based
    algorithms presented by Bar-Yoseef et al.

Other Algorithms
  • Data Mining Decision tree are another form of
    synopsis used for prediction
  • Multiple Streams Computing simple functions in
    distributed environment
  • Reduction of Streams In list-efficient
    algorithms, instead of being presented one data
    item at a time, they are implicitly presented
    with a list of data items in a succinct form
  • Property Testing Programs that make one pass
    over data and using small space verify if the
    data satisfies a certain property
  • Measuring Sortedness Useful in determining the
    choice of a sort algorithm for underlying data

  • Adaption to some existing techniques to the
    proposed model can be performed
  • Exact answers from a data stream query is
    probably not possible
  • There are a lot of ongoing projects that deal
    with streams

  • Babcock, Brian, S Babu, M Datar, R Motwani, J
  • Models and Issues in Data Stream Systems. In
    Proc. ACM SIGMOD/PODS 2002. June 3-5, 2002.
    Madison, Wisconsin.

End of Presentation
Write a Comment
User Comments (0)
About PowerShow.com