Streaming Queries over Streaming Data - PowerPoint PPT Presentation

About This Presentation
Title:

Streaming Queries over Streaming Data

Description:

Landmark (constant beginning and variable ending time) ... At end of probe, if cell = 0, that means the data tuple satisfies the given query ... – PowerPoint PPT presentation

Number of Views:58
Avg rating:3.0/5.0
Slides: 30
Provided by: andyw157
Category:

less

Transcript and Presenter's Notes

Title: Streaming Queries over Streaming Data


1
Streaming Queries over Streaming Data
  • Sirish Chandrasekaran (UC Berkeley)
  • Michael J. Franklin (UC Berkeley)
  • Presented by Andy Williamson

2
About Me
  • 3rd Year ISYE major
  • Minor in Computer Science
  • From Austin, TX
  • Have visited every state but Alaska
  • Intern at Deloitte Consulting focusing on SAP
    implementation

3
Agenda
  • Background/Motivation
  • PSoup
  • Introduction
  • System Overview
  • Query Processing Techniques
  • Implementation
  • Performance
  • Aggregation Queries
  • Conclusions
  • Critique

4
Background/Motivation
  • Continuous Query (CQ) Systems
  • Treat queries as fixed entities and stream data
    over them
  • Previous systems only allowed streaming of either
    data or queries
  • Continuously deliver results as they are computed
    (infeasible/inefficient)
  • Data Recharging
  • Monitoring

5
PSoup Introduction
  • Query processor based on Telegraph query
    processing framework
  • Allows both data and queries to be streamed
  • Partially stores results to support disconnected
    operation and improve data throughput and
    response time

6
PSoup System Overview
  • User initially registers query specification with
    system
  • System returns handle which can be used to invoke
    results of query later
  • Example Query
  • SELECT
  • FROM Data_Stream D_s
  • WHERE (D_s.a lt x D_s.b gt y)
  • BEGIN(NOW 10)
  • END(NOW)
  • Begin-End Clause allows
  • Snapshot (constant beginning and ending time)
  • Landmark (constant beginning and variable ending
    time)
  • Sliding window (variable beginning and ending
    time)
  • Limited by size of memory

7
PSoup System Overview
  • PSoup treats execution of query streams as a join
    of query and data streams
  • Maintains State
  • Modules (SteMs)
  • for queries and data
  • One query SteM for
  • all queries in the system, and one data SteM for
    each data stream

8
PSoup Query Processing Techniques
  • Overview
  • PSoup assigns unique queryID that it returns to
    the user
  • Client can disconnect, reconnect and execute
    query to obtain updated results
  • PSoup continuously matches data to query
    predicates in background and stores the results
    in its Results Structure
  • When a query is invoked, PSoup applies the
    appropriate input window to the Results Structure
    to return the current results

9
PSoup Query Processing Techniques
  • Entry of new Query specs
  • New queries split into two parts
  • Standing Query Clause (SQC) consists of the
    SELECT-FROM-WHERE clauses
  • BEGIN-END clause, stored in separate WindowsTable
    structure
  • SQC inserted into Query SteM
  • Used to probe Data SteMs corresponding to tables
    in FROM clause
  • Resulting tuples stored in Results Structure

10
PSoup Query Processing Techniques
  • Entry of new data
  • New tuples assigned globally unique tupleID and
    physical timestamp (physicalID) based on system
    clock
  • Inserted into appropriate Data SteM
  • Then used to probe Query SteM to determine which
    SQCs it satisfies
  • TupleIDs and physicalIDs stored in Results
    Structure

11
PSoup Query Processing Techniques
  • Selection Queries over a single stream

12
PSoup Query Processing Techniques
  • Join Queries Over Multiple Streams

13
PSoup Query Processing Techniques
  • Query Invocation and Result Construction
  • Results Structure maintains info about which
    tuples in Data SteM(s) satisfy which SQCs in
    Query SteM
  • For each result tuple of each query, it stores
    tupleID and physicalID of all constituent base
    tuples of result tuple
  • Results of a query can be accessed by its queryID
  • Ordered by timestamp (physicalID)

14
PSoup Implementation
  • Eddy
  • Each tuple has a predicate attribute and an
    Interest List dictating where it is to be routed
  • Provides Stream Prefix Consistency by storing new
    and temporary tuples separately in New Tuple Pool
    and Temporary Tuple Pool
  • Begins by selecting a tuple from the NTP and then
    processing everything in the TTP before pickign
    another tuple from the NTP

15
PSoup Implementation
  • Data SteM
  • Use tree-based index for data to provide
    efficient access to probing queries
  • One red-black tree for every attribute
  • Maintains hash-based index over tupleIDs for fast
    access

16
PSoup Implementation
  • Query SteM
  • Allows sharing of work between queries that have
    overlapping FROM clauses
  • Use red-black trees to index single-attribute
    single-relation boolean factors of a query

17
PSoup Implementation
  • Query SteM
  • For queries involving joins of multiple
    attributes, tree structure doesnt work
  • Instead, a linked list called the predicateList
    is used
  • Query SteM contains an array in which each cell
    represents a query
  • At beginning of probe by a data tuple, each cell
    is set to the number of boolean factors in
    corresponding query
  • Every time tuple satisfies a boolean factor, cell
    value is decremented
  • At end of probe, if cell 0, that means the data
    tuple satisfies the given query

18
PSoup Implementation
  • Results Structure
  • Stores metadata indicating which tuples satisfy
    which SQCs
  • Can either be accomplished by previously-mentioned
    bitmap or by associating a linked list of
    satisfactory data tuples for each query
  • Ordering by timestamp is simple for single-table
    queries
  • For Join queries, typically use oldest timestamp

19
PSoup Performance
  • Implemented in Java with customized versions of
    Eddy and SteMs
  • Examined performance of two versions
  • PSoup-Partial (PSoup-P) Maintain results
    corresponding to SQCs in Results Structure, and
    apply BEGIN-END clauses to retrieve current
    results on query invocation
  • PSoup-Complete (PSoup-C) Continuously maintains
    results corresponding to current input window for
    each query in linked lists
  • NoMat Measurements of a system that doesnt
    materialize results

20
PSoup Performance
  • Storage Requirements
  • NoMat Storage cost space taken to store base
    data streams within maximum window over which
    queries are supported, plus size of structures
  • PSoup-P Storage cost storage cost of NoMat
    size of Results Structure (either bitarray or
    linked-list)
  • PSoup-C Storage cost gtgt storage cost of PSoup-P
    since C always stores current results of standing
    queries at a given time

21
PSoup Performance
  • Experimental Setup
  • Varied window sizes (27-216) and number(1-8)/type
    of boolean factors
  • Measured response time and maximum supportable
    data arrival rate
  • Examined both P and C with and without predicate
    indexes
  • Tested scheme to remove redundancies arising from
    joins
  • Used synthetic generated query(27-212) /data
    streams

22
PSoup Performance
  • Response Time vs. Window Size

23
PSoup Performance
  • Response Time vs. Interval Predicates

24
PSoup Performance
  • Data Arrival Rate vs. SQCs

25
PSoup Performance
  • Summary of Results
  • Materializing results of queries supports higher
    query invocation rates
  • Indexing queries and lazily applying windows
    improves maximum data throughput
  • PSoup-C requires more memory
  • PSoup-C optimizes query invocation rate
  • PSoup-P optimizes data arrival rate

26
PSoup Performance
  • Removing Redundancy in Join processing
  • Entry of a query
  • specification or
  • new data
  • Composite tuples
  • in joins

27
PSoup Aggregation Queries
  • PSoup can support aggregate functions
  • Only possible to share data structures across
    queries with identical SELECT-PROJECT-JOIN clause

28
PSoup Conclusions
  • Treats data and query streams analogously
  • Can support queries that require access to data
    that arrived before and after the query
  • Materializes results to cut down on response time
    and to support disconnected operation
  • Enables data recharging and monitoring
  • Future work
  • Write data streams to disk and execute queries
    over them
  • Transfer queries between disk and memory,
    allowing query execution to be scheduled
  • Confront resource constraints when dealing with
    infinite streams
  • Query browser for temporal data

29
Critique
  • Strengths
  • Very well written, easy to follow
  • Clear examples, excellent explanation of
    performance results
  • Strong method that reduces processing time with
    increase in interval predicates
  • Weaknesses
  • Lacking sufficient data on storage costs
  • Experimentation only tested one multiple-relation
    boolean factor for joins unrealistic
  • Didnt address whether same (or similar) query
    could be entered twice and accidentally given two
    IDs
Write a Comment
User Comments (0)
About PowerShow.com