Streaming Queries over Streaming Data

About This Presentation

Title:

Streaming Queries over Streaming Data

Description:

Landmark (constant beginning and variable ending time) ... At end of probe, if cell = 0, that means the data tuple satisfies the given query ... – PowerPoint PPT presentation

Number of Views:58

Avg rating:3.0/5.0

Slides: 30

Provided by: andyw157

Learn more at: https://faculty.cc.gatech.edu

Category:

more less

Transcript and Presenter's Notes

Title: Streaming Queries over Streaming Data

1
Streaming Queries over Streaming Data

Sirish Chandrasekaran (UC Berkeley)
Michael J. Franklin (UC Berkeley)
Presented by Andy Williamson

2
About Me

3rd Year ISYE major
Minor in Computer Science
From Austin, TX
Have visited every state but Alaska
Intern at Deloitte Consulting focusing on SAP
implementation

3
Agenda

Background/Motivation
PSoup
Introduction
System Overview
Query Processing Techniques
Implementation
Performance
Aggregation Queries
Conclusions
Critique

4
Background/Motivation

Continuous Query (CQ) Systems
Treat queries as fixed entities and stream data
over them
Previous systems only allowed streaming of either
data or queries
Continuously deliver results as they are computed
(infeasible/inefficient)
Data Recharging
Monitoring

5
PSoup Introduction

Query processor based on Telegraph query
processing framework
Allows both data and queries to be streamed
Partially stores results to support disconnected
operation and improve data throughput and
response time

6
PSoup System Overview

User initially registers query specification with
system
System returns handle which can be used to invoke
results of query later
Example Query
SELECT
FROM Data_Stream D_s
WHERE (D_s.a lt x D_s.b gt y)
BEGIN(NOW 10)
END(NOW)
Begin-End Clause allows
Snapshot (constant beginning and ending time)
Landmark (constant beginning and variable ending
time)
Sliding window (variable beginning and ending
time)
Limited by size of memory

7
PSoup System Overview

PSoup treats execution of query streams as a join
of query and data streams
Maintains State
Modules (SteMs)
for queries and data
One query SteM for
all queries in the system, and one data SteM for
each data stream

8
PSoup Query Processing Techniques

Overview
PSoup assigns unique queryID that it returns to
the user
Client can disconnect, reconnect and execute
query to obtain updated results
PSoup continuously matches data to query
predicates in background and stores the results
in its Results Structure
When a query is invoked, PSoup applies the
appropriate input window to the Results Structure
to return the current results

9
PSoup Query Processing Techniques

Entry of new Query specs
New queries split into two parts
Standing Query Clause (SQC) consists of the
SELECT-FROM-WHERE clauses
BEGIN-END clause, stored in separate WindowsTable
structure
SQC inserted into Query SteM
Used to probe Data SteMs corresponding to tables
in FROM clause
Resulting tuples stored in Results Structure

10
PSoup Query Processing Techniques

Entry of new data
New tuples assigned globally unique tupleID and
physical timestamp (physicalID) based on system
clock
Inserted into appropriate Data SteM
Then used to probe Query SteM to determine which
SQCs it satisfies
TupleIDs and physicalIDs stored in Results
Structure

11
PSoup Query Processing Techniques

Selection Queries over a single stream

12
PSoup Query Processing Techniques

Join Queries Over Multiple Streams

13
PSoup Query Processing Techniques

Query Invocation and Result Construction
Results Structure maintains info about which
tuples in Data SteM(s) satisfy which SQCs in
Query SteM
For each result tuple of each query, it stores
tupleID and physicalID of all constituent base
tuples of result tuple
Results of a query can be accessed by its queryID
Ordered by timestamp (physicalID)

14
PSoup Implementation

Eddy
Each tuple has a predicate attribute and an
Interest List dictating where it is to be routed
Provides Stream Prefix Consistency by storing new
and temporary tuples separately in New Tuple Pool
and Temporary Tuple Pool
Begins by selecting a tuple from the NTP and then
processing everything in the TTP before pickign
another tuple from the NTP

15
PSoup Implementation

Data SteM
Use tree-based index for data to provide
efficient access to probing queries
One red-black tree for every attribute
Maintains hash-based index over tupleIDs for fast
access

16
PSoup Implementation

Query SteM
Allows sharing of work between queries that have
overlapping FROM clauses
Use red-black trees to index single-attribute
single-relation boolean factors of a query

17
PSoup Implementation

Query SteM
For queries involving joins of multiple
attributes, tree structure doesnt work
Instead, a linked list called the predicateList
is used
Query SteM contains an array in which each cell
represents a query
At beginning of probe by a data tuple, each cell
is set to the number of boolean factors in
corresponding query
Every time tuple satisfies a boolean factor, cell
value is decremented
At end of probe, if cell 0, that means the data
tuple satisfies the given query

18
PSoup Implementation

Results Structure
Stores metadata indicating which tuples satisfy
which SQCs
Can either be accomplished by previously-mentioned
bitmap or by associating a linked list of
satisfactory data tuples for each query
Ordering by timestamp is simple for single-table
queries
For Join queries, typically use oldest timestamp

19
PSoup Performance

Implemented in Java with customized versions of
Eddy and SteMs
Examined performance of two versions
PSoup-Partial (PSoup-P) Maintain results
corresponding to SQCs in Results Structure, and
apply BEGIN-END clauses to retrieve current
results on query invocation
PSoup-Complete (PSoup-C) Continuously maintains
results corresponding to current input window for
each query in linked lists
NoMat Measurements of a system that doesnt
materialize results

20
PSoup Performance

Storage Requirements
NoMat Storage cost space taken to store base
data streams within maximum window over which
queries are supported, plus size of structures
PSoup-P Storage cost storage cost of NoMat
size of Results Structure (either bitarray or
linked-list)
PSoup-C Storage cost gtgt storage cost of PSoup-P
since C always stores current results of standing
queries at a given time

21
PSoup Performance

Experimental Setup
Varied window sizes (27-216) and number(1-8)/type
of boolean factors
Measured response time and maximum supportable
data arrival rate
Examined both P and C with and without predicate
indexes
Tested scheme to remove redundancies arising from
joins
Used synthetic generated query(27-212) /data
streams

22
PSoup Performance

Response Time vs. Window Size

23
PSoup Performance

Response Time vs. Interval Predicates

24
PSoup Performance

Data Arrival Rate vs. SQCs

25
PSoup Performance

Summary of Results
Materializing results of queries supports higher
query invocation rates
Indexing queries and lazily applying windows
improves maximum data throughput
PSoup-C requires more memory
PSoup-C optimizes query invocation rate
PSoup-P optimizes data arrival rate

26
PSoup Performance

Removing Redundancy in Join processing
Entry of a query
specification or
new data
Composite tuples
in joins

27
PSoup Aggregation Queries

PSoup can support aggregate functions
Only possible to share data structures across
queries with identical SELECT-PROJECT-JOIN clause

28
PSoup Conclusions

Treats data and query streams analogously
Can support queries that require access to data
that arrived before and after the query
Materializes results to cut down on response time
and to support disconnected operation
Enables data recharging and monitoring
Future work
Write data streams to disk and execute queries
over them
Transfer queries between disk and memory,
allowing query execution to be scheduled
Confront resource constraints when dealing with
infinite streams
Query browser for temporal data

29
Critique

Strengths
Very well written, easy to follow
Clear examples, excellent explanation of
performance results
Strong method that reduces processing time with
increase in interval predicates
Weaknesses
Lacking sufficient data on storage costs
Experimentation only tested one multiple-relation
boolean factor for joins unrealistic
Didnt address whether same (or similar) query
could be entered twice and accidentally given two
IDs

Write a Comment

User Comments (0)

About PowerShow.com

Streaming Queries over Streaming Data - PowerPoint PPT Presentation

Streaming Queries over Streaming Data

Landmark (constant beginning and variable ending time) ... At end of probe, if cell = 0, that means the data tuple satisfies the given query ... – PowerPoint PPT presentation