Joining Punctuated Streams - PowerPoint PPT Presentation

About This Presentation

Title:

Joining Punctuated Streams

Description:

... .00 1522.00 363.00 21281.00 19478.00 1562.00 371.00 22142.00 19958.00 1612.00 376.00 22953.00 20507.00 1662.00 375.00 23754.00 21020.00 1712.00 375.00 24556.00 ... – PowerPoint PPT presentation

Number of Views:50

Avg rating:3.0/5.0

Slides: 34

Provided by: ding72

Learn more at: https://davis.wpi.edu

Category:

more less

Transcript and Presenter's Notes

Title: Joining Punctuated Streams

1
Joining Punctuated Streams

Luping Ding, Nishant Mehta, Elke A. Rundensteiner
and George T. Heineman
Department of Computer Science
Worcester Polytechnic Institute
lisading, nishantm, rundenst, heineman_at_cs.wpi.ed
u

2
Outline

Motivation
Punctuation Preliminaries
Our Join Approach PJoin
Experimental Study
Related Work
Conclusion

3
Challenges in Joining Continuous Data Streams

Potentially unbounded growing join state, e.g.,
Symmetric Hash Join WA93
-gt To bound runtime join state
Uneven workload caused by time-varying data
arrival characteristics
-gt To adjust execution behavior according to
runtime circumstances

B
A
probe
insert
4
Tackling Challenges

To bound runtime join state
Exploiting semantic constraints to timely remove
stale data from join state,
e.g., sliding window KNV03, GO03,
HFA03, k-constraint BW02, punctuations
TMS03.
To adjust execution at runtime
Developing adaptive join execution logic,
e.g., XJoin UF00, Ripple Join HH99.

5
Tackling Challenges

Goals
To bound runtime join state
To adjust join execution according to runtime
circumstances
Solutions
Exploiting semantic constraints to timely remove
stale data from the join state, e.g., sliding
window KNV03, GO03, HFA03, k-constraint
BW02, punctuations TMS03.
Developing adaptive join execution logic, e.g.,
XJoin UF00, Ripple Join HH99.

6
Punctuation

Punctuation is predicate on stream elements that
evaluates to false for every element following
the punctuation.

ID
Name
Age
no more tuples for students whose age are less
than or equal to 18!
9961234
Edward
17
9961235
Justin
19
9961238
Janet
18

(0, 18
9961256
Anna
20

7
Query optimization enabled by punctuation

Guide stateful operators to purge stale data from
state
e.g., join, duplicate elimination,
Unblock blocking operators to produce partial
result intermittanly
e.g., group-by, set difference,

8
An Example
Open Stream
item_id seller_id open_price timestamp 1080
jsmith 130.00 Nov-10-03 90300 lt1080, ,
, gt 1082 melissa 20.00 Nov-10-03
91000 lt1082, , , gt
Query For each item that has at least one bid,
return its bid-increase value. Select
O.item_id, Sum (B.bid_price -
O.open_price) From Open O, Bid B Where
O.item_id B.item_id Group by O.item_id
Bid Stream
item_id bidder_id bid_price timestamp 1080
pclover 175.00 Nov-14-03 82700 1082
smartguy 30.00 Nov-14-03 83000 1080
richman 177.00 Nov-14-03 85200 lt1080, , ,
gt
Open Stream
Group-byitem_id (sum())
Joinitem_id
Out1 (item_id)
Out2 (item_id, sum)
Bid Stream
No more bids for item 1080!
9
Punctuation-Related Rules TMS03

Purge rule for join operator
?tA ? TSA(T), purge(tA) if setMatch(tA, PSB(T))
?tB ? TSB(T), purge(tB) if setMatch(tB, PSA(T))
Propagate rule for join operator
?pA?PSA(T), propagate(pA) if ?tA?TSA(T), ?
match(tA, pA)
?pB?PSB(T), propagate(pB) if ?tB?TSB(T), ?
match(tB, pB)
TSA(T) all tuples that arrived before time T
from stream A
PSA(T) all punctuations that arrived before time
T from stream A

10
Obtaining Punctuations

Punctuations are supplied by stream providers.
Derive punctuations from application semantics
Key-to-foreign-key join
derive punctuation
following each tuple at Key side
Clustered data arrival
derive punctuation
whenever different value is encountered
Other application-specific semantics,
e.g., bidding time constraint
for each item in online auction application
derive punctuation whenever bidding time period
for particular item expires

11
Our Join Approach PJoin

1st punctuation-exploiting join implementation
Binary hash-based equi-join
Optimized for reducing memory overhead
Optimized for increasing data output rate
Fine-tunable execution logic
Targeting various optimization goals
minimum memory overhead
maximum tuple output rate
Reacting to dynamic stream environment

12
PJoin Execution Logic
3
3
2
Join State (Memory-Resident Portion)
State of Stream A (Sa)
State of Stream B (Sb)
Hash Table
Hash Table
Purge Cand. Pool
Purge Cand. Pool

3 5 3 9 9

3

Punct. Set (PSb)
Punct. Set (PSa)
1
3
lt10
4
Join State (Disk-Resident Portion)
Hash(ta) 1
Hash Table
Hash Table
5 9 3 5
3
Tuple ta

Stream B
Stream A
13
PJoin Execution Logic
Join State (Memory-Resident Portion)
State of Stream A (Sa)
State of Stream B (Sb)
Hash Table
Hash Table
Purge Cand. Pool
Purge Cand. Pool

3 5 3 9 9

Punct. Set (PSb)
Punct. Set (PSa)
3
lt10
Join State (Disk-Resident Portion)
Hash(pa) 1
Hash Table
Hash Table
5 9 3 5
3
Punctuation pa

Stream B
Stream A
14
PJoin Design

Observations
Join operation typically involve multiple
subtasks
Subtasks are executed at different frequencies
Each subtask can be finer-tuned to target
different optimization goals
Design decision
Break join execution logic into components
Equip each component with various execution
strategies
Employ event-driven inter-component scheduling to
allow flexible join execution logic configuration

15
Join-Related Components

Components
Memory Join join new tuple with in-memory state
State Relocation move part of in-memory state to
disk
Disk Join join on-disk states
Scheduling strategy
Memory Join runs as main thread
State Relocation is executed when memory is full
Disk Join is scheduled when input queues are
empty (depending on activation threshold)

16
State Purge

Eager purge
purge condition when a punctuation is received.
Pros guarantee minimum join state
Cons CPU overhead under frequent punctuations
Lazy purge
purge condition when certain number of new
punctuations are received or when state is full
Pros reduce CPU overhead in searching for stale
tuples
Cons stale tuples may stay for a longer time,
thus affecting probe efficiency

17
Punctuation Propagation Concerns

Correctness
before propagate a punctuation, guarantee that
no more result tuples matching this punctuation
will be generated in future.
Efficiency
detect propagable punctuations at cost of fewer
state scans

18
Punctuation Index
Hash Table HTA
Punctuation Set PSA
Hash Bucket 1
pid count predicate indexed
attributes
timestamp
pid
105
101

3
50 lt Y lt 100
true

null
101
null
null
102 4 100 lt Y lt 200 true
102
102
Hash Bucket m
attributes
timestamp
pid
null
101
102
null
102
19
Two Steps

Punctuation Index building
Eager build build index once a punctuation is
received
Lazy build build index when propagation is
invoked
Propagation
Push mode propagate punctuations when propagate
threshold is reached
Pull mode propagate punctuations upon request
from down-stream operators

20
Event-driven Framework

Runtime parameter monitoring and feedback
mechanism
Runtime changeable component coupling mode

Memory Join
Monitor
Event
Event
Event
Event
Event
State Relocation
Disk Join
State Purge
Punctuation Index Build
Punctuation Propagation
21
Configuration Example
Memory Join
Monitor
StreamEmpty Activation Threshold
PurgeThreshold- Reach
PropagateCount- Reach
StateFull
State Relocation
Disk Join
State Purge
Punctuation Index Build
Punctuation Propagation
22
Event-Listener Registry
Events Conditions Listeners
StreamEmptyEvent Activation Threshold is reached Disk Join
PurgeThreshold-ReachEvent - State Purge
StateFullEvent C1 State Purge
StateFullEvent C2 State Relocation
PropagateCount-ReachEvent - Index Build, Propagation
C1 Punctuations exist that havent been used to purge state yet. C2 No punctuations exist that havent been used to purge state. C1 Punctuations exist that havent been used to purge state yet. C2 No punctuations exist that havent been used to purge state. C1 Punctuations exist that havent been used to purge state yet. C2 No punctuations exist that havent been used to purge state.
23
Experimental Study

Experimental System
CAPE Continuous XQuery Processing System
Stream benchmark generate synthetic data streams
by controlling arrival characteristics of data
and punctuations
2.4GHz Intel(R) Pentium-IV CPU, 512MB RAM,
Windows XP
Experiments
Compare PJoin with XJoin, a constraint-unaware
operator
Compare trade-offs between different state purge
strategies
Study PJoin under asymmetric punctuation
inter-arrival rates
Measurements memory overhead and tuple output
rate

24
PJoin vs. XJoin Memory Overhead
Tuple inter-arrival 2 milliseconds Punctuation
inter-arrival 40 tuples/punctuation
25
PJoin vs. XJoin Tuple Output Rate
Tuple inter-arrival 2 milliseconds Punctuation
inter-arrival 30 tuples/punctuation
26
State Purge Strategies Memory Overhead
Tuple inter-arrival 2 milliseconds Punctuation
inter-arrival 10 tuples/punctuation
27
State Purge Strategies Tuple Output Rate
Tuple inter-arrival 2 milliseconds Punctuation
inter-arrival 10 tuples/punctuation
28
Asymmetric Punctuation Inter-arrival Rates
Memory Overhead
Tuple inter-arrival 2 milliseconds A Punctuation
inter-arrival 10 tuples/punctuation
29
Asymmetric Punctuation Inter-arrival Rates Tuple
Output Rate
Tuple inter-arrival 2 milliseconds A Punctuation
inter-arrival 10 tuples/punctuation
30
Observations

Memory requirement for PJoin state almost
insignificant compare to XJoins.
Increase in join state of XJoin leading to
increasing probe cost, thus affecting tuple
output rate.
Eager purge is best strategy for minimizing join
state.
Lazy purge with appropriate purge threshold
provides significant advantage in increasing
tuple output rate.

31
Related Work

Continuous Query Systems
Aurora Brandeis, Brown, MIT, TelegraphCQ
Berkeley, STREAM Stanford, NiagaraCQ
Wisconsin
Constraint-exploiting join solutions
Window joins Wisconsin, Waterloo, Purdue
k-Constraint exploiting algorithm Stanford
Punctuation fundamentals, purge and propagate
rules OGI.
Adaptive join solutions
XJoin Maryland
Ripple Join Berkeley

32
Conclusion

Contributions
Implement first punctuation-exploiting join
solution
Propose eager and lazy strategies for purging
join state using punctuations.
Propose eager and lazy strategies for propagating
punctuations.
Design event-driven framework for flexible join
configuration
Future work
Support sliding window semantics
Handle n-ary joins

ACC03 D. Abadi et al. Aurora A New Model and
Architecture for Data Stream Management. VLDB
Journal, 2003.
CCD03 S. Chandrasekaran et al. TelegraphCQ
Continuous Dataflow Processing for an Uncertain
World. CIDR, 2003.
MWA03 R. Motwani et al. Query Processing,
Resource Management, and Approximation in a Data
Stream Management System. CIDR 2003.
WA93 A. N. Wilschut et al. Dataflow Query
Execution in a Parallel Main-memory Environment.
Distributed and Parallel Databases, 1993.
KNV03 J. Kang et al. Evaluating Window Joins
over Unbounded Streams. ICDE, 2003.
GO03 L. Golab et al. Processing Sliding Window
Multi-joins in Continuous Queries over Data
Streams. VLDB, 2003.
HFA03 M. Hammad et al. Scheduling for Shared
Window Joins over Data Streams. VLDB, 2003.
BW02 S. Babu et al. Exploiting k-Constraints
to Reduce Memory Overhead in Continuous Queries
over Data Streams. Technical report, 2002.
TMS03 P. Tucker et al. Exploiting Punctuation
Semantics in Continuous Data Streams. IEEE TKDE,
2003.
UF00 T. Urhan et al. A Reactively Scheduled
Pipelined Join Operator. IEEE Data Engineering
Bulletin, 2000.
HH99 P. Hass et al. Ripple Joins for Online
Aggregation. ACM SIGMOD, 1999.
MSH02 S. Madden et al. Continuously Adaptive
Continuous Queries over Streams. ACM SIGMOD,
2002.
IFF99 Z.G. Ives et al. An Adaptive Query
Execution System for Data Integration. ACM
SIGMOD, 1999.