Title: Evaluating Window Joins over Punctuated Streams
1Evaluating Window Joins over Punctuated Streams
- Luping Ding and Elke A. Rundensteiner
- Database Systems Research Group
- Worcester Polytechnic Institute
- lisading, rundenst_at_cs.wpi.edu
2Stream Data Processing
- Online Transaction Management
- Sensor Network Monitoring
Register Continuous Queries
Stream Query Engine
Streaming Data
Streaming Result
3New Challenges in Stream Context
- Potentially infinite data streams vs. stateful
operators. e.g., join, distinct, - Problem potentially unbounded state
- Reason no hint on which data is no longer useful
4Example -Symmetric Hash Join WA93
- Memory overflow resolution state relocation
- Example XJoin UF00,
- Hash-Merge Join MLA04
- Problems
- Join state still grows with no bound
- Delivery of some join results may be highly
deferred
Memory Overflow
Memory
SA
SB
probe
insert
A
B
5Avoiding Unbounded State
- Solution exploit constraints to detect
no-longer-useful data - Sliding window MWA03
- Identify a bounded set of input data based on
time - K-constraint BW03
- Models clustered or ordered data arrival pattern
- Punctuation TMSF03
- Dynamically announce termination of certain value
6Sliding Window KNV03
Wa
Wb
Timeline
Stream A
Stream B
7Punctuation
- Meta-knowledge embedded inside data streams
- An ordered set of patterns corresponding to
attributes of tuples - Wildcard (), constant (9), list (1,2,3), range
(1, 20), empty (?) - Semantics tuples after a punctuation p will NOT
match p
Bid
180
Marlie
820.00
Nov-13-03 110200
No more tuple will contain Item_id 180.
182
Ultrasale
1000.00
Nov-13-03 110500
180
Jocelyn
850.00
Nov-13-03 111400
180
181
pcfan
50.00
Nov-13-03 113600
8Punctuation-Aware Join DMR04
A
C
A
B
1
200.00
Joinitem_id
2
63.00
SA
SB
175
80.00
175
80.00
175
100.00
175
100.00
No more tuple will have A 175.
175
181
50.00
180
135.00
175
20.00
158
310.00
175
20.00
Stream B
Stream A
9Window and Punctuation Occur Simultaneously
SELECT A.item_id, Count () FROM
Auction Range 24 Hours A, Bid B
WHERE A.item_id B.item_id GROUP BY
A.item_id
Auction Stream
Group-byitem_id (count())
Joinitem_id
Bid Stream
Out1 (item_id)
Out2 (item_id, count)
Contains punctuations on item_id
Applies a 24-hour window on Auction stream
10Optimization Opportunities
- Maintain smaller state than either pure window
join or pure punctuation-exploiting join - Bid tuples that have been joined dont need to be
maintained in state - Drop tuples without affecting precision of result
- Bid tuples out of 24-hour window of corresponding
Auction tuple dont need to be processed - Produce some aggregate results earlier
- Aggregate result for some Auciton tuples can be
produced in less than 24 hours
11Our Approach PWJoin
- Punctuation-exploiting Window Join
- Features of PWJoin
- Include optimizations enabled by punctuations and
by sliding windows individually - Accomplish optimizations enabled by interactions
of two constraint types - Employ a state design that effectively
facilitates constraint-exploiting optimizations
12PWJoin Basics and Issue
Receive a new tuple ta from stream A
Invalidate tuples from B state
Probe B state
Insert ta into A state
Receive a new punct pa from stream A
Purge tuples from B state
Insert pa into A state
- Issue how to design PWJoin state to facilitate
all search-based operations? - Invalidate conducts time-based search
- Probe and Purge needs value-based search
13PWJoin State with Two-dimensional Index
Time List
I-Node Index (Hash Table)
Punctuation Time List
Window Begin
8
8
none
10
10
punctuated
8
8
10
tuple
NextValueListTNode
T-Node
4
NextTimeListTNode
8
Key
Head
Tail
PunctFlag
Window End
I-Node
14Facilitating Search-based Operations
- Search-based Operations
- Invalidate probe time list and stop when
encountering a time-valid tuple - Probe probe I-Node index and join with tuples in
value list of matching I-Node - Purge probe I-Node index and delete tuples in
value list of matching I-Node - Avoid access to irrelevant tuples
15Punctuation Propagation
- An operator may propagate punctuations to benefit
downstream operators
Auction Stream
Group-byitem_id (count())
Joinitem_id
Bid Stream
Item_id
Bidder_id
Bid_price
be unblocked by punctuations propagated by join
operator
propagate punctuations on item_id
180
16Optimizations Enabled by Combined Constraints
Early Punctuation Propagation
Tuple Dropping
a1
a1
a6
a6
a1
a1
a2
a3
a2
a3
a3
a3
a3
a3
a7
a7
a4
a4
a3
a3
a2
a2
a1
a1
a8
a8
a3
a3
propagation point 2
a2
a2
a6
a6
a3
a3
a10
a10
a3
propagation point 1
a3
Stream S1
Stream S2
Stream S1
Stream S2
17Achieving Optimizations by Combined Constraints
- Early propagation
- Invalidate punctuations in punctuation time list
as invalidating tuples - Expired punctuations can be propagated
- Tuple dropping
- When early propagation happens, set PunctFlag of
matching I-Node as propagated - Drop new tuples that matches an I-Node whose
PunctFlag is propagated
18Memory Cost Analysis
- SbT SbTinsert - SbTpurge SbTarrive -
SbTpurge - ?bTb - ? bTb(? paT/NKb,T)
- ?b tuple input rate of stream B
- ?pa punctuation input rate of stream A
- NKb,T - of distinct join values occurred in
stream B up to Tth time unit - Tb time window on stream B
Saving by Punctuation
Window Join
19Experimental Setup
- Experimental System
- CAPE RDS04 Continuous Query Processing System
- Stream benchmark generate synthetic data streams
- 733MHz Intel(R) Celeron CPU, 512MB RAM, Windows
2000 - Experiments
- Compare memory overhead and tuple output rate of
PWJoin with a pure window join - Compare punctuation output rate of PWJoin with
PJoin
20PWJoin vs. WJoin Memory and Tuple Output Rate
Stream A, B punct-asc-100-40
21PWJoin vs. PJoin Punctuation Output Rate
Stream A punct-asc-100-40, Stream B
punct-random-30-40 Window 1 second
22Related Work
- Pipelined join solutions
- Symmetric Hash Join WA93, XJoin UF00,
Hash-Merge JoinMLA04, Ripple JoinsHH99 - Constraint-exploiting stream query optimization
- Window joins KNV03, GO03, GGO04, HFA03, ZRH04
- PunctuationTMS03, PJoin DMR04
- k-Constraint-exploiting algorithm BW04
23Conclusion
- Proposed PWJoin algorithm
- Designed storage structure for PWJoin state
- Derived cost model for PWJoin
- Conducted experimental study to explore
effectiveness of PWJoin
24Thanks
- Nishant Mehta (developing stream generator)
- Prof. Leonidas Fegaras (feedback on paper)
- CAPE Group Members
- WPI Database Research Group
CAPE Project http//davis.wpi.edu/dsrg/CAPE/
25References
- KNV03 J. Kang, J. F. Naughton and S. D. Viglas.
Evaluating Window Joins over Unbounded Streams.
ICDE03. - UF00 T. Urhan and M. Franklin, XJoin A
Reactively Scheduled Pipelined Join Operator.
IEEE Data Engineering Bulletin, 23(2), 2000. - HH99 P. Haas and J. Hellerstein, Ripple Joins
for Online Aggregation. SIGMOD99. - GO03 L. Golab and M. T. Ozsu, Processing
Sliding Window Multi-Joins in Continuous Queries
over Data Streams. VLDB03. - GGO04 L. Golab, S. Garg and M. T. Ozsu, On
Indexing Sliding Windows over On-line Data
Streams, EDBT04. - RDS04 E. A. Rundensteiner, L. Ding, T.
Sutherland, Y. Zhu, B. Pielech and N. Mehta,
CAPE Continuous Query Engine with
Heterogeneous-Grained Adaptivity. VLDB Demo,
2004. - BW04 S. Babu and J. Widom. Exploiting
k-Constraints to Reduce Memory Overhead in
Continuous Queries over Data Streams - TMS03 P. A. Tucker, D. Maier, T. Sheard and L.
Fegaras. Exploiting Punctuation Semantics in
Continuous Data Streams. TKDE, 15(3), 2003. - DMR04 L. Ding, N. Mehta, E. A. Rundensteiner
and G. T. Heineman, Joining Punctuated Streams.
EDBT04. - MWA03 R. Motwani, J. Widom, A. Arasu et al.
Query Processing, Resource Management, and
Approximation in a Data Stream Management System.
CIDR03.
26PWJoin vs. WJoin Irrelevant Punctuations
Stream A punct-asc-100-40, Stream B
punct-random-30-40 Window 2 seconds