Title: Mirek Riedewald
1Efficient Processing of Massive Data Streams for
Mining and Monitoring
- Mirek Riedewald
- Department of Computer Science
- Cornell University
2Acknowledgements
- Al Demers
- Abhinandan Das
- Alin Dobra
- Sasha Evfimievski
- Johannes Gehrke
- KD-D initiative (Art Becker et al.)
3Introduction
- Data streams versus databases
- Infinite stream, continuous queries
- Limited resources
- Network monitoring
- High arrival rates, approximation CGJSS02
- Stock trading
- Complex computation ZS02
- Retail, E-business, Intelligence, Medical
Surveillance - Identify relevant information on-the-fly, archive
for data mining - Exact results, error guarantees
4Information Spheres
- Local Information Sphere
- Within each organization
- Continuous processing of distributed data streams
- Online evaluation of thousands of triggers
- Storage/archival of important data
- Global Information Sphere
- Between organizations
- Share data in privacy preserving way
5Local Information Sphere
- Distributed data stream event processing and
online data mining - Technical challenges
- Blocking operators, unbounded state
- Graceful degradation under increasing load
- Integration with archive
- Processing of physically distributed streams
6Event Matching, Correlation
7Event Matching, Correlation
8Event Matching, Correlation
- Join of data streams
- Equi-join, text similarity, geographical
proximity, - Problem unbounded state, computation
9Window Joins
- Restrict join to window of most recent records
(tuples) - Landmark window
- Sliding window based on time or number of records
- Problem definition
- Window based on time size w
- Synchronous record arrival
- Equi-join
10Abstract Model
- Data streams R(A,), S(A,)
- Compute equi-join on A
- Match all r and s of streams R, S such that
r.As.A - Sliding window of size w
R
(r0,s2), (r1,s2), (r2,s2)
S
11Abstract Model (cont.)
- Data streams R(A,), S(A,)
- Compute equi-join on A
- Match all r and s of streams R, S such that
r.As.A - Sliding window of size w
R
(r0,s2), (r1,s2), (r2,s2) (r3,s1), (r1,s3),
(r2,s3)
S
12Abstract Model (cont.)
- Data streams R(A,), S(A,)
- Compute equi-join on A
- Match all r and s of streams R, S such that
r.As.A - Sliding window of size w
R
(r0,s2), (r1,s2), (r2,s2) (r3,s1), (r1,s3),
(r2,s3) No new output
S
13Limited Resources
- Focus on limited memory Mlt2w
- State of the art random load shedding KNV03
- Random sample of streams
- Desired approach semantic load shedding
- Goal graceful degradation
- Approximation
- Set-valued result Error measure?
14Set-Approximation Error
- What is a good error measure?
- Information Retrieval, Statistics, Data Mining
- Matching coefficient
- Dice coefficient
- Jaccard coefficient
- Cosine coefficient
- Overlap coefficient
- Earth Movers Distance (EMD) RTG98
- Match And Compare (MAC) IP99
- Join subset of output result
- EMD, Overlap coefficient trivially 0 or 1
- Others (except MAC) reduce to MAX-subset error
measure
15Optimization Problem
- Select records to be kept in memory such that the
result size is maximized subject to memory
constraints - Lightweight online technique
- Adaptivity in presence of memory fluctuations
16Optimal Offline Algorithm
- What is the best possible that can be achieved?
- Optimal sampling strategy for MAX-subset
- Bottom-line for evaluation of any online
algorithm - Same optimization problem, but knows future
- Finite subsets of input streams
- Formulate as linear flow problem
17Generation of Flow Model
M2, w3
-1
R1,1,1,3
-1
-1
-1
Fixed memory allocation
-1
-3
3
S2,3,1,1
-1
cost
Keep in memory
Capacity 0..1, linear cost
Replace
18Correspondence to Windows
R1,1,1,3
S2,3,1,1
19Correspondence to Windows
R1,1,1,3
S2,3,1,1
20Correspondence to Windows
-1
R1,1,1,3
-1
-1
S2,3,1,1
21Correspondence to Windows
-1
R1,1,1,3
-1
-1
-1
-1
S2,3,1,1
-1
22Complexity
- Integer solution exists
- Optimal solution found in O(n2 m log n)
- N input size of single stream
- nodes n lt 2wN N 2
- arcs m lt 2n M 1
- Reasonable costs for benchmarking
- Approx. 1GB memory (w800, M800)
- Approx. 1h computation time
23Optimal Flow
M2, w3
-1
R1,1,1,3
-1
-1
-1
Fixed memory allocation
-1
-3
3
S2,3,1,1
-1
cost
Keep in memory
Capacity 0..1, linear cost
Replace
24Easy to Extend
M2, w3
-1
R1,1,1,3
-1
-1
-1
Variable memory allocation
-1
-3
3
S2,3,1,1
-1
cost
Keep in memory
Capacity 0..1, linear cost
Replace
25Online Heuristics
- Maximize expected output
- PROB sort tuples by join partner arrival
probability - LIFE sort tuples by product of partner arrival
probability and remaining lifetime - Maintain stream statistics
- Histograms (DGIM02, TGIK02), wavelets (GKMS01),
quantiles (GKMS02, GK01)
26Approximation Quality
27Effect of Skew
28Summary
- Information sphere architecture
- Optimal algorithm and fast efficient heuristic
for sliding window joins - Open problems
- Other set error measures, resource models
- Other joins compress records
- Complex queries
- Distributed processing
- Integration with other techniques into local
information sphere
29Related Work
- Aurora (Brown, MIT), STREAM (Stanford), Telegraph
(Berkeley), NiagaraCQ (Wisconsin, OGI) - Memory requirements ABBMW02,TM02
- Aggregation
- Alon, Bar-Yossef, Datar, Dobra, Garofalakis,
Gehrke, Gibbons, Gilbert, Indyk, Korn, Kotidis,
Koudas, Matias, Motwani, Muthukrishnan, Rastogi,
Srivastava, Strauss, Szegedy
30Other Results
- DGR03
- Integration with archive
- Load smoothing, not shedding
- Novel error measure archive access cost
- Static join for sensor networks
- Maximize result size subject to constraints on
energy consumption - Polynomial dynamic programming solution
- Fast 2-approximation algorithms
- NP-hardness proof for join of 3 or more streams
31Other Results (cont.)
- DGGR02
- Computation of aggregates over streams for
multiple joins - Small pseudo-random sketch synopses (randomized
linear projections) - Explicit, tunable error guarantees
- Sketch partitioning to boost accuracy
(intelligently partition join attribute space)
32Thanks!
?
?
?
Questions?
?
?
?
?