Mirek Riedewald - PowerPoint PPT Presentation

1 / 32
About This Presentation
Title:

Mirek Riedewald

Description:

Identify relevant information on-the-fly, archive for data mining ... Fuji. Price. Mpix. Brand 250 2.0 400 4.0. Price. Mpix. Event Matching, Correlation ... – PowerPoint PPT presentation

Number of Views:117
Avg rating:3.0/5.0
Slides: 33
Provided by: johanne59
Category:
Tags: fuji | mirek | riedewald

less

Transcript and Presenter's Notes

Title: Mirek Riedewald


1
Efficient Processing of Massive Data Streams for
Mining and Monitoring
  • Mirek Riedewald
  • Department of Computer Science
  • Cornell University

2
Acknowledgements
  • Al Demers
  • Abhinandan Das
  • Alin Dobra
  • Sasha Evfimievski
  • Johannes Gehrke
  • KD-D initiative (Art Becker et al.)

3
Introduction
  • Data streams versus databases
  • Infinite stream, continuous queries
  • Limited resources
  • Network monitoring
  • High arrival rates, approximation CGJSS02
  • Stock trading
  • Complex computation ZS02
  • Retail, E-business, Intelligence, Medical
    Surveillance
  • Identify relevant information on-the-fly, archive
    for data mining
  • Exact results, error guarantees

4
Information Spheres
  • Local Information Sphere
  • Within each organization
  • Continuous processing of distributed data streams
  • Online evaluation of thousands of triggers
  • Storage/archival of important data
  • Global Information Sphere
  • Between organizations
  • Share data in privacy preserving way

5
Local Information Sphere
  • Distributed data stream event processing and
    online data mining
  • Technical challenges
  • Blocking operators, unbounded state
  • Graceful degradation under increasing load
  • Integration with archive
  • Processing of physically distributed streams

6
Event Matching, Correlation
  • Join of data streams

7
Event Matching, Correlation
  • Join of data streams

8
Event Matching, Correlation
  • Join of data streams
  • Equi-join, text similarity, geographical
    proximity,
  • Problem unbounded state, computation

9
Window Joins
  • Restrict join to window of most recent records
    (tuples)
  • Landmark window
  • Sliding window based on time or number of records
  • Problem definition
  • Window based on time size w
  • Synchronous record arrival
  • Equi-join

10
Abstract Model
  • Data streams R(A,), S(A,)
  • Compute equi-join on A
  • Match all r and s of streams R, S such that
    r.As.A
  • Sliding window of size w

R
(r0,s2), (r1,s2), (r2,s2)
S
11
Abstract Model (cont.)
  • Data streams R(A,), S(A,)
  • Compute equi-join on A
  • Match all r and s of streams R, S such that
    r.As.A
  • Sliding window of size w

R
(r0,s2), (r1,s2), (r2,s2) (r3,s1), (r1,s3),
(r2,s3)
S
12
Abstract Model (cont.)
  • Data streams R(A,), S(A,)
  • Compute equi-join on A
  • Match all r and s of streams R, S such that
    r.As.A
  • Sliding window of size w

R
(r0,s2), (r1,s2), (r2,s2) (r3,s1), (r1,s3),
(r2,s3) No new output
S
13
Limited Resources
  • Focus on limited memory Mlt2w
  • State of the art random load shedding KNV03
  • Random sample of streams
  • Desired approach semantic load shedding
  • Goal graceful degradation
  • Approximation
  • Set-valued result Error measure?

14
Set-Approximation Error
  • What is a good error measure?
  • Information Retrieval, Statistics, Data Mining
  • Matching coefficient
  • Dice coefficient
  • Jaccard coefficient
  • Cosine coefficient
  • Overlap coefficient
  • Earth Movers Distance (EMD) RTG98
  • Match And Compare (MAC) IP99
  • Join subset of output result
  • EMD, Overlap coefficient trivially 0 or 1
  • Others (except MAC) reduce to MAX-subset error
    measure

15
Optimization Problem
  • Select records to be kept in memory such that the
    result size is maximized subject to memory
    constraints
  • Lightweight online technique
  • Adaptivity in presence of memory fluctuations

16
Optimal Offline Algorithm
  • What is the best possible that can be achieved?
  • Optimal sampling strategy for MAX-subset
  • Bottom-line for evaluation of any online
    algorithm
  • Same optimization problem, but knows future
  • Finite subsets of input streams
  • Formulate as linear flow problem

17
Generation of Flow Model
M2, w3
-1
R1,1,1,3
-1
-1
-1
Fixed memory allocation
-1
-3
3
S2,3,1,1
-1
cost
Keep in memory
Capacity 0..1, linear cost
Replace
18
Correspondence to Windows
R1,1,1,3
S2,3,1,1
19
Correspondence to Windows
R1,1,1,3
S2,3,1,1
20
Correspondence to Windows
-1
R1,1,1,3
-1
-1
S2,3,1,1
21
Correspondence to Windows
-1
R1,1,1,3
-1
-1
-1
-1
S2,3,1,1
-1
22
Complexity
  • Integer solution exists
  • Optimal solution found in O(n2 m log n)
  • N input size of single stream
  • nodes n lt 2wN N 2
  • arcs m lt 2n M 1
  • Reasonable costs for benchmarking
  • Approx. 1GB memory (w800, M800)
  • Approx. 1h computation time

23
Optimal Flow
M2, w3
-1
R1,1,1,3
-1
-1
-1
Fixed memory allocation
-1
-3
3
S2,3,1,1
-1
cost
Keep in memory
Capacity 0..1, linear cost
Replace
24
Easy to Extend
M2, w3
-1
R1,1,1,3
-1
-1
-1
Variable memory allocation
-1
-3
3
S2,3,1,1
-1
cost
Keep in memory
Capacity 0..1, linear cost
Replace
25
Online Heuristics
  • Maximize expected output
  • PROB sort tuples by join partner arrival
    probability
  • LIFE sort tuples by product of partner arrival
    probability and remaining lifetime
  • Maintain stream statistics
  • Histograms (DGIM02, TGIK02), wavelets (GKMS01),
    quantiles (GKMS02, GK01)

26
Approximation Quality
27
Effect of Skew
28
Summary
  • Information sphere architecture
  • Optimal algorithm and fast efficient heuristic
    for sliding window joins
  • Open problems
  • Other set error measures, resource models
  • Other joins compress records
  • Complex queries
  • Distributed processing
  • Integration with other techniques into local
    information sphere

29
Related Work
  • Aurora (Brown, MIT), STREAM (Stanford), Telegraph
    (Berkeley), NiagaraCQ (Wisconsin, OGI)
  • Memory requirements ABBMW02,TM02
  • Aggregation
  • Alon, Bar-Yossef, Datar, Dobra, Garofalakis,
    Gehrke, Gibbons, Gilbert, Indyk, Korn, Kotidis,
    Koudas, Matias, Motwani, Muthukrishnan, Rastogi,
    Srivastava, Strauss, Szegedy

30
Other Results
  • DGR03
  • Integration with archive
  • Load smoothing, not shedding
  • Novel error measure archive access cost
  • Static join for sensor networks
  • Maximize result size subject to constraints on
    energy consumption
  • Polynomial dynamic programming solution
  • Fast 2-approximation algorithms
  • NP-hardness proof for join of 3 or more streams

31
Other Results (cont.)
  • DGGR02
  • Computation of aggregates over streams for
    multiple joins
  • Small pseudo-random sketch synopses (randomized
    linear projections)
  • Explicit, tunable error guarantees
  • Sketch partitioning to boost accuracy
    (intelligently partition join attribute space)

32
Thanks!
?
?
?
Questions?
?
?
?
?
Write a Comment
User Comments (0)
About PowerShow.com