Data Mining on Streams - PowerPoint PPT Presentation

1 / 95
About This Presentation
Title:

Data Mining on Streams

Description:

incremental: on-line, any-time' response. single pass ( you get to see ... Astronomy, seismology, ... Wavelet coefficients can be updated as new points arrive ... – PowerPoint PPT presentation

Number of Views:31
Avg rating:3.0/5.0
Slides: 96
Provided by: christ395
Category:

less

Transcript and Presenter's Notes

Title: Data Mining on Streams


1
Data Mining on Streams
  • Christos Faloutsos
  • CMU

2
THANK YOU!
  • Prof. Panos Ipeirotis
  • Julia Mills

3
Joint work with
  • Spiros Papadimitriou (CMU-gtIBM)
  • Jimeng Sun (CMU/CS)
  • Anthony Brockwell (CMU/Stat)
  • Jeanne Vanbriesen (CMU/CivEng)
  • Greg Ganger (CMU/ECE)

4
Outline
  • Problem and motivation
  • Single-sequence mining AWSOM
  • Co-evolving sequences SPIRIT
  • Lag correlations BRAID
  • Conclusions

5
Problem definition - example
Each sensor collects data (x1, x2, , xt, )
6
Problem definition
  • Given one or more sequences
  • x1 , x2 , , xt ,
  • (y1, y2, , yt,
  • )
  • Find
  • patterns correlations outliers
  • incrementally!

7
Limitations / Challenges
  • Find patterns using a method that is
  • nimble limited resources
  • Memory
  • Bandwidth, power, CPU
  • incremental on-line, any-time response
  • single pass (you get to see it only once)
  • automatic no human intervention
  • eg., in remote environments

8
Application domains
  • Sensor devices
  • Temperature, weather measurements
  • Road traffic data
  • Geological observations
  • Patient physiological data
  • Embedded devices
  • Network routers
  • Intelligent (active) disks

9
Motivation - Applications (contd)
  • Smart house
  • sensors monitor temperature, humidity, air
    quality
  • video surveillance

10
Motivation - Applications (contd)
  • civil/automobile infrastructure
  • bridge vibrations Oppenheim02
  • road conditions / traffic monitoring

11
Motivation - Applications (contd)
  • Weather, environment/anti-pollution
  • volcano monitoring
  • air/water pollutant monitoring

12
Motivation - Applications (contd)
  • Computer systems
  • Active Disks (buffering, prefetching)
  • web servers (ditto)
  • network traffic monitoring
  • ...

13
InteMonw/ Evan Hoke, Jimeng Sun
self- PetaByte data center at CMU
14
Outline
  • Problem and motivation
  • Single-sequence mining AWSOM
  • Co-evolving sequences SPIRIT
  • Lag correlations BRAID
  • conclusions

15
Single sequence mining - AWSOM
  • with Spiros Papadimitriou (CMU -gt IBM)
  • Anthony Brockwell (CMU/Stat)

16
Problem definition
  • Semi-infinite streams of values (time series) x1,
    x2, , xt,
  • Find patterns, forecasts, outliers

Periodicity? (twice daily)
Periodicity? (daily)
17
Requirements / Goals
  • Adapt and handle arbitrary periodic components
  • and
  • nimble (limited resources, single pass)
  • on-line, any-time
  • automatic (no human intervention/tuning)

18
Overview
  • Introduction / Related work
  • Background
  • Main idea
  • Experimental results

19
WaveletsExample Haar transform
constant
frequency
time
20
WaveletsWhy we like them
  • Wavelets compress many real signals well
  • Image compression and processing
  • Vision
  • Astronomy, seismology,
  • Wavelet coefficients can be updated as new points
    arrive

21
Overview
  • Introduction / Related work
  • Background
  • Main idea
  • Experimental results

22
AWSOM
xt
23
AWSOM
xt
24
AWSOM - idea
Wl,t ? ?l,1Wl,t-1 ? ?l,2Wl,t-2 ?
Wl,t ? ?l,1Wl,t-1 ? ?l,2Wl,t-2 ?
Wl,t
25
More details
  • Update of wavelet coefficients
  • Update of linear models
  • Feature selection
  • Not all correlations are significant
  • Throw away the insignificant ones (noise)

(incremental)
(incremental RLS)
(single-pass)
26
Complexity
?
  • Model update
  • Space O?lgN mk2? ? O?lgN?
  • Time O?k2? ? O?1?
  • Where
  • N number of points (so far)
  • k number of regression coefficients fixed
  • m number of linear models O?lgN?

27
Overview
  • Introduction / Related work
  • Background
  • Main idea
  • Experimental results

28
Results - Synthetic data
AWSOM
AR
Seasonal AR
  • Triangle pulse
  • Mix (sine square)
  • AR captures wrong trend (or none)
  • Seasonal AR estimation fails

29
Results - Real data
  • Automobile traffic
  • Daily periodicity
  • Bursty noise at smaller scales
  • AR fails to capture any trend
  • Seasonal AR estimation fails

30
Results - real data
?
  • Sunspot intensity
  • Slightly time-varying period
  • AR captures wrong trend
  • Seasonal ARIMA
  • wrong downward trend, despite help by human!

31
Conclusions
  • Adapt and handle arbitrary periodic components
  • and
  • nimble
  • Limited memory (logarithmic)
  • Constant-time update
  • on-line, any-time
  • Single pass over the data
  • automatic No human intervention/tuning

32
Outline
  • Problem and motivation
  • Single-sequence mining AWSOM
  • Co-evolving sequences SPIRIT
  • Lag correlations BRAID
  • conclusions

33
Part 2
  • SPIRIT Mining co-evolving streams
  • Papadimitriou, Sun, Faloutsos, VLDB05

34
Motivation
  • Eg., chlorine concentration in water distribution
    network

35
Motivation
water distribution network
normal operation
May have hundreds of measurements, but it is
unlikely they are completely unrelated!
36
Motivation
sensors near leak
chlorine concentrations
sensors away from leak
water distribution network
normal operation
major leak
37
Motivation
sensors near leak
chlorine concentrations
sensors away from leak
water distribution network
normal operation
major leak
38
Motivation
actual measurements (n streams)
k hidden variable(s)
We would like to discover a few hidden (latent)
variables that summarize the key trends
39
Motivation
Phase 1
Phase 1
Phase 2
Phase 2
chlorine concentrations
k 2
actual measurements (n streams)
k hidden variable(s)
We would like to discover a few hidden (latent)
variables that summarize the key trends
40
Motivation
Phase 1
Phase 1
Phase 2
Phase 2
Phase 3
Phase 3
chlorine concentrations
k 1
actual measurements (n streams)
k hidden variable(s)
We would like to discover a few hidden (latent)
variables that summarize the key trends
41
Goals
  • Discover hidden (latent) variables for
  • Summarization of main trends for users
  • Efficient forecasting, spotting
    outliers/anomalies
  • and the usual
  • nimble Limited memory requirements
  • on-line, any-time (single pass etc)
  • automatic No special parameters to tune

42
Related workStream mining
  • Stream SVD Guha, Gunopulos, Koudas / KDD03
  • StatStream Zhu, Shasha / VLDB02
  • Clustering
  • Aggarwal, Han, Yu / VLDB03, Guha, Meyerson,
    et al / TKDE,
  • Lin, Vlachos, Keogh, Gunopulos / EDBT04,
  • Classification
  • Wang, Fan, et al / KDD03, Hulten, Spencer,
    Domingos / KDD01

43
Related workStream mining
  • Piecewise approximations
  • Palpanas, Vlachos, Keogh, etal / ICDE 2004
  • Queries on streams
  • Dobra, Garofalakis, Gehrke, et al / SIGMOD02,
  • Madden, Franklin, Hellerstein, et al / OSDI02,
  • Considine, Li, Kollios, et al / ICDE04,
  • Hammad, Aref, Elmagarmid / SSDBM03

44
OverviewPart 2
  • Method
  • Experiments
  • Conclusions Other work

45
Stream correlations
  • Step 1 How to capture correlations?
  • Step 2 How to do it incrementally, when we have
    a very large number of points?
  • Step 3 How to dynamically adjust the number of
    hidden variables?

46
1. How to capture correlations?
  • First sensor

30oC
Temperature t1
20oC
47
1. How to capture correlations?
  • First sensor
  • Second sensor

30oC
Temperature t2
20oC
48
1. How to capture correlations
Correlations Lets take a closer look at the
first three value-pairs
30oC
Temperature t2
20oC
20oC
30oC
Temperature t1
49
1. How to capture correlations
  • First three lie (almost) on a line in the space
    of value-pairs

30oC
Temperature t2
offset hidden variable
? O(n) numbers for the slope, and ? One number
for each value-pair (offset on line)
20oC
20oC
30oC
Temperature t1
50
1. How to capture correlations
  • Other pairs also follow the same pattern they
    lie (approximately) on this line

30oC
Temperature t2
20oC
20oC
30oC
Temperature t1
51
Stream correlations
  • Step 1 How to capture correlations?
  • Step 2 How to do it incrementally, when we have
    a very large number of points?
  • Step 3 How to dynamically adjust the number of
    hidden variables?

52
Incremental updates
53
Incremental updates
  • Algorithm runs in O(n) where n of streams
  • no need to access old data

error
30oC
20oC
20oC
30oC
Temperature T1
54
Stream correlationsPrincipal Component Analysis
(PCA)
  • The line is the first principal component (PC)
  • This line is optimal it minimizes the sum of
    squared projection errors

55
2. Incremental updateGiven number of hidden
variables k
  • Assuming k is known
  • We know how to update the slope
  • For each new point x and for i 1, , k
  • yi wiTx (proj. onto wi)
  • di ? ?di yi2 (energy ? i-th eigenval.)
  • ei x yiwi (error)
  • wi ? wi (1/di) yiei (update estimate)
  • x ? x yiwi (repeat with remainder)

56
Stream correlations
  • Step 1 How to capture correlations?
  • Step 2 How to do it incrementally, when we have
    a very large number of points?
  • Step 3 How to dynamically adjust k, the number
    of hidden variables?

57
Answer
  • When the reconstruction accuracy is too low (say,
    lt95)
  • then introduce another hidden variable (k)
  • How to initialize its values tricky

58
Missing values
best guess (given correlations intersection)
30oC
true values (pair)
Temperature T2
20oC
all possible value pairs (given only t1)
20oC
30oC
Temperature T1
59
Forecasting
  • Assume we want to forecast the next value for a
    particular stream (e.g. auto-regression)

?
n streams
60
Forecasting
  • Option 1 One complex model per stream
  • Next value function of previous values on all
    streams
  • Captures correlations
  • Too costly! O(n3)

?
n streams
61
Forecasting
  • Option 1 One complex model per stream
  • Option 2 One simple model per stream
  • Next value function of previous value on same
    stream
  • Worse accuracy, but maybe acceptable
  • But, still need n models

?
n streams
62
Forecasting
?
hidden variables
Only k simple models
Efficiency robustness
k ltlt n and already capture correlations
n streams
63
Time/space requirementsIncremental PCA
  • O(nk) space (total) and time (per tuple), i.e.,
  • Independent of points
  • Linear w.r.t. streams (n)
  • Linear w.r.t. hidden variables (k)
  • In fact,
  • Can be done in real time

64
OverviewPart 2
  • Method
  • Experiments
  • Conclusions Other work

65
ExperimentsChlorine concentration
Measurements
Reconstruction
166 streams 2 hidden variables (4 error)
CMU Civil Engineering
66
ExperimentsChlorine concentration
hidden variables
  • Both capture global, periodic pattern
  • Second first, but phase-shifted
  • Can express any phase-shift

CMU Civil Engineering
67
ExperimentsLight measurements
measurement reconstruction
54 sensors 2-4 hidden variables (6 error)
68
ExperimentsLight measurements
intermittent
intermittent
hidden variables
  • 1 2 main trend (as before)
  • 3 4 potential anomalies and outliers

69
Conclusions
  • SPIRIT
  • Discovers hidden variables for
  • Summarization of main trends for users
  • Efficient forecasting, spotting
    outliers/anomalies
  • Incremental, real time computation
  • nimble With limited memory
  • automatic No special parameters to tune

70
Outline
  • Problem and motivation
  • Single-sequence mining AWSOM
  • Co-evolving sequences SPIRIT
  • Lag correlations BRAID
  • Conclusions

71
Part 3BRAID Discovering Lag Correlations in
Multiple Streams
  • Yasushi Sakurai,
  • Spiros Papadimitriou,
  • Christos Faloutsos
  • SIGMOD05

72
Lag Correlations
  • Examples
  • A decrease in interest rates typically precedes
    an increase in house sales by a few months
  • Higher amounts of fluoride in the drinking water
    leads to fewer dental cavities, some years later

73
Lag Correlations
  • Example of lag-correlated sequences

These sequences are correlated with lag l1300
time-ticks
CCF (Cross-Correlation Function)
74
Lag Correlations
  • Example of lag-correlated sequences
  • how to compute it
  • quickly
  • cheaply
  • incrementally

CCF (Cross-Correlation Function)
75
Challenging Problems
  • Problem definitions
  • For given two co-evolving sequences X and Y,
    determine
  • Whether there is a lag correlation
  • If yes, what is the lag length l
  • For given k numerical sequences, X1,,Xk , report
  • Which pairs have a lag correlation
  • The corresponding lag for each pair

76
Our solution
  • Ideal characteristics
  • Any-time processing, and fast
  • Computation time per time tick is constant
  • Nimble
  • Memory space requirement is sub-linear of
    sequence length
  • Accurate
  • Approximation introduces small error

77
Related Work
  • Sequence indexing
  • Agrawal et al. (FODO 1993)
  • Faloutsos et al. (SIGMOD 1994)
  • Keogh et al. (SIGMOD 2001)
  • Compression (wavelet and random projections)
  • Gilbert et al. (VLDB 2001), Guha et al. (VLDB
    2004)
  • Dobra et al.(SIGMOD 2002), Ganguly et al.(SIGMOD
    2003)
  • Data Stream Management
  • Abadi et al. (VLDB Journal 2003)
  • Motwani et al. (CIDR 2003)
  • Chandrasekaran et al. (CIDR 2003)
  • Cranor et al. (SIGMOD 2003)

78
Related Work
  • Pattern discovery
  • Clustering for data streams
  • Guha et al. (TKDE 2003)
  • Monitoring multiple streams
  • Zhu et al. (VLDB 2002)
  • Forecasting
  • Yi et al. (ICDE 2000)
  • Papadimitriou et al. (VLDB 2003)
  • None of previously published methods focuses on
    the problem

79
Overview
  • Introduction / Related work
  • Background
  • Main ideas
  • Theoretical analysis
  • Experimental results

80
Main Idea (1)
  • Incremental compution
  • Sufficient statistics
  • Sum of X
  • Square sum of X
  • Inner-product for X and the shifted Y
  • Compute R(l) incrementally
  • Covariance of X and Y
  • Variance of X

81
Main Idea (2)
  • Sequence smoothing

82
Main Idea (2)
  • Sequence smoothing
  • Means of windows for each level
  • Sufficient statistics computed from the means
  • CCF computed from the sufficient statistics
  • But, it allows a partial redundancy

83
Main Idea (3)
  • Geometric lag probing

84
Main Idea (3)
  • Geometric lag probing
  • Use colored windows
  • Keep track of only a geometric progression of the
    lag values l0,1,2,4,8,,2h,
  • Use a cubic spline to interpolate

85
Overview
  • Introduction / Related work
  • Background
  • Main ideas
  • Theoretical analysis
  • Experimental results

86
Experimental results
  • Setup
  • Intel Xeon 2.8GHz, 1GB memory, Linux
  • Datasets
  • Sines, SpikeTrains, Humidity, Light,
    Temperature,
  • Kursk, Sunspots
  • Enhanced BRAID, b16
  • Evaluation
  • Estimation error of lag correlations
  • Computation time

87
Detecting Lag Correlations (2)
  • SpikeTrains

BRAID closely estimates the correlation
coefficients
CCF (Cross-Correlation Function)
88
Detecting Lag Correlations (3)
  • Humidity

BRAID closely estimates the correlation
coefficients
CCF (Cross-Correlation Function)
89
Detecting Lag Correlations (4)
  • Light

BRAID closely estimates the correlation
coefficients
CCF (Cross-Correlation Function)
90
Detecting Lag Correlations (5)
  • Kursk

BRAID closely estimates the correlation
coefficients
CCF (Cross-Correlation Function)
91
Estimation Error
  • Largest relative error is about 1

92
Performance
  • Almost linear w.r.t. sequence length
  • Up to 40,000 times faster

93
Group Lag Correlations
  • Two correlated pairs from 55 Temperature
    sequences
  • Each sensor is located in a different place

48
16
19
47
Estimation of CCF of 16 and 19
Estimation of CCF of 47 and 48
94
Conclusions
  • Automatic lag correlation detection on stream
    data
  • incremental online, any-time
  • nimble
  • O(log n) space, O(1) time to update the
    statistics
  • Up to 40,000 times faster than the naive
    implementation
  • Accurate
  • Detecting the correct lag within 1 relative
    error or less

95
Overall Conclusions
  • Mining streaming numerical data challenging!
  • Extensions streaming matrix data (eg., network
    traffic matrix)

time
IP-destination
IP-source
96
Thank you
  • christos ltatgt cs.cmu.edu
  • www.cs.cmu.edu/christos
  • InteMon demo
Write a Comment
User Comments (0)
About PowerShow.com