Title: Data Mining on Streams
1Data Mining on Streams
2THANK YOU!
- Prof. Panos Ipeirotis
- Julia Mills
3Joint work with
- Spiros Papadimitriou (CMU-gtIBM)
- Jimeng Sun (CMU/CS)
- Anthony Brockwell (CMU/Stat)
- Jeanne Vanbriesen (CMU/CivEng)
- Greg Ganger (CMU/ECE)
4Outline
- Problem and motivation
- Single-sequence mining AWSOM
- Co-evolving sequences SPIRIT
- Lag correlations BRAID
- Conclusions
5Problem definition - example
Each sensor collects data (x1, x2, , xt, )
6Problem definition
- Given one or more sequences
- x1 , x2 , , xt ,
- (y1, y2, , yt,
- )
- Find
- patterns correlations outliers
- incrementally!
7Limitations / Challenges
- Find patterns using a method that is
- nimble limited resources
- Memory
- Bandwidth, power, CPU
- incremental on-line, any-time response
- single pass (you get to see it only once)
- automatic no human intervention
- eg., in remote environments
8Application domains
- Sensor devices
- Temperature, weather measurements
- Road traffic data
- Geological observations
- Patient physiological data
- Embedded devices
- Network routers
- Intelligent (active) disks
9Motivation - Applications (contd)
- Smart house
- sensors monitor temperature, humidity, air
quality - video surveillance
10Motivation - Applications (contd)
- civil/automobile infrastructure
- bridge vibrations Oppenheim02
- road conditions / traffic monitoring
11Motivation - Applications (contd)
- Weather, environment/anti-pollution
- volcano monitoring
- air/water pollutant monitoring
12Motivation - Applications (contd)
- Computer systems
- Active Disks (buffering, prefetching)
- web servers (ditto)
- network traffic monitoring
- ...
13InteMonw/ Evan Hoke, Jimeng Sun
self- PetaByte data center at CMU
14Outline
- Problem and motivation
- Single-sequence mining AWSOM
- Co-evolving sequences SPIRIT
- Lag correlations BRAID
- conclusions
15Single sequence mining - AWSOM
- with Spiros Papadimitriou (CMU -gt IBM)
- Anthony Brockwell (CMU/Stat)
16Problem definition
- Semi-infinite streams of values (time series) x1,
x2, , xt, - Find patterns, forecasts, outliers
Periodicity? (twice daily)
Periodicity? (daily)
17Requirements / Goals
- Adapt and handle arbitrary periodic components
- and
- nimble (limited resources, single pass)
- on-line, any-time
- automatic (no human intervention/tuning)
18Overview
- Introduction / Related work
- Background
- Main idea
- Experimental results
19WaveletsExample Haar transform
constant
frequency
time
20WaveletsWhy we like them
- Wavelets compress many real signals well
- Image compression and processing
- Vision
- Astronomy, seismology,
- Wavelet coefficients can be updated as new points
arrive
21Overview
- Introduction / Related work
- Background
- Main idea
- Experimental results
22AWSOM
xt
23AWSOM
xt
24AWSOM - idea
Wl,t ? ?l,1Wl,t-1 ? ?l,2Wl,t-2 ?
Wl,t ? ?l,1Wl,t-1 ? ?l,2Wl,t-2 ?
Wl,t
25More details
- Update of wavelet coefficients
- Update of linear models
- Feature selection
- Not all correlations are significant
- Throw away the insignificant ones (noise)
(incremental)
(incremental RLS)
(single-pass)
26Complexity
?
- Model update
- Space O?lgN mk2? ? O?lgN?
- Time O?k2? ? O?1?
- Where
- N number of points (so far)
- k number of regression coefficients fixed
- m number of linear models O?lgN?
27Overview
- Introduction / Related work
- Background
- Main idea
- Experimental results
28Results - Synthetic data
AWSOM
AR
Seasonal AR
- Triangle pulse
- Mix (sine square)
- AR captures wrong trend (or none)
- Seasonal AR estimation fails
29Results - Real data
- Automobile traffic
- Daily periodicity
- Bursty noise at smaller scales
- AR fails to capture any trend
- Seasonal AR estimation fails
30Results - real data
?
- Sunspot intensity
- Slightly time-varying period
- AR captures wrong trend
- Seasonal ARIMA
- wrong downward trend, despite help by human!
31Conclusions
- Adapt and handle arbitrary periodic components
- and
- nimble
- Limited memory (logarithmic)
- Constant-time update
- on-line, any-time
- Single pass over the data
- automatic No human intervention/tuning
32Outline
- Problem and motivation
- Single-sequence mining AWSOM
- Co-evolving sequences SPIRIT
- Lag correlations BRAID
- conclusions
33Part 2
- SPIRIT Mining co-evolving streams
- Papadimitriou, Sun, Faloutsos, VLDB05
34Motivation
- Eg., chlorine concentration in water distribution
network
35Motivation
water distribution network
normal operation
May have hundreds of measurements, but it is
unlikely they are completely unrelated!
36Motivation
sensors near leak
chlorine concentrations
sensors away from leak
water distribution network
normal operation
major leak
37Motivation
sensors near leak
chlorine concentrations
sensors away from leak
water distribution network
normal operation
major leak
38Motivation
actual measurements (n streams)
k hidden variable(s)
We would like to discover a few hidden (latent)
variables that summarize the key trends
39Motivation
Phase 1
Phase 1
Phase 2
Phase 2
chlorine concentrations
k 2
actual measurements (n streams)
k hidden variable(s)
We would like to discover a few hidden (latent)
variables that summarize the key trends
40Motivation
Phase 1
Phase 1
Phase 2
Phase 2
Phase 3
Phase 3
chlorine concentrations
k 1
actual measurements (n streams)
k hidden variable(s)
We would like to discover a few hidden (latent)
variables that summarize the key trends
41Goals
- Discover hidden (latent) variables for
- Summarization of main trends for users
- Efficient forecasting, spotting
outliers/anomalies - and the usual
- nimble Limited memory requirements
- on-line, any-time (single pass etc)
- automatic No special parameters to tune
42Related workStream mining
- Stream SVD Guha, Gunopulos, Koudas / KDD03
- StatStream Zhu, Shasha / VLDB02
- Clustering
- Aggarwal, Han, Yu / VLDB03, Guha, Meyerson,
et al / TKDE, - Lin, Vlachos, Keogh, Gunopulos / EDBT04,
- Classification
- Wang, Fan, et al / KDD03, Hulten, Spencer,
Domingos / KDD01
43Related workStream mining
- Piecewise approximations
- Palpanas, Vlachos, Keogh, etal / ICDE 2004
- Queries on streams
- Dobra, Garofalakis, Gehrke, et al / SIGMOD02,
- Madden, Franklin, Hellerstein, et al / OSDI02,
- Considine, Li, Kollios, et al / ICDE04,
- Hammad, Aref, Elmagarmid / SSDBM03
44OverviewPart 2
- Method
- Experiments
- Conclusions Other work
45Stream correlations
- Step 1 How to capture correlations?
- Step 2 How to do it incrementally, when we have
a very large number of points? - Step 3 How to dynamically adjust the number of
hidden variables?
461. How to capture correlations?
30oC
Temperature t1
20oC
471. How to capture correlations?
- First sensor
- Second sensor
30oC
Temperature t2
20oC
481. How to capture correlations
Correlations Lets take a closer look at the
first three value-pairs
30oC
Temperature t2
20oC
20oC
30oC
Temperature t1
491. How to capture correlations
- First three lie (almost) on a line in the space
of value-pairs
30oC
Temperature t2
offset hidden variable
? O(n) numbers for the slope, and ? One number
for each value-pair (offset on line)
20oC
20oC
30oC
Temperature t1
501. How to capture correlations
- Other pairs also follow the same pattern they
lie (approximately) on this line
30oC
Temperature t2
20oC
20oC
30oC
Temperature t1
51Stream correlations
- Step 1 How to capture correlations?
- Step 2 How to do it incrementally, when we have
a very large number of points? - Step 3 How to dynamically adjust the number of
hidden variables?
52Incremental updates
53Incremental updates
- Algorithm runs in O(n) where n of streams
- no need to access old data
error
30oC
20oC
20oC
30oC
Temperature T1
54Stream correlationsPrincipal Component Analysis
(PCA)
- The line is the first principal component (PC)
- This line is optimal it minimizes the sum of
squared projection errors
552. Incremental updateGiven number of hidden
variables k
- Assuming k is known
- We know how to update the slope
- For each new point x and for i 1, , k
- yi wiTx (proj. onto wi)
- di ? ?di yi2 (energy ? i-th eigenval.)
- ei x yiwi (error)
- wi ? wi (1/di) yiei (update estimate)
- x ? x yiwi (repeat with remainder)
56Stream correlations
- Step 1 How to capture correlations?
- Step 2 How to do it incrementally, when we have
a very large number of points? - Step 3 How to dynamically adjust k, the number
of hidden variables?
57Answer
- When the reconstruction accuracy is too low (say,
lt95) - then introduce another hidden variable (k)
- How to initialize its values tricky
58Missing values
best guess (given correlations intersection)
30oC
true values (pair)
Temperature T2
20oC
all possible value pairs (given only t1)
20oC
30oC
Temperature T1
59Forecasting
- Assume we want to forecast the next value for a
particular stream (e.g. auto-regression)
?
n streams
60Forecasting
- Option 1 One complex model per stream
- Next value function of previous values on all
streams - Captures correlations
- Too costly! O(n3)
?
n streams
61Forecasting
- Option 1 One complex model per stream
- Option 2 One simple model per stream
- Next value function of previous value on same
stream - Worse accuracy, but maybe acceptable
- But, still need n models
?
n streams
62Forecasting
?
hidden variables
Only k simple models
Efficiency robustness
k ltlt n and already capture correlations
n streams
63Time/space requirementsIncremental PCA
- O(nk) space (total) and time (per tuple), i.e.,
- Independent of points
- Linear w.r.t. streams (n)
- Linear w.r.t. hidden variables (k)
- In fact,
- Can be done in real time
64OverviewPart 2
- Method
- Experiments
- Conclusions Other work
65ExperimentsChlorine concentration
Measurements
Reconstruction
166 streams 2 hidden variables (4 error)
CMU Civil Engineering
66ExperimentsChlorine concentration
hidden variables
- Both capture global, periodic pattern
- Second first, but phase-shifted
- Can express any phase-shift
CMU Civil Engineering
67ExperimentsLight measurements
measurement reconstruction
54 sensors 2-4 hidden variables (6 error)
68ExperimentsLight measurements
intermittent
intermittent
hidden variables
- 1 2 main trend (as before)
- 3 4 potential anomalies and outliers
69Conclusions
- SPIRIT
- Discovers hidden variables for
- Summarization of main trends for users
- Efficient forecasting, spotting
outliers/anomalies - Incremental, real time computation
- nimble With limited memory
- automatic No special parameters to tune
70Outline
- Problem and motivation
- Single-sequence mining AWSOM
- Co-evolving sequences SPIRIT
- Lag correlations BRAID
- Conclusions
71Part 3BRAID Discovering Lag Correlations in
Multiple Streams
- Yasushi Sakurai,
- Spiros Papadimitriou,
- Christos Faloutsos
- SIGMOD05
72Lag Correlations
- Examples
- A decrease in interest rates typically precedes
an increase in house sales by a few months - Higher amounts of fluoride in the drinking water
leads to fewer dental cavities, some years later
73Lag Correlations
- Example of lag-correlated sequences
These sequences are correlated with lag l1300
time-ticks
CCF (Cross-Correlation Function)
74Lag Correlations
- Example of lag-correlated sequences
- how to compute it
- quickly
- cheaply
- incrementally
CCF (Cross-Correlation Function)
75Challenging Problems
- Problem definitions
- For given two co-evolving sequences X and Y,
determine - Whether there is a lag correlation
- If yes, what is the lag length l
- For given k numerical sequences, X1,,Xk , report
- Which pairs have a lag correlation
- The corresponding lag for each pair
76Our solution
- Ideal characteristics
- Any-time processing, and fast
- Computation time per time tick is constant
- Nimble
- Memory space requirement is sub-linear of
sequence length - Accurate
- Approximation introduces small error
77Related Work
- Sequence indexing
- Agrawal et al. (FODO 1993)
- Faloutsos et al. (SIGMOD 1994)
- Keogh et al. (SIGMOD 2001)
- Compression (wavelet and random projections)
- Gilbert et al. (VLDB 2001), Guha et al. (VLDB
2004) - Dobra et al.(SIGMOD 2002), Ganguly et al.(SIGMOD
2003) - Data Stream Management
- Abadi et al. (VLDB Journal 2003)
- Motwani et al. (CIDR 2003)
- Chandrasekaran et al. (CIDR 2003)
- Cranor et al. (SIGMOD 2003)
78Related Work
- Pattern discovery
- Clustering for data streams
- Guha et al. (TKDE 2003)
- Monitoring multiple streams
- Zhu et al. (VLDB 2002)
- Forecasting
- Yi et al. (ICDE 2000)
- Papadimitriou et al. (VLDB 2003)
- None of previously published methods focuses on
the problem
79Overview
- Introduction / Related work
- Background
- Main ideas
- Theoretical analysis
- Experimental results
80Main Idea (1)
- Incremental compution
- Sufficient statistics
- Sum of X
- Square sum of X
- Inner-product for X and the shifted Y
- Compute R(l) incrementally
- Covariance of X and Y
- Variance of X
81Main Idea (2)
82Main Idea (2)
- Sequence smoothing
- Means of windows for each level
- Sufficient statistics computed from the means
- CCF computed from the sufficient statistics
- But, it allows a partial redundancy
83Main Idea (3)
84Main Idea (3)
- Geometric lag probing
- Use colored windows
- Keep track of only a geometric progression of the
lag values l0,1,2,4,8,,2h, - Use a cubic spline to interpolate
85Overview
- Introduction / Related work
- Background
- Main ideas
- Theoretical analysis
- Experimental results
86Experimental results
- Setup
- Intel Xeon 2.8GHz, 1GB memory, Linux
- Datasets
- Sines, SpikeTrains, Humidity, Light,
Temperature, - Kursk, Sunspots
- Enhanced BRAID, b16
- Evaluation
- Estimation error of lag correlations
- Computation time
87Detecting Lag Correlations (2)
BRAID closely estimates the correlation
coefficients
CCF (Cross-Correlation Function)
88Detecting Lag Correlations (3)
BRAID closely estimates the correlation
coefficients
CCF (Cross-Correlation Function)
89Detecting Lag Correlations (4)
BRAID closely estimates the correlation
coefficients
CCF (Cross-Correlation Function)
90Detecting Lag Correlations (5)
BRAID closely estimates the correlation
coefficients
CCF (Cross-Correlation Function)
91Estimation Error
- Largest relative error is about 1
92Performance
- Almost linear w.r.t. sequence length
- Up to 40,000 times faster
93Group Lag Correlations
- Two correlated pairs from 55 Temperature
sequences - Each sensor is located in a different place
48
16
19
47
Estimation of CCF of 16 and 19
Estimation of CCF of 47 and 48
94Conclusions
- Automatic lag correlation detection on stream
data - incremental online, any-time
- nimble
- O(log n) space, O(1) time to update the
statistics - Up to 40,000 times faster than the naive
implementation - Accurate
- Detecting the correct lag within 1 relative
error or less
95Overall Conclusions
- Mining streaming numerical data challenging!
- Extensions streaming matrix data (eg., network
traffic matrix)
time
IP-destination
IP-source
96Thank you
- christos ltatgt cs.cmu.edu
- www.cs.cmu.edu/christos
- InteMon demo