Title: Sensor data mining and forecasting
1Sensor data mining and forecasting
- Christos Faloutsos
- CMU
- christos_at_cs.cmu.edu
2Outline
- Problem definition - motivation
- Linear forecasting - AR and AWSOM
- Coevolving series - MUSCLES
- Fractal forecasting - F4
- Other projects
- graph modeling, outliers etc
3Problem definition
- Given one or more sequences
- x1 , x2 , , xt ,
- (y1, y2, , yt,
- )
- Find
- forecasts patterns
- clusters outliers
4Motivation - Applications
- Financial, sales, economic series
- Medical
- ECGs blood pressure etc monitoring
- reactions to new drugs
- elderly care
5Motivation - Applications (contd)
- Smart house
- sensors monitor temperature, humidity, air
quality - video surveillance
6Motivation - Applications (contd)
- civil/automobile infrastructure
- bridge vibrations Oppenheim02
- road conditions / traffic monitoring
7Stream Data automobile traffic
cars
time
8Motivation - Applications (contd)
- Weather, environment/anti-pollution
- volcano monitoring
- air/water pollutant monitoring
9Stream Data Sunspots
sunspots per month
time
10Motivation - Applications (contd)
- Computer systems
- Active Disks (buffering, prefetching)
- web servers (ditto)
- network traffic monitoring
- ...
11Stream Data Disk accesses
bytes
time
12Settings Applications
- One or more sensors, collecting time-series data
13Settings Applications
Each sensor collects data (x1, x2, , xt, )
14Settings Applications
Sensors report to a central site
15Settings Applications
Problem 1 Finding patterns in a single time
sequence
16Settings Applications
Problem 2 Finding patterns in many time
sequences
17Problem 1
- Goal given a signal (eg., packets over time)
- Find patterns, periodicities, and/or compress
count
lynx caught per year (packets per
day temperature per day)
year
18Problem1 Forecast
- Given xt, xt-1, , forecast xt1
90
80
70
60
Number of packets sent
??
50
40
30
20
10
0
1
3
5
7
9
11
Time Tick
19Problem 2
- Given A set of correlated time sequences
- Forecast Sent(t)
20Differences from DSP/Stat
- Semi-infinite streams
- we need on-line, any-time algorithms
- Can not afford human intervention
- need automatic methods
- sensors have limited memory / processing /
transmitting power - need for (lossy) compression
21Important observations
- Patterns, rules, compression and forecasting are
closely related - To do forecasting, we need
- to find patterns/rules
- good rules help us compress
- to find outliers, we need to have forecasts
- (outlier too far away from our forecast)
22Pictorial outline of the talk
23Outline
- Problem definition - motivation
- Linear forecasting
- AR
- AWSOM
- Coevolving series - MUSCLES
- Fractal forecasting - F4
- Other projects
- graph modeling, outliers etc
24Mini intro to A.R.
25Forecasting
- "Prediction is very difficult, especially about
the future." - Nils Bohr - http//www.hfac.uh.edu/MediaFutures/thoughts.html
26Problem1 Forecast
- Example give xt-1, xt-2, , forecast xt
90
80
70
60
Number of packets sent
??
50
40
30
20
10
0
1
3
5
7
9
11
Time Tick
27Forecasting Preprocessing
- MANUALLY
- remove trends spot
periodicities
7 days
time
time
28Linear Regression idea
85
Body height
80
75
70
65
60
55
50
45
40
15
25
35
45
Body weight
- express what we dont know ( dependent
variable) - as a linear function of what we know ( indep.
variable(s))
29Linear Auto Regression
30Problem1 Forecast
- Solution try to express
- xt
- as a linear function of the past xt-2, xt-2, ,
- (up to a window of w)
- Formally
31Linear Auto Regression
85
lag-plot
80
75
70
65
Number of packets sent (t)
60
55
50
45
40
15
25
35
45
Number of packets sent (t-1)
- lag w1
- Dependent variable of packets sent (S t)
- Independent variable of packets sent (St-1)
32More details
- Q1 Can it work with window wgt1?
- A1 YES!
xt
xt-1
xt-2
33More details
- Q1 Can it work with window wgt1?
- A1 YES! (well fit a hyper-plane, then!)
xt
xt-1
xt-2
34More details
- Q1 Can it work with window wgt1?
- A1 YES! (well fit a hyper-plane, then!)
xt
xt-1
xt-2
35Even more details
- Q2 Can we estimate a incrementally?
- A2 Yes, with the brilliant, classic method of
Recursive Least Squares (RLS) (see, e.g.,
Chen94, or Yi00, for details) - Q3 can we down-weight older samples?
- A3 yes (RLS does that easily!)
36Mini intro to A.R.
37How to choose w?
- goal capture arbitrary periodicities
- with NO human intervention
- on a semi-infinite stream
38Outline
- Problem definition - motivation
- Linear forecasting
- AR
- AWSOM
- Coevolving series - MUSCLES
- Fractal forecasting - F4
- Other projects
- graph modeling, outliers etc
39Problem
- in a train of spikes (128 ticks apart)
- any AR with window w lt 128 will fail
- What to do, then?
40Answer (intuition)
- Do a Wavelet transform ( short window DFT)
- look for patterns in every frequency
41Intuition
- Why NOT use the short window Fourier transform
(SWFT)? - A how short should be the window?
freq
time
w
42wavelets
- main idea variable-length window!
f
t
43Advantages of Wavelets
- Better compression (better RMSE with same number
of coefficients - used in JPEG-2000) - fast to compute (usually O(n)!)
- very good for spikes
- mammalian eye and ear Gabor wavelets
44Wavelets - intuition
- Q baritone/silence/ soprano - DWT?
45Wavelets - intuition
- Q baritone/soprano - DWT?
46AWSOM
xt
47AWSOM
xt
48AWSOM - idea
Wl,t ? ?l,1Wl,t-1 ? ?l,2Wl,t-2 ?
Wl,t ? ?l,1Wl,t-1 ? ?l,2Wl,t-2 ?
Wl,t
49Wavelets - example
- Q weekly daily periodicity - DWT?
f
t
50Wavelets - example
- Q weekly daily periodicity - DWT?
f
t
51Wavelets - Example
- Q weekly daily periodicity - DWT?
f
t
52More details
- Update of wavelet coefficients
- Update of linear models
- Feature selection
- Not all correlations are significant
- Throw away the insignificant ones (noise)
(incremental)
(incremental RLS)
(single-pass)
53Results - Synthetic data
AWSOM
AR
Seasonal AR
- Triangle pulse
- Mix (sine square)
- AR captures wrong trend (or none)
- Seasonal AR estimation fails
54Results - Real data
- Automobile traffic
- Daily periodicity
- Bursty noise at smaller scales
- AR fails to capture any trend
- Seasonal AR estimation fails
55Results - real data
- Sunspot intensity
- Slightly time-varying period
- AR captures wrong trend
- Seasonal ARIMA
- wrong downward trend, despite help by human!
56Complexity
- Model update
- Space O?lgN mk2? ? O?lgN?
- Time O?k2? ? O?1?
- Where
- N number of points (so far)
- k number of regression coefficients fixed
- m number of linear models O?lgN?
57Conclusions
- AWSOM Automatic, hands-off traffic modeling
(first of its kind!)
58Outline
- Problem definition - motivation
- Linear forecasting
- AR
- AWSOM
- Coevolving series - MUSCLES
- Fractal forecasting - F4
- Other projects
- graph modeling, outliers etc
59Co-Evolving Time Sequences
- Given A set of correlated time sequences
- Forecast Repeated(t)
??
60Solution
61Solution
- Least Squares, with
- Dep. Variable Repeated(t)
- Indep. Variables Sent(t-1) Sent(t-w)
Lost(t-1) Lost(t-w) Repeated(t-1), ... - (named MUSCLES Yi00)
62Examples - Experiments
- Datasets
- Modem pool traffic (14 modems, 1500 time-ticks
packets per time unit) - ATT WorldNet internet usage (several data
streams 980 time-ticks) - Measures of success
- Accuracy Root Mean Square Error (RMSE)
63Accuracy - Modem
- MUSCLES outperforms AR yesterday
64Accuracy - Internet
MUSCLES consistently outperforms AR yesterday
65Outline
- Problem definition - motivation
- Linear forecasting
- AR
- AWSOM
- Coevolving series - MUSCLES
- Fractal forecasting - F4
- Other projects
- graph modeling, outliers etc
66Detailed Outline
- Non-linear forecasting
- Problem
- Idea
- How-to
- Experiments
- Conclusions
67Recall Problem 1
Value
Time
Given a time series xt, predict its future
course, that is, xt1, xt2, ...
68How to forecast?
- ARIMA - but linearity assumption
- ANSWER Delayed Coordinate Embedding Lag
Plots Sauer92
69General Intuition (Lag Plot)
Lag 1,k 4 NN
xt
xt-1
70Questions
- Q1 How to choose lag L?
- Q2 How to choose k (the of NN)?
- Q3 How to interpolate?
- Q4 why should this work at all?
71Q1 Choosing lag L
- Manually (16, in award winning system by
Sauer94) - Our proposal choose L such that the intrinsic
dimension in the lag plot stabilizes
Chakrabarti02
72Fractal Dimensions
- FD intrinsic dimensionality
Embedding dimensionality 3
Intrinsic dimensionality 1
73Fractal Dimensions
- FD intrinsic dimensionality
log( pairs)
log(r)
74Intuition
X(t)
X(t-1)
75Intuition
x(t)
x(t-1)
76Intuition
Fractal dimension
- The FD vs L plot does flatten out
- L(opt) 1
Lag
77Proposed Method
- Use Fractal Dimensions to find the optimal lag
length L(opt)
78Q2 Choosing number of neighbors k
- Manually (typically 1-10)
79Q3 How to interpolate?
How do we interpolate between the k nearest
neighbors? A3.1 Average A3.2 Weighted average
(weights drop with distance - how?)
80Q3 How to interpolate?
xt
Xt-1
81Q4 Any theory behind it?
82Theoretical foundation
- Based on the Takens Theorem Takens81
- which says that long enough delay vectors can do
prediction, even if there are unobserved
variables in the dynamical system ( diff.
equations)
83Theoretical foundation
Skip
Example Lotka-Volterra equations dH/dt r H
a HP dP/dt b HP m P H is count of prey
(e.g., hare)P is count of predators (e.g.,
lynx) Suppose only P(t) is observed (t1, 2, ).
84Theoretical foundation
Skip
- But the delay vector space is a faithful
reconstruction of the internal system state - So prediction in delay vector space is as good as
prediction in state space
P(t)
P(t-1)
85Detailed Outline
- Non-linear forecasting
- Problem
- Idea
- How-to
- Experiments
- Conclusions
86Datasets
Logistic Parabola xt axt-1(1-xt-1) noise
Models population of flies R. May/1976
Lag-plot
87Datasets
Logistic Parabola xt axt-1(1-xt-1) noise
Models population of flies R. May/1976
Lag-plot
ARIMA fails
88Logistic Parabola
Our Prediction from here
Value
Timesteps
89Logistic Parabola
Value
Comparison of prediction to correct values
Timesteps
90Datasets
Value
LORENZ Models convection currents in the air dx
/ dt a (y - x) dy / dt x (b - z) - y dz /
dt xy - c z
91LORENZ
Value
Comparison of prediction to correct values
Timesteps
92Datasets
Value
- LASER fluctuations in a Laser over time (used in
Santa Fe competition)
Time
93Laser
Value
Comparison of prediction to correct values
Timesteps
94Conclusions
- Lag plots for non-linear forecasting (Takens
theorem) - suitable for chaotic signals
95Additional projects at CMU
- Graph/Network mining
- spatio-temporal mining - outliers
96Graph/network mining
- Internet web gnutella P2P networks
- Q Any pattern?
- Q how to generate realistic topologies?
- Q how to define/verify realism?
97Patterns?
- avg degree is, say 3.3
- pick a node at random - what is the degree you
expect it to have?
count
?
avg 3.3
degree
98Patterns?
- avg degree is, say 3.3
- pick a node at random - what is the degree you
expect it to have? - A 1!!
count
avg 3.3
degree
99Patterns?
- avg degree is, say 3.3
- pick a node at random - what is the degree you
expect it to have? - A 1!!
count
avg 3.3
degree
100Patterns?
log(count)
log (out) degree
101Other laws?
Count vs Indegree
Count vs Outdegree
Hop-plot
Stress
Network value
Eigenvalue vs Rank
102RMAT, to generate realistic graphs
Count vs Indegree
Count vs Outdegree
Hop-plot
Stress
Network value
Eigenvalue vs Rank
103Epidemic threshold?
- one a real graph, will a (computer / biological)
virus die out? (given - beta probability that an infected node will
infect its neighbor and - delta probability that an infected node will
recover
NO
MAYBE
YES
104Epidemic threshold?
- one a real graph, will a (computer / biological)
virus die out? (given - beta probability that an infected node will
infect its neighbor and - delta probability that an infected node will
recover - A depends on largest eigenvalue of adjacency
matrix! Wang03
105Additional projects
- Graph mining
- spatio-temporal mining - outliers
106Outliers - LOCI
107Outliers - LOCI
- finds outliers quickly,
- with no human intervention
108Conclusions
- AWSOM for automatic, linear forecasting
- MUSCLES for co-evolving sequences
- F4 for non-linear forecasting
- Graph/Network topology power laws and
generators epidemic threshold - LOCI for outlier detection
109Conclusions
- Overarching theme automatic discovery of
patterns (outliers/rules) in - time sequences (sensors/streams)
- graphs (computer/social networks)
- multimedia (video, motion capture data etc)
- www.cs.cmu.edu/christos
- christos_at_cs.cmu.edu
110Books
- William H. Press, Saul A. Teukolsky, William T.
Vetterling and Brian P. Flannery Numerical
Recipes in C, Cambridge University Press, 1992,
2nd Edition. (Great description, intuition and
code for DFT, DWT) - C. Faloutsos Searching Multimedia Databases by
Content, Kluwer Academic Press, 1996
(introduction to DFT, DWT)
111Books
- George E.P. Box and Gwilym M. Jenkins and Gregory
C. Reinsel, Time Series Analysis Forecasting and
Control, Prentice Hall, 1994 (the classic book on
ARIMA, 3rd ed.) - Brockwell, P. J. and R. A. Davis (1987). Time
Series Theory and Methods. New York, Springer
Verlag.
112Resources software and urls
- MUSCLES Prof. Byoung-Kee Yi
- http//www.postech.ac.kr/bkyi/
- or christos_at_cs.cmu.edu
- AWSOM LOCI spapadim_at_cs.cmu.edu
- F4, RMAT deepay_at_cs.cmu.edu
113Additional Reading
- Chakrabarti02 Deepay Chakrabarti and Christos
Faloutsos F4 Large-Scale Automated Forecasting
using Fractals CIKM 2002, Washington DC, Nov.
2002. - Chen94 Chung-Min Chen, Nick Roussopoulos
Adaptive Selectivity Estimation Using Query
Feedback. SIGMOD Conference 1994161-172 - Gilbert01 Anna C. Gilbert, Yannis Kotidis and
S. Muthukrishnan and Martin Strauss, Surfing
Wavelets on Streams One-Pass Summaries for
Approximate Aggregate Queries, VLDB 2001
114Additional Reading
- Spiros Papadimitriou, Anthony Brockwell and
Christos Faloutsos Adaptive, Hands-Off Stream
Mining VLDB 2003, Berlin, Germany, Sept. 2003 - Spiros Papadimitriou, Hiroyuki Kitagawa, Phil
Gibbons and Christos Faloutsos LOCI Fast Outlier
Detection Using the Local Correlation Integral
ICDE 2003, Bangalore, India, March 5 - March 8,
2003. - Sauer, T. (1994). Time series prediction using
delay coordinate embedding. (in book by Weigend
and Gershenfeld, below) Addison-Wesley.
115Additional Reading
- Takens, F. (1981). Detecting strange attractors
in fluid turbulence. Dynamical Systems and
Turbulence. Berlin Springer-Verlag. - Yang Wang, Deepayan Chakrabarti, Chenxi Wang and
Christos Faloutsos Epidemic Spreading in Real
Networks An Eigenvalue Viewpoint 22nd Symposium
on Reliable Distributed Computing (SRDS2003)
Florence, Italy, Oct. 6-8, 2003
116Additional Reading
- Weigend, A. S. and N. A. Gerschenfeld (1994).
Time Series Prediction Forecasting the Future
and Understanding the Past, Addison Wesley.
(Excellent collection of papers on
chaotic/non-linear forecasting, describing the
algorithms behind the winners of the Santa Fe
competition.) - Yi00 Byoung-Kee Yi et al. Online Data Mining
for Co-Evolving Time Sequences, ICDE 2000.
(Describes MUSCLES and Recursive Least Squares)