Sensor data mining and forecasting - PowerPoint PPT Presentation

1 / 112
About This Presentation
Title:

Sensor data mining and forecasting

Description:

Time Tick. Number of packets sent. Telcordia 2003. C. Faloutsos. 27. CMU SCS ... in a train of spikes (128 ticks apart) any AR with window w 128 will fail ... – PowerPoint PPT presentation

Number of Views:68
Avg rating:3.0/5.0
Slides: 113
Provided by: christosf
Learn more at: http://www.cs.cmu.edu
Category:

less

Transcript and Presenter's Notes

Title: Sensor data mining and forecasting


1
Sensor data mining and forecasting
  • Christos Faloutsos
  • CMU
  • christos_at_cs.cmu.edu

2
Outline
  • Problem definition - motivation
  • Linear forecasting - AR and AWSOM
  • Coevolving series - MUSCLES
  • Fractal forecasting - F4
  • Other projects
  • graph modeling, outliers etc

3
Problem definition
  • Given one or more sequences
  • x1 , x2 , , xt ,
  • (y1, y2, , yt,
  • )
  • Find
  • forecasts patterns
  • clusters outliers

4
Motivation - Applications
  • Financial, sales, economic series
  • Medical
  • ECGs blood pressure etc monitoring
  • reactions to new drugs
  • elderly care

5
Motivation - Applications (contd)
  • Smart house
  • sensors monitor temperature, humidity, air
    quality
  • video surveillance

6
Motivation - Applications (contd)
  • civil/automobile infrastructure
  • bridge vibrations Oppenheim02
  • road conditions / traffic monitoring

7
Stream Data automobile traffic
cars
time
8
Motivation - Applications (contd)
  • Weather, environment/anti-pollution
  • volcano monitoring
  • air/water pollutant monitoring

9
Stream Data Sunspots
sunspots per month
time
10
Motivation - Applications (contd)
  • Computer systems
  • Active Disks (buffering, prefetching)
  • web servers (ditto)
  • network traffic monitoring
  • ...

11
Stream Data Disk accesses
bytes
time
12
Settings Applications
  • One or more sensors, collecting time-series data

13
Settings Applications
Each sensor collects data (x1, x2, , xt, )
14
Settings Applications
Sensors report to a central site
15
Settings Applications
Problem 1 Finding patterns in a single time
sequence
16
Settings Applications
Problem 2 Finding patterns in many time
sequences
17
Problem 1
  • Goal given a signal (eg., packets over time)
  • Find patterns, periodicities, and/or compress

count
lynx caught per year (packets per
day temperature per day)
year
18
Problem1 Forecast
  • Given xt, xt-1, , forecast xt1

90
80
70
60
Number of packets sent
??
50
40
30
20
10
0
1
3
5
7
9
11
Time Tick
19
Problem 2
  • Given A set of correlated time sequences
  • Forecast Sent(t)

20
Differences from DSP/Stat
  • Semi-infinite streams
  • we need on-line, any-time algorithms
  • Can not afford human intervention
  • need automatic methods
  • sensors have limited memory / processing /
    transmitting power
  • need for (lossy) compression

21
Important observations
  • Patterns, rules, compression and forecasting are
    closely related
  • To do forecasting, we need
  • to find patterns/rules
  • good rules help us compress
  • to find outliers, we need to have forecasts
  • (outlier too far away from our forecast)

22
Pictorial outline of the talk
23
Outline
  • Problem definition - motivation
  • Linear forecasting
  • AR
  • AWSOM
  • Coevolving series - MUSCLES
  • Fractal forecasting - F4
  • Other projects
  • graph modeling, outliers etc

24
Mini intro to A.R.
25
Forecasting
  • "Prediction is very difficult, especially about
    the future." - Nils Bohr
  • http//www.hfac.uh.edu/MediaFutures/thoughts.html

26
Problem1 Forecast
  • Example give xt-1, xt-2, , forecast xt

90
80
70
60
Number of packets sent
??
50
40
30
20
10
0
1
3
5
7
9
11
Time Tick
27
Forecasting Preprocessing
  • MANUALLY
  • remove trends spot
    periodicities

7 days
time
time
28
Linear Regression idea
85
Body height
80
75
70
65
60
55
50
45
40
15
25
35
45
Body weight
  • express what we dont know ( dependent
    variable)
  • as a linear function of what we know ( indep.
    variable(s))

29
Linear Auto Regression
30
Problem1 Forecast
  • Solution try to express
  • xt
  • as a linear function of the past xt-2, xt-2, ,
  • (up to a window of w)
  • Formally

31
Linear Auto Regression
85
lag-plot
80
75
70
65
Number of packets sent (t)
60
55
50
45
40
15
25
35
45
Number of packets sent (t-1)
  • lag w1
  • Dependent variable of packets sent (S t)
  • Independent variable of packets sent (St-1)

32
More details
  • Q1 Can it work with window wgt1?
  • A1 YES!

xt
xt-1
xt-2
33
More details
  • Q1 Can it work with window wgt1?
  • A1 YES! (well fit a hyper-plane, then!)

xt
xt-1
xt-2
34
More details
  • Q1 Can it work with window wgt1?
  • A1 YES! (well fit a hyper-plane, then!)

xt
xt-1
xt-2
35
Even more details
  • Q2 Can we estimate a incrementally?
  • A2 Yes, with the brilliant, classic method of
    Recursive Least Squares (RLS) (see, e.g.,
    Chen94, or Yi00, for details)
  • Q3 can we down-weight older samples?
  • A3 yes (RLS does that easily!)

36
Mini intro to A.R.
37
How to choose w?
  • goal capture arbitrary periodicities
  • with NO human intervention
  • on a semi-infinite stream

38
Outline
  • Problem definition - motivation
  • Linear forecasting
  • AR
  • AWSOM
  • Coevolving series - MUSCLES
  • Fractal forecasting - F4
  • Other projects
  • graph modeling, outliers etc

39
Problem
  • in a train of spikes (128 ticks apart)
  • any AR with window w lt 128 will fail
  • What to do, then?

40
Answer (intuition)
  • Do a Wavelet transform ( short window DFT)
  • look for patterns in every frequency

41
Intuition
  • Why NOT use the short window Fourier transform
    (SWFT)?
  • A how short should be the window?

freq
time
w
42
wavelets
  • main idea variable-length window!

f
t
43
Advantages of Wavelets
  • Better compression (better RMSE with same number
    of coefficients - used in JPEG-2000)
  • fast to compute (usually O(n)!)
  • very good for spikes
  • mammalian eye and ear Gabor wavelets

44
Wavelets - intuition
  • Q baritone/silence/ soprano - DWT?

45
Wavelets - intuition
  • Q baritone/soprano - DWT?

46
AWSOM
xt
47
AWSOM
xt
48
AWSOM - idea
Wl,t ? ?l,1Wl,t-1 ? ?l,2Wl,t-2 ?
Wl,t ? ?l,1Wl,t-1 ? ?l,2Wl,t-2 ?
Wl,t
49
Wavelets - example
  • Q weekly daily periodicity - DWT?

f
t
50
Wavelets - example
  • Q weekly daily periodicity - DWT?

f
t
51
Wavelets - Example
  • Q weekly daily periodicity - DWT?

f
t
52
More details
  • Update of wavelet coefficients
  • Update of linear models
  • Feature selection
  • Not all correlations are significant
  • Throw away the insignificant ones (noise)

(incremental)
(incremental RLS)
(single-pass)
53
Results - Synthetic data
AWSOM
AR
Seasonal AR
  • Triangle pulse
  • Mix (sine square)
  • AR captures wrong trend (or none)
  • Seasonal AR estimation fails

54
Results - Real data
  • Automobile traffic
  • Daily periodicity
  • Bursty noise at smaller scales
  • AR fails to capture any trend
  • Seasonal AR estimation fails

55
Results - real data
  • Sunspot intensity
  • Slightly time-varying period
  • AR captures wrong trend
  • Seasonal ARIMA
  • wrong downward trend, despite help by human!

56
Complexity
  • Model update
  • Space O?lgN mk2? ? O?lgN?
  • Time O?k2? ? O?1?
  • Where
  • N number of points (so far)
  • k number of regression coefficients fixed
  • m number of linear models O?lgN?

57
Conclusions
  • AWSOM Automatic, hands-off traffic modeling
    (first of its kind!)

58
Outline
  • Problem definition - motivation
  • Linear forecasting
  • AR
  • AWSOM
  • Coevolving series - MUSCLES
  • Fractal forecasting - F4
  • Other projects
  • graph modeling, outliers etc

59
Co-Evolving Time Sequences
  • Given A set of correlated time sequences
  • Forecast Repeated(t)

??
60
Solution
  • Q what should we do?

61
Solution
  • Least Squares, with
  • Dep. Variable Repeated(t)
  • Indep. Variables Sent(t-1) Sent(t-w)
    Lost(t-1) Lost(t-w) Repeated(t-1), ...
  • (named MUSCLES Yi00)

62
Examples - Experiments
  • Datasets
  • Modem pool traffic (14 modems, 1500 time-ticks
    packets per time unit)
  • ATT WorldNet internet usage (several data
    streams 980 time-ticks)
  • Measures of success
  • Accuracy Root Mean Square Error (RMSE)

63
Accuracy - Modem
  • MUSCLES outperforms AR yesterday

64
Accuracy - Internet
MUSCLES consistently outperforms AR yesterday
65
Outline
  • Problem definition - motivation
  • Linear forecasting
  • AR
  • AWSOM
  • Coevolving series - MUSCLES
  • Fractal forecasting - F4
  • Other projects
  • graph modeling, outliers etc

66
Detailed Outline
  • Non-linear forecasting
  • Problem
  • Idea
  • How-to
  • Experiments
  • Conclusions

67
Recall Problem 1
Value
Time
Given a time series xt, predict its future
course, that is, xt1, xt2, ...
68
How to forecast?
  • ARIMA - but linearity assumption
  • ANSWER Delayed Coordinate Embedding Lag
    Plots Sauer92

69
General Intuition (Lag Plot)
Lag 1,k 4 NN
xt
xt-1
70
Questions
  • Q1 How to choose lag L?
  • Q2 How to choose k (the of NN)?
  • Q3 How to interpolate?
  • Q4 why should this work at all?

71
Q1 Choosing lag L
  • Manually (16, in award winning system by
    Sauer94)
  • Our proposal choose L such that the intrinsic
    dimension in the lag plot stabilizes
    Chakrabarti02

72
Fractal Dimensions
  • FD intrinsic dimensionality

Embedding dimensionality 3
Intrinsic dimensionality 1
73
Fractal Dimensions
  • FD intrinsic dimensionality

log( pairs)
log(r)
74
Intuition
X(t)
  • Its lag plot for lag 1

X(t-1)
75
Intuition
x(t)
x(t-1)
76
Intuition
Fractal dimension
  • The FD vs L plot does flatten out
  • L(opt) 1

Lag
77
Proposed Method
  • Use Fractal Dimensions to find the optimal lag
    length L(opt)

78
Q2 Choosing number of neighbors k
  • Manually (typically 1-10)

79
Q3 How to interpolate?
How do we interpolate between the k nearest
neighbors? A3.1 Average A3.2 Weighted average
(weights drop with distance - how?)
80
Q3 How to interpolate?
xt
Xt-1
81
Q4 Any theory behind it?
  • A4 YES!

82
Theoretical foundation
  • Based on the Takens Theorem Takens81
  • which says that long enough delay vectors can do
    prediction, even if there are unobserved
    variables in the dynamical system ( diff.
    equations)

83
Theoretical foundation
Skip
Example Lotka-Volterra equations dH/dt r H
a HP dP/dt b HP m P H is count of prey
(e.g., hare)P is count of predators (e.g.,
lynx) Suppose only P(t) is observed (t1, 2, ).
84
Theoretical foundation
Skip
  • But the delay vector space is a faithful
    reconstruction of the internal system state
  • So prediction in delay vector space is as good as
    prediction in state space

P(t)
P(t-1)
85
Detailed Outline
  • Non-linear forecasting
  • Problem
  • Idea
  • How-to
  • Experiments
  • Conclusions

86
Datasets
Logistic Parabola xt axt-1(1-xt-1) noise
Models population of flies R. May/1976
Lag-plot
87
Datasets
Logistic Parabola xt axt-1(1-xt-1) noise
Models population of flies R. May/1976
Lag-plot
ARIMA fails
88
Logistic Parabola
Our Prediction from here
Value
Timesteps
89
Logistic Parabola
Value
Comparison of prediction to correct values
Timesteps
90
Datasets
Value
LORENZ Models convection currents in the air dx
/ dt a (y - x) dy / dt x (b - z) - y dz /
dt xy - c z
91
LORENZ
Value
Comparison of prediction to correct values
Timesteps
92
Datasets
Value
  • LASER fluctuations in a Laser over time (used in
    Santa Fe competition)

Time
93
Laser
Value
Comparison of prediction to correct values
Timesteps
94
Conclusions
  • Lag plots for non-linear forecasting (Takens
    theorem)
  • suitable for chaotic signals

95
Additional projects at CMU
  • Graph/Network mining
  • spatio-temporal mining - outliers

96
Graph/network mining
  • Internet web gnutella P2P networks
  • Q Any pattern?
  • Q how to generate realistic topologies?
  • Q how to define/verify realism?

97
Patterns?
  • avg degree is, say 3.3
  • pick a node at random - what is the degree you
    expect it to have?

count
?
avg 3.3
degree
98
Patterns?
  • avg degree is, say 3.3
  • pick a node at random - what is the degree you
    expect it to have?
  • A 1!!

count
avg 3.3
degree
99
Patterns?
  • avg degree is, say 3.3
  • pick a node at random - what is the degree you
    expect it to have?
  • A 1!!

count
avg 3.3
degree
100
Patterns?
  • A Power laws!

log(count)
log (out) degree
101
Other laws?
Count vs Indegree
Count vs Outdegree
Hop-plot
Stress
Network value
Eigenvalue vs Rank
102
RMAT, to generate realistic graphs
Count vs Indegree
Count vs Outdegree
Hop-plot
Stress
Network value
Eigenvalue vs Rank
103
Epidemic threshold?
  • one a real graph, will a (computer / biological)
    virus die out? (given
  • beta probability that an infected node will
    infect its neighbor and
  • delta probability that an infected node will
    recover

NO
MAYBE
YES
104
Epidemic threshold?
  • one a real graph, will a (computer / biological)
    virus die out? (given
  • beta probability that an infected node will
    infect its neighbor and
  • delta probability that an infected node will
    recover
  • A depends on largest eigenvalue of adjacency
    matrix! Wang03

105
Additional projects
  • Graph mining
  • spatio-temporal mining - outliers

106
Outliers - LOCI
107
Outliers - LOCI
  • finds outliers quickly,
  • with no human intervention

108
Conclusions
  • AWSOM for automatic, linear forecasting
  • MUSCLES for co-evolving sequences
  • F4 for non-linear forecasting
  • Graph/Network topology power laws and
    generators epidemic threshold
  • LOCI for outlier detection

109
Conclusions
  • Overarching theme automatic discovery of
    patterns (outliers/rules) in
  • time sequences (sensors/streams)
  • graphs (computer/social networks)
  • multimedia (video, motion capture data etc)
  • www.cs.cmu.edu/christos
  • christos_at_cs.cmu.edu

110
Books
  • William H. Press, Saul A. Teukolsky, William T.
    Vetterling and Brian P. Flannery Numerical
    Recipes in C, Cambridge University Press, 1992,
    2nd Edition. (Great description, intuition and
    code for DFT, DWT)
  • C. Faloutsos Searching Multimedia Databases by
    Content, Kluwer Academic Press, 1996
    (introduction to DFT, DWT)

111
Books
  • George E.P. Box and Gwilym M. Jenkins and Gregory
    C. Reinsel, Time Series Analysis Forecasting and
    Control, Prentice Hall, 1994 (the classic book on
    ARIMA, 3rd ed.)
  • Brockwell, P. J. and R. A. Davis (1987). Time
    Series Theory and Methods. New York, Springer
    Verlag.

112
Resources software and urls
  • MUSCLES Prof. Byoung-Kee Yi
  • http//www.postech.ac.kr/bkyi/
  • or christos_at_cs.cmu.edu
  • AWSOM LOCI spapadim_at_cs.cmu.edu
  • F4, RMAT deepay_at_cs.cmu.edu

113
Additional Reading
  • Chakrabarti02 Deepay Chakrabarti and Christos
    Faloutsos F4 Large-Scale Automated Forecasting
    using Fractals CIKM 2002, Washington DC, Nov.
    2002.
  • Chen94 Chung-Min Chen, Nick Roussopoulos
    Adaptive Selectivity Estimation Using Query
    Feedback. SIGMOD Conference 1994161-172
  • Gilbert01 Anna C. Gilbert, Yannis Kotidis and
    S. Muthukrishnan and Martin Strauss, Surfing
    Wavelets on Streams One-Pass Summaries for
    Approximate Aggregate Queries, VLDB 2001

114
Additional Reading
  • Spiros Papadimitriou, Anthony Brockwell and
    Christos Faloutsos Adaptive, Hands-Off Stream
    Mining VLDB 2003, Berlin, Germany, Sept. 2003
  • Spiros Papadimitriou, Hiroyuki Kitagawa, Phil
    Gibbons and Christos Faloutsos LOCI Fast Outlier
    Detection Using the Local Correlation Integral
    ICDE 2003, Bangalore, India, March 5 - March 8,
    2003.
  • Sauer, T. (1994). Time series prediction using
    delay coordinate embedding. (in book by Weigend
    and Gershenfeld, below) Addison-Wesley.

115
Additional Reading
  • Takens, F. (1981). Detecting strange attractors
    in fluid turbulence. Dynamical Systems and
    Turbulence. Berlin Springer-Verlag.
  • Yang Wang, Deepayan Chakrabarti, Chenxi Wang and
    Christos Faloutsos Epidemic Spreading in Real
    Networks An Eigenvalue Viewpoint 22nd Symposium
    on Reliable Distributed Computing (SRDS2003)
    Florence, Italy, Oct. 6-8, 2003

116
Additional Reading
  • Weigend, A. S. and N. A. Gerschenfeld (1994).
    Time Series Prediction Forecasting the Future
    and Understanding the Past, Addison Wesley.
    (Excellent collection of papers on
    chaotic/non-linear forecasting, describing the
    algorithms behind the winners of the Santa Fe
    competition.)
  • Yi00 Byoung-Kee Yi et al. Online Data Mining
    for Co-Evolving Time Sequences, ICDE 2000.
    (Describes MUSCLES and Recursive Least Squares)
Write a Comment
User Comments (0)
About PowerShow.com