Data Mining on Streams

About This Presentation

Title:

Data Mining on Streams

Description:

incremental: on-line, any-time' response. single pass ( you get to see ... Astronomy, seismology, ... Wavelet coefficients can be updated as new points arrive ... – PowerPoint PPT presentation

Number of Views:31

Avg rating:3.0/5.0

Slides: 96

Provided by: christ395

Learn more at: https://dbirday.stern.nyu.edu

Category:

more less

Transcript and Presenter's Notes

Title: Data Mining on Streams

1
Data Mining on Streams

Christos Faloutsos
CMU

2
THANK YOU!

Prof. Panos Ipeirotis
Julia Mills

3
Joint work with

Spiros Papadimitriou (CMU-gtIBM)
Jimeng Sun (CMU/CS)
Anthony Brockwell (CMU/Stat)
Jeanne Vanbriesen (CMU/CivEng)
Greg Ganger (CMU/ECE)

4
Outline

Problem and motivation
Single-sequence mining AWSOM
Co-evolving sequences SPIRIT
Lag correlations BRAID
Conclusions

5
Problem definition - example
Each sensor collects data (x1, x2, , xt, )
6
Problem definition

Given one or more sequences
x1 , x2 , , xt ,
(y1, y2, , yt,
)
Find
patterns correlations outliers
incrementally!

7
Limitations / Challenges

Find patterns using a method that is
nimble limited resources
Memory
Bandwidth, power, CPU
incremental on-line, any-time response
single pass (you get to see it only once)
automatic no human intervention
eg., in remote environments

8
Application domains

Sensor devices
Temperature, weather measurements
Road traffic data
Geological observations
Patient physiological data
Embedded devices
Network routers
Intelligent (active) disks

9
Motivation - Applications (contd)

Smart house
sensors monitor temperature, humidity, air
quality
video surveillance

10
Motivation - Applications (contd)

civil/automobile infrastructure
bridge vibrations Oppenheim02
road conditions / traffic monitoring

11
Motivation - Applications (contd)

Weather, environment/anti-pollution
volcano monitoring
air/water pollutant monitoring

12
Motivation - Applications (contd)

Computer systems
Active Disks (buffering, prefetching)
web servers (ditto)
network traffic monitoring
...

13
InteMonw/ Evan Hoke, Jimeng Sun
self- PetaByte data center at CMU
14
Outline

Problem and motivation
Single-sequence mining AWSOM
Co-evolving sequences SPIRIT
Lag correlations BRAID
conclusions

15
Single sequence mining - AWSOM

with Spiros Papadimitriou (CMU -gt IBM)
Anthony Brockwell (CMU/Stat)

16
Problem definition

Semi-infinite streams of values (time series) x1,
x2, , xt,
Find patterns, forecasts, outliers

Periodicity? (twice daily)
Periodicity? (daily)
17
Requirements / Goals

Adapt and handle arbitrary periodic components
and
nimble (limited resources, single pass)
on-line, any-time
automatic (no human intervention/tuning)

18
Overview

Introduction / Related work
Background
Main idea
Experimental results

19
WaveletsExample Haar transform
constant
frequency
time
20
WaveletsWhy we like them

Wavelets compress many real signals well
Image compression and processing
Vision
Astronomy, seismology,

Wavelet coefficients can be updated as new points
arrive

21
Overview

Introduction / Related work
Background
Main idea
Experimental results

22
AWSOM
xt
23
AWSOM
xt
24
AWSOM - idea
Wl,t ? ?l,1Wl,t-1 ? ?l,2Wl,t-2 ?
Wl,t ? ?l,1Wl,t-1 ? ?l,2Wl,t-2 ?
Wl,t
25
More details

Update of wavelet coefficients
Update of linear models
Feature selection
Not all correlations are significant
Throw away the insignificant ones (noise)

(incremental)
(incremental RLS)
(single-pass)
26
Complexity
?

Model update
Space O?lgN mk2? ? O?lgN?
Time O?k2? ? O?1?
Where
N number of points (so far)
k number of regression coefficients fixed
m number of linear models O?lgN?

27
Overview

Introduction / Related work
Background
Main idea
Experimental results

28
Results - Synthetic data
AWSOM
AR
Seasonal AR

Triangle pulse
Mix (sine square)
AR captures wrong trend (or none)
Seasonal AR estimation fails

29
Results - Real data

Automobile traffic
Daily periodicity
Bursty noise at smaller scales
AR fails to capture any trend
Seasonal AR estimation fails

30
Results - real data
?

Sunspot intensity
Slightly time-varying period
AR captures wrong trend
Seasonal ARIMA
wrong downward trend, despite help by human!

31
Conclusions

Adapt and handle arbitrary periodic components
and
nimble
Limited memory (logarithmic)
Constant-time update
on-line, any-time
Single pass over the data
automatic No human intervention/tuning

32
Outline

Problem and motivation
Single-sequence mining AWSOM
Co-evolving sequences SPIRIT
Lag correlations BRAID
conclusions

33
Part 2

SPIRIT Mining co-evolving streams
Papadimitriou, Sun, Faloutsos, VLDB05

34
Motivation

Eg., chlorine concentration in water distribution
network

35
Motivation
water distribution network
normal operation
May have hundreds of measurements, but it is
unlikely they are completely unrelated!
36
Motivation
sensors near leak
chlorine concentrations
sensors away from leak
water distribution network
normal operation
major leak
37
Motivation
sensors near leak
chlorine concentrations
sensors away from leak
water distribution network
normal operation
major leak
38
Motivation
actual measurements (n streams)
k hidden variable(s)
We would like to discover a few hidden (latent)
variables that summarize the key trends
39
Motivation
Phase 1
Phase 1
Phase 2
Phase 2
chlorine concentrations
k 2
actual measurements (n streams)
k hidden variable(s)
We would like to discover a few hidden (latent)
variables that summarize the key trends
40
Motivation
Phase 1
Phase 1
Phase 2
Phase 2
Phase 3
Phase 3
chlorine concentrations
k 1
actual measurements (n streams)
k hidden variable(s)
We would like to discover a few hidden (latent)
variables that summarize the key trends
41
Goals

Discover hidden (latent) variables for
Summarization of main trends for users
Efficient forecasting, spotting
outliers/anomalies
and the usual
nimble Limited memory requirements
on-line, any-time (single pass etc)
automatic No special parameters to tune

42
Related workStream mining

Stream SVD Guha, Gunopulos, Koudas / KDD03
StatStream Zhu, Shasha / VLDB02
Clustering
Aggarwal, Han, Yu / VLDB03, Guha, Meyerson,
et al / TKDE,
Lin, Vlachos, Keogh, Gunopulos / EDBT04,
Classification
Wang, Fan, et al / KDD03, Hulten, Spencer,
Domingos / KDD01

43
Related workStream mining

Piecewise approximations
Palpanas, Vlachos, Keogh, etal / ICDE 2004
Queries on streams
Dobra, Garofalakis, Gehrke, et al / SIGMOD02,
Madden, Franklin, Hellerstein, et al / OSDI02,
Considine, Li, Kollios, et al / ICDE04,
Hammad, Aref, Elmagarmid / SSDBM03

44
OverviewPart 2

Method
Experiments
Conclusions Other work

45
Stream correlations

Step 1 How to capture correlations?
Step 2 How to do it incrementally, when we have
a very large number of points?
Step 3 How to dynamically adjust the number of
hidden variables?

46
1. How to capture correlations?

First sensor

30oC
Temperature t1
20oC
47
1. How to capture correlations?

First sensor
Second sensor

30oC
Temperature t2
20oC
48
1. How to capture correlations
Correlations Lets take a closer look at the
first three value-pairs
30oC
Temperature t2
20oC
20oC
30oC
Temperature t1
49
1. How to capture correlations

First three lie (almost) on a line in the space
of value-pairs

30oC
Temperature t2
offset hidden variable
? O(n) numbers for the slope, and ? One number
for each value-pair (offset on line)
20oC
20oC
30oC
Temperature t1
50
1. How to capture correlations

Other pairs also follow the same pattern they
lie (approximately) on this line

30oC
Temperature t2
20oC
20oC
30oC
Temperature t1
51
Stream correlations

Step 1 How to capture correlations?
Step 2 How to do it incrementally, when we have
a very large number of points?
Step 3 How to dynamically adjust the number of
hidden variables?

52
Incremental updates
53
Incremental updates

Algorithm runs in O(n) where n of streams
no need to access old data

error
30oC
20oC
20oC
30oC
Temperature T1
54
Stream correlationsPrincipal Component Analysis
(PCA)

The line is the first principal component (PC)
This line is optimal it minimizes the sum of
squared projection errors

55
2. Incremental updateGiven number of hidden
variables k

Assuming k is known
We know how to update the slope

For each new point x and for i 1, , k
yi wiTx (proj. onto wi)
di ? ?di yi2 (energy ? i-th eigenval.)
ei x yiwi (error)
wi ? wi (1/di) yiei (update estimate)
x ? x yiwi (repeat with remainder)

56
Stream correlations

Step 1 How to capture correlations?
Step 2 How to do it incrementally, when we have
a very large number of points?
Step 3 How to dynamically adjust k, the number
of hidden variables?

57
Answer

When the reconstruction accuracy is too low (say,
lt95)
then introduce another hidden variable (k)
How to initialize its values tricky

58
Missing values
best guess (given correlations intersection)
30oC
true values (pair)
Temperature T2
20oC
all possible value pairs (given only t1)
20oC
30oC
Temperature T1
59
Forecasting

Assume we want to forecast the next value for a
particular stream (e.g. auto-regression)

?
n streams
60
Forecasting

Option 1 One complex model per stream
Next value function of previous values on all
streams
Captures correlations
Too costly! O(n3)

?
n streams
61
Forecasting

Option 1 One complex model per stream
Option 2 One simple model per stream
Next value function of previous value on same
stream
Worse accuracy, but maybe acceptable
But, still need n models

?
n streams
62
Forecasting
?
hidden variables
Only k simple models
Efficiency robustness
k ltlt n and already capture correlations
n streams
63
Time/space requirementsIncremental PCA

O(nk) space (total) and time (per tuple), i.e.,
Independent of points
Linear w.r.t. streams (n)
Linear w.r.t. hidden variables (k)
In fact,
Can be done in real time

64
OverviewPart 2

Method
Experiments
Conclusions Other work

65
ExperimentsChlorine concentration
Measurements
Reconstruction
166 streams 2 hidden variables (4 error)
CMU Civil Engineering
66
ExperimentsChlorine concentration
hidden variables

Both capture global, periodic pattern
Second first, but phase-shifted
Can express any phase-shift

CMU Civil Engineering
67
ExperimentsLight measurements
measurement reconstruction
54 sensors 2-4 hidden variables (6 error)
68
ExperimentsLight measurements
intermittent
intermittent
hidden variables

1 2 main trend (as before)
3 4 potential anomalies and outliers

69
Conclusions

SPIRIT
Discovers hidden variables for
Summarization of main trends for users
Efficient forecasting, spotting
outliers/anomalies
Incremental, real time computation
nimble With limited memory
automatic No special parameters to tune

70
Outline

Problem and motivation
Single-sequence mining AWSOM
Co-evolving sequences SPIRIT
Lag correlations BRAID
Conclusions

71
Part 3BRAID Discovering Lag Correlations in
Multiple Streams

Yasushi Sakurai,
Spiros Papadimitriou,
Christos Faloutsos
SIGMOD05

72
Lag Correlations

Examples
A decrease in interest rates typically precedes
an increase in house sales by a few months
Higher amounts of fluoride in the drinking water
leads to fewer dental cavities, some years later

73
Lag Correlations