Title: Sensor and Graph Mining
1Sensor and Graph Mining
- Christos Faloutsos
- Carnegie Mellon University IBM
- www.cs.cmu.edu/christos
2Joint work with
- Anthony Brockwell (CMU/Stat)
- Deepayan Chakrabarti (CMU)
- Spiros Papadimitriou (CMU)
- Chenxi Wang (CMU)
- Yang Wang (CMU)
3Outline
- Introduction - motivation
- Problem 1 Stream Mining
- Motivation
- Main idea
- Experimental results
- Problem 2 Graphs Virus propagation
- Conclusions
4Introduction
- Sensor devices
- Temperature, weather measurements
- Road traffic data
- Geological observations
- Patient physiological data
- Embedded devices
- Network routers
- Intelligent (active) disks
5Introduction
- Limited resources
- Memory
- Bandwidth
- Power
- CPU
- Remote environments
- No human intervention
6Introduction problem dfn
- Given a emi-infinite stream of values (time
series) x1, x2, , xt, - Find patterns, forecasts, outliers
7Introduction
Periodicity? (twice daily)
Periodicity? (daily)
8Introduction
- Can we capture these patterns
- automatically
- with limited resources?
9Related workStatistics Time series forecasting
- Main problem
- The first step in the analysis of any time
series is to plot the data and inspect the
graph Brockwell 91 - Typically
- Resource intensive
- Cannot update online
- AR(I)MA and seasonal variants
- ARFIMA, GARCH,
10Related workDatabases Continuous Queries
- Typically, different focus
- Compression
- Not generative models
- Largely orthogonal problem
- Gilbert, Guha, Indyk et al. (STOC 2002)
- Garofalakis, Gibbons (SIGMOD 2002)
- Chen, Dong, Han et al. (VLDB 2002) Bulut, Singh
(ICDE 2003) - Gehrke, Korn, et al. (SIGMOD 2001), Dobra,
Garofalakis, Gehrke et al. (SIGMOD 2002) - Guha, Koudas (ICDE 2003) Datar, Gionis, Indyk et
al. (SODA 2002) - Madden SIGMOD02, SIGMOD03
11Goals
- Adapt and handle arbitrary periodic components
- No human intervention/tuning
- Also
- Single pass over the data
- Limited memory (logarithmic)
- Constant-time update
12Outline
- Introduction - motivation
- Problem 1 Stream Mining
- Motivation
- Main idea
- Experimental results
- Problem 2 Graphs Virus propagation
- Conclusions
13WaveletsStraight signal
time
14WaveletsIntroduction Haar
frequency
time
15Wavelets
- So?
- Wavelets compress many real signals well
- Image compression and processing
- Vision Astronomy, seismology,
- Wavelet coefficients can be updated as new points
arrive Kotidis
16WaveletsCorrelations
xt
frequency
time
17WaveletsCorrelations
xt
frequency
time
18Main ideaCorrelations
- Wavelets are good
- we can do even better
- One number
- and the fact that they are equal/correlated
19Proposed method
Wl,t ? ?l,1Wl,t-1 ? ?l,2Wl,t-2 ?
Wl,t ? ?l,1Wl,t-1 ? ?l,2Wl,t-2 ?
Wl,t
Small windows suffice (k4)
20More details
- Update of wavelet coefficients
- Update of linear models
- Feature selection
- Not all correlations are significant
- Throw away the insignificant ones
- very important!!
- see paper
(incremental)
(incremental RLS)
(single-pass)
21Complexity
SKIP
- Model update
- Space O?lgN mk2? ? O?lgN?
- Time O?k2? ? O?1?
- Where
- N number of points (so far)
- k number of regression coefficients fixed
- m number of linear models O?lgN?
- see paper
22Outline
- Introduction - motivation
- Problem 1 Stream Mining
- Motivation
- Main idea
- Experimental results
- Problem 2 Graphs Virus propagation
- Conclusions
23Setup
- First half used for model estimation
- Models applied forward to forecast entire second
half - AR, Seasonal AR (SAR) R
- Simplest possible estimation no maximum
likelihood estimation (MLE), etc. - vs. Python scripts
24ResultsSynthetic data Triangle pulse
- Triangle pulse
- AR captures wrong trend (or none)
- Seasonal AR (SAR) estimation fails
25ResultsSynthetic data Mix
- Mix (sine square pulse)
- AR captures wrong trend (or none)
- Seasonal AR estimation fails
26ResultsReal data Automobile
(filtered)
- Automobile traffic
- Daily periodicity with rush-hour peaks
- Bursty noise at smaller time scales
27ResultsReal data Automobile
- Automobile traffic
- Daily periodicity with rush-hour peaks
- Bursty noise at smaller time scales
- AR fails to capture any trend (average)
- Seasonal AR estimation fails
28ResultsReal data Automobile
- Automobile traffic
- Daily periodicity with rush-hour peaks
- Bursty noise at smaller time scales
- AWSOM spots periodicities, automatically
29ResultsReal data Automobile
- Automobile traffic
- Daily periodicity with rush-hour peaks
- Bursty noise at smaller time scales
- Generation with identified noise
30ResultsReal data Sunspot
- Sunspot intensity Slightly time-varying
period - AR captures wrong trend (average)
- Seasonal ARIMA
- Captures immediate wrong downward trend
- Requires human to determine seasonal component
period (fixed)
31ResultsReal data Sunspot
- Sunspot intensity Slightly time-varying
period
Estimation 40 minutes (R) vs. 9 seconds (Python)
32Variance
SKIP
Hurst exponent
- Variance (log-power) vs. scale
- Noise diagnostic (if decreasing linear)
- Can use to estimate noise parameters
33Running time
time (t)
stream size (N)
34Space requirements
Equal total number of model parameters
35Conclusion
- Adapt and handle arbitrary periodic components
- No human intervention/tuning
- Single pass over the data
- Limited memory (logarithmic)
- Constant-time update
36Conclusion
- Adapt and handle arbitrary periodic components
- No human intervention/tuning
- Single pass over the data
- Limited memory (logarithmic)
- Constant-time update
no human
limited resources
37Outline
- Introduction - motivation
- Problem 1 Streams
- Problem 2 Graphs Virus propagation
- Motivation problem definition
- Related work
- Main idea
- Experiments
- Conclusions
38Introduction
Protein Interactions genomebiology.com
Internet Map lumeta.com
Food Web Martinez 91
? Graphs are ubiquitious
Friendship Network Moody 01
39Introduction
bridges
- What can we do with graph analysis?
- Immunization
- Information Dissemination
- network value of a customer Domingos
Needle exchange networks of drug usersWeeks
et al. 2002
40Problem definition
- Q1 How does a virus spread across an arbitrary
network? - Q2 will it create an epidemic?
- (in a sensor setting, with a gossip protocol,
will a rumor/query spread?)
41Framework
- Susceptible-Infected-Susceptible (SIS) model
- Cured nodes immediately become susceptible
42Framework
- b prob. an infected neighbor attacks
- d prob. an infected node heals
43The model
- (virus) Birth rate ß probability than an
infected neighbor attacks - (virus) Death rate d probability that an
infected node heals
Healthy
N2
N
N1
Infected
N3
44Epidemic threshold t
- Defined as the value of t, such that
- if b / d lt t
- an epidemic can not happen
- Thus,
- given a graph
- compute its epidemic threshold
45Epidemic threshold t
- What should t depend on?
- avg. degree? and/or highest degree?
- and/or variance of degree?
- and/or determinant of the adjacency matrix?
46Basic Homogeneous Model
- Homogeneous graphs Kephart-White 91, 93
- Epidemic threshold 1/ltkgt
- Homogeneous connectivity ltkgt, ie, all nodes have
same degree ? unrealistic
47Power-law Networks
- Model for Barabási-Albert networks
- Pastor-Satorras Vespignani, 01, 02
- Epidemic threshold ltkgt / ltk2gt
- for BA type networks, with only ? 3 (? slope
of power-law exponent)
48Epidemic threshold
- Homogeneous graphs 1/ltkgt
- BA (g3) ltkgt / ltk2gt
- more complicated graphs ?
- arbitrary, REAL graphs ?
- how many parameters??
49Epidemic threshold
- Theorem We have no epidemic, if
ß/d ltt 1/ ?1,A
50Epidemic threshold
- Theorem We have no epidemic, if
epidemic threshold
recovery prob.
ß/d ltt 1/ ?1,A
largest eigenvalue of adj. matrix A
attack prob.
Proof Wang03
51Epidemic threshold for various networks
- sanity checks / older results
- Homogeneous networks
- ?1,A ltkgt t 1/ltkgt
- where ltkgt average degree
- This is the same result as of Kephart White !
52Epidemic threshold for various networks
- sanity checks / older results
- Star networks
- ?1,A sqrt(d) t 1/ sqrt(d)
- where d the degree of the central node
53Epidemic threshold for various networks
- sanity checks / older results
- Infinite, power-law networks
- ?1,A 8 t 0 any virus has a chance!
Barabasi et al - Finite power-law networks
- t 1/ ?1,A
54Outline
- Introduction - motivation
- Problem 1 Streams
- Problem 2 Graphs Virus propagation
- Motivation problem definition
- Related work
- Main idea
- Experiments
- Conclusions
55Experiments
- 2 graphs
- Star network one hub 99 spokes
- Oregon Internet AS graph
- 10,900 nodes, 31180 edges
- topology.eecs.umich.edu/data.html
- More in our paper SRDS 03
56Experiments (Star)
ß/d gt t (above threshold)
ß/d t (at the threshold)
57Experiments (Oregon)
ß/d gt t (above threshold)
ß/d t (at the threshold)
ß/d lt t (below threshold)
58Our prediction vs. previous prediction
PL3
Number of infected nodes
Our
Our
ß/d
ß/d
Oregon
Star
- our predictions are more accurate
59Conclusions
- We found an epidemic threshold
- v that applies to any network topology
- v and it depends only on one parameter of the
graph
60Overall conclusions
- Automatic stream mining AWSOM
- graphs and virus propagation eigenvalue
61Ongoing / related work
- Streams
- how to find hidden variables on multiple streams
w/ Spiros and Jimeng Sun - network tomography w/ Airoldi
- Graphs
- graph partitioning w/ Deepay
- important subgraphs w/ Tomkins McCurley
- graph generators RMAT, w/ Deepay
62Thank you!
- Contact info
- christos _at_ cs.cmu.edu
- spapadim _at_ cs.cmu.edu
- deepay _at_ cs.cmu.edu
63Main References
- Spiros Papadimitriou, Anthony Brockwell and
Christos Faloutsos Adaptive, Hands-Off Stream
Mining VLDB 2003, Berlin, Germany, Sept. 2003. - Wang03 Yang Wang, Deepayan Chakrabarti, Chenxi
Wang and Christos Faloutsos Epidemic Spreading
in Real Networks an Eigenvalue Viewpoint, SRDS
2003, Florence, Italy.
64Additional References
- Connection Subgraphs, C. Faloutsos, K. McCurley,
A. Tomkins, SIAM-DM 2004 workshop on link
analysis - RMAT A recursive graph generator, D.
Chakrabarti, Y. Zhan, C. Faloutsos, SIAM-DM 2004 - iFilter Network tomography using particle
filters, Edoardo Airoldi, Christos Faloutsos
(submitted)