Title: Data Mining Meets Systems: Tools and Case Studies
1Data Mining Meets SystemsTools and Case Studies
- Christos Faloutsos
- SCS CMU
2Thanks
- Spiros Papadimitriou (CMU-gtIBM)
Jimeng Sun (CMU -gt IBM)
- Mengzhi Wang (CMU-gtGoogle)
3Outline
- Problem 1 workload characterization
- Problem 2 self- monitoring
- Problem 3 BGP mining
- (Problem 4 sensor mining)
- (Problem 5 Large graphs hadoop)
fractals
SVD
wavelets
tensors
PageRank
4Problem 1
- Goal given a signal (eg., bytes over time)
- Find patterns, periodicities, and/or compress
bytes
Bytes per 30 (packets per day earthquakes per
year)
time
5Problem 1
- model bursty traffic
- generate realistic traces
- (Poisson does not work)
bytes
Poisson
time
6Motivation
- predict queue length distributions (e.g., to give
probabilistic guarantees) - learn traffic, for buffering, prefetching,
active disks, web servers
7Q any pattern?
- Not Poisson
- spike silence more spikes more silence
- any rules?
bytes
time
8solution self-similarity
bytes
bytes
time
time
9But
- Q1 How to generate realistic traces
extrapolate? - Q2 How to estimate the model parameters?
10Approach
- Q1 How to generate a sequence, that is
- bursty
- self-similar
- and has similar queue length distributions
11Approach
- A binomial multifractal Wang02
- 80-20 law
- 80 of bytes/queries etc on first half
- repeat recursively
- b bias factor (eg., 80)
12binary multifractals
20
80
13binary multifractals
20
80
14Parameter estimation
- Q2 How to estimate the bias factor b?
15Parameter estimation
- Q2 How to estimate the bias factor b?
- A MANY ways Crovella96
- Hurst exponent
- variance plot
- even DFT amplitude spectrum! (periodogram)
- More robust entropy plot Wang02
Mengzhi Wang, Tara Madhyastha, Ngai Hang Chang,
Spiros Papadimitriou and Christos Faloutsos,
Data Mining Meets Performance Evaluation Fast
Algorithms for Modeling Bursty Traffic, ICDE 2002
16Entropy plot
- Rationale
- burstiness inverse of uniformity
- entropy measures uniformity of a distribution
- find entropy at several granularities, to see
whether/how our distribution is close to uniform.
17Entropy plot
p1
p2
of bytes here
- Entropy E(n) after n levels of splits
- n1 E(1) - p1 log2(p1)- p2 log2(p2)
18Entropy plot
p2,3
p2,2
p2,4
p2,1
- Entropy E(n) after n levels of splits
- n1 E(1) - p1 log(p1)- p2 log(p2)
- n2 E(2) - Si p2,i log2 (p2,i)
19Real traffic
Entropy E(n)
- Has linear entropy plot (-gt self-similar)
0.73
of levels (n)
20Observation - intuition
Entropy E(n)
- intuition slope
- intrinsic dimensionality
- degrees of freedom or
- info-bits per coordinate-bit
- unif. Dataset slope 1
- multi-point slope 0
0.73
of levels (n)
21Entropy plot - Intuition
Skip
- Slope intrinsic dimensionality (in fact,
Information fractal dimension) - info bit per coordinate bit - eg
Dim 1
Pick a point reveal its coordinate bit-by-bit
- how much info is each bit worth to me?
22Entropy plot
Skip
- Slope intrinsic dimensionality (in fact,
Information fractal dimension) - info bit per coordinate bit - eg
Dim 1
Is MSB 0?
info value E(1) 1 bit
23Entropy plot
Skip
- Slope intrinsic dimensionality (in fact,
Information fractal dimension) - info bit per coordinate bit - eg
Dim 1
Is MSB 0?
Is next MSB 0?
24Entropy plot
Skip
- Slope intrinsic dimensionality (in fact,
Information fractal dimension) - info bit per coordinate bit - eg
Dim 1
Is MSB 0?
Info value 1 bit E(2) - E(1) slope!
Is next MSB 0?
25Entropy plot
Skip
- Repeat, for all points at same position
Dim0
26Entropy plot
Skip
- Repeat, for all points at same position
- we need 0 bits of info, to determine position
- -gt slope 0 intrinsic dimensionality
Dim0
27Entropy plot
Skip
- Real (and 80-20) datasets can be in-between
bursts, gaps, smaller bursts, smaller gaps, at
every scale
Dim 1
Dim0
0ltDimlt1
28(Fractals, again)
- What set of points could have behavior between
point and line?
29Cantor dust
- Eliminate the middle third
- Recursively!
30Cantor dust
31Cantor dust
32Cantor dust
33Cantor dust
34Cantor dust
Dimensionality? (no length infinite
points!) Answer log2 / log3 0.6
35Some more entropy plots
0.73
1
Poisson slope 1 -gt uniformly distributed
36B-model
- b-model traffic gives perfectly linear plot
- Lemma its slope is
- slope -b log2b - (1-b) log2 (1-b)
- Fitting do entropy plot get slope solve for b
E(n)
n
37Experimental setup
- Disk traces (from HP Wilkes 93)
- web traces from LBL
- http//repository.cs.vt.edu/
- lbl-conn-7.tar.Z
38Model validation
Bias factors b 0.6-0.8 smallest b / smoothest
nntp traffic
39Web traffic - results
- LBL, NCDF of queue lengths (log-log scales)
Prob( gtl)
(queue length l)
40Conclusions
- Multifractals (80/20, b-model, Multiplicative
Wavelet Model (MWM)) for analysis and synthesis
of bursty traffic
41Books
- Fractals Manfred Schroeder Fractals, Chaos,
Power Laws Minutes from an Infinite Paradise
W.H. Freeman and Company, 1991 (Probably the BEST
book on fractals!)
42Outline
- Problem 1 workload characterization
- Problem 2 self- monitoring
- Problem 3 BGP mining
- (Problem 4 sensor mining)
- (Problem 5 Large graphs hadoop)
43Clusters/data center monitoring
- Monitor correlations of multiple measurements
- Automatically flag anomalous behavior
- Intemon intelligent monitoring system
- warsteiner.db.cs.cmu.edu/demo/intemon.jsp
44Publication
- Evan Hoke, Jimeng Sun, John D. Strunk,
Gregory R. Ganger, Christos Faloutsos. InteMon
Continuous Mining of Sensor Data in Large-scale
Self- Infrastructures. ACM SIGOPS Operating
Systems Review, 40(3)38-44. ACM Press, July 2006
45Under the hood SVD
- Singular Value Decomposition
- Done incrementally
Spiros Papadimitriou, Jimeng Sun and Christos
Faloutsos Streaming Pattern Discovery in
Multiple Time-Series VLDB 2005, Trondheim,
Norway.
46Singular Value Decomposition (SVD)
- SVD (LSI KL PCA spectral analysis...)
LSI S. Dumais M. Berry KL eg, DudaHart PCA
eg., Jolliffe Details Press
u of CPU2
t2
t1
u of CPU1
47Singular Value Decomposition (SVD)
- SVD (LSI KL PCA spectral analysis...)
u of CPU2
t2
t1
u of CPU1
48Singular Value Decomposition (SVD)
- SVD (LSI KL PCA spectral analysis...)
u of CPU2
t2
t1
u of CPU1
49Singular Value Decomposition (SVD)
- SVD (LSI KL PCA spectral analysis...)
u of CPU2
t2
t1
u of CPU1
50Outline
- Problem 1 workload characterization
- Problem 2 self- monitoring
- Problem 3 BGP mining
- (Problem 4 sensor mining)
- (Problem 5 Large graphs hadoop)
51BGP updates
- With
- Aditya Prakash (CMU)
- Michalis Faloutsos (UC Riverside)
- Nicholas Valler (UC Riverside)
- Dave Andersen (CMU)
52Tool 0 Time plot
Time Series Updates per 600s, Washington
Router 09/2004-09/2006
53Tool 0 Time plot
- Observation 1 Missing values
- Observation 2 Bursty
54Tool 1 Wavelets
55Wavelets - DWT
- Short window Fourier transform (SWFT)
- But how short should be the window?
freq
value
time
time
56Wavelets - DWT
- Answer multiple window sizes! -gt DWT
Time domain
DWT
SWFT
DFT
freq
time
57Haar Wavelets
- subtract sum of left half from right half
- repeat recursively for quarters, eight-ths, ...
58Low freq.
High freq.
time
Tornado Plot for Washington Router Dark
areas correspond to high energy
59Tornado Plot Wavelet Transform for Washington
Router 09/2004-09/2006, All coefficients
and Detail levels 1-12
- Observations
- Obvious Spikes (E1)
- tornados that touch down
- 2. Prolonged Spikes (E2 and E3)
- when coarser scales have high values but
finer scales do not - Intermittent Waves (E4 and E5) High-energy
entries at nearby scales correspond to local
periodic motion
60 Magnification of updates on 28th Aug. 2005
updates
time
E2 Prolonged Spike
Sustained Period of relatively high Activity
61Tool 2 logarithms
62Tool 2 logarithms
Prominent clothesline at 50 updates per 600
secs. Culprit IP addresses 192.211.42.0/24 216
.109.38.0/24 207.157.115.0/24 All from Alabama
(Supercomputing Center)!
63Outline
- Problem 1 workload characterization
- Problem 2 self- monitoring
- Problem 3 BGP mining
- (Problem 4 sensor mining)
- (Problem 5 Large graphs hadoop)
fractals
SVD
wavelets
tensors
PageRank
64Main point
- Two-way street
- lt- DM can use such infrastructures to find
patterns - -gt DM can help such systems/networks etc to
become self-healing, self-adjusting, self- - Hot topic in Data Mining finding patterns in
Tera- and Peta-bytes
65Additional resources
- Machine learning classes at SCS/MLD
- Tom Mitchells book on Machine Learning
- Classification
- Clustering/Anomaly detection
- Support vector machines
- Graphical models
- Bayesian networks
- ltetc etcgt
66Thank you!
www.cs.cmu.edu/christos For code, papers
etc WeH 7107 christos ltatgt cs