Data Mining Meets Systems: Tools and Case Studies - PowerPoint PPT Presentation

About This Presentation
Title:

Data Mining Meets Systems: Tools and Case Studies

Description:

predict queue length distributions (e.g., to give probabilistic guarantees) ... Michalis Faloutsos (UC Riverside) Nicholas Valler (UC Riverside) Dave Andersen (CMU) ... – PowerPoint PPT presentation

Number of Views:75
Avg rating:3.0/5.0
Slides: 53
Provided by: christosf
Learn more at: http://www.cs.cmu.edu
Category:

less

Transcript and Presenter's Notes

Title: Data Mining Meets Systems: Tools and Case Studies


1
Data Mining Meets SystemsTools and Case Studies
  • Christos Faloutsos
  • SCS CMU

2
Thanks
  • Spiros Papadimitriou (CMU-gtIBM)

Jimeng Sun (CMU -gt IBM)
  • Mengzhi Wang (CMU-gtGoogle)

3
Outline
  • Problem 1 workload characterization
  • Problem 2 self- monitoring
  • Problem 3 BGP mining
  • (Problem 4 sensor mining)
  • (Problem 5 Large graphs hadoop)

fractals
SVD
wavelets
tensors
PageRank
4
Problem 1
  • Goal given a signal (eg., bytes over time)
  • Find patterns, periodicities, and/or compress

bytes
Bytes per 30 (packets per day earthquakes per
year)
time
5
Problem 1
  • model bursty traffic
  • generate realistic traces
  • (Poisson does not work)

bytes
Poisson
time
6
Motivation
  • predict queue length distributions (e.g., to give
    probabilistic guarantees)
  • learn traffic, for buffering, prefetching,
    active disks, web servers

7
Q any pattern?
  • Not Poisson
  • spike silence more spikes more silence
  • any rules?

bytes
time
8
solution self-similarity
bytes
bytes
time
time
9
But
  • Q1 How to generate realistic traces
    extrapolate?
  • Q2 How to estimate the model parameters?

10
Approach
  • Q1 How to generate a sequence, that is
  • bursty
  • self-similar
  • and has similar queue length distributions

11
Approach
  • A binomial multifractal Wang02
  • 80-20 law
  • 80 of bytes/queries etc on first half
  • repeat recursively
  • b bias factor (eg., 80)

12
binary multifractals
20
80
13
binary multifractals
20
80
14
Parameter estimation
  • Q2 How to estimate the bias factor b?

15
Parameter estimation
  • Q2 How to estimate the bias factor b?
  • A MANY ways Crovella96
  • Hurst exponent
  • variance plot
  • even DFT amplitude spectrum! (periodogram)
  • More robust entropy plot Wang02

Mengzhi Wang, Tara Madhyastha, Ngai Hang Chang,
Spiros Papadimitriou and Christos Faloutsos,
Data Mining Meets Performance Evaluation Fast
Algorithms for Modeling Bursty Traffic, ICDE 2002
16
Entropy plot
  • Rationale
  • burstiness inverse of uniformity
  • entropy measures uniformity of a distribution
  • find entropy at several granularities, to see
    whether/how our distribution is close to uniform.

17
Entropy plot
p1
p2
of bytes here
  • Entropy E(n) after n levels of splits
  • n1 E(1) - p1 log2(p1)- p2 log2(p2)

18
Entropy plot
p2,3
p2,2
p2,4
p2,1
  • Entropy E(n) after n levels of splits
  • n1 E(1) - p1 log(p1)- p2 log(p2)
  • n2 E(2) - Si p2,i log2 (p2,i)

19
Real traffic
Entropy E(n)
  • Has linear entropy plot (-gt self-similar)

0.73
of levels (n)
20
Observation - intuition
Entropy E(n)
  • intuition slope
  • intrinsic dimensionality
  • degrees of freedom or
  • info-bits per coordinate-bit
  • unif. Dataset slope 1
  • multi-point slope 0

0.73
of levels (n)
21
Entropy plot - Intuition
Skip
  • Slope intrinsic dimensionality (in fact,
    Information fractal dimension)
  • info bit per coordinate bit - eg

Dim 1
Pick a point reveal its coordinate bit-by-bit
- how much info is each bit worth to me?
22
Entropy plot
Skip
  • Slope intrinsic dimensionality (in fact,
    Information fractal dimension)
  • info bit per coordinate bit - eg

Dim 1
Is MSB 0?
info value E(1) 1 bit
23
Entropy plot
Skip
  • Slope intrinsic dimensionality (in fact,
    Information fractal dimension)
  • info bit per coordinate bit - eg

Dim 1
Is MSB 0?
Is next MSB 0?
24
Entropy plot
Skip
  • Slope intrinsic dimensionality (in fact,
    Information fractal dimension)
  • info bit per coordinate bit - eg

Dim 1
Is MSB 0?
Info value 1 bit E(2) - E(1) slope!
Is next MSB 0?
25
Entropy plot
Skip
  • Repeat, for all points at same position

Dim0
26
Entropy plot
Skip
  • Repeat, for all points at same position
  • we need 0 bits of info, to determine position
  • -gt slope 0 intrinsic dimensionality

Dim0
27
Entropy plot
Skip
  • Real (and 80-20) datasets can be in-between
    bursts, gaps, smaller bursts, smaller gaps, at
    every scale

Dim 1
Dim0
0ltDimlt1
28
(Fractals, again)
  • What set of points could have behavior between
    point and line?

29
Cantor dust
  • Eliminate the middle third
  • Recursively!

30
Cantor dust
31
Cantor dust
32
Cantor dust
33
Cantor dust
34
Cantor dust
Dimensionality? (no length infinite
points!) Answer log2 / log3 0.6
35
Some more entropy plots
  • Poisson vs real

0.73
1
Poisson slope 1 -gt uniformly distributed
36
B-model
  • b-model traffic gives perfectly linear plot
  • Lemma its slope is
  • slope -b log2b - (1-b) log2 (1-b)
  • Fitting do entropy plot get slope solve for b

E(n)
n
37
Experimental setup
  • Disk traces (from HP Wilkes 93)
  • web traces from LBL
  • http//repository.cs.vt.edu/
  • lbl-conn-7.tar.Z

38
Model validation
  • Linear entropy plots

Bias factors b 0.6-0.8 smallest b / smoothest
nntp traffic
39
Web traffic - results
  • LBL, NCDF of queue lengths (log-log scales)

Prob( gtl)
(queue length l)
40
Conclusions
  • Multifractals (80/20, b-model, Multiplicative
    Wavelet Model (MWM)) for analysis and synthesis
    of bursty traffic

41
Books
  • Fractals Manfred Schroeder Fractals, Chaos,
    Power Laws Minutes from an Infinite Paradise
    W.H. Freeman and Company, 1991 (Probably the BEST
    book on fractals!)

42
Outline
  • Problem 1 workload characterization
  • Problem 2 self- monitoring
  • Problem 3 BGP mining
  • (Problem 4 sensor mining)
  • (Problem 5 Large graphs hadoop)

43
Clusters/data center monitoring
  • Monitor correlations of multiple measurements
  • Automatically flag anomalous behavior
  • Intemon intelligent monitoring system
  • warsteiner.db.cs.cmu.edu/demo/intemon.jsp

44
Publication
  • Evan Hoke, Jimeng Sun, John D. Strunk,
    Gregory R. Ganger, Christos Faloutsos. InteMon
    Continuous Mining of Sensor Data in Large-scale
    Self- Infrastructures. ACM SIGOPS Operating
    Systems Review, 40(3)38-44. ACM Press, July 2006

45
Under the hood SVD
  • Singular Value Decomposition
  • Done incrementally

Spiros Papadimitriou, Jimeng Sun and Christos
Faloutsos Streaming Pattern Discovery in
Multiple Time-Series VLDB 2005, Trondheim,
Norway.
46
Singular Value Decomposition (SVD)
  • SVD (LSI KL PCA spectral analysis...)

LSI S. Dumais M. Berry KL eg, DudaHart PCA
eg., Jolliffe Details Press
u of CPU2
t2
t1
u of CPU1
47
Singular Value Decomposition (SVD)
  • SVD (LSI KL PCA spectral analysis...)

u of CPU2
t2
t1
u of CPU1
48
Singular Value Decomposition (SVD)
  • SVD (LSI KL PCA spectral analysis...)

u of CPU2
t2
t1
u of CPU1
49
Singular Value Decomposition (SVD)
  • SVD (LSI KL PCA spectral analysis...)

u of CPU2
t2
t1
u of CPU1
50
Outline
  • Problem 1 workload characterization
  • Problem 2 self- monitoring
  • Problem 3 BGP mining
  • (Problem 4 sensor mining)
  • (Problem 5 Large graphs hadoop)

51
BGP updates
  • With
  • Aditya Prakash (CMU)
  • Michalis Faloutsos (UC Riverside)
  • Nicholas Valler (UC Riverside)
  • Dave Andersen (CMU)

52
Tool 0 Time plot
Time Series Updates per 600s, Washington
Router 09/2004-09/2006
53
Tool 0 Time plot
  • Observation 1 Missing values
  • Observation 2 Bursty

54
Tool 1 Wavelets
55
Wavelets - DWT
  • Short window Fourier transform (SWFT)
  • But how short should be the window?

freq
value
time
time
56
Wavelets - DWT
  • Answer multiple window sizes! -gt DWT

Time domain
DWT
SWFT
DFT
freq
time
57
Haar Wavelets
  • subtract sum of left half from right half
  • repeat recursively for quarters, eight-ths, ...

58
Low freq.
High freq.
time
Tornado Plot for Washington Router Dark
areas correspond to high energy
59
Tornado Plot Wavelet Transform for Washington
Router 09/2004-09/2006, All coefficients
and Detail levels 1-12
  • Observations
  • Obvious Spikes (E1)
  • tornados that touch down
  • 2. Prolonged Spikes (E2 and E3)
  • when coarser scales have high values but
    finer scales do not
  • Intermittent Waves (E4 and E5) High-energy
    entries at nearby scales correspond to local
    periodic motion

60
Magnification of updates on 28th Aug. 2005
updates
time
E2 Prolonged Spike
Sustained Period of relatively high Activity
61
Tool 2 logarithms
62
Tool 2 logarithms
Prominent clothesline at 50 updates per 600
secs. Culprit IP addresses 192.211.42.0/24 216
.109.38.0/24 207.157.115.0/24 All from Alabama
(Supercomputing Center)!
63
Outline
  • Problem 1 workload characterization
  • Problem 2 self- monitoring
  • Problem 3 BGP mining
  • (Problem 4 sensor mining)
  • (Problem 5 Large graphs hadoop)

fractals
SVD
wavelets
tensors
PageRank
64
Main point
  • Two-way street
  • lt- DM can use such infrastructures to find
    patterns
  • -gt DM can help such systems/networks etc to
    become self-healing, self-adjusting, self-
  • Hot topic in Data Mining finding patterns in
    Tera- and Peta-bytes

65
Additional resources
  • Machine learning classes at SCS/MLD
  • Tom Mitchells book on Machine Learning
  • Classification
  • Clustering/Anomaly detection
  • Support vector machines
  • Graphical models
  • Bayesian networks
  • ltetc etcgt

66
Thank you!
www.cs.cmu.edu/christos For code, papers
etc WeH 7107 christos ltatgt cs
Write a Comment
User Comments (0)
About PowerShow.com