Title: Statistical Approaches to Mining Multivariate Data Streams
1Statistical Approaches to Mining Multivariate
Data Streams
- Eric Vance, Duke University
- Department of Statistical Science
- David Banks, Duke University
- Tamraparni Dasu, ATT Labs - Research
JSM July 31, 2007 Salt Lake City, Utah
2Data Streams
- Huge amounts of complex data
- Rapid rate of accumulation
- One-time access to raw data
3Change Detection
- Problem Change detection in complicated data
streams - Three criteria
- Nonparametric
- Fast
- Statistical guarantees
4E-Commerce Server Data
- Data description
- One server in a network of servers
- Data polled every 5 minutes
- 27 variables
- 5 week time period
- Quality issues
- Many variables unimportant or unchanging
- Missing data
5E-Commerce Server Data
- Variable Elimination
- Several variables remain constant (Total Swap)
- Predictable and non-informative (Used SysDisk
Space) - Correlated (CPU Used, CPU User)
- 6 Variables Selected
- CPU Used
- Number of Procs
- Number of Threads
- Ping Latency
- Used Swap
- Log Packets log(In Packets Out Packets)
6Daily Boxplots
CPU
Procs
Threads
7Daily Boxplots
Ping
Swap
Packets
8Our Approach
- Partitioning scheme (Multivariate Histogram)
- Data depth Rank each point in relation to
Mahalanobis distance from center of data - Data Pyramid Determine which direction in the
data is most extreme - Profile based comparison
- Identify changes in profiles over time (week to
week)
9Partitioning Example in 2D
- 5 center-outward depth layers
- 4 pyramids
?A depth 4 pyramid y ?B depth 3
pyramid -x ?C depth 1 pyramid x
?A
?C
?B
10Identify Depth and Direction
- Compute center of comparison Data Sphere
- Calculate Mahalanobis distance for each point
- In which of the 5 quantiles of depth is
- Determine direction of greatest variation for
11Data Partition in 6 Dimensions
12Partitioning Into Bins
13Partitioning Into Bins
CPU
14Partitioning Into Bins
CPU
CPU
15Partitioning Into Bins
CPU
Procs
16Partitioning Into Bins
Procs
CPU
Procs
17Partitioning Into Bins
Procs
CPU
Threads
18Partitioning Into Bins
Procs
CPU
Threads
Threads
19Partitioning Into Bins
Procs
CPU
Threads
Threads
Threads
20Partitioning Into Bins
Procs
CPU
Threads
Threads
Threads
Threads
21Partitioning Into Bins
Procs
CPU
Threads
Ping
22Partitioning Into Bins
Procs
CPU
Threads
Ping
Ping
23Partitioning Into Bins
Procs
CPU
Threads
Ping
Swap
24Partitioning Into Bins
Procs
CPU
Threads
Ping
Swap
Swap
25Partitioning Into Bins
Procs
CPU
Threads
Ping
Swap
Packets
26Partitioning Into Bins
Procs
CPU
Packets
Threads
Ping
Swap
Packets
27Partitioning Into Bins Weeks 1 and 2
28Partitioning Into Bins Weeks 1 and 2
Swap
29Partitioning Into Bins Weeks 1 and 2
Swap
Swap
30Partitioning Into Bins Weeks 1 and 2
Swap
Swap
Swap
31Partitioning Into Bins Weeks 1 and 2
Swap
Swap
32Partitioning Into Bins Weeks 1 and 2
Threads
33Partitioning Into Bins Weeks 1 and 2
Threads
Threads
346 Variables Over 5 Weeks
35Week 1 Results
36Week 2 Results
376 Variables Over 5 Weeks
386 Variables Over 5 Weeks
Threads
39Week 3 Results
40Week 3 Results
Threads
41Week 4 Results
42Week 4 Results
Procs
43Week 4 Results
Threads
446 Variables Over 5 Weeks
456 Variables Over 5 Weeks
Packets
46Week 5 Results
47Week 5 Results
Packets
48Trimmed Mean and Covariance
- Applying partitioning method using a trimmed mean
and trimmed covariance matrix - Use 90 most central data points
- Recompute center
- Recompute Mahalanobis distances using new center
and new covariance matrix - Bins are more uniformly filled
- More points appear closer to trimmed center
- More variables become most extreme
49Trimmed Comparison Week 1
50Trimmed Results Week 1
51Trimmed Results Week 2
52Trimmed Results Week 3
53Trimmed Results Week 4
54Trimmed Results Week 5
55Multivariate Tests of Statistical Significance
- Multinomial test
- Chi-squared test, G-test
- Bootstrap and Resampling
- Bayesian methods
56Conclusion
- Quickly categorize data points non-parametrically
- Compare bins
- Identify which variables change most
- Questions or comments to
- ervance _at_ stat.duke.edu