Statistical Approaches to Mining Multivariate Data Streams - PowerPoint PPT Presentation

1 / 56
About This Presentation
Title:

Statistical Approaches to Mining Multivariate Data Streams

Description:

Number of Threads. Ping Latency. Used Swap. Log Packets = log(In Packets Out Packets) ... Threads. Daily Boxplots. Ping. Swap. Packets. Our Approach ... – PowerPoint PPT presentation

Number of Views:52
Avg rating:3.0/5.0
Slides: 57
Provided by: stat86
Category:

less

Transcript and Presenter's Notes

Title: Statistical Approaches to Mining Multivariate Data Streams


1
Statistical Approaches to Mining Multivariate
Data Streams
  • Eric Vance, Duke University
  • Department of Statistical Science
  • David Banks, Duke University
  • Tamraparni Dasu, ATT Labs - Research

JSM July 31, 2007 Salt Lake City, Utah
2
Data Streams
  • Huge amounts of complex data
  • Rapid rate of accumulation
  • One-time access to raw data

3
Change Detection
  • Problem Change detection in complicated data
    streams
  • Three criteria
  • Nonparametric
  • Fast
  • Statistical guarantees

4
E-Commerce Server Data
  • Data description
  • One server in a network of servers
  • Data polled every 5 minutes
  • 27 variables
  • 5 week time period
  • Quality issues
  • Many variables unimportant or unchanging
  • Missing data

5
E-Commerce Server Data
  • Variable Elimination
  • Several variables remain constant (Total Swap)
  • Predictable and non-informative (Used SysDisk
    Space)
  • Correlated (CPU Used, CPU User)
  • 6 Variables Selected
  • CPU Used
  • Number of Procs
  • Number of Threads
  • Ping Latency
  • Used Swap
  • Log Packets log(In Packets Out Packets)

6
Daily Boxplots
CPU
Procs
Threads
7
Daily Boxplots
Ping
Swap
Packets
8
Our Approach
  • Partitioning scheme (Multivariate Histogram)
  • Data depth Rank each point in relation to
    Mahalanobis distance from center of data
  • Data Pyramid Determine which direction in the
    data is most extreme
  • Profile based comparison
  • Identify changes in profiles over time (week to
    week)

9
Partitioning Example in 2D
  • 5 center-outward depth layers
  • 4 pyramids

?A depth 4 pyramid y ?B depth 3
pyramid -x ?C depth 1 pyramid x

?A
?C
?B
10
Identify Depth and Direction
  • Compute center of comparison Data Sphere
  • Calculate Mahalanobis distance for each point
  • In which of the 5 quantiles of depth is
  • Determine direction of greatest variation for

11
Data Partition in 6 Dimensions
12
Partitioning Into Bins
13
Partitioning Into Bins
CPU
14
Partitioning Into Bins
CPU
CPU
15
Partitioning Into Bins
CPU
Procs
16
Partitioning Into Bins
Procs
CPU
Procs
17
Partitioning Into Bins
Procs
CPU
Threads
18
Partitioning Into Bins
Procs
CPU
Threads
Threads
19
Partitioning Into Bins
Procs
CPU
Threads
Threads
Threads
20
Partitioning Into Bins
Procs
CPU
Threads
Threads
Threads
Threads
21
Partitioning Into Bins
Procs
CPU
Threads
Ping
22
Partitioning Into Bins
Procs
CPU
Threads
Ping
Ping
23
Partitioning Into Bins
Procs
CPU
Threads
Ping
Swap
24
Partitioning Into Bins
Procs
CPU
Threads
Ping
Swap
Swap
25
Partitioning Into Bins
Procs
CPU
Threads
Ping
Swap
Packets
26
Partitioning Into Bins
Procs
CPU
Packets
Threads
Ping
Swap
Packets
27
Partitioning Into Bins Weeks 1 and 2
28
Partitioning Into Bins Weeks 1 and 2
Swap
29
Partitioning Into Bins Weeks 1 and 2
Swap
Swap
30
Partitioning Into Bins Weeks 1 and 2
Swap
Swap
Swap
31
Partitioning Into Bins Weeks 1 and 2
Swap
Swap
32
Partitioning Into Bins Weeks 1 and 2
Threads
33
Partitioning Into Bins Weeks 1 and 2
Threads
Threads
34
6 Variables Over 5 Weeks
35
Week 1 Results
36
Week 2 Results
37
6 Variables Over 5 Weeks
38
6 Variables Over 5 Weeks
Threads
39
Week 3 Results
40
Week 3 Results
Threads
41
Week 4 Results
42
Week 4 Results
Procs
43
Week 4 Results
Threads
44
6 Variables Over 5 Weeks
45
6 Variables Over 5 Weeks
Packets
46
Week 5 Results
47
Week 5 Results
Packets
48
Trimmed Mean and Covariance
  • Applying partitioning method using a trimmed mean
    and trimmed covariance matrix
  • Use 90 most central data points
  • Recompute center
  • Recompute Mahalanobis distances using new center
    and new covariance matrix
  • Bins are more uniformly filled
  • More points appear closer to trimmed center
  • More variables become most extreme

49
Trimmed Comparison Week 1
50
Trimmed Results Week 1
51
Trimmed Results Week 2
52
Trimmed Results Week 3
53
Trimmed Results Week 4
54
Trimmed Results Week 5
55
Multivariate Tests of Statistical Significance
  • Multinomial test
  • Chi-squared test, G-test
  • Bootstrap and Resampling
  • Bayesian methods

56
Conclusion
  • Quickly categorize data points non-parametrically
  • Compare bins
  • Identify which variables change most
  • Questions or comments to
  • ervance _at_ stat.duke.edu
Write a Comment
User Comments (0)
About PowerShow.com