Title: Multisite Internet Data Analysis
1Multisite Internet Data Analysis
- Alfred O. Hero, Clyde Shih, David Barsic
- University of Michigan - Ann Arbor
hero_at_eecs.umich.edu - http//www.eecs.umich.edu/hero
- Network Data Collection
- Distributed Data Analysis
- Dimension Reduction
- Model-Based Data Analysis
- Conclusions
Research supported in part by NSF CCR-0325571
21. Network Data Collection
- Objectives
- Global monitoring centers aggregate statistics
from sites distributed around network to detect,
classify, or estimate global network state while
ensuring information privacy constraints - Local collection sites gather data relevant to
local network state and share information as
necessary to enhance local analysis. - Types of data measured
- Active queries and requests, packet probes
- Passive netflow, router fields, honeypots,
backscatter
3ISP 2
Local data collection and probing site
ISP 1
Monitoring Center
Data collection site
ISP 3
Data collector
4Abilene Netflow Data
No. Flows Avg. Duration Std.
Duration Avg Packets Std. Packets Avg
Bytes Std. Bytes
Protocol
Dataset 1
No. Flows Avg. Duration Std.
Duration Avg Packets Std. Packets Avg
Bytes Std. Bytes
Dataset 2
5Abilene Netflow Data
No. Flows Avg. Duration Std.
Duration Avg Packets Std. Packets Avg
Bytes Std. Bytes
Router
Dataset 1
No. Flows Avg. Duration Std.
Duration Avg Packets Std. Packets Avg
Bytes Std. Bytes
Dataset 2
6Abilene Netflow Data
7Challenges and Approaches
- Challenges
- High dimensional measurement space
- Non-linear dependencies and non-stationarity
- Privacy and proprietary concerns
- Insufficient bandwidth for cts sampled data
- Approaches
- Dimension reduction
- Model-based distributed inference
- Controlled information sharing
- Hierarchical and modular collection/analysis
8Hierarchical Architecure
92. Distributed Data Analysis
Site C
Site A
Site B
- Hypothesis data collected at sites A,B,C follow
a statistical distribution defined over a lower
dimensional manifold. - Overall objective Find distributed strategies to
perform reliable statistical inference with
minimum amount of data sharing
102.1 Distributed Dimension Reduction
Unknown Manifold
Unknown Embedding
Unknown Distribution
Sampling
Observed Sample
11Geodesic Entropic GraphsA Planar Sample and its
Euclidean MST
12GMST Dimension Estimation
GMST Estimates d13 H120(bits)_
13Distributed GMST Estimator
- Principal MST convergence result
- Distributed BHH (Aggregation rule)
- Tight upper and lower bounds on limit if
exchange rooted dual graphs Yukich97 among
sites
BHH Theorem
142.2 Distributed Model-based Inference
- Global likelihood model
- Global M-estimator recursion
- Global Fisher score function
- Local Fisher score functions
15Distributed M-estimator
A
B
16Properties
- Communication requirement is
- 2p bytes/update/site.
- If data are independent
attain stationary points of global likelihood - All local MLEs are available
to each site. - For multimodal likelihood, improvement on local
MLEs can be achieved by aggregation under
mixture model.
17Global Likelihood Function
Local MLEs
x xx x xx xxxx x xx
18Key Theoretical Result
- The asymptotic distribution of local estimates is
a Gaussian mixture dependent on global likelihood - Parameters
Proof asymptotic normal theory of local maxima
(Huber67) see BlattHero2003
19Local Estimator Aggregation Algorithm
Sample Covariance Analysis
Estimation of Gaussian Mixture Parameters (FS
,EM)
Aggregation To Final Estimate
20Simple Example
IID Observation Model
Local maximum
Global maximum
- Each site observes 2 component Gaussian mixture
- Identical component variances
- Unknown mixing parameters
- Unknown component means
- 200 data collection sites
- 100 samples/site
- CEM2 algorithm implemented for estimation and
aggregation
Ambiguity function.
21Clustering and Discrimination
Local maximum
Inverse FIM
Global maximum
2
m
Empirically estimated covariances via CEM2
m
1
22Validation of Key Result
QQ for Cluster 1
QQ for Cluster 2
23Conclusions
- Lossless distributed dimension reduction and
model-based inference requires - Reliable local inference methods
- Aggregation rules for combining local statistics
- Information sharing constraints?
- Effects of bandwidth constraints - data
compression? - Tracking in dynamical models?
24References
- A. O. Hero, B. Ma, O. Michel and J. D. Gorman,
Application of entropic graphs, IEEE Signal
Processing Magazine, Sept 2002. - J. Costa and A. O. Hero, Manifold learning with
geodesic minimal spanning trees, accepted in
IEEE T-SP (Special Issue on Machine Learning),
2004. - D. Blatt and A. Hero, "Asymptotic distribution of
log-likelihood maximization based algorithms and
applications," in Energy Minimization Methods in
Computer Vision and Pattern Recognition
(EMM-CVPR), Eds. M. Figueiredo, R. Rangagaran, J.
Zerubia, Springer-Verlag, 2003 - M.F. Shih and A. O. Hero, "Unicast-based
inference of network link delay distributions
using mixed finite mixture models," IEEE T-SP,
vol. 51, No. 9, pp. 2219-2228, Aug. 2003 - N. Patwari, A. O. Hero, and Brian Sadler,
"Hierarchical censoring sensors for Change
Detection, Proc. Of SSP, St. Louis, Sept.
2003.
25Information Sharing Game
26Addition of other Discriminants
Value-added due to transmission of likelihood
values