Title: Things about Trace Analysis
1Things about Trace Analysis
- Wei-jen Hsu
- In class presentation for CIS6930
- wjhsu_at_ufl.edu
- (Advisor Ahmed Helmy)
2Objective
- More background knowledge related to trace-based
study - Details about the trace format an intro for one
of the assignments - Share the experience in trace analysis
3Why trace analysis?
- Traces provide the realism of how the system
work - Verification of established system
- Diagnosis of system operation (identify faults)
- Identifying design flaws
- Large-scale properties (e.g. self-similar
traffic) - Understand how a new system works
- Provide domain knowledge for analysis work
- Verifying an idea
4Typical Work Flow for Trace Analysis
- Build the system
- Identify point(s) of trace collection and the
methodology used - Obtain the data
- Clean-up and sanity check
- Analyze the data and post processing
- Explain the results
- Apply the results to further study or modify the
existing system
5WLAN Traces Study
- It starts back around 2000
- WLAN was new, people wanted to understand how
people used it (usage study) - Surveys v.s. trace
- Work by Tang and Baker (00), Kotz and Essien
(02) are pioneer examples - Statistics of usage ( of users, amount of
traffic, etc.)
6WLAN Traces Study
- Mobility-related
- MIT work (home location, prevalence, and
persistence) - UCSD (PDA users)
- WLAN mobility model (INFOCOM05, T-model,
T-model) - Other user properties
- Handoff
- Pause time distribution
7Trace Format
- For association
- Usually with format
- (Node_id, start_time, location, end_time)
- But with various ways to get you there.
- Syslog Event-based
- SNMP Polling
- USC raw trace
- Wireless association (time start/stop switch-port
MAC) - DHCP log (time MAC IP)
- Traffic log
8Trace Format Example
- USC wireless association trace
- (Time Start/Stop Switch_IP Switch_port
MAC_of_node) - Mon Oct 10 011652 Start
172.16.8.245 31005 03065f9c0ae - Mon Oct 10 011700 Stop
172.16.8.245 21044 0e359964d1 - Mon Oct 10 011702 Start
172.16.8.245 31015 01124dfc03a - USC DHCP trace
- (Time IP_of_node MAC_of_node)
- Jan 27 002119 207.151.229.50
018f310ea4c - Jan 27 002120 207.151.232.184
018de33792 - Jan 27 002120 207.151.229.50
018f310ea4c - USC traffic trace
- (Start_time End_time Destination_IP_port
Source_IP_port protocol(TCP6, UDP17) ?
Packet_number Data_size) - 0127.235942.925 0127.235944.905
128.125.253.143 53 207.151.239.208 1795
17 0 3 1368 - 0127.235942.925 0127.235952.677
63.236.56.237 80 207.151.239.208 3257
6 2 4 192
9Work with the Trace
- An exercise
- Does the Encounter-Relationship graph change
with respect to time?? - From WLAN traces,
- We find encounters to measure inter-node
relationship
Note Is this a good assumption??
10Encounter distribution
- How many other nodes does a node encounter with?
Prob. (unique encounter fraction gt x)
11Encounter-Relationship graph
- Imagine that there is a link to connect the node
pairs if they ever encounter with each other
What does the graph look like?
But, is ER grapha connected graph? What are its
properties?
12Encounter-Relationship graph
- To our surprise, ER graphs are connected!!
Disconnected Ratio ()
13Encounter-Relationship graph
- What are the graph properties of the relationship
graphs?
High clustering as regular graph Low path length
as random graph
14Encounter-Relationship graph
- Relationship graphs are SmallWorld graph
- High clustering coefficient, low avg. path length
Normalized CC and PL
15Work with the Trace
- An exercise
- Does the Encounter-Relationship graph change
with respect to time?? - Chop the trace into multiple segments
- Analyze the average clustering coefficient and
average path length of the resultant graph - How to deal with changing population?
- Does the encounter duration matter?
16Work with the Trace
- Ask questions! What to look for from the trace?
- Its importance
- Its implication
- Its potential usage
- Its alternative solutions
- Apply new techniques to look into the data
- Find/Create interesting data sets
17Lessons Learned
- You need a lot of patience and care
- Exceptions in the data
- Flaws in your assumption
- You need a lot of hard-drive space too!
- You need good questions
- For each question there are multiple ways to come
up with an answer - New questions require new data sets and tools
- You need to read a lot of papers
18More Potential Direction
- Mobility modeling/prediction
- Data mining and clustering
- Behavior-aware service/advertisements
- Behavior-aware routing
- Caveat Over-generalization from WLAN to
futuristic networks (such as DTN)? - Re-examine assumptions in earlier work
19Related Skills
- General programming (C/C)
- Perl/shell script/awk
- Matrix manipulation (MATLAB)
- Statistics software (R)
- http//www.r-project.org/
- Clustering/Machine learning
- Principal component analysis/ Singular value
decomposition - http//www.cs.cmu.edu/elaw/papers/pca.pdf
- Data mining? Database analysis?
20Good Online Resources
- MobiLib
- http//nile.cise.ufl.edu/MobiLib
- Links to various traces, USC trace and some
processing tools download - CRAWDAD
- http//crawdad.cs.dartmouth.edu/
- Various traces download, related papers
21References
- Stanford D. Tang and M. Baker, Analysis of a
Local-area Wireless Network - Stanford2 D. Tang and M. Baker, Analysis of a
Metropolitan-area Wireless Network - Dartmouth D. Kotz and K. Essien, Analysis of a
Campus-wide Wireless Network - Dartmouth2 T. Henderson, D. Kotz, and I.
Abyzov, The Changing Usage of a Mature
Campus-wide Wireless Network - MIT/IBM M. Balazinska and P. Castro,
Characterizing Mobility and Network Usage in a
Corporate Wireless Local-area Network
22References
- UCSD M. McNett and G. Voelker, Access and
Mobility of Wireless PDA Users - UCLA X. Meng, S. Wong, Y. Yuan, and S. Lu,
Characterizing Flows in Large Wireless Data
Networks - USC D. Bhattacharjee, A. Rao, C. Shah, M. Shah,
and A. Helmy, Empirical Modeling of Campus-wide
Pedestrian Mobility Observations on the USC
Campus - USC2 K. Merchant, W. Hsu, H. Shu, C. Hsu, and
A. Helmy, Weighted Waypoint Mobility Model and
Its Impacts on Ad Hoc Networks
23References
- Dartmouth M. Kim and D Kotz, Methodology for
Classifying Mobile Users and Access Points - Dartmouth L. Song, D. Kotz, R. Jain, and X. He,
Evaluating location predictors with extensive
Wi-Fi mobility data - SIGCOMM01 A. Balachandran, G. Voelker, P. Bahl,
and V. Rangan, Characterizing User Behavior and
Network Performance in a Public Wireless LAN - INFOCOM05 C. Tuduce and T. Gross, A Mobility
Model Based on WLAN Traces and its Validation - T-model D Lelescu, UC Kozat, R Jain, M
Balakrishnan, Model T an empirical joint
space-time registration model - T-model R Jain, D Lelescu, M Balakrishnan,
Model T an empirical model for user
registration patterns in a campus wireless LAN
24More on Mobility Modeling
25Mobility Observations from WLANs
- Skewed location visiting preferences
- Nodes spend 95 of time at top 5 preferred
locations. - Heavily visited preferred spots
- Periodical re-appearance
- Nodes show up repeatedly at the same location
after integer multiples of days. - Periodical daily/weekly schedules
26Mobility Observations from WLANs
- Problems of simple random models (random walk,
random waypoint, random direction) - No preferred locations in spatial domain (uniform
nodal distribution across space) - No structure in time domain (homogeneous behavior
across time) - Nodes behave statistically identical to one
another - Benefit Math analysis tractability
- Can we improve realism and not sacrifice math
tractability?
27Time-variant Community Model
- Skewed location visiting preferences
- Create communities to be the preferred
destination - Each node can have its own community
- Periodical re-appearance
- Create structure in time Periods
- Node move with different parameters in periods
- Repetitive structure
75
25
28Time-variant Community Model
- Major trends of mobility characteristics
preserved (extensions later) - In addition, mathematical tractability is retained
29More on Matrix-based Analysis
30Introduction
- Wide-spread WLAN deployments create large-scale
infrastructures. - Large number of users lead to large scale
management and design issues. - We need methods to quantify, summarize, and
compare long-run trends (in the order of months)
of individual user associations - Usage model / association model
- Personalized services
- Behavior aware ads / monetization
- Behavior-aware routing protocols
31Questions
- Q1. How to quantify user association consistency?
- (Challenge) What is a proper representation of
user association, and how do we measure
consistency? - Q2. How do we summarize long run user association
patterns? - (Challenge) How to utilize existing data
reduction techniques? - Q3. How to group users with similar association
patterns? - (Challenge) How to quantify the similarity of
user association patterns? - How to reduce computational complexity?
- Contribution Generic methods to address these
questions and empirically validated using USC and
Dartmouth WLAN traces.
32Representation of User Association Patterns
- We choose to represent summary of user
association in each day by a single vector. - For a given day d, user association vector is
defined by a n-element vector a aj the
percentage of online time the user i spends at
APj on day d. - The elements of a vector sum to 1.
- Use zero vector for off-line users.
- The elements in the vectors quantify the relative
importance (or, attraction) of the AP to the user.
Association vector (library, office, class)
(0.2, 0.4, 0.4)
33Q1. User Association Consistency
- User i is consistent, if its daily association
vectors can be grouped into few clusters (e.g.,
less than 10 of the number of days). - Evaluation use hierarchical clustering with
Manhattan distance measure (L1) -
- Distance between two vectors is at most 2.
34Q1. User Association Consistency
- Hierarchical Clustering
- Start Each vector is a single-member cluster.
- Recursion Two closest clusters are merged.
- End Until remaining clusters have distances
larger than a threshold
35Q1. User Association Consistency
Distribution of Number ofclusters under
cut-offthreshold 0.9
80 of users show at most9 clusters of behavior
modesduring the 94-day trace
complete link Distance between clusters
distance between the furthest components inthe
considered clusters
Observation many users are multimodal but with
much less association modes than total number
of days in the trace period.
36Q2. Summarizing user associations
- Association matrix concatenate user association
vectors for all days into a matrix. - To summarize, perform SVD and store the top-k
eigen values/vectors. - What value of k we have to use for a good
representation of the matrix? - Captured matrix power
- How much is the reconstruction error?
- Matrix norms X-Xkp/Xp where
37Q2. Summarizing user associations
Only top 6 singular vectorsare needed to capture
at least90 of power for more than 95 of
association matrices
Reconstruction error of low-rank
approximationis low (5 singular vectorsgive
error lt 0.05)
Observation although users are multi-modal,a
few major modes dominate its behavior
38Q2. Summarizing user associations
- Association matrix concatenate user association
vectors for all days into a matrix. - To summarize, perform SVD and store the top-k
eigen values/vectors. - What value of k we have to use for a good
representation of the matrix? - Captured matrix power
- How much is the reconstruction error?
- Matrix norms X-Xkp/Xp where
39Q2. Summarizing user associations
Only top 6 singular vectorsare needed to capture
at least90 of power for more than 95 of
association matrices
Reconstruction error of low-rank
approximationis low (5 singular vectorsgive
error lt 0.05)
Observation although users are multi-modal,a
few major modes dominate its behavior
40Q3. Similarity Metrics between Users
- Naive method to compare similarity between user i
and j - Intuition for every daily association vector of
i, if there is a similar association vector for
j, then (i,j) have similar behavior. - From user i, pick association vector aid of user
i on day d. - Find the association vector of user j, denoted by
ajd , which is the nearest to aid - Find average of ajd - aid over all days d.
- Drawback expensive
- O(nd2) for each pair
- Lots of file reads for large dataset . Read raw
data - Need a faster method which reads summaries
41Q3. Similarity Metrics between Users
- Compare the similarity of the eigen-vectors
obtained from SVD. - Similarity between users determined by weighted
inner products of eigen vectors. -
- wi proportion of power of singular vector
- D(U,V) 1 - Sim(U,V)
- Are the 2 metrics similar?
- 0.911 correlation coefficient for studied users.
42Q3. Similarity Metrics between Users
- Are we able to get clusters with similar users?
- Compare the PDF/CDF for inter- and intra- cluster
users (Example 200 clusters).
43Q3. Similarity Metrics between Users
- Take users in the same clusters and concatenate
the asso. matrices, and perform SVD and find
power captured by top k eigen vectors. - Also take random users and concatenate the
eigenvectors and do the same. - There is a clear distinction between the 2
clustering methods.
straight-forward similarity decided based
onpair-wise comparison of association
vectors feature-based similarity decided based
on singular vectors
44Q3. Similarity Metrics between Users
- For all clusters, use a scatter plot to show the
power captured by top-4 eigenvectors.
(distance-based cluster vs random cluster)