Title: Network Simulation and Testing
1Network Simulation and Testing
- Polly Huang
- EE NTU
- http//cc.ee.ntu.edu.tw/phuang
- phuang_at_cc.ee.ntu.edu.tw
2Dynamics Papers
- Hongsuda Tangmunarunkit, Ramesh Govindan, and
Scott Shenker. Internet path inflation due to
policy routing. In Proceedings of the SPIE ITCom,
pages 188-195, Denver, CO, USA, August 2001. SPIE - Lixin Gao. On inferring automonous system
relationships in the internet. ACM/IEEE
Transactions on Networking, 9(6)733-745,
December 2001 - Vern Paxson. End-to-end internet packet dynamics.
ACM/IEEE Transactions on Networking,
7(3)277-292, June 1999 - Craig Labovitz, G. Robert Malan, Farnam Jahanian.
Internet Routing Instability. ACM/IEEE
Transactions on Networking, 6(5)515-528, October
1998
3Doing Your Own Analysis
- Having a problem
- Need to simulate or to test
- Define experiments
- Base scenarios
- Scaling factors
- Metrics of investigation
4Base Scenarios
- The source models
- To generate traffic
- The topology models
- To generate the network
- Then?
5Internet Dynamics
- How traffic flow across the network
- Routing
- Shortest path?
- How failures occur
- Packets dropped
- Routes failed
- i.i.d?
6Identifying Internet Dynamics
- Routing Policy
- Packet Dynamics
- Routing Dynamics
7To the best of our knowledge, we could now
generate
- AS-level topology
- Hierarchical router-level topology
8The Problem
- Does it matter what routing computation we use?
- Equivalent of
- Can I just do shortest path computation?
9Topology with Policy
- Internet Path Inflation Due to Policy Routing
- Hongsuda Tangmunarunkit, Ramesh Govindan, Scott
Shenker - In Proceedings of the SPIE ITCom, pages 188-195,
Denver, CO, USA, August 2001. SPIE
10Paper of Choice
- Methodological value
- A simple re-examine type of study
- To strengthen technical value of prior work
- Technical value
- Actual paths are not the shortest due to routing
policy. - The routing policy is business-driven and can be
quite hard to obtain. - Shown in this paper, for simulation study
concerning large-scale route path
characteristics, a simple shortest-AS policy
routing may be sufficient.
11Inter-AS Routing
AS 3
AS 2
source
destination
AS 1
AS 5
AS 4
12Hierarchical Routing
destination
source
13Flat Routing
destination
source
1453
- Hierarchical Routing is not optimal
- Or
- Routes are inflated
15How sub-optimal?
16Prior Work
- Based on
- An actual router-level graph
- An actual AS-level graph at the same time
- Overlay the AS-level graph on the router-level
graph - Compute
- For each source-destination pair
- Shortest path using hierarchical routing
- Shortest path using flat routing
- Compare route length
- In number of router hops
17Prior Conclusions
- 80 of the paths are inflated
- 20 of the paths are inflated gt 50
- There exists a better detour for 50 of the
source-destination pairs - There exists an intermediate node i such that
Length(s-i-d) lt Length(s-d)
18This Work
- To address 2 shortcomings
- Theres now a newer router-level graph
- Theres now a more sophisticated policy model
- Paper 4
- Inter-AS routing is not quite shortest-AS
routing
19Newer vs. Older Graph
- Inflation difference not the same
- Difference is larger in the newer graph
- Due to the newer graph being larger
- Inflation ratio remains the same
20Shortest-AS vs. Policy-AS Routing
- Shortest-AS
- Simplified model
- Every AS is equal
- Policy-AS
- Realistic model
- Not all ASs are the same
- Some are provider ASs
- Some are customer ASs
- Customer ASs do not transit traffic
21Consider TANET? CHT
UUNET
Through UUNET?
CHT
TANET
NTU
Through NTU?
22Routing with Constraints
- Routes could be
- Going up
- Going down
- Going up and then down
- Routes can never be
- Going down and then up
23Inferring the Constraints
- On Inferring Autonomous System Relationships in
the Internet - Lixin Gao
- ACM/IEEE Transactions on Networking,
9(6)733-745, December 2001
24Not All ASs the Same
- 2 types of ASs
- Customer
- Provider
- 3 types of Relationships
- Customer-provider
- Provider-provider
- Peer-peer
- Sibling-sibling
25Customer-Provider
- Formal definition
- A provider transits for its customer
- A customer does no transit for its provider
- Informal
- Provider Ill take any traffic
- Customer Ill take only the traffic to me (or my
customers)
26Peer-Peer
- Formal Definition
- A provider does not transit for another provider
- Informal
- Ill take only the traffic to me (or my
customers) - Youll take only the traffic to you (or your
customers)
27Sibling-Sibling
- Formal Definition
- A provider transits for another provider
- Informal
- Ill take any traffic
- Youll take any traffic
28Never Going Down and then Up
- A provider-customer link can be followed by only
- Provider-customer link
- (Or sibling-sibling link)
- A peer-peer link can be followed by only
- Provider-customer link
- (Or sibling-sibling link)
29Heuristics
- Compute out-degrees
- For each AS path in routing tables
- 1st AS with the max degree the root of hierarchy
- From the root, drawing provider?customer
relationship down 2 ends of the AS path
30Determining Siblings
- After gone through all AS paths
- Any AS pair being both provider and customer to
each other are siblings
31Determining Peers
- Do another pass on the AS paths in routing tables
- For each AS path
- Top AS who does not have sibling relationships
with the neighboring ASs - Could have peering relationship with the higher
out-degree neighbor - Given the Top AS and the higher out-degree
neighbor are comparable in out-degree
32Back to Path Inflation
- Draw the customer-provider, peer-peer, and
sibling-sibling relationships on the overlay AS
graph - Compute the best routes under the never going
down and then up constraint - Compare the inflation difference and ratio again
with these running at the inter-AS level - Shortest
- Policy
33Shortest vs. Policy Routing
- Pretty much the same both in terms of
- Inflation difference
- Inflation ratio
34Therefore
- The observations from the prior work holds
- With a newer graph
- With the more realistic inter-AS policy routing
35Now forget path inflation
- How far away is the shortest to the policy
inter-AS routing?
36Shortest vs. Policy
- In AS hops
- 95 paths have the same length
- Policy routes always longer
- In router hops
- 84 paths have the same length
- Some policy routes longer, some shorter
3795 and 84 are pretty good numbers
- Therefore shortest path at the inter-AS level
might be OK
38To Answer the Question
- Can we simply do shortest path computation?
- A likely yes for AS-level graph
- A firm no for hierarchical graph
- Must separate inter-AS shortest and intra-AS
shortest
39Questions?
40Identifying Internet Dynamics
- Routing Policy
- Packet Dynamics
- Routing Dynamics
41Its never a perfect world
42The Problem
- But how perfect is the Internet?
- The Internet
- A network of computers with stored information
- Some valuable, some relevant
- You participate by putting information up or
getting information down - From time to time, you cant quite do some of
these things you want to do
43Why is that?
44At the philosophical level
- Humans are so bound to failures.And the Internet
is human-made.
45But, Seriously
- Consider loading a Web page
46Web Surfing Failures
- The window waving forever?
- An error message saying network not reachable
- An error message saying the server too busy
- An error message saying the server is down
- Anything else?
47Network Specific Failures
- The window waving forever?
- An error message saying network not reachable
- An error message saying the server too busy
- An error message saying the server is down
- Anything else?
48The Causes
- The window waving forever
- Congestion in the network
- Buffer overflow
- Packet drops
- An error message saying network not reachable
- Network outage
- Broken cables, Frozen routers
- Route re-computation
- Route instability
49Back to the Problem
- But how perfect is the Internet?
- Equivalent of
- Packets can be dropped
- How frequent
- How much
- Routes may be unstable
- How frequent
- For how long
50Significance
- Knowing the characteristics of packet drops and
route instability helps - Design for fault-tolerance
- Test for fault-tolerance
51There are tons of formal/informal study on the
dynamics
- Lets take a look at a couple that are classical
52Packet Dynamics
- End-to-End Internet Packet Dynamics
- Vern Paxson
- ACM/IEEE Transactions on Networking,
7(3)277-292, June 1999
53Emphasis in Reverse Order
- Real subject of study
- Packet loss
- Packet delay
- Necessary assessment
- The unexpected
- Bandwidth estimation
54Measurement
- Instrumentation
- 35 sites, 9 countries
- Education, research, provider, company
- 2 runs
- N1 Dec 1994
- N2 Nov-Dec 1995
- 21 sites in common
55Measurement Methodology
- Each site running NPD
- A daemon program
- Sender side sends 100KB TCP transfer
- Sender and receiver sides both
- tcpdump the packets
- Noteworthy
- Measurement occurred in Poisson arrival
- Unbiased to time of measurement
- N2 used big max window size
- Prevent window size to limit the TCP connection
throughput
56Packet Loss
- Overall loss rate
- N1 2.7, N2 5.2
- N2 higher, because of big max window?
- I.e. Pumping more data into the network therefore
more loss? - Big max window in N2 is not a factor
- By separating data and ack loss
- Assumption ack traffic in a half lower rate
- Wont stress the network
- Ack loss N1 2.88, N2 5.14
- Data loss N1 2.65, N2 5.28
57Quiescent vs. Busy
- Definition
- Quiescent connections without ack drops
- Busy otherwise
- About 50 of the connections are quiescent
- For connections are busy
- Loss rate N1 5.7, N2 9.2
58More Numbers
- Geographical effect
- Time of the day effect
59Towards a Markov Chain Model
- For hours long
- No-loss connection now indicates further no-loss
connection in the future - Lossy connection now indicates further lossy
connections in the future - For minutes long
- The rate remains similar
60Another Classification
- Data
- Loaded data packets experiencing queueing delay
due to own connection - Unloaded data packets not experiencing queueing
delay due to own connection - Bottleneck bandwidth measurement is needed here
to determine whether a packet is loaded or not - Ack
- Simply acks
613 Major Observations
- Although loss rate very high (47, 65, 68), all
connections complete in 10 minutes - Loss of data and ack not correlated
- Cumulative distribution of per connection loss
rate - Exponential for data
- Not so exponential for ack
- Adaptive sampling contributing to the exponential
observation?
62More on the Markov Chain Model
- The loss rate Pu
- The rate of loss
- The conditional loss rate Pc
- The rate of loss when the previous packet is lost
- Contrary to the earlier work
- Losses are busty
- Duration shows pareto upper tail
- (Polly maybe more log-normal)
63You might askpl ,pn?
64Values for the pls
N1 N2
Loaded data 49 50
Unloaded data 20 25
Ack 25 31
65Possible Invariant
- Conditional loss rate
- For the value remains relatively close over the 1
year period - More up-to-date data to verifying this?
- The loss burst size log normal?
- Both interested research questions
66Packet Delay
- Looking at one-way transit times (OTT)
- Theres model for OTT distribution
- Shifted gamma
- Parameters changes with regards to time and path
- Internet path are asymmetric
- OTT one way often not equal OTT the other way
67Timing Compression
- Ack compressions are small events
- So not really pose threads on
- Ack clocking
- Rate estimation based control
- Data compression very rare
- For outlier filtering
68Queueing Delay
- Variance of OTT over different time scales
- For each time scale ?
- Divide the packets arrival into intervals of ?
- For all 2 neighboring intervals l, r
- ml the median of OTT in interval l
- mr the median of OTT in interval r
- Calculate (ml-mr)
- Variance of OTT over ? is median of all (ml-mr)
69Finding the Dominant Scale
- Looking for ?s whose queueing variance are large
- Where control most needed
- For example, if those ?s re smaller than RTT
- Then TCP doesnt need to bother adapting to
queueing fluctuations
70Oh Well
- Queueing delay variations occur
- Dominantly on 0.1-1 sec scales
- But non-negligibly on larger scales
71Share of Bandwidth
- Pretty much uniformly distributed
72Conclusions on Analysis
- Common assumptions violated
- In-order packet delivery
- FIFO queueing
- Independent loss
- Single congestion time scale
- Path asymmetry
- Behavior
- Very wide range, not one typical
73Conclusions on Design
- Measurement methodology
- TCP-based measurement shown viable
- Sender-side only inferior
- TCP implementation
- Sufficiently conservative
74The Pathologies
75Packet Re-Ordering
- Varying widely and too few samples
- Therefore, deriving only a rule of thumb
- The Internet paths sometimes experience bad
reordering - Mainly due to route flapping
- Occasionally this funny case of router
implementation - Buffering packets while processing a route update
- Sending these packets interleaving with the
post-update arrivals
76Orthogonal to TCP SACK
- Receiver end modification
- 20 msec wait before sending duplicate
acknowledgement - Waiting for re-ordered packets therefore lower
false duplicate acknowledge - Dup acks should be indication of losses
- Sender end motification
- Fast retransmission after 2 duplicate
acknowledgements - Reactive fast retransmission, higher throughput
77Packet Replication
- Very strange, cant quite explain
- A pair of acks duped 9 times, arriving 32 msec
apart - A data packet duped 23 times, arriving in burst
- False-configured bridge?
- Observation
- Most of these site specific
- But small number of dups spread between other
sites - Senders dup packets too
78Packet Corruption
- Checksum good?
- Problem
- The traces contain only the header data
- Pure ack OK, the header the packet
- Data not OK, the header ltgt the packet
- Use an corruption inferring algorithm in tcpanaly
79Corruption Rate
- 1 corruption out of 5000 data packets
- 1 corruption out of 300,000 pure acks
- Possible reasons of the difference
- Header compression
- Packet size
- Inferring tool discrepancy
- Other router/link level implementation artifacts
80Implication
- 16-bit checksum no longer sufficient
- A corrupted packet has a one 216th chance to have
the same checksum as the non-corrupted packet - I.e., one out of the 216 corrupted packet cant
be detected by the checksum - Since 1 out of 5000 data packets is corrupted
- 1 out of 5000 216 (300 M) packets cant be
identified as corrupted by the TCP 16-bit
checksum - Consider one Gbps link and packet size 1Kb ? 1M
Pps - 3 seconds per falsely received corrupted packet
81Estimating Bottleneck Bandwidth
- The packet pair technique
- Send 2 packets back to back (or close enough)
- Inter-packet time, T2-T1, very small
- When then go across the bottleneck
- Serving packet 1 while packet 2 will be queued
- Packet 2 immediately follow packet 1
- Packets will be stretched
- Internet-packet time, T2-T1 , now the
transmission time of packet 1 - Estimated bandwidth (Size of packet 1)/(T2-T1 )
82This Wont Work
- Bottleneck bandwidth higher than sending rate
- Out-of-order delivery
- Clock resolution
- Changes in the bottleneck bandwidth
- Multi bottlenecks
83PBM
- Instead of sending a pair
- Send a bunch
- More robust again the multi bottleneck problem
84Questions?
85Identifying Internet Dynamics
- Routing Policy
- Packet Dynamics
- Routing Dynamics
86Route Instability
- Internet Routing Instability
- Craig Labovitz, G. Robert Malan, Farnam Jahanian
- ACM/IEEE Transactions on Networking,
6(5)515-528, October 1998
87BGP Specific
- BGP is an important part of the Internet
- Connecting the domains
- Widespread
- Known in prior work that route failure could
result in - Packet loss
- Longer network delay
- Network outage (Time to globally converge to
local change) - A closer look at the BGP dynamics
- How much route updates are sent
- How frequent are they sent
- How useful are these updates
88BGP (In a Slide)
- The routing protocol running among the border
routers - Path Vector
- Think DV
- Exchange not just next hop, but entire path
- Dynamics
- In case of link/router recovery
- Exchange from the recovering point the route
announcements - In case of link/router down
- Exchange from the closed point the route
withdraws - Route updates
- Including route announcements/withdraws
89Data Collection
- Monitoring exchange of route updates
- Over 9 month period
- 5 public exchange points in the core
- Exchange point
- Connecting points of ASs
- Public exchange of the US government
- Private exchange of the commercial providers
90Terminology
- AS
- You all know
- In the path of the path vector exchanged by BGP
- AS-PATH
- Prefix
- Basically network address
- The source/destination of the route entries in
BGP - 140.119.154/24
- 140.119/16
91Classification of Problems
- Forward instability
- Legitimate topological changes affecting paths
- Routing policy fluctuation
- Changes in routing policy but not affecting
forwarding paths - Pathological updates
- Redundant information not affecting routing nor
forwarding
92Forwarding Instability
- WADiff
- A route is explicitly withdrawn
- Replaced with an alternative route
- As it becomes unreachable
- The alternative route is different in AS-PATH or
next-hop - AADiff
- A route is implicitly withdrawn
- Replaced with an alternative route
- As it becomes unreachable or a preferred
alternative route becomes available
93In the Middle
- WADup
- A route is explicitly withdrawn
- Then re-announced as reachable
- Could be
- Pathological
- Forwarding instability transient topological
change - AADup
- A route is implicitly withdrawn
- Replaced with a duplicate of the original route
- Same AS-PATH and next-hop
- Could be
- Pathological
- Policy fluctuation differ in other policy
attributes
94Pathological
- WWDup
- Repeated withdraws for a prefix no longer
reachable - Pathological
95Observations The Majority
- Pathological updates (redundant)
- Minimum effect on
- Route quality
- Router processing load
- Some not agree
- Adding significant amount of traffic
- 300 updates/second could crash a high-end router
96Observation - Instability
- Forwarding instability
- 3-10 WADiff
- 5-20 AADiff
- 10-50 WADup
- Policy fluctuation
- AADup quite high
- But most probably pathological
- Need this
- The Internet routing works become of these
necessary and frequent updates
97Observation Distribution
- No spacial correlation
- Correlates to router implementation instead
- Temporal
- Time the the date effect, date of the week effect
- Therefore correlates to network congestion
- Periodicity
- 30, 60 second period
- For self-sync, mis-configuration, BGP is
soft-state based, etc
98Basically, not saying much
- But for the background
- And ease of reading
99Questions?
100What Should You Do?
- Routing policy
- Intra-AS shortest path
- Inter-AS shortest path (95, 84 OK)
- Better model in progress
- Packet losses
- 2-state markov chain model
- pl some info
- pn no info
- Routing instability outage time
- The paper 2 of the original paper set (OSPF vs.
DV)