Title: Limiting the Impact of Failures on
1Limiting the Impact of Failures on Network
Performance Joint work with Supratik
Bhattacharyya, and Christophe Diot High
Performance Networking Group, 25 Feb. 2004
Yashar Ganjali Computer Systems Lab. Stanford
University yganjali_at_stanford.edu http//www.stan
ford.edu/yganjali
2Motivation
- The core of the Internet consists of several
large networks (IP backbones). - IP backbones are carefully provisioned to
guarantee low latency and jitter for packet
delivery. - Failures occur on a daily basis as a result of
- Physical layer malfunction,
- Router hardware/software failures,
- Maintenance,
- Human errors,
- Failures affect the quality of service delivered
to backbone customers.
3Outline
- Background
- Sprints IP backbone
- Data
- Impact Metrics
- Time-based metrics
- Link-based metrics
- Measurements
- Reducing the impact
- Identifying critical failures
- Causes analysis
- Reducing critical failures
4Background Sprints IP backbone
- IP layer operates above DWDM with SONET framing.
- IS-IS protocol used to route traffic inside the
network. - IP-level restoration
- When an IP link fails, all routers in the network
independently compute a new path around the
failure - No protection in the underlying optical
infrastructure.
5Data
- IS-IS Link State PDU logs
- Collected by passive listeners from Sprints
North America backbone. - Feb. 1st, 2003 to Jun. 30th, 2003.
- SNMP logs
- Link loads recorded once in every 5 minutes.
- SONET layer alarms
- Corresponding to minor and major problems in the
optical layer - We are only interested in two alarmsSLOS, and
SLOS cleared.
6Link Failures in Sprints IP Backbone 9408
Failures
7Inter-POP vs. Intra-POP
8Outline
- Background
- Sprints IP backbone
- Data
- Impact Metrics
- Time-based metrics
- Link-based metrics
- Measurements
- Reducing the impact
- Identifying critical failures
- Causes analysis
- Reducing critical failures
9Inter-POP Link Failures in Sprints IP Backbone
10Two Perspectives
- For a given impact metric
- Time-based analysis Measure the impact of
failures on the given metric as a function of
time. - Link-based analysis Measure the impact of
failures on the given metric as a function of
failing links.
11Time-based Impact Metrics
- Number of Simultaneous Link Failures
- Number of affected O-D pairs
- Number of affected BGP prefixes
- Path unavailability
- Total rerouted traffic
- Maximum load
12Number of Simultaneous Failures
13Number of Simultaneous Failures
14Time-based Impact Metrics
- Number of Simultaneous Link Failures
- Number of affected O-D pairs
- Number of affected BGP prefixes
- Path unavailability
- Total rerouted traffic
- Maximum load
15Number of Affected O-D Pairs
B
A
C
F
D
E
16Number of Affected O-D Pairs
17Time-based Impact Metrics
- Number of Simultaneous Link Failures
- Number of affected O-D pairs
- Number of affected BGP prefixes
- Path unavailability
- Total rerouted traffic
- Maximum load
18Number of Affected BGP Prefixes
19Time-based Impact Metrics
- Number of Simultaneous Link Failures
- Number of affected O-D pairs
- Number of affected BGP prefixes
- Path unavailability
- Total rerouted traffic
- Maximum load
20Path Unavailability
B
A
C
F
D
E
21Path Unavailability
22Time-based Impact Metrics
- Number of Simultaneous Link Failures
- Number of affected O-D pairs
- Number of affected BGP prefixes
- Path unavailability
- Total rerouted traffic
- Maximum load
23Total Rerouted Traffic
24Time-based Impact Metrics
- Number of Simultaneous Link Failures
- Number of affected O-D pairs
- Number of affected BGP prefixes
- Path unavailability
- Total rerouted traffic
- Maximum load
25Maximum Load Throughout the Network
26Maximum Load Throughout the Network
96 of link failures were not followed by an
immediate change in maximum load.
27Time-based Impact Metrics
- Number of Simultaneous Link Failures
- Number of affected O-D pairs
- Number of affected BGP prefixes
- Path unavailability
- Total rerouted traffic
- Maximum load
28Number of Failures per Link
29Number of Affected OD Pairs per Link
30Number of Affected BGP Prefixes per Link
31Path Coverage
B
A
C
F
D
E
32Path Coverage of Links
33Total Rerouted Traffic on a Link
34Peak Factor of a Link
35Link-based Impact Metrics
- Number of Link Failures
- Number of affected O-D pairs
- Number of affected BGP prefixes
- Path coverage
- Total rerouted traffic
- Peak factor
36Outline
- Background
- Sprints IP backbone
- Data
- Impact Metrics
- Time-based metrics
- Link-based metrics
- Measurements
- Reducing the impact
- Identifying critical failures
- Causes analysis
- Reducing critical failures
37Critical Failures
- For each time-based metric
- Removing failures occuring during 1-5 of time
improves the metrics by a factor of at least 5. - For each link-based metric
- Removing failures on 1-7 of links improves the
metric by a factor of at least 3.
38Critical Time Periods
39Critical Links
- Any link which has a critical failures, is called
a Critical Link. - We are interested in fixing such links.
40Correlation of Critical Sets
41Correlation of the Critical Sets
Metric Size 1 2 3 4 5 6 7 8 9 10
1) Simultaneous failures 11 - 0.38 0.33 0.27 0.23 0.11 0.13 0.08 0.15 0.05
2) of O-D pairs 9 - - 0.37 0.21 0.25 0.12 0.14 0.06 0.09 0.06
3) of BGP prefixes 6 - - - 0.18 0.32 0.09 0.05 0.1 0.07 0.03
4) Path unavailability 5 - - - - 0.41 0.14 0.11 0.08 0.12 0.04
5) Total rerouted traffic 6 - - - - - 0.09 0.11 0.09 0.08 0.08
6) of failures 2 - - - - - - 0.29 0.31 0.25 0.17
7) of O-D pairs 3 - - - - - - - 0.29 0.3 0.18
8) of BGP prefixes 2 - - - - - - - - 0.13 0.19
9) Path coverage 6 - - - - - - - - - 0.08
10) Total rerouted traffic 1 - - - - - - - - - -
Overall 23 of all links are critical.
42Cause Analysis
- Markopoulou et al. have used IS-IS update
messages for characterizing link failures into
the following categories MIB04. - Maintenance
- Unplanned
- Shared failures
- Router-related
- Optical-related
- Unspecified
- Individual failures
About 70 of all unplanned failures
43Matching SLOS Alarms with IP Link Failures
58 of all link failures are due to optical layer
problems. 84 of critical failures are due to
optical layer problems.
44Reducing Critical Failures
- Replace old optical fibers/parts.
- Optical Protection.
- Push the traffic away.
- Also works for maximum load and peak factor.
45Performance Improvement
Time-based metrics Time-based metrics Link-based Metrics Link-based Metrics
Metric improvement Metric improvement
of failures of affected O-D pairs of BGP prefixes Path unavailability Total rerouted traffic 41 36 32 39 29 of failures of affected O-D pairs of BGP prefixes Path coverage Total rerouted traffic 45 37 29 42 38
46Reducing Link Down-time
- Low-failure links
- Failure are very rare.
- Damping doesnt help.
- High-failure links
- Failure rate changes very slowly.
- Fixed damping is wasteful.
47Adaptive Damping
- Input
- ? time difference between the last two failures
- ? threshold
- ? constant
Output ADT Adaptive damping timer
function Adaptive_Damping begin if (? lt ?) ADT
? x ? else ADT 0 end
48Number Duration Pareto Curve
49Thank you!