Limiting the Impact of Failures on - PowerPoint PPT Presentation

About This Presentation
Title:

Limiting the Impact of Failures on

Description:

Reducing critical failures. 4. Background Sprint's IP backbone ... Overall 23% of all links are critical. 42. Cause Analysis ... – PowerPoint PPT presentation

Number of Views:28
Avg rating:3.0/5.0
Slides: 50
Provided by: nic8160
Learn more at: http://yuba.stanford.edu
Category:

less

Transcript and Presenter's Notes

Title: Limiting the Impact of Failures on


1
Limiting the Impact of Failures on Network
Performance Joint work with Supratik
Bhattacharyya, and Christophe Diot High
Performance Networking Group, 25 Feb. 2004
Yashar Ganjali Computer Systems Lab. Stanford
University yganjali_at_stanford.edu http//www.stan
ford.edu/yganjali
2
Motivation
  • The core of the Internet consists of several
    large networks (IP backbones).
  • IP backbones are carefully provisioned to
    guarantee low latency and jitter for packet
    delivery.
  • Failures occur on a daily basis as a result of
  • Physical layer malfunction,
  • Router hardware/software failures,
  • Maintenance,
  • Human errors,
  • Failures affect the quality of service delivered
    to backbone customers.

3
Outline
  • Background
  • Sprints IP backbone
  • Data
  • Impact Metrics
  • Time-based metrics
  • Link-based metrics
  • Measurements
  • Reducing the impact
  • Identifying critical failures
  • Causes analysis
  • Reducing critical failures

4
Background Sprints IP backbone
  • IP layer operates above DWDM with SONET framing.
  • IS-IS protocol used to route traffic inside the
    network.
  • IP-level restoration
  • When an IP link fails, all routers in the network
    independently compute a new path around the
    failure
  • No protection in the underlying optical
    infrastructure.

5
Data
  • IS-IS Link State PDU logs
  • Collected by passive listeners from Sprints
    North America backbone.
  • Feb. 1st, 2003 to Jun. 30th, 2003.
  • SNMP logs
  • Link loads recorded once in every 5 minutes.
  • SONET layer alarms
  • Corresponding to minor and major problems in the
    optical layer
  • We are only interested in two alarmsSLOS, and
    SLOS cleared.

6
Link Failures in Sprints IP Backbone 9408
Failures
7
Inter-POP vs. Intra-POP
8
Outline
  • Background
  • Sprints IP backbone
  • Data
  • Impact Metrics
  • Time-based metrics
  • Link-based metrics
  • Measurements
  • Reducing the impact
  • Identifying critical failures
  • Causes analysis
  • Reducing critical failures

9
Inter-POP Link Failures in Sprints IP Backbone
10
Two Perspectives
  • For a given impact metric
  • Time-based analysis Measure the impact of
    failures on the given metric as a function of
    time.
  • Link-based analysis Measure the impact of
    failures on the given metric as a function of
    failing links.

11
Time-based Impact Metrics
  • Number of Simultaneous Link Failures
  • Number of affected O-D pairs
  • Number of affected BGP prefixes
  • Path unavailability
  • Total rerouted traffic
  • Maximum load

12
Number of Simultaneous Failures
13
Number of Simultaneous Failures
14
Time-based Impact Metrics
  • Number of Simultaneous Link Failures
  • Number of affected O-D pairs
  • Number of affected BGP prefixes
  • Path unavailability
  • Total rerouted traffic
  • Maximum load

15
Number of Affected O-D Pairs
B
A
C
F
D
E
16
Number of Affected O-D Pairs
17
Time-based Impact Metrics
  • Number of Simultaneous Link Failures
  • Number of affected O-D pairs
  • Number of affected BGP prefixes
  • Path unavailability
  • Total rerouted traffic
  • Maximum load

18
Number of Affected BGP Prefixes
19
Time-based Impact Metrics
  • Number of Simultaneous Link Failures
  • Number of affected O-D pairs
  • Number of affected BGP prefixes
  • Path unavailability
  • Total rerouted traffic
  • Maximum load

20
Path Unavailability
B
A
C
F
D
E
21
Path Unavailability
22
Time-based Impact Metrics
  • Number of Simultaneous Link Failures
  • Number of affected O-D pairs
  • Number of affected BGP prefixes
  • Path unavailability
  • Total rerouted traffic
  • Maximum load

23
Total Rerouted Traffic
24
Time-based Impact Metrics
  • Number of Simultaneous Link Failures
  • Number of affected O-D pairs
  • Number of affected BGP prefixes
  • Path unavailability
  • Total rerouted traffic
  • Maximum load

25
Maximum Load Throughout the Network
26
Maximum Load Throughout the Network
96 of link failures were not followed by an
immediate change in maximum load.
27
Time-based Impact Metrics
  • Number of Simultaneous Link Failures
  • Number of affected O-D pairs
  • Number of affected BGP prefixes
  • Path unavailability
  • Total rerouted traffic
  • Maximum load

28
Number of Failures per Link
29
Number of Affected OD Pairs per Link
30
Number of Affected BGP Prefixes per Link
31
Path Coverage
B
A
C
F
D
E
32
Path Coverage of Links
33
Total Rerouted Traffic on a Link
34
Peak Factor of a Link
35
Link-based Impact Metrics
  • Number of Link Failures
  • Number of affected O-D pairs
  • Number of affected BGP prefixes
  • Path coverage
  • Total rerouted traffic
  • Peak factor

36
Outline
  • Background
  • Sprints IP backbone
  • Data
  • Impact Metrics
  • Time-based metrics
  • Link-based metrics
  • Measurements
  • Reducing the impact
  • Identifying critical failures
  • Causes analysis
  • Reducing critical failures

37
Critical Failures
  • For each time-based metric
  • Removing failures occuring during 1-5 of time
    improves the metrics by a factor of at least 5.
  • For each link-based metric
  • Removing failures on 1-7 of links improves the
    metric by a factor of at least 3.

38
Critical Time Periods
39
Critical Links
  • Any link which has a critical failures, is called
    a Critical Link.
  • We are interested in fixing such links.

40
Correlation of Critical Sets
41
Correlation of the Critical Sets
Metric Size 1 2 3 4 5 6 7 8 9 10
1) Simultaneous failures 11 - 0.38 0.33 0.27 0.23 0.11 0.13 0.08 0.15 0.05
2) of O-D pairs 9 - - 0.37 0.21 0.25 0.12 0.14 0.06 0.09 0.06
3) of BGP prefixes 6 - - - 0.18 0.32 0.09 0.05 0.1 0.07 0.03
4) Path unavailability 5 - - - - 0.41 0.14 0.11 0.08 0.12 0.04
5) Total rerouted traffic 6 - - - - - 0.09 0.11 0.09 0.08 0.08
6) of failures 2 - - - - - - 0.29 0.31 0.25 0.17
7) of O-D pairs 3 - - - - - - - 0.29 0.3 0.18
8) of BGP prefixes 2 - - - - - - - - 0.13 0.19
9) Path coverage 6 - - - - - - - - - 0.08
10) Total rerouted traffic 1 - - - - - - - - - -
Overall 23 of all links are critical.
42
Cause Analysis
  • Markopoulou et al. have used IS-IS update
    messages for characterizing link failures into
    the following categories MIB04.
  • Maintenance
  • Unplanned
  • Shared failures
  • Router-related
  • Optical-related
  • Unspecified
  • Individual failures

About 70 of all unplanned failures
43
Matching SLOS Alarms with IP Link Failures
58 of all link failures are due to optical layer
problems. 84 of critical failures are due to
optical layer problems.
44
Reducing Critical Failures
  • Replace old optical fibers/parts.
  • Optical Protection.
  • Push the traffic away.
  • Also works for maximum load and peak factor.

45
Performance Improvement
Time-based metrics Time-based metrics Link-based Metrics Link-based Metrics
Metric improvement Metric improvement
of failures of affected O-D pairs of BGP prefixes Path unavailability Total rerouted traffic 41 36 32 39 29 of failures of affected O-D pairs of BGP prefixes Path coverage Total rerouted traffic 45 37 29 42 38
46
Reducing Link Down-time
  • Low-failure links
  • Failure are very rare.
  • Damping doesnt help.
  • High-failure links
  • Failure rate changes very slowly.
  • Fixed damping is wasteful.

47
Adaptive Damping
  • Input
  • ? time difference between the last two failures
  • ? threshold
  • ? constant

Output ADT Adaptive damping timer
function Adaptive_Damping begin if (? lt ?) ADT
? x ? else ADT 0 end
48
Number Duration Pareto Curve
49
Thank you!
Write a Comment
User Comments (0)
About PowerShow.com