An introduction to - PowerPoint PPT Presentation

About This Presentation
Title:

An introduction to

Description:

Teleshopping. 64 Kbps 2 Mbps. Very High. Pagina . 5.00 5.00 5.00 5.00 5.00 5.00 5.00 5.00 5.00 4.00 4.00 3.00 5.00 5.00 4.00 5.00 2.00 2.00 1.00 2.00 2.00 2.00 ... – PowerPoint PPT presentation

Number of Views:255
Avg rating:3.0/5.0
Slides: 65
Provided by: uninaStid5
Category:

less

Transcript and Presenter's Notes

Title: An introduction to


1
  • An introduction to
  • NETWORK RESILIENCY
  • Giorgio Ventre Stefano AvalloneCOMICS
    GroupDipartimento di Informatica e
    SistemisticaUniversità di Napoli Federico II

2
References
  • Jean-Philippe Vasseur, Mario Pickavet, Piet
    Demeester. Network Recovery, protection and
    restoration of optical, SONET-SDH, IP and MPLS.
    Morgan Kaufmann
  • AA. VV. Building Survivable Networks, Feature
    Issue of IEEE Network Magazine, March/April 2004

3
Communication Networks Relevance
  • Communication Networks are becoming fundamental
    infrastructures
  • the amount of data carried out by Communication
    Networks is considerably grows in the last years
  • many social and economic activities depend on
    Communication Networks
  • many safe critical activities depend on
    Communication Networks.
  • Reliability is an essential feature of today
    Communication Networks !

4
Network Reliability definition1
  • The (a) ability of a network to maintain or
    restore an acceptable level of performance during
    network failures by applying various restoration
    techniques, and (b) mitigation or prevention of
    service outages from network failures by applying
    preventive techniques.
  • Acronym Network Survivability.
  • 1 Alliance for Telecommunications Industry
    Solutions (ATIS) http//www.atis.org/tg2k/_network
    _reliability.html

5
Network Reliability related concepts
  • There are many concepts that are related to
    Network Reliability, for example
  • network element reliability the probability of a
    network element to be fully operational during a
    certain period of time
  • network element availability the probability of
    a network element to be in an up-state at a given
    instant of time t
  • network element fault the inability of a network
    element to perform a required action
  • ....

6
Which failures may occur ?
  • The ability of a network to provide required
    services may be compromised by different
    failures
  • planed or unplanned failures
  • internal or external failures
  • software or hardware failures
  • malicious or casual failures
  • ....

7
Accounted Failures
  • Provide actions to address all the failures that
    may occur on a Communication Network is
    unfeasible.
  • Network provider and ISP normally provides
    actions plain to address the most frequent
    failures.
  • These failure are called Accounted Failure
  • The most common type of Accounted Failure are
  • single link failure
  • single node failure.

8
Failures' Impact
  • In today Communication Networks a single failure
    may produces a major disruption in network
    availability.
  • A single cut in an optical cable may drop
    thousands of logical network connections.
  • On July 5, 2002 a submarine cable break affected
    the Asia Pacific Cable Network (ACPN 2), causing
    a considerable slowdown in all the network
    connections among Japan, China, South Korea, etc.

9
Failures' Impact ATC systems
  • Press Releases (http//www.natca.org/mediacenter/p
    ress-release-detail.aspx?id394)
  • MASSIVE POWER, COMMUNICATIONS FAILURE AT MAJOR
    AIR TRAFFIC CONTROL CENTER PUTS CONTROLLERS IN
    DARK, FLIGHTS IN JEOPARDY
  • 07/19/2006 Bob Marks                              
                   PALMDALE, Calif. A massive
    power and communications failure late Tuesday at
    the Los Angeles Air Route Traffic Control Center
    left scrambling air traffic controllers to deal
    with a nightmare scenario how to keep dozens of
    flights away from each other above a large swath
    of the Southwestern United States despite the
    inability to see them, talk to them or relay
    crucial instructions for 15 excruciatingly long
    minutes.
  • Every ounce of skill, heart and determination
    that controllers bring into the control room
    every day was put to the test during one of the
    worst outages to ever hit the facility. It was so
    bad, controllers say, that the only thing they
    had of use to aid the situation that actually
    worked was their cell phones devices which the
    Federal Aviation Administration, inexplicably,
    has barred from control rooms, further impeding
    the safety of the system.
  • More details in http//themainbang.typepad.com/blo
    g/2006/07/complete_failur.html

10
Network Reliability Parameters
  • Some parameters that may be used to characterize
    the reliability of a network may be found in ITU
    G.911 Recommendation
  • Parameters and Calculation Methodologies for
    Reliability and Availability of Fibre Optic
    Systems
  • In the following slides some of the parameters
    defined in ITU G.911 are introduced

11
Failure in Time (FITs) and Maintenance Time
  • Failure in Time
  • is the number of device's failure occurred in a
    specific time interval
  • normally is expressed as failures per bilion of
    device hours.
  • Maintenance Time
  • the time interval during which a maintenance
    action is performed on an item either manually or
    automatically, ...

12
Mean Time Between Failure (MTBF)
  • The Mean Time Between Failures (MTBF) is the
    steady-state expectation of time between failures
  • Mathematically the MTBF (in years per failure) is
    releated to the failure rate F (in FITs per 109
    hours) as follows

13
Mean Time To Repair (MTTR)
  • The Mean Time To Repair (MTTR) is defined as
    total corrective maintenance time divided by the
    total number of corrective maintenance actions
    during a period of time.
  • Given the definitions of MTBF and MTTR the
    availability A of an item may be derived as

14
Users, services and reliability requirements
  • Network reliability is a relative concept.
  • The reliability requirements of a communication
    network depend on
  • the user type
  • the service type.
  • Different users-services combinations led to
    divers requirements in terms of MTBF and MTTR.

15
User classification
  • According to their reliability requirements,
    network users may be classified in the following
    categories
  • Safety critical users. Users for which service
    interruption are unacceptable.
  • Business critical users. Users for which any
    service interruption bring to a high financial
    loss.
  • Low cost users. Users for which service
    interruption cause only discomfort.
  • Basic lever users. Users for which service
    reliability is only a side effect.

16
Availability Impact of Outages
Ref Service Applications for SONET DCS
Distribution Restoration, IEEE J. Special Areas
in Comm, Jan 94
  • Potentially FCC reportable
  • Major social/ business impacts
  • Minor social/ Business impacts
  • Drop all circuit switched connections
  • PL disconnects
  • Potential packet (X.25) disconnects
  • Potential data session time-outs

Social / Business Impact
  • Network congestion
  • Packet (X.25) disconnects
  • Data session time-outs
  • Potential voiceband discinnects (lt5)
  • Trigger changeover of CSS7 STP signaling links
  • Effect cell rerouting process

Unacceptable
Service Outage Impact
  • May drop voice band calls depending on channel
    bank vintage

Undesirable
4th Restoration Target Range
3rd Restoration Target Range
2nd Restoration Target Range
Service Hit (Reframes)
1st Restoration Target Range
Protection Switching Range
200 ms
10 Sec
5 Min
30 Min
15 Min
0
50 ms
2 Sec
Restoration time after failure detection
17
Market Drivers for Survivability
  • Customer Relations
  • Competitive Advantage
  • Revenue
  • Negative - Tariff Rebates
  • Positive - Premium Services
  • Business Customers
  • Medical Institutions
  • Government Agencies
  • Impact on Operations
  • Minimize Liability

18
Network Survivability
  • Availability 99.999 (5 nines) gt less than 5
    min downtime per year
  • Since a network is made up of several components,
    the ONLY way to reach 5-nines is to add
    survivability in the face of failures
  • Survivability continued services in the
    presence of failures
  • Protection switching or restoration mechanisms
    used to ensure survivability
  • Add redundant capacity, detect faults and
    automatically re-route traffic around the failure
  • Restoration related term, but slower time-scale
  • Protection fast time-scale 10s-100s of ms
  • implemented in a distributed manner to ensure
    fast restoration

19
Failure Types Other Motivations
  • Types of failure
  • Components links, nodes, channels in WDM, active
    components, software
  • Human error backhoe fiber cut
  • Fiber inside oil/gas pipelines less likely to be
    cut
  • Systems Entire COs can fail due to catastrophic
    events
  • Protection allows easy maintenance and upgrades
  • Eg switchover traffic when servicing a link
  • Single failure vs multiple concurrent failures
  • Goal mean repair time ltlt mean time between
    failures
  • Protection also depends upon kind of application.
  • Survivability may hence be provided at several
    layers

20
Network Survivability Architectures
Linear Protection Architectures
Ring Protection Architectures
Mesh Restoration Architectures
21
Network Availability Survivability
Availability is the probability that an item will
be able to perform its designed functions at the
stated performance level, within the stated
conditions and in the stated environment when
called upon to do so.
Availability
Reliability Reliability Recovery
22
Quantification of Availability
Percent Availability N-Nines Downtime Time Minutes/Year
99 2-Nines 5,000 Min/Yr
99.9 3-Nines 500 Min/Yr
99.99 4-Nines 50 Min/Yr
99.999 5-Nines 5 Min/Yr
99.9999 6-Nines .5 Min/Yr
23
PSTN
  • Individual elements have an availability of
    99.99
  • One cut off call in 8000 calls (3 min for average
    call). Five ineffective calls in every 10,000
    calls.

NI
NI
0.005
0.005
AN 0.01
AN 0.01
LE
LE
Facility Entrance
Facility Entrance
NI Network Interface LE Local Exchange LD
Long Distance AN Access Network
LD
0.005
0.005
0.02
Source http//www.packetcable.com/downloads/spec
s/pkt-tr-voipar-v01-001128.pdf
24
IP Network Expectations
Service Delay Jitter Loss Availability
Real Time Interactive (VOIP, Cell Relay ..) L L L H
Layer 2 Layer 3 VPNs (FR/Ethernet/AAL5) M
Internet Service H H M L
Video Services L M M H
H
L
L
L Low M Medium H High
25
Measuring Availability The Port Method
  • Based on Port count in Network
  • Does not take into account the Bandwidth of ports
  • e.g. OC-192 and 64k are both ports
  • Good for dedicated Access service because ports
    are tied to customers.

(Total of Ports X Sample Period) - (number of
impacted port x outage duration)
x 100
(Total number of Ports x sample period)
26
The Port Method Example
  • 10,000 active access ports Network
  • An Access Router with 100 access ports fails for
    30 minutes.
  • Total Available Port-Hours 10,00024 240,000
  • Total Down Port-Hours 100.5 50
  • Availability for a Single Day
    (240000-50)/240,000100 99.979166

27
The Bandwidth Method
  • Based on Amount of Bandwidth available in
    Network
  • Takes into account the Bandwidth of ports
  • Good for Core Routers

(Total amount of BW X Sample Period) - (Amount of
BE impacted x outage duration)
x 100
(Total amount of BW in network x sample period)
28
The Bandwidth Method Example
  • Total capacity of network 100 Gigabits/sec
  • An Access Router with 1 Gigabits/sec BW fails for
    30 minutes.
  • Total BW available in network for a day 10024
    2400 Total BW lost in outage 1.5 0.5
  • Availability for a Single Day
    ((2400-0.5)/2,400)100 99.979166

29
Basic Ideas Working and Protect Fibers
30
Service classification (1/2)
  • Communication networks are used to carry many
    different services.
  • Different services may have divers reliability
    requirements.
  • Reliability requirements of such services are
    related to QoS parameters
  • Bit Rate
  • Delay
  • Jitter
  • ...

31
Service classification (2/2)
2 A.Lason, et al., Network Scenarios and
Requirements, European IST project Layers
Internetworking in Optical Network (LION),
deliverable D6, Septemper 1999.
32
How to increase network reliability ?
  • Prevent network failure
  • put network cables deeper in the ground
  • more testing for hardware and software
  • .....
  • Duplicate vulnerable network elements
  • dual homing.
  • Independently from these measures, network
    failures still occur.
  • There is need for network recovery or resilience
    schemes !

33
Network recovery basic idea
  • Build networks to have alternate paths
  • Design systems to have alternate entities
  • Monitor for possible falures
  • Manage networks proactively

34
Network recovery requirements
  • Network recovery imposes several requirements.
    For example
  • there should be backup capacity to create a
    recovery path
  • the backup capacity must be enough to ensure QoS
    constraints
  • single point of failure must be avoided
  • .....

35
Recovery and reversion cycles
Recovery Cycle
Reversion Cycle
36
Recovery mechanisms
  • A high variety of recovery mechanisms exist.
  • Every mechanisms has advantages and drawbacks
  • In the following slides some criteria that may be
    used to evaluate and classify recovery mechanisms
    are reported 3, 4.
  • 3 V. Sharma et al., Framework for MPLS-based
    recovery, RFC 3469, IETF web site, Feb 2003
  • 4 K. Owens, V. Sharma, M. Oommen, and F.
    Hellstrand, Network Survivability Considerations
    for Traffic Engineered IP Networks, Internet
    draft draft-owens-te-network-survivability-03,
    May 2002. Available at www.ietf.org. Accessed
    July 2005

37
Backup Capacity
  • Dedicated
  • one to one relationship between the backup
    resources and the working path
  • the simplest solution
  • an inefficient solution.
  • Shared
  • the backup resources are shared among different
    working path
  • a more simple solution
  • a more efficient solution.

38
Recovery Path
  • Preplanned
  • recovery paths for all accounted failure scenario
    is calculated in advance
  • allows fast recovery of failure
  • lacks flexibility for unaccounted failure
    scenarios.
  • Dynamic
  • the recover path is calculate on the fly when
    the failure is detected
  • may be used to search recovery paths also for
    unaccounted failure scenarios.

39
Recovery Approaches
  • Protection
  • the recovery paths are preplanned and fully
    signaled before a failure occurs
  • when a failure occurs no additional signaling is
    needed to establish the recovery path
  • is the faster solution.
  • Restoration
  • the recovery pat may be preplanned or dynamically
    allocated but are not signaled in advance
  • when a failure occurs aditional signaling is
    needed to establish the recovery path
  • is a more flexible solution.

40
Protection Variants (1/2)
  • 11 Protection (Dedicated Protection)
  • there is exactly one dedicated recovery path for
    each working segment
  • the traffic is permanently duplicated on both the
    working path and the recovery path
  • is a quite expensive solution.
  • 11 Protection (Dedicated Protection with extra
    traffic)
  • there is exactly one dedicated recovery path for
    each working segment
  • the traffic is transmitted over only a path at a
    time
  • it is possible to transport extra traffic along
    the recovery path in failure free condition.

41
Protection Variants (2/2)
  • 1N (Shared Recovery With Extra Traffic)
  • each recovery entity is used to protect N working
    entities
  • it is possible use the recovery entities to
    transport extra traffic in failure free
    conditions.
  • MN (M N)
  • a set of M recovery entities are used to protect
    a set of N working entities
  • it is possible use the recovery entities to
    transport extra traffic in failure free
    conditions.

42
Recovery Extent (1/2)
  • Local Recovery
  • in failure condition only the affected network
    element are bypassed using the recovery path
  • the RHE and RTE are closer to the failure, so
    they may detect the failure quickly, leading to a
    smaller recovery time.
  • in case of failure the route followed by the
    traffic may be not optimal (e.g the same traffic
    may cross a link twice !) .
  • In case of two successive nodes failure will fail

43
Recovery Extent (2/2)
  • Global Recovery
  • in failure condition the complete working path
    between source and destination is bypassed
  • the recovery time is greater that that of the
    local recovery
  • an optimal recovery path is used in case of
    failure
  • In case of two successive nodes failure could
    still resolve the problem
  • may generate more state overhead that the local
    approach.
  • An intermediate solution between Local and Global
    approach may be adopted !!

44
Control of Recovery Mechanisms (1/2)
  • Centralized
  • a central controller determines the action to
    take in case of failure
  • the central controller also determine when and
    where a fault ha occurred
  • the central controller is a single point of
    failure.
  • is generally an efficient approach
  • in principle is a simpler approach, but
  • the central controller may become a very complex
    system

45
Control of Recovery Mechanisms (2/2)
  • Distributed
  • there is not a centralized controller, all the
    network elements are capable to autonomously
    react to failure
  • with this approach there is not a global view of
    the network condition
  • the network elements may have to exchange
    information to keep a consistent view of the
    network
  • is a more scalable approach.

46
Protection Topologies - Ring
  • Two or more nodes connected to each other with a
    ring of links

E
W
D
L
W
E
L
Working
Protect
W
E
E
W
47
Protection Topologies - Mesh
  • Three or more nodes connected to each other
  • Can be sparse or complete meshes
  • Spans may be individually protected with linear
    protection
  • Overall edge-to-edge connectivity is protected
    through multiple paths

Working
Protect
48
Protection Switching Terminology
  • 11 architectures - permanent bridge at the
    source - select at sink
  • mn architectures - m entities provide protection
    for n working entities where m is less than or
    equal to n
  • allows unprotected extra traffic
  • most common - SONET linear 11 and 1n

49
11 vs 1n
Working
Protect
Working
Protect
(11)
(1n)
50
SONET Linear 11 APS
TX Transmitter RX Receiver
BR Bridge SW Switch
Working
BR
SW
TX
RX
Protection
RX
TX
Working
SW
RX
BR
TX
RX
TX
Protection
51
SONET 11 Linear APS
TX Transmitter RX Receiver
BR Bridge SW Switch
APS Channel
BR
SW
TX
RX
RX
TX
Protection
SW
RX
BR
TX
Working
TX
RX
Protection
52
Protection Switching Terminology
  • Dedicated vs Shared working connection assigned
    dedicated or shared protection bandwidth
  • 11 is dedicated, 1n is shared
  • Revertive vs Non-revertive after failure is
    fixed, traffic is automatically or manually
    switched back
  • Shared protection schemes are usually revertive
  • Uni-directional or bi-directional protection
  • Uni each direction of traffic is handled
    independent of the other.
  • Fiber cut gt only one direction switched over to
    protection . Usually done with dedicated
    protection no signaling required.
  • Bi-directional transmission on fiber (full
    duplex) gt requires bi-directional switching
    signaling required

53
Mesh Restoration
Working Path
DCS
DCS
Line or Link Restoration
DCS
DCS
DCS
DCS
Path Restoration
  • Control Centralized or Distributed
  • Route Calculation Preplanned or Dynamic
  • Type of Alternate Routing Line or Path

54
Link vs. Path restoration
  • Link restoration
  • Requires the ability to identify the failed link
    at both ends.
  • Can not protect node failure.
  • Link based
  • Mesh (generalized loop back) insensitive to
    additions to network scalable backup path can
    be pre-computed fast recovery dynamic
    rerouting
  • Path restoration
  • More resilient than link restoration.
  • Reroutes the traffic from the primary path to a
    Shared Risk Group (SRG) -disjoint backup path.
  • Protect both end-to-end paths and single links.
  • Preferred Path Based

55
Link vs. Path restoration
D
A
C
Fault Link Cut
F
B
D
A
E
C
F
Link (Generalized Loopback) Restoration
B
E
D
A
C
F
B
E
Path Restoration
56
Pre-compute vs. Real-time
  • Pre-computed
  • calculates restoration paths before a failure
    happens.
  • Allows prior availability of reroute information
    to the nodes where actions need to be taken after
    failure is detected.
  • Enables fast restoration.
  • Real-time
  • calculates restoration paths after a failure
    happens.
  • Restoration is slower.
  • Enables more efficient capacity utilization.
  • Preferred Pre-computed

57
Centralized vs. Distributed
  • Centralized restoration
  • Computes restoration and primary paths for all
    demands with up-to-date information
  • Routes may then be downloaded into nodal
    databases.
  • Effectiveness?
  • More capacity efficiency
  • Possibly slow (but may be executed in the
    background)
  • Scalability in question.
  • Distributed restoration
  • Source and destination nodes dynamically search
    for the protection wavelengths required to
    reestablish the disrupted lightpath
  • Since lack of knowledge of sharing database of
    other OXCs, it may not be able to determine
    backup sharability for any given primary path
  • Preferred
  • Central path determination
  • Distributed Restoration

58
Protection Topologies - Linear
  • Two nodes connected to each other with two or
    more sets of links

Working
Protect
Working
Protect
(11)
(1n)
59
Mesh Restoration vs Ring/Linear Protection
Extracted from T-H. Wu, Emerging Technologies
for Fiber Network Survivability, See References
60
IP layer restoration
  • IP Layer Restoration (real-time)
  • Achieved by exchanging control messages between
    adjacent routers
  • Re-determine the affected route
  • Update routing tables
  • Propagate changes (OSPF, BGP-4)
  • Capable of recovery from multiple faults
  • Slow (10s of seconds to minutes Fumagalli)
    requires online processing upon failure
  • Fault discovery
  • Explicitly ICMP messaging
  • Implicitly Expiring of timers
  • Guarantees networkwide survivability
  • Independent of underlying physical network

Application
Presentation
Session
Transport
Network (IP)
Data Link
Physical
61
MPLS layer restoration
  • MPLS Layer Protection
  • Real-time or pre-computed
  • Line or path level protection
  • Protection path is node and link disjoint from
    the primary path.
  • Protection path may be allocated to low-priority
    traffic in the absence of network failure.
  • Faster than dynamic IP rerouting
  • Working LSPs have pre-established node/link
    disjoint protection paths

Application
Presentation
Session
Transport
Network
MPLS
Data Link
Physical
62
Optical layer restoration
  • Optical layer restoration
  • Real-time or pre-computed
  • Ring protection or mesh restoration
  • No visibility into higher layer operations.
  • May be wasteful use of resources.
  • For ring protection, there is over 100 capacity
    redundancy
  • For mesh restoration, 60-80 physical redundancy
    level is typical.
  • Not recommended for node (or software) failures
  • Faster than higher layer restorations (??)

Application
Presentation
Session
Transport
Network IP)
DWDM (Optical)
Physical
63
Multilayer Recovery (1/2)
  • In a multilayer network it is possible to imagine
    a situation in which each layer has its own
    recovery mechanisms.
  • Not every failure in a particular layer may be
    resolved in the same layer.
  • If a failure may be resolved in several layer
    uncoordinated actions may produce inefficient
    results
  • A coordination among the layers is needed !!

64
Multilayer Recovery (2/2)
  • Sequential Approach1
  • using an hold-off time a chronological order
    among the recovery mechanisms adopted in
    different layer is imposed
  • alternatively a token may used to impose a
    sequential order among the different layers.
  • Integrated Approach1
  • there is a recovery scheme that has a full
    overview of all the layers
  • the recovery scheme may decide when and in which
    layer (layers) the recovery actions must be
    taken.
  • 1 D. Colle, et all., Data-centric optical
    networks and their survivability, Selected Areas
    in Communications, IEEE Journal on Volume 20,
    Issue 1, Jan. 2002 Page(s)6 - 20
Write a Comment
User Comments (0)
About PowerShow.com