Introduction High-Availability Systems: An Example - PowerPoint PPT Presentation

1 / 13

About This Presentation

Title:

Introduction High-Availability Systems: An Example

Description:

In 1978, Bell Labs collected data on historic trends of causes of system downtime: ... If a time-out occurs, the reconfiguration logic generates a new configuration to ... – PowerPoint PPT presentation

Number of Views:39

Avg rating:3.0/5.0

Slides: 14

Provided by: Fabian55

Category:

more less

Transcript and Presenter's Notes

Title: Introduction High-Availability Systems: An Example

1
IntroductionHigh-Availability Systems An Example
ATT

Pioneered FT in telephone switching applications.
Aggressive availability goal 2 hours downtime in
40 years (i.e., 3 min/year), with less than 0.01
of the calls handled incorrectly.

2
IntroductionHigh-Availability Systems An Example
ATT

In 1978, Bell Labs collected data on historic
trends of causes of system downtime
20 attributed to HW (good diagnostics and
trouble-location programs can help minimize
HW-induced downtime).
15 attributed to SW (SW deficiencies included
improper translation of algorithms into code or
improper specifications).
35 attributed to recovery deficiencies (these
deficiencies can be caused by undetected faults
or incorrect fault isolation).
30 attributed to human procedural error.

3
IntroductionHigh-Availability Systems An Example
ATT

Other studies on the same direction ...

4
IntroductionHigh-Availability Systems An Example
ATT
5
IntroductionHigh-Availability Systems An Example
ATT
Note, however, that the thresholds are different
for failure to establish a call (moderately high)
and disconnection of an established call (very
low)
Levels of recovery in a Telephone Switching System
6
IntroductionHigh-Availability Systems An Example
ATT

In a typical telephone switching system, tasks of
the Central Control Unit are related with
Overall system control/administration
Call processing
System maintenance
Automatic isolation of faulty units
Defensive SW strategies
Support for rapid repair

7
IntroductionHigh-Availability Systems An Example
ATT
Bus Interface
Program Store (PS)
Central Control (CC)
Call Store (CS)
AU
Auxiliary Unit (AU) Bus
Typical switching system diagram
8
IntroductionHigh-Availability Systems An Example
ATT
CC instructions reside in the program store (PS)
while transient info (e.g., telephone calls,
routing, equipment configuration) is held in the
call store (CS) Auxiliary Unit (AU) Bus
interfaces to disk and magnetic tape mass storage.
9
IntroductionHigh-Availability Systems An Example
ATT
PSB Program Store Bus PU Peripheral Unit Bus
PUB1
PUB2
Bus Interface 2
Bus Interface 1
PSB1
PSB2
Program Store 1 (PS)
Program Store 2 (PS)
Central Control 2 (CC)
Central Control 1 (CC)
Call Store 1 (CS)
Call Store 2 (CS)
AU 2
AU 1
Auxiliary Unit (AU) Bus
Duplex configuration for switching computer.
(Assuming that only one of each component is
required for a functional system, there are 64
possible system configurations.)
10
IntroductionHigh-Availability Systems An Example
ATT
1- Both CCs operate in synchronism. Two matched
circuits compare 24 bits of internal state during
each 5.5us machine cycle. 2- There are 6
different sets of internal nodes that can be
monitored, depending on the instruction being
executed. 3- A mismatch generates an interrupt
which calls fault recognition programs to
determine which half of the system is faulty. 4-
Information can be sample by the matchers and
retained for later examination by diagnostic
programs.
11
IntroductionHigh-Availability Systems An Example
ATT
5- The OS employs Hamming code on the 37 data
bits. 6- There is parity check bits over address
plus data bus the CS has one parity bit on
address and data, and another parity bit just on
address. 7- Both OS and CS automatically retry
operations upon error detection. 8- After a
fault has been detected, the system configuration
logic attempts to establish various combinations
of subunits. 9- A sanity program is then
executed.
12
IntroductionHigh-Availability Systems An Example
ATT

Summarizing some features of the FT system
Duplication of ALU.
30 of Control Logic devoted to Self-Checking.
EDAC on disks.
SW audits.
Sanity timer (a Sanity Program is similar to a
maze that the HW must traverse before the sanity
timer times out. If a time-out occurs, the
reconfiguration logic generates a new
configuration to be tried).

13
IntroductionHigh-Availability Systems An Example
ATT