Title: Introduction High-Availability Systems: An Example
1IntroductionHigh-Availability Systems An Example
ATT
- Pioneered FT in telephone switching applications.
- Aggressive availability goal 2 hours downtime in
40 years (i.e., 3 min/year), with less than 0.01
of the calls handled incorrectly.
2IntroductionHigh-Availability Systems An Example
ATT
- In 1978, Bell Labs collected data on historic
trends of causes of system downtime - 20 attributed to HW (good diagnostics and
trouble-location programs can help minimize
HW-induced downtime). - 15 attributed to SW (SW deficiencies included
improper translation of algorithms into code or
improper specifications). - 35 attributed to recovery deficiencies (these
deficiencies can be caused by undetected faults
or incorrect fault isolation). - 30 attributed to human procedural error.
3IntroductionHigh-Availability Systems An Example
ATT
- Other studies on the same direction ...
4IntroductionHigh-Availability Systems An Example
ATT
5IntroductionHigh-Availability Systems An Example
ATT
Note, however, that the thresholds are different
for failure to establish a call (moderately high)
and disconnection of an established call (very
low)
Levels of recovery in a Telephone Switching System
6IntroductionHigh-Availability Systems An Example
ATT
- In a typical telephone switching system, tasks of
the Central Control Unit are related with - Overall system control/administration
- Call processing
- System maintenance
- Automatic isolation of faulty units
- Defensive SW strategies
- Support for rapid repair
7IntroductionHigh-Availability Systems An Example
ATT
Bus Interface
Program Store (PS)
Central Control (CC)
Call Store (CS)
AU
Auxiliary Unit (AU) Bus
Typical switching system diagram
8IntroductionHigh-Availability Systems An Example
ATT
CC instructions reside in the program store (PS)
while transient info (e.g., telephone calls,
routing, equipment configuration) is held in the
call store (CS) Auxiliary Unit (AU) Bus
interfaces to disk and magnetic tape mass storage.
9IntroductionHigh-Availability Systems An Example
ATT
PSB Program Store Bus PU Peripheral Unit Bus
PUB1
PUB2
Bus Interface 2
Bus Interface 1
PSB1
PSB2
Program Store 1 (PS)
Program Store 2 (PS)
Central Control 2 (CC)
Central Control 1 (CC)
Call Store 1 (CS)
Call Store 2 (CS)
AU 2
AU 1
Auxiliary Unit (AU) Bus
Duplex configuration for switching computer.
(Assuming that only one of each component is
required for a functional system, there are 64
possible system configurations.)
10IntroductionHigh-Availability Systems An Example
ATT
1- Both CCs operate in synchronism. Two matched
circuits compare 24 bits of internal state during
each 5.5us machine cycle. 2- There are 6
different sets of internal nodes that can be
monitored, depending on the instruction being
executed. 3- A mismatch generates an interrupt
which calls fault recognition programs to
determine which half of the system is faulty. 4-
Information can be sample by the matchers and
retained for later examination by diagnostic
programs.
11IntroductionHigh-Availability Systems An Example
ATT
5- The OS employs Hamming code on the 37 data
bits. 6- There is parity check bits over address
plus data bus the CS has one parity bit on
address and data, and another parity bit just on
address. 7- Both OS and CS automatically retry
operations upon error detection. 8- After a
fault has been detected, the system configuration
logic attempts to establish various combinations
of subunits. 9- A sanity program is then
executed.
12IntroductionHigh-Availability Systems An Example
ATT
- Summarizing some features of the FT system
- Duplication of ALU.
- 30 of Control Logic devoted to Self-Checking.
- EDAC on disks.
- SW audits.
- Sanity timer (a Sanity Program is similar to a
maze that the HW must traverse before the sanity
timer times out. If a time-out occurs, the
reconfiguration logic generates a new
configuration to be tried).
13IntroductionHigh-Availability Systems An Example
ATT
- Integrity monitor (Supervisor).
- Byte parity on datapaths.
- Parity checking where parity preserved,
duplication otherwise. - Two-parity bits on registers.
- Modified Hamming Code on Main Memory.
- Maintenance Channel for observability and
controlability.