Title: FAULTTOLERANT COMPUTING
1FAULT-TOLERANT COMPUTING
- Fall 2007
- Daniel Ortiz-Arroyo
- Computer Science and Engineering Department
- Aalborg University, Esbjerg
2About me
- Associate professor in DE and CS educations
- Multidisciplinary research mainly focused in
- Computational intelligence
- High performance computing
3About the Course
- This is a short introductory course
- 5 lectures
- Each class 2 sessions/45 min each
- Exercises
4Course Contents
5Reading Material
- Textbook No textbook required
- Optional Reference Books
- Fault Tolerant Systems by Israel Koren. Morgan
Kauffman 2007 - Software Fault Tolerance Techniques and
Implementation by Laura L. Pullum ISBN
1580531377 Publisher Artech House Computer
Security Series, 2001 - Reliability of Computer Systems and Networks
-Fault Tolerance Analysis and Design, M.L.
Shooman, Wiley 2002 - Papers listed on courses web page
6Today
- Review of Fault Tolerant Concepts
- Overview of the course/topics
7Course Goals
- Provide an overview of Fault Tolerant Computing
(FTC) - Hardware and software
- Models
- Implementation mechanisms
- Discuss some research and real cases
- Hugh area with more than 50 years of
research/development
8Motivation
- What is Fault-Tolerance?
- A fault-tolerant system is one that continues
to perform at desired level of service in spite
of failures in some components that constitute
the system - What is Fault tolerant computing?
- Is the art and science of building computing
systems that continue to operate satisfactorily
in the presence of faults - Computing correctly despite the existence of
errors in the system
9Motivation
- Why FTC is important
- FTC techniques are the foundation for other areas
e.g. FT Control - Techniques invented in FTC have been transferred
to other fields
10Motivation
- Dependable Systems are essential to humans
Dependability is the ability of a system to
deliver a service that can justifiably be trusted
Measured by
Means to achieve dependability
11Motivation (contd.)
- Approaches to design fault tolerant computer
systems - Bottom-up designing fault tolerant components
to integrate them into a fault tolerant system - Top-down designing a fault tolerant system
using components with little or not fault
tolerance - Top down is the most used approach
12Motivation (contd.)
- Challenge of Fault Tolerant Computing using the
top-down approach - Given that both hardware and software components
are unreliable, how do we build reliable systems
from these unreliable components? - Problem addressed by John von Neumann in 1950s
13Motivation (contd.)
- A FTC system may be able to tolerate one or more
fault-types including - HW transient, intermittent or permanent hardware
faults, - HW SW design errors,
- operator errors, or
- externally induced upsets or physical damage.
14Motivation (contd.)
- Examples of FTC mechanisms/systems at different
levels - PCs RAMs with parity checks and Error Correcting
Codes (ECC) (HW) - Workstations error detection (HW), occasional
corrective action (SW), keeping logs (SW) - RAID (Redundant Array of Inexpensive Disks)
- Distributed Systems
15Introduction
- Historical Perspective
- Theory established by J. von Neumann, 1956
- Probabilistic logic and synthesis of reliable
organism from unreliable components, Annals of
mathematical studies, Princeton University Press - The SAPO computer built in Prague, Czechoslovakia
in 19501954 under the supervision of A. Svoboda
was probably the first fault-tolerant computer. - Used relays and a magnetic drum memory.
- The processor used triplication and voting (TMR),
and the memory implemented error detection with
automatic retries when an error was detected.
16Introduction
- Historical Perspective
- Over the past 30 years, a number of
fault-tolerant computers have been developed that
fall into three general types - long-life, unmaintainable computers,
- ultradependable, real-time computers, and
- high-availability computers.
17Introduction
- Long-life, un-maintainable computers.
- Spacecrafts require computers to operate for long
periods of time without external repair. Typical
requirements are a probability of 95 that the
computer will operate correctly for 510 years. - JPL Self-Testing-and-Repairing (STAR) computer
was the next fault-tolerant computer, developed
by NASA in the late 1960s for a 10-year mission
to the outer planets.
18Introduction
- Ultra-dependable, real-time computers Computers
for which an error or delay can prove to be
catastrophic. - They are designed for applications such as
control of aircraft, mass transportation systems,
and nuclear power plants. - One of the first operational machines of this
type was the Saturn V guidance computer,
developed in the 1960s. Space shuttle is another
example - Fly-by-wire aircraft exhibits a very high degree
of fault-tolerance in their real-time flight
control computers. For example the Airbus
Airliners
19Introduction
- High-availability computers can tolerate an
occasional error or very short delays (on the
order of a few seconds), while error recovery is
taking place. - Example applications are telephone switching and
transaction processing for banks, airline
reservations, etc. - Tandem Computers, Inc used a design of a
distributed system with a sophisticated form of
duplication. - SUN's ft-SPARC and the HP/Stratus Continuum 400
are systems that contain redundant processors,
disks and power supplies, and automatically
switch to backups if a failure is detected.
20Introduction (contd.)
- Pushes for FTC
- Moores Law, complexity in processors
- Fault tolerant mechanisms
- in HW (more studied)
- in SW (more recent, more debated)
21Introduction (contd.)
- Intuitive concepts
- Reliability continues to work
- Availability works when I need it
- Safety does not put me in jeopardy
- Performability maintains same performance in
spite of failures - Maintainability do not take much time to repair
22Introduction (contd.)
- The two most common ways industry expresses a
systems ability to tolerate failure are - Reliability
- Availability
- In modern distributed systems other measures can
be used such as outage time (or down time)
23Terminology and definitions
- Reliability (time interval)
- R(t) conditional probability that a system is up
in the interval 0,t given that it was up at
time 0. Measured by MTBF MTTF MTTR - Availability (time point)
- A(t) probability that a system is operating
correctly and is available to perform its
functions at the instant of time t. Measured by
MTBF/(MTBFMTTR) - Availability can be high, even if the system has
frequent periods of inoperability if time to
repair is low.
Up means system provides the required
functionality
24Beyond Fault Tolerance
Server
Hw
- Distributed Systems -while cost of HW and SW
drops, down time cost increases every year - Availability is a good metric but outage (down
time) minutes may be more useful in some cases
(it can be easily measured)
Sys-sw
App-sw
Network
Hw
Sys-sw
App-sw
Client
Hw
Sys-sw
Industry has focused mainly on Hw faults but
those are not the only ones
App-sw
Customer view of 7x24
25Availability
- We can calculate the steady state system
availability as - A
Uptime _ UptimeDowntime
26Break
27Fundamental Principles
- Redundancy
- Addition of extra parts in a systems design to
allow it continue functioning as intended in
spite of failures - Providing some sort of redundancy is the key
concept in fault tolerant computer systems - Hardware redundancy
- Software Redundancy
- Time Redundancy
- Information Redundancy
28Summary of FTC Techniques
Well cover FTC basic and advanced techniques
some research work case studies Little
discussion on HW redundancy
29Fundamental Principles (contd.)
- Hardware Redundancy
- Low level
- Logic level - Self checking circuits, parity bit
code - High level
- Triplicate or use 5-copies of a computer (as in
space shuttle)
30Fundamental Principles (contd.)
- Software Redundancy
- Use two different programs/algorithms
- Time Redundancy
- Re-compute or redo the task and compare the
results - May or may not use the same hardware/software
- Information Redundancy
- Backup information
- Use of Error Correcting Codes (ECC)
31Fault-Error-Failure concept
- Intuitive definitions
- Fault
- An anomalous physical condition caused by a
manufacturing problem, fatigue, external
disturbance (intentional or un-intentional),
design flaw, - Error - Effect of activation of a fault
- Failure - over-all system effect of an error
- Fault -gt Error -gt Failure
Bit stuck at
Incorrect data at ALU
Incorrect balance, system crash
Not all errors lead to failures!!
32Faults/Failure Classification
Fault and failure taxonomies
33Containment Zones
- Containment Zones are used to limit error
propagation in a system - Barriers reduce the chance that a fault or error
in one zone will propagate to another - A fault-containment zone can be created by
providing an independent power supply to each
zone - The designer tries to electrically isolate one
zone from another - An error-containment zone can be created by using
redundant units and voting on their output
(details later)
34Barriers in HW/SW
Barriers constructed by design techniques for
fault avoidance, masking, tolerance
Fault -gt Error -gt Failure
Better name for fault tolerance is error tolerance
35Fault Modeling
- Fault models at different levels (HW)
- Process level
- Transistor level
- Gate level
- Function level
-
- System level
VLSI Manufacturing (Important but not covered
here)
We will discuss fault/failure models mainly at
high levels (from function to system level) in
the course
36System Level Fault Modeling
- Distributed System
- Must consider HW/SW faults communication errors
Communication Channel (HW/SW)
Client HW/SW Redundancy
DataBase HW/SW Redundancy
Server HW/SW Redundancy
Time redundancy Information redundancy
Information redundancy
37System Level Fault Modeling
- Computer Networks
- Must consider Links Node failures
- Connectivity is a basic measure of network
reliability minimum number of nodes and links
that have to fail before network becomes
disconnected
Connectivity1
38Fault Modeling (contd.)
- High-level failure models (process or system
level failure). General classification - Crash failure - a faulty processor or system
stops permanently - Omission failure - a faulty process omits
inputs/outputs some times but when it works, it
works correctly - Timing failure - inputs/outputs are delayed or
arrive too early - Byzantine failure (or arbitrary failure) - a
faulty processor can exhibit arbitrary behavior
including malicious nature
39Failure Rate
- Bath tube curve (for Hardware)
- The rate at which a component suffers faults
depends on its age, the ambient temperature, any
voltage or physical shocks that it suffers, and
the technology
Burning in used to avoid this zone
constant
Normal lifetime
20 weeks
5-25 years
40Failure Rate
- Empirical formula for failure rate (in normal
lifetime stage) for HW systems - ? LQ(C1TVC2E)
- LLearning factor (maturity of technology)
- QManufacturing process quality factor
- TTemparature factor
- VVoltage stress factor
- EEnvironmental shock factor
- C1C2 Complexity factor ( gates, pins in
package)
41Failure Rate
- In most calculations of reliability, a constant
failure rate ? is assumed, or equivalently the
exponential distribution for the component
lifetime T. - There are cases in which this simplifying
assumption is inappropriate - Example - during the infant mortality and
wear-out phases of the bathtub curve - In such cases, the Weibull distribution for the
lifetime T is often used in reliability
calculation
42Fault Tolerance and Reliability
- The effect of a fault tolerant design on
reliability can be expressed as - RsysP(no-fault)P(correct-operation
fault)P(fault)
Maximized by fault intolerant design (proofs of
correct design, high quality components)
bottom-up approach
Coverage of a fault tolerance design over all
possible faults (top-down approach)
For cost effectiveness, fault tolerant design
should target most likely faults
43Dependability Evaluation
- Once a fault-tolerant system is designed, it must
be evaluated to determine if its architecture
meets reliability and dependability objectives
using - Analytical models
- Injecting faults
44Modeling
- Mathematical formulation for quantitative
analysis - consider a large experiment with N systems at
observation at time t - Nc(t) - number of correctly operating systems
- Nf(t) - number of failed systems
- N Nc(t)Nf(t)
- Hence
- Reliability R(t) Nc(t)/N 1 - Nf(t)/N
- Unreliability Q(t) 1 - R(t)
- Derivative of reliability dR(t)/dt
-(1/N)(dNf(t)/dt) - dNf(t)/dt is called instantaneous failure rate of
the component
45Modeling (contd.)
- Reliability Modeling
- System model, concentrating on reliability aspect
- The structure of a system is used to model its
reliability - Models
- Combinatorial Models
- Markov Models
46Modeling (contd.)
- Combinatorial Modeling - Probabilistic Techniques
- Idea Express reliability of a system as a
function of reliability of its components - Construction models
- series
- All components must work correctly
- No redundancy
- parallel
- Only one of the components must work correctly
- High redundancy
47Modeling (contd.)
- Markov Models a type of stochastic process
(random variables indexed by time) - Many complex problems cannot be modeled easily
in combinational fashion - Use Markov models (aka Markov chains)
- Repair is very difficult to model
combinatorially - Markov models can be applied to modeling
reliability, availability etc.
48Modeling (contd.)
Markov Models STATE Represents all that must be
known to describe the system at a given instant
in time E.g. for reliability Each state
represents a distinct combination of faulty and
fault-free modules (e.g. 101, 1OK, 0fault)
TRANSITION Changes of state that happen in
system Over time as failures occur, system goes
from one state to another State changes are
given probabilities (e.g. prob. of failure, etc.)
Transitions are probabilities
49Fault Injection
- Fault injection artificial creation of failure
sources - study system behavior under faults/errors
- what is the effect of faults? how fast do they
propagate? - is recovery code correct?
- is our failure model accurate?
- if failover / reconfiguration effective?
- Limitations
- Not all failures can be injected
- Year-long MTTFs impossible to study
50Fault Injection
- Fault Injection is used to test Fault Tolerant
Systems - Simulation
- Prototyping
51Summary
- What were the main points of the lecture?