FP9 FAULTTOLERANT COMPUTING - PowerPoint PPT Presentation

1 / 44

About This Presentation

Title:

FP9 FAULTTOLERANT COMPUTING

Description:

Is the art and science of building computing ... Beyond Fault Tolerance. While cost of HW and SW drops, down time cost increases every year ... Bath tube curve ... – PowerPoint PPT presentation

Number of Views:145

Avg rating:3.0/5.0

Slides: 45

Provided by: kew67

Category:

more less

Transcript and Presenter's Notes

Title: FP9 FAULTTOLERANT COMPUTING

1
FP9 FAULT-TOLERANT COMPUTING

Daniel Ortiz-Arroyo
Computer Science and Engineering Department
Aalborg University, Esbjerg

2
About the Course

This is a short introductory course
5 classes
Each class 2 sessions/45 min each
Exercises
Prerequisites
Concepts of Probability
Computer Architecture
Software engineering
Reliability

3
Course Contents
4
Reading Material

Textbook No textbook required
Optional Reference Books
Reliability of Computer Systems and Networks
-Fault Tolerance Analysis and Design, M.L.
Shooman, Wiley 2002
Software Fault Tolerance Techniques and
Implementation by Laura L. Pullum ISBN
1580531377 Publisher Artech House Computer
Security Series, 2001
Papers listed on course web page
www.cs.aaue.dk/do/teaching/f05/FTC.htm

5
Course Goals

Provide an overview of fault tolerant computing
Hardware and software
Models
Implementation mechanisms
Hugh area with more than 40 years of
research/development

6
Motivation

What is Fault-Tolerance?
A fault-tolerant system is one that continues
to perform at desired level of service in spite
of failures in some components that constitute
the system.
What is Fault tolerant computing?
Is the art and science of building computing
systems that continue to operate satisfactorily
in the presence of faults
Computing correctly despite the existence of
errors in the system

7
Motivation (contd.)

Approaches to design fault tolerant computer
systems
Bottom-up designing fault tolerant components
to integrate them into a fault tolerant system
Top-down designing a fault tolerant system
using components with little or not fault
tolerance
Top down is the most used approach

8
Motivation (contd.)

Challenge of Fault Tolerant Computing using the
top-down approach
Given that both hardware and software components
are unreliable, how do we build reliable systems
from these unreliable components?

9
Motivation (contd.)

A fault-tolerant computing system may be able to
tolerate one or more fault-types including
transient, intermittent or permanent hardware
faults,
software and hardware design errors,
operator errors, or
externally induced upsets or physical damage.

10
Motivation (contd.)

Permanent faults
Once a component fails, it never works again.
Easiest to diagnose
Transient faults
Occurs one time. 10 times as likely as permanent
faults
Intermittent faults
Re-occuring, may appear as transient if period is
long
Hard and expensive to detect

11
Motivation (contd.)

Examples of fault tolerant mechanisms/systems
General Purpose Systems
PCs RAMs with parity checks and possibly ECC
Workstations error detection (HW), occasional
corrective action (SW), ECC (HW), keeping log
(SW)
Reliable Systems
Telephone systems
Banking systems e.g. ATM
Stock market

12
Motivation (contd.)

Examples
Critical and Life Critical Systems
Manned and unmanned space borne systems
Aircraft control systems
Nuclear reactor control systems
Life support systems
Reliable -gt Critical Systems
Traffic light control system
Automobile control system (ABS, Fuel injection
system)

13
Introduction

Historical Perspective
Not a new concept. First use by J. von Neumann
1956
Probabilistic logic and synthesis of reliable
organism from unreliable components, Annals of
mathematical studies, Princeton University Press
Major push
Space program
HW Fault tolerance - then
SW Fault tolerance later
Merge the two

14
Introduction (contd.)

New pushes
Density of devices
(Moores law)
Deep submicron tech and time to market pressure
Implementation of numerous functionalities on
chip/board/system
Speculative execution
in modern processors

15
Introduction (contd.)

Intuitive concepts
Reliability continues to work
Availability works when I need it
Safety does not put me in jeopardy
Performability maintains same performance in
spite of failures
Maintainability do not take much time to repair

16
Introduction (contd.)

The two most common ways industry expresses a
systems ability to tolerate failure are
Reliability
Availability

17
Terminology and definitions

MTTF mean time to failure
the expected time the system will operate before
the first failure occurs (a system is replaced
after a failure).
MTTR mean time to repair
average time required to repair a system
MTBF mean time between failure
average time between failures of a system
(renewal situation theres repair or
replacement)
MTBF MTTF MTTR

18
Terminology and definitions

Reliability (time interval)
R(t) conditional probability that a system is up
in the interval 0,t given that it was up at
time 0. Measured by MTBF
Availability (time point)
A(t) probability that a system is operating
correctly and is available to perform its
functions at the instant of time t. Measured by
MTBF/(MTBFMTTR)
Availability can be high, even if the system has
frequent periods of inoperability if time to
repair is low.

Up means system provides the required
functionality
19
Beyond Fault Tolerance
Server
Hw

While cost of HW and SW drops, down time cost
increases every year
Availability is a good metric but outage minutes
may be more useful (it can be measured) in some
cases

Sys-sw
App-sw
Network
Hw
Sys-sw
App-sw
Client
Hw
Sys-sw
Industry has focused mainly on Hw faults
App-sw
Customer view of 7x24
20
Fundamental Principles

Redundancy
Addition of extra parts in a systems design to
allow it continue functioning as intended in
spite of failures
Providing redundancy is key in fault tolerant
computing
Hardware redundancy
Software Redundancy
Time Redundancy
Information Redundancy

21
Fundamental Principles (contd.)

Hardware Redundancy
Low level
Logic level - Self checking circuits, parity bit
code
High level
Triplicate or use 5-copies of a computer (as in
space shuttle)

22
Fundamental Principles (contd.)

Software Redundancy
Use two different programs/algorithms
Time Redundancy
Re-compute or redo the task and compare the
results
May or may not use the same hardware/software
Information Redundancy
Backup information
Use of Error Correcting Codes (ECC)

23
Fault-Error-Failure concept

Intuitive definitions
Fault
An anomalous physical condition caused by a
manufacturing problem, fatigue, external
disturbance (intentional or un-intentional),
design flaw,
Error - Effect of activation of a fault
Failure - over-all system effect of an error
Fault -gt Error -gt Failure

Bit stuck at
Incorrect data at ALU
Incorrect balance, system crash
Not all errors lead to failures!!
24
Fault-Error-Failure concept (contd.)

Origins of faults
Physical device level (HW)
Logic level (HW)
Chip level (HW)
System level (HW/SW)
interfacing, specifications,

25
Propagation of Faults and Errors

Both faults and errors can spread through the
system
If a chip shorts out power to ground, it may
cause nearby chips to fail as well
Errors can spread because the output of one
computing element is frequently used as input by
others
Adder example the erroneous result of the faulty
adder can be fed into further calculations, thus
propagating the error

26
Containment Zones

Containment Zones
To limit error propagation, designers incorporate
these zones into systems
Barriers reduce the chance that a fault or error
in one zone will propagate to another
A fault-containment zone can be created by
providing an independent power supply to each
zone
The designer tries to electrically isolate one
zone from another
An error-containment zone can be created by using
redundant units and voting on their output

27
Hardware Fault Classification

Transient Faults
Disappear after a relatively short time
Example - a memory cell that changes spuriously
due to some electromagnetic interference.
Overwriting the memory cell with the right
content will make the fault go away
Permanent Faults
Never go away, component has to be repaired or
replaced
Intermittent Faults
Example - a loose connection

28
Fault Modeling

Fault models at different levels (HW)
Process level
Transistor level
Gate level
Function level
.
System level

VLSI Manufacturing
We will discuss fault/failure models mainly at
high levels (from gate to system) in the course
29
Fault Modeling (contd.)

High-level failure models (process or system
failure)
General classification
crash failure - a faulty processor or system
stops permanently
omission failure - a faulty process omits
inputs/outputs some times but when it works, it
works correctly
timing failure - inputs/outputs are delayed or
arrive too early
Byzantine failure (or arbitrary failure) - a
faulty processor can exhibit arbitrary behavior
including malicious nature

30
Failure Rate

Bath tube curve
The rate at which a component suffers faults
depends on its age, the ambient temperature, any
voltage or physical shocks that it suffers, and
the technology

Burning in used to avoid this zone
Normal lifetime
20 weeks
5-25 years
31
Failure Rate

Empirical formula for failure rate (in normal
lifetime)
? LQ(C1TVC2E)
LLearning factor (maturity of technology)
QManufacturing process quality factor
TTemparature factor
VVoltage stress factor
EEnvironmental shock factor
C1C2 Complexity factor ( gates, pins in
package)

32
Failure Rate

In most calculations of reliability, a constant
failure rate ? is assumed, or equivalently the
exponential distribution for the component
lifetime T
There are cases in which this simplifying
assumption is inappropriate
Example - during the infant mortality and
wear-out phases of the bathtub curve
In such cases, the Weibull distribution for the
lifetime T is often used in reliability
calculation

33
Fault Tolerance and Reliability

The effect of a fault tolerant design on
reliability can be expressed as
RsysP(no-fault)P(correct-operation/fault)P(faul
t)

Maximized by fault intolerant design (proofs of
correct design, high quality components)
Coverage of a fault tolerance design over all
possible faults
For cost effectiveness, fault tolerant design
should target most likely faults
34
Importance of Design

Planning to avoid failure is the most important
aspect of fault tolerance
Analysis of the environment to determine the
failures that must be tolerated to achieve a
desired level of reliability
Redundancy costs money and time
Design must tradeoff the amount of redundancy
used and the desired level of fault tolerance

35
Fault Tolerant Techniques

Modular Redundancy
Multiple identical replicas of hardware and a
voter
N version programming - multiple versions of a
software module
Error- control coding
ECC Hamming and Reed-Solomon

36
Fault Tolerant Techniques

Check points and roll backs
Applications state saved at checkpoint. Roll
back restarts execution from a previous
checkpoint
Recovery Blocks
Alternates - secondary modules that perform same
function of a primary module - are executed when
primary fails to pass an acceptance test

37
Dependability Evaluation

Once a fault-tolerant system is designed, it must
be evaluated to determine if its architecture
meets reliability and dependability objectives
using
Analytical models
Injecting faults

38
Modeling

Importance of analysis and analytical model
to evaluate a design
a metric to compare different designs
to provide feedback to the designer during early
design stages
use a model for performance analysis
used for quantitative and qualitative analysis

39
Modeling (contd.)

Mathematical formulation for quantitative
analysis
consider a large experiment with N systems at
observation at time t
Nc(t) - number of correctly operating systems
Nf(t) - number of failed systems
N Nc(t)Nf(t)
Hence
Reliability R(t) Nc(t)/N 1 - Nf(t)/N
Unreliability Q(t) 1 - R(t)
Derivative of reliability dR(t)/dt
-(1/N)(dNf(t)/dt)
dNf(t)/dt is called instantaneous failure rate of
the component

40
Modeling (contd.)

Reliability Modeling
System model, concentrating on reliability aspect
Models
Combinatorial Models
Markov Models

41
Modeling (contd.)

Combinatorial Modeling
Probabilistic techniques
Express reliability of a system as a function
of reliability of its components
Construction models
series
parallel

42
Modeling (contd.)

Combinatorial Modeling

Parallel Only one of the components must work
correctly High redundancy
Series All components must work correctly No
redundancy
43
Modeling (contd.)

Markov Models
Many complex problems cannot be modeled easily
in combinational fashion
Use Markov models (aka Markov chains)
Repair is very difficult to model combinatorially
Markov models can be applied to modeling
reliability, availability, repair etc.

44
Modeling (contd.)
Markov Models STATE Represents all that must be
known to describe the system at a given instant
in time E.g. for reliability Each state
represents a distinct combination of faulty and
fault-free modules (e.g. 101, 1OK, 0fault)
TRANSITION Changes of state that happen in
system Over time as failures occur, system goes
from one state to another State changes are
given probabilities (e.g. prob. of failure, etc.)
Transitions are probabilities

Write a Comment

User Comments (0)