FP9 FAULTTOLERANT COMPUTING - PowerPoint PPT Presentation

1 / 44
About This Presentation
Title:

FP9 FAULTTOLERANT COMPUTING

Description:

Is the art and science of building computing ... Beyond Fault Tolerance. While cost of HW and SW drops, down time cost increases every year ... Bath tube curve ... – PowerPoint PPT presentation

Number of Views:145
Avg rating:3.0/5.0
Slides: 45
Provided by: kew67
Category:

less

Transcript and Presenter's Notes

Title: FP9 FAULTTOLERANT COMPUTING


1
FP9 FAULT-TOLERANT COMPUTING
  • Daniel Ortiz-Arroyo
  • Computer Science and Engineering Department
  • Aalborg University, Esbjerg

2
About the Course
  • This is a short introductory course
  • 5 classes
  • Each class 2 sessions/45 min each
  • Exercises
  • Prerequisites
  • Concepts of Probability
  • Computer Architecture
  • Software engineering
  • Reliability

3
Course Contents
4
Reading Material
  • Textbook No textbook required
  • Optional Reference Books
  • Reliability of Computer Systems and Networks
    -Fault Tolerance Analysis and Design, M.L.
    Shooman, Wiley 2002
  • Software Fault Tolerance Techniques and
    Implementation by Laura L. Pullum ISBN
    1580531377 Publisher Artech House Computer
    Security Series, 2001
  • Papers listed on course web page
    www.cs.aaue.dk/do/teaching/f05/FTC.htm

5
Course Goals
  • Provide an overview of fault tolerant computing
  • Hardware and software
  • Models
  • Implementation mechanisms
  • Hugh area with more than 40 years of
    research/development

6
Motivation
  • What is Fault-Tolerance?
  • A fault-tolerant system is one that continues
    to perform at desired level of service in spite
    of failures in some components that constitute
    the system.
  • What is Fault tolerant computing?
  • Is the art and science of building computing
    systems that continue to operate satisfactorily
    in the presence of faults
  • Computing correctly despite the existence of
    errors in the system

7
Motivation (contd.)
  • Approaches to design fault tolerant computer
    systems
  • Bottom-up designing fault tolerant components
    to integrate them into a fault tolerant system
  • Top-down designing a fault tolerant system
    using components with little or not fault
    tolerance
  • Top down is the most used approach

8
Motivation (contd.)
  • Challenge of Fault Tolerant Computing using the
    top-down approach
  • Given that both hardware and software components
    are unreliable, how do we build reliable systems
    from these unreliable components?

9
Motivation (contd.)
  • A fault-tolerant computing system may be able to
    tolerate one or more fault-types including
  • transient, intermittent or permanent hardware
    faults,
  • software and hardware design errors,
  • operator errors, or
  • externally induced upsets or physical damage.

10
Motivation (contd.)
  • Permanent faults
  • Once a component fails, it never works again.
    Easiest to diagnose
  • Transient faults
  • Occurs one time. 10 times as likely as permanent
    faults
  • Intermittent faults
  • Re-occuring, may appear as transient if period is
    long
  • Hard and expensive to detect

11
Motivation (contd.)
  • Examples of fault tolerant mechanisms/systems
  • General Purpose Systems
  • PCs RAMs with parity checks and possibly ECC
  • Workstations error detection (HW), occasional
    corrective action (SW), ECC (HW), keeping log
    (SW)
  • Reliable Systems
  • Telephone systems
  • Banking systems e.g. ATM
  • Stock market

12
Motivation (contd.)
  • Examples
  • Critical and Life Critical Systems
  • Manned and unmanned space borne systems
  • Aircraft control systems
  • Nuclear reactor control systems
  • Life support systems
  • Reliable -gt Critical Systems
  • Traffic light control system
  • Automobile control system (ABS, Fuel injection
    system)

13
Introduction
  • Historical Perspective
  • Not a new concept. First use by J. von Neumann
    1956
  • Probabilistic logic and synthesis of reliable
    organism from unreliable components, Annals of
    mathematical studies, Princeton University Press
  • Major push
  • Space program
  • HW Fault tolerance - then
  • SW Fault tolerance later
  • Merge the two

14
Introduction (contd.)
  • New pushes
  • Density of devices
  • (Moores law)
  • Deep submicron tech and time to market pressure
  • Implementation of numerous functionalities on
    chip/board/system
  • Speculative execution
  • in modern processors

15
Introduction (contd.)
  • Intuitive concepts
  • Reliability continues to work
  • Availability works when I need it
  • Safety does not put me in jeopardy
  • Performability maintains same performance in
    spite of failures
  • Maintainability do not take much time to repair

16
Introduction (contd.)
  • The two most common ways industry expresses a
    systems ability to tolerate failure are
  • Reliability
  • Availability

17
Terminology and definitions
  • MTTF mean time to failure
  • the expected time the system will operate before
    the first failure occurs (a system is replaced
    after a failure).
  • MTTR mean time to repair
  • average time required to repair a system
  • MTBF mean time between failure
  • average time between failures of a system
    (renewal situation theres repair or
    replacement)
  • MTBF MTTF MTTR

18
Terminology and definitions
  • Reliability (time interval)
  • R(t) conditional probability that a system is up
    in the interval 0,t given that it was up at
    time 0. Measured by MTBF
  • Availability (time point)
  • A(t) probability that a system is operating
    correctly and is available to perform its
    functions at the instant of time t. Measured by
    MTBF/(MTBFMTTR)
  • Availability can be high, even if the system has
    frequent periods of inoperability if time to
    repair is low.

Up means system provides the required
functionality
19
Beyond Fault Tolerance
Server
Hw
  • While cost of HW and SW drops, down time cost
    increases every year
  • Availability is a good metric but outage minutes
    may be more useful (it can be measured) in some
    cases

Sys-sw
App-sw
Network
Hw
Sys-sw
App-sw
Client
Hw
Sys-sw
Industry has focused mainly on Hw faults
App-sw
Customer view of 7x24
20
Fundamental Principles
  • Redundancy
  • Addition of extra parts in a systems design to
    allow it continue functioning as intended in
    spite of failures
  • Providing redundancy is key in fault tolerant
    computing
  • Hardware redundancy
  • Software Redundancy
  • Time Redundancy
  • Information Redundancy

21
Fundamental Principles (contd.)
  • Hardware Redundancy
  • Low level
  • Logic level - Self checking circuits, parity bit
    code
  • High level
  • Triplicate or use 5-copies of a computer (as in
    space shuttle)

22
Fundamental Principles (contd.)
  • Software Redundancy
  • Use two different programs/algorithms
  • Time Redundancy
  • Re-compute or redo the task and compare the
    results
  • May or may not use the same hardware/software
  • Information Redundancy
  • Backup information
  • Use of Error Correcting Codes (ECC)

23
Fault-Error-Failure concept
  • Intuitive definitions
  • Fault
  • An anomalous physical condition caused by a
    manufacturing problem, fatigue, external
    disturbance (intentional or un-intentional),
    design flaw,
  • Error - Effect of activation of a fault
  • Failure - over-all system effect of an error
  • Fault -gt Error -gt Failure

Bit stuck at
Incorrect data at ALU
Incorrect balance, system crash
Not all errors lead to failures!!
24
Fault-Error-Failure concept (contd.)
  • Origins of faults
  • Physical device level (HW)
  • Logic level (HW)
  • Chip level (HW)
  • System level (HW/SW)
  • interfacing, specifications,

25
Propagation of Faults and Errors
  • Both faults and errors can spread through the
    system
  • If a chip shorts out power to ground, it may
    cause nearby chips to fail as well
  • Errors can spread because the output of one
    computing element is frequently used as input by
    others
  • Adder example the erroneous result of the faulty
    adder can be fed into further calculations, thus
    propagating the error

26
Containment Zones
  • Containment Zones
  • To limit error propagation, designers incorporate
    these zones into systems
  • Barriers reduce the chance that a fault or error
    in one zone will propagate to another
  • A fault-containment zone can be created by
    providing an independent power supply to each
    zone
  • The designer tries to electrically isolate one
    zone from another
  • An error-containment zone can be created by using
    redundant units and voting on their output

27
Hardware Fault Classification
  • Transient Faults
  • Disappear after a relatively short time
  • Example - a memory cell that changes spuriously
    due to some electromagnetic interference.
    Overwriting the memory cell with the right
    content will make the fault go away
  • Permanent Faults
  • Never go away, component has to be repaired or
    replaced
  • Intermittent Faults
  • Example - a loose connection

28
Fault Modeling
  • Fault models at different levels (HW)
  • Process level
  • Transistor level
  • Gate level
  • Function level
  • .
  • System level

VLSI Manufacturing
We will discuss fault/failure models mainly at
high levels (from gate to system) in the course
29
Fault Modeling (contd.)
  • High-level failure models (process or system
    failure)
  • General classification
  • crash failure - a faulty processor or system
    stops permanently
  • omission failure - a faulty process omits
    inputs/outputs some times but when it works, it
    works correctly
  • timing failure - inputs/outputs are delayed or
    arrive too early
  • Byzantine failure (or arbitrary failure) - a
    faulty processor can exhibit arbitrary behavior
    including malicious nature

30
Failure Rate
  • Bath tube curve
  • The rate at which a component suffers faults
    depends on its age, the ambient temperature, any
    voltage or physical shocks that it suffers, and
    the technology

Burning in used to avoid this zone
Normal lifetime
20 weeks
5-25 years
31
Failure Rate
  • Empirical formula for failure rate (in normal
    lifetime)
  • ? LQ(C1TVC2E)
  • LLearning factor (maturity of technology)
  • QManufacturing process quality factor
  • TTemparature factor
  • VVoltage stress factor
  • EEnvironmental shock factor
  • C1C2 Complexity factor ( gates, pins in
    package)

32
Failure Rate
  • In most calculations of reliability, a constant
    failure rate ? is assumed, or equivalently the
    exponential distribution for the component
    lifetime T
  • There are cases in which this simplifying
    assumption is inappropriate
  • Example - during the infant mortality and
    wear-out phases of the bathtub curve
  • In such cases, the Weibull distribution for the
    lifetime T is often used in reliability
    calculation

33
Fault Tolerance and Reliability
  • The effect of a fault tolerant design on
    reliability can be expressed as
  • RsysP(no-fault)P(correct-operation/fault)P(faul
    t)

Maximized by fault intolerant design (proofs of
correct design, high quality components)
Coverage of a fault tolerance design over all
possible faults
For cost effectiveness, fault tolerant design
should target most likely faults
34
Importance of Design
  • Planning to avoid failure is the most important
    aspect of fault tolerance
  • Analysis of the environment to determine the
    failures that must be tolerated to achieve a
    desired level of reliability
  • Redundancy costs money and time
  • Design must tradeoff the amount of redundancy
    used and the desired level of fault tolerance

35
Fault Tolerant Techniques
  • Modular Redundancy
  • Multiple identical replicas of hardware and a
    voter
  • N version programming - multiple versions of a
    software module
  • Error- control coding
  • ECC Hamming and Reed-Solomon

36
Fault Tolerant Techniques
  • Check points and roll backs
  • Applications state saved at checkpoint. Roll
    back restarts execution from a previous
    checkpoint
  • Recovery Blocks
  • Alternates - secondary modules that perform same
    function of a primary module - are executed when
    primary fails to pass an acceptance test

37
Dependability Evaluation
  • Once a fault-tolerant system is designed, it must
    be evaluated to determine if its architecture
    meets reliability and dependability objectives
    using
  • Analytical models
  • Injecting faults

38
Modeling
  • Importance of analysis and analytical model
  • to evaluate a design
  • a metric to compare different designs
  • to provide feedback to the designer during early
    design stages
  • use a model for performance analysis
  • used for quantitative and qualitative analysis

39
Modeling (contd.)
  • Mathematical formulation for quantitative
    analysis
  • consider a large experiment with N systems at
    observation at time t
  • Nc(t) - number of correctly operating systems
  • Nf(t) - number of failed systems
  • N Nc(t)Nf(t)
  • Hence
  • Reliability R(t) Nc(t)/N 1 - Nf(t)/N
  • Unreliability Q(t) 1 - R(t)
  • Derivative of reliability dR(t)/dt
    -(1/N)(dNf(t)/dt)
  • dNf(t)/dt is called instantaneous failure rate of
    the component

40
Modeling (contd.)
  • Reliability Modeling
  • System model, concentrating on reliability aspect
  • Models
  • Combinatorial Models
  • Markov Models

41
Modeling (contd.)
  • Combinatorial Modeling
  • Probabilistic techniques
  • Express reliability of a system as a function
    of reliability of its components
  • Construction models
  • series
  • parallel

42
Modeling (contd.)
  • Combinatorial Modeling

Parallel Only one of the components must work
correctly High redundancy
Series All components must work correctly No
redundancy
43
Modeling (contd.)
  • Markov Models
  • Many complex problems cannot be modeled easily
    in combinational fashion
  • Use Markov models (aka Markov chains)
  • Repair is very difficult to model combinatorially
  • Markov models can be applied to modeling
    reliability, availability, repair etc.

44
Modeling (contd.)
Markov Models STATE Represents all that must be
known to describe the system at a given instant
in time E.g. for reliability Each state
represents a distinct combination of faulty and
fault-free modules (e.g. 101, 1OK, 0fault)
TRANSITION Changes of state that happen in
system Over time as failures occur, system goes
from one state to another State changes are
given probabilities (e.g. prob. of failure, etc.)
Transitions are probabilities
Write a Comment
User Comments (0)
About PowerShow.com