FAULTTOLERANT COMPUTING - PowerPoint PPT Presentation

1 / 49
About This Presentation
Title:

FAULTTOLERANT COMPUTING

Description:

... developed by NASA in the late 1960s for a 10-year mission to the outer planets. ... switching and transaction processing for banks, airline reservations, etc. ... – PowerPoint PPT presentation

Number of Views:23
Avg rating:3.0/5.0
Slides: 50
Provided by: kew62
Category:

less

Transcript and Presenter's Notes

Title: FAULTTOLERANT COMPUTING


1
FAULT-TOLERANT COMPUTING
  • Fall 2006
  • Daniel Ortiz-Arroyo
  • Computer Science and Engineering Department
  • Aalborg University, Esbjerg

2
About the Course
  • This is a short introductory course
  • 5 classes
  • Each class 2 sessions/45 min each
  • Exercises
  • Prerequisites (DE education)
  • Concepts of Probability
  • Computer Architecture
  • Reliability
  • Software Systems

3
Course Contents
4
Reading Material
  • Textbook No textbook required
  • Optional Reference Books
  • Software Fault Tolerance Techniques and
    Implementation by Laura L. Pullum ISBN
    1580531377 Publisher Artech House Computer
    Security Series, 2001
  • Reliability of Computer Systems and Networks
    -Fault Tolerance Analysis and Design, M.L.
    Shooman, Wiley 2002
  • Papers listed on courses web page

5
Today
  • Review of Fault Tolerant Concepts
  • Overview of the course/topics
  • Topic list course format discussion

6
Course Goals
  • Provide an overview of Fault Tolerant Computing
    (FTC)
  • Hardware and software
  • Models
  • Implementation mechanisms
  • Discuss research and real implementation cases
  • Hugh area with more than 50 years of
    research/development

7
Motivation
  • What is Fault-Tolerance?
  • A fault-tolerant system is one that continues
    to perform at desired level of service in spite
    of failures in some components that constitute
    the system.
  • What is Fault tolerant computing?
  • Is the art and science of building computing
    systems that continue to operate satisfactorily
    in the presence of faults
  • Computing correctly despite the existence of
    errors in the system

8
Motivation
  • Why FTC is important
  • FTC techniques are the foundation for other areas
    e.g. FT Control
  • Techniques invented in FTC have been transferred
    to other fields

9
Motivation
  • Dependable Systems are essential to humans
    Dependability is the ability of a system to
    deliver a service that can justifiably be trusted

Measured by
Means to achieve dependability
10
Motivation (contd.)
  • Approaches to design fault tolerant computer
    systems
  • Bottom-up designing fault tolerant components
    to integrate them into a fault tolerant system
  • Top-down designing a fault tolerant system
    using components with little or not fault
    tolerance
  • Top down is the most used approach

11
Motivation (contd.)
  • Challenge of Fault Tolerant Computing using the
    top-down approach
  • Given that both hardware and software components
    are unreliable, how do we build reliable systems
    from these unreliable components?

12
Motivation (contd.)
  • A FTC system may be able to tolerate one or more
    fault-types including
  • HW transient, intermittent or permanent hardware
    faults,
  • HW SW design errors,
  • operator errors, or
  • externally induced upsets or physical damage.

13
Motivation (contd.)
  • Examples of FTC mechanisms/systems at different
    levels
  • PCs RAMs with parity checks and Error Correcting
    Codes (ECC) (HW)
  • Workstations error detection (HW), occasional
    corrective action (SW), keeping logs (SW)
  • RAID (Redundant Array of Inexpensive Disks)
  • Distributed Systems

14
Introduction
  • Historical Perspective
  • Theory established by J. von Neumann, 1956
  • Probabilistic logic and synthesis of reliable
    organism from unreliable components, Annals of
    mathematical studies, Princeton University Press
  • The SAPO computer built in Prague, Czechoslovakia
    in 19501954 under the supervision of A. Svoboda
    was probably the first fault-tolerant computer.
  • Used relays and a magnetic drum memory.
  • The processor used triplication and voting (TMR),
    and the memory implemented error detection with
    automatic retries when an error was detected.

15
Introduction
  • Historical Perspective
  • Over the past 30 years, a number of
    fault-tolerant computers have been developed that
    fall into three general types
  • long-life, unmaintainable computers,
  • ultradependable, real-time computers, and
  • high-availability computers.

16
Introduction
  • Long-life, un-maintainable computers.
  • Spacecrafts require computers to operate for long
    periods of time without external repair. Typical
    requirements are a probability of 95 that the
    computer will operate correctly for 510 years.
  • JPL Self-Testing-and-Repairing (STAR) computer
    was the next fault-tolerant computer, developed
    by NASA in the late 1960s for a 10-year mission
    to the outer planets.

17
Introduction
  • Ultra-dependable, real-time computers Computers
    for which an error or delay can prove to be
    catastrophic.
  • They are designed for applications such as
    control of aircraft, mass transportation systems,
    and nuclear power plants.
  • One of the first operational machines of this
    type was the Saturn V guidance computer,
    developed in the 1960s. Space shuttle is another
    example
  • Fly-by-wire aircraft exhibits a very high degree
    of fault-tolerance in their real-time flight
    control computers. For example the Airbus
    Airliners

18
Introduction
  • High-availability computers can tolerate an
    occasional error or very short delays (on the
    order of a few seconds), while error recovery is
    taking place.
  • Example applications are telephone switching and
    transaction processing for banks, airline
    reservations, etc.
  • Tandem Computers, Inc used a design of a
    distributed system with a sophisticated form of
    duplication.
  • SUN's ft-SPARC and the HP/Stratus Continuum 400
    are systems that contain redundant processors,
    disks and power supplies, and automatically
    switch to backups if a failure is detected.

19
Introduction (contd.)
  • More recent pushes for FTC
  • Moores Law, complexity in processors
  • Fault tolerant mechanisms
  • in HW (more studied)
  • in SW (more recent, more debated)

20
Introduction (contd.)
  • Intuitive concepts
  • Reliability continues to work
  • Availability works when I need it
  • Safety does not put me in jeopardy
  • Performability maintains same performance in
    spite of failures
  • Maintainability do not take much time to repair

21
Introduction (contd.)
  • The two most common ways industry expresses a
    systems ability to tolerate failure are
  • Reliability
  • Availability
  • In modern distributed systems other measures can
    be used such as outage time (or down time)

22
Terminology and definitions
  • Reliability (time interval)
  • R(t) conditional probability that a system is up
    in the interval 0,t given that it was up at
    time 0. Measured by MTBF MTTF MTTR
  • Availability (time point)
  • A(t) probability that a system is operating
    correctly and is available to perform its
    functions at the instant of time t. Measured by
    MTBF/(MTBFMTTR)
  • Availability can be high, even if the system has
    frequent periods of inoperability if time to
    repair is low.

Up means system provides the required
functionality
23
Beyond Fault Tolerance
Server
Hw
  • Distributed Systems -while cost of HW and SW
    drops, down time cost increases every year
  • Availability is a good metric but outage minutes
    may be more useful in some cases (it can be
    measured)

Sys-sw
App-sw
Network
Hw
Sys-sw
App-sw
Client
Hw
Sys-sw
Industry has focused mainly on Hw faults but
those are not the only ones
App-sw
Customer view of 7x24
24
Break
25
Fundamental Principles
  • Redundancy
  • Addition of extra parts in a systems design to
    allow it continue functioning as intended in
    spite of failures
  • Providing redundancy is key in fault tolerant
    computing
  • Hardware redundancy
  • Software Redundancy
  • Time Redundancy
  • Information Redundancy

26
Summary of FTC Techniques
Well cover FTC basic and advanced techniques
some research work and case studies
27
Fundamental Principles (contd.)
  • Hardware Redundancy
  • Low level
  • Logic level - Self checking circuits, parity bit
    code
  • High level
  • Triplicate or use 5-copies of a computer (as in
    space shuttle)

28
Fundamental Principles (contd.)
  • Software Redundancy
  • Use two different programs/algorithms
  • Time Redundancy
  • Re-compute or redo the task and compare the
    results
  • May or may not use the same hardware/software
  • Information Redundancy
  • Backup information
  • Use of Error Correcting Codes (ECC)

29
Fault-Error-Failure concept
  • Intuitive definitions
  • Fault
  • An anomalous physical condition caused by a
    manufacturing problem, fatigue, external
    disturbance (intentional or un-intentional),
    design flaw,
  • Error - Effect of activation of a fault
  • Failure - over-all system effect of an error
  • Fault -gt Error -gt Failure

Bit stuck at
Incorrect data at ALU
Incorrect balance, system crash
Not all errors lead to failures!!
30
Faults/Failure Classification
Fault -gt Error -gt Failure
31
Containment Zones
  • Containment Zones are used to limit error
    propagation in a system
  • Barriers reduce the chance that a fault or error
    in one zone will propagate to another
  • A fault-containment zone can be created by
    providing an independent power supply to each
    zone
  • The designer tries to electrically isolate one
    zone from another
  • An error-containment zone can be created by using
    redundant units and voting on their output
    (details later)

32
Barriers in HW/SW
Barriers constructed by design techniques for
fault avoidance, masking, tolerance
Better name for fault tolerance is error tolerance
33
Fault Modeling
  • Fault models at different levels (HW)
  • Process level
  • Transistor level
  • Gate level
  • Function level
  • System level

VLSI Manufacturing (Important but not covered
here)
We will discuss fault/failure models mainly at
high levels (from function to system level) in
the course
34
System Level Fault Modeling
  • Distributed System
  • Must consider HW/SW faults communication errors

Communication Channel (HW/SW)
Client HW/SW Redundancy
DataBase HW/SW Redundancy
Server HW/SW Redundancy
Time redundancy Information redundancy
Information redundancy
35
Fault Modeling (contd.)
  • High-level failure models (process or system
    level failure). General classification
  • crash failure - a faulty processor or system
    stops permanently
  • omission failure - a faulty process omits
    inputs/outputs some times but when it works, it
    works correctly
  • timing failure - inputs/outputs are delayed or
    arrive too early
  • Byzantine failure (or arbitrary failure) - a
    faulty processor can exhibit arbitrary behavior
    including malicious nature

36
Failure Rate
  • Bath tube curve (for Hardware)
  • The rate at which a component suffers faults
    depends on its age, the ambient temperature, any
    voltage or physical shocks that it suffers, and
    the technology

Burning in used to avoid this zone
constant
Normal lifetime
20 weeks
5-25 years
37
Failure Rate
  • Empirical formula for failure rate (in normal
    lifetime stage) for HW systems
  • ? LQ(C1TVC2E)
  • LLearning factor (maturity of technology)
  • QManufacturing process quality factor
  • TTemparature factor
  • VVoltage stress factor
  • EEnvironmental shock factor
  • C1C2 Complexity factor ( gates, pins in
    package)

38
Failure Rate
  • In most calculations of reliability, a constant
    failure rate ? is assumed, or equivalently the
    exponential distribution for the component
    lifetime T.
  • There are cases in which this simplifying
    assumption is inappropriate
  • Example - during the infant mortality and
    wear-out phases of the bathtub curve
  • In such cases, the Weibull distribution for the
    lifetime T is often used in reliability
    calculation

39
Fault Tolerance and Reliability
  • The effect of a fault tolerant design on
    reliability can be expressed as
  • RsysP(no-fault)P(correct-operation
    fault)P(fault)

Maximized by fault intolerant design (proofs of
correct design, high quality components)
Coverage of a fault tolerance design over all
possible faults
For cost effectiveness, fault tolerant design
should target most likely faults
40
Dependability Evaluation
  • Once a fault-tolerant system is designed, it must
    be evaluated to determine if its architecture
    meets reliability and dependability objectives
    using
  • Analytical models
  • Injecting faults

41
Modeling
  • Mathematical formulation for quantitative
    analysis
  • consider a large experiment with N systems at
    observation at time t
  • Nc(t) - number of correctly operating systems
  • Nf(t) - number of failed systems
  • N Nc(t)Nf(t)
  • Hence
  • Reliability R(t) Nc(t)/N 1 - Nf(t)/N
  • Unreliability Q(t) 1 - R(t)
  • Derivative of reliability dR(t)/dt
    -(1/N)(dNf(t)/dt)
  • dNf(t)/dt is called instantaneous failure rate of
    the component

42
Modeling (contd.)
  • Reliability Modeling
  • System model, concentrating on reliability aspect
  • The structure of a system is used to model its
    reliability
  • Models
  • Combinatorial Models
  • Markov Models

43
Modeling (contd.)
  • Combinatorial Modeling - Probabilistic Techniques
  • Idea Express reliability of a system as a
    function of reliability of its components
  • Construction models
  • series
  • All components must work correctly
  • No redundancy
  • parallel
  • Only one of the components must work correctly
  • High redundancy

44
Modeling (contd.)
  • Markov Models
  • Many complex problems cannot be modeled easily
    in combinational fashion
  • Use Markov models (aka Markov chains)
  • Repair is very difficult to model
    combinatorially
  • Markov models can be applied to modeling
    reliability, availability etc.

45
Modeling (contd.)
Markov Models STATE Represents all that must be
known to describe the system at a given instant
in time E.g. for reliability Each state
represents a distinct combination of faulty and
fault-free modules (e.g. 101, 1OK, 0fault)
TRANSITION Changes of state that happen in
system Over time as failures occur, system goes
from one state to another State changes are
given probabilities (e.g. prob. of failure, etc.)
Transitions are probabilities
46
Fault Injection
  • Fault injection artificial creation of failure
    sources
  • study system behavior under faults/errors
  • what is the effect of faults? how fast do they
    propagate?
  • is recovery code correct?
  • is our failure model accurate?
  • if failover / reconfiguration effective?
  • Limitations
  • Not all failures can be injected
  • Year-long MTTFs impossible to study

47
Fault Injection
  • Fault Injection is used to test Fault Tolerant
    Systems
  • Simulation
  • Prototyping

48
Topic List Discussion
Which topics which you havent seen before,
should we cover?
49
Course Format
  • Discussion on course format
  • Regular lectures
  • Student Participation
  • Workshop
Write a Comment
User Comments (0)
About PowerShow.com