Faults and fault-tolerance - PowerPoint PPT Presentation

About This Presentation
Title:

Faults and fault-tolerance

Description:

Title: Concurrent Reading and Writing using Mobile Agents Author: Sukumar Ghosh Last modified by: Sukumar Ghosh Created Date: 11/1/2002 2:53:35 AM – PowerPoint PPT presentation

Number of Views:119
Avg rating:3.0/5.0
Slides: 13
Provided by: Suku48
Category:

less

Transcript and Presenter's Notes

Title: Faults and fault-tolerance


1
Faults and fault-tolerance
  • One of the selling points of a distributed
    system is that the system will continue to
    perform even if some components / processes fail.

2
Cause and effect
  • Study what causes what.
  • We view the effect of failures at our level of
    abstraction, and then try to mask it, or recover
    from it.
  • Be familiar with the terms MTBF (Mean Time
    Between Failures) and MTTR (Mean Time To Repair)

3
Classification of failures
Omission failure
Crash failure
Software failure
Transient failure
Temporal failure
Security failure
Byzantine failure
4
Crash failures
  • Crash failure is irreversible. How can we
    distinguish between a process that has crashed
    and a process that is running very slowly?
  • In synchronous system, it is easy to detect
    crash failure (using heartbeat signals and
    timeout), but in asynchronous systems, it is
    never accurate.
  • Some failures may be complex and nasty.
    Arbitrary deviation from program execution is a
    form of failure that may not be as nice as a
    crash. Fail-stop failure is an simple abstraction
    that mimics crash failure when program execution
    becomes arbitrary. Such implementations help
    detect which processor has failed. If a system
    cannot tolerate fail-stop failure, then it cannot
    tolerate crash.

5
Omission failures
  • Message lost in transit. May happen due to
    various causes, like
  • Transmitter malfunction
  • Buffer overflow
  • Collisions at the MAC layer
  • Receiver out of range

6
Transient failure
  • (Hardware) Arbitrary perturbation of the global
    state. May be induced by power surge, weak
    batteries, lightning, radio-frequency
    interferences etc.
  • (Software) Heisenbugs, are a class of temporary
    internal faults and are intermittent. They are
    essentially permanent faults whose conditions of
    activation occur rarely or are not easily
    reproducible, so they are harder to detect during
    the testing phase.
  • Over 99 of bugs in IBM DB2 production code are
    non-deterministic and transient

7
Byzantine failure
  • Anything goes! Includes every conceivable form
    of erroneous behavior.
  • Numerous possible causes. Includes malicious
    behaviors (like a process executing a different
    program instead of the specified one) too.
  • Most difficult kind of failure to deal with.

8
Software failures
  • Coding error or human error
  • Design flaws
  • Memory leak
  • Incomplete specification (example Y2K)
  • Many failures (like crash, omission etc) can be
    caused by software bugs too.

9
Specification of faulty behavior
  • program example1
  • define x boolean (initially x true)
  • a, b are messages)
  • do S x ? send a specified action
  • ? F true ? send b faulty action
  • od

a a a a b a a a b b a a a a a a a
10
Fault-tolerance
A system that tolerates failure of type F
  • F-intolerant vs F-tolerant systems
  • Four types of tolerance
  • - Masking
  • - Non-masking
  • - Fail-safe
  • - Graceful degradation
  • tolerances

faults
11
Fault-tolerance
  • P is the invariant of the
  • original fault-free system
  • Q represents the worst
  • possible behavior of the
  • system when failures occur.
  • It is called the fault span.
  • Q is closed under S or F.

Q
P
12
Fault-tolerance
  • Masking tolerance P Q
  • (neither safety nor liveness is violated
  • Non-masking tolerance P ? Q
  • (safety property may be temporarily
  • violated, but not liveness). Eventually
  • safety property is restored

Q
P
Write a Comment
User Comments (0)
About PowerShow.com