ECE 753: FAULTTOLERANT COMPUTING - PowerPoint PPT Presentation

1 / 24
About This Presentation
Title:

ECE 753: FAULTTOLERANT COMPUTING

Description:

Department of Electrical and Computer Engineering. Software Fault Tolerance. Overview ... other re-configuration techniques for reconfiguring logic in the presence of ... – PowerPoint PPT presentation

Number of Views:18
Avg rating:3.0/5.0
Slides: 25
Provided by: kew67
Category:

less

Transcript and Presenter's Notes

Title: ECE 753: FAULTTOLERANT COMPUTING


1
ECE 753 FAULT-TOLERANT COMPUTING
  • Kewal K.Saluja
  • Department of Electrical and Computer Engineering
  • Software Fault Tolerance

2
Overview
  • Introduction and Motivation
  • Basic Approaches
  • Fault Tolerance
  • Using backup
  • N-version programming
  • Recovery block scheme

3
Introduction SW fault tolerance
  • References
  • Prad96 Chapter 7 Sections 7.1-7.3
  • Shooman Chapter 5 Section 5.9
  • Lyu Chapter 14 all sections
  • Motivation
  • Large and complex software systems
  • Faults more likely
  • Fault avoidance less successful
  • Difficulty in testing and verification
  • Consequences of failure
  • Millions of dollars
  • Inconvenience non availability of phones,
    network, computer
  • Loss of life

4
Basic Approaches
  • Robustness extent of continuation of operation
    despite invalid inputs
  • Methods to overcome
  • Fault avoidance
  • Check validity of inputs
  • Re-read inputs
  • Use default values
  • Exits to recoverable states instead of core dump
  • Use of robust data structures
  • Object oriented programming (generate robust
    structures)

5
Basic Approaches (contd.)
  • Fault containment fault in one module should
    not affect other modules
  • Methods to use
  • Validity checks such as addresses are in range
  • Reasonableness checks such as divide by zero,
    overflow, watchdog timer is the response timely
    or is it stale
  • Assertion check

6
Basic Approaches (contd.)
  • Fault tolerance continued correct operation in
    the presence of program faults
  • Different methods in different systems and
    software architectures following already
    discussed in context of checkpointing and error
    recovery
  • Uniprocessor systems
  • Multiprocessor loosely coupled and tightly
    coupled systems
  • Message passing systems

7
Fault Tolerance
  • Using backups or process pair architecture
  • Application is replicated on two processors
    primary and backup processes
  • Normally primary provides service
  • Primary provides checkpoints (state, results,
    etc.) to the backup
  • Backup can take over when primary fails
  • Some checks, such as I am alive, are used to
    assure that primary is making progress

8
Fault Tolerance (contd.)
  • N-version programming
  • Multiple version of a software initiate
  • Results of all versions on their completion are
    collected
  • Decision algorithm executes
  • Result is accepted/rejected to provide the
    succeeding processes
  • Key requirement
  • Independent generation of versions
  • Issue of specifications from which versions are
    generated

9
Fault Tolerance (contd.)
  • Recovery block scheme this is similar in
    concept to standby sparing
  • Primary block is normally used
  • Alternates, with lower performance and less
    disirable attributes, are used on failure
    detection in primary
  • Each block (primary alternate or other
    alternates) has its acceptance test
  • Programming practices, such as hierarchical and
    structured approaches, make this method some
    what easier to use with limited impact on
    performance

10
Summary
  • Re-iterated SW fault tolerant methods generally
    discussed earlier in the course

11
ECE 753 FAULT-TOLERANT COMPUTING
  • Kewal K.Saluja
  • Department of Electrical and Computer Engineering
  • Reconfiguration

12
Overview
  • Introduction and basic concept
  • Fault model and fault coverage
  • Two example architectures
  • n-cubes
  • de Bruijn networks
  • Summary

13
Introduction and basic concept
  • References
  • Prad96 Chapter 3 Sections on 3.8 3.10
  • Basic concept
  • Must avoid using the faulty unit(s) weather it
    be a process, processor, program, data, link
    between a pair of units, etc.
  • Two types of re-configurations
  • Fault tolerance via degraded performance
  • fault tolerance provided by sufficient
    redundancy at design stage

14
Fault model and fault coverage
  • Candidate architectures
  • Bus bases systems
  • Crossbar based systems
  • Mash connected systems
  • Hypercube networks
  • de Bruijn Networks
  • Tree networks
  • Hexagonal networks
  • Other regular architectures

15
Fault model and Fault coverage (contd.)
  • System Model
  • Units are represented as nodes
  • Interconnects are represented as links between
    nodes
  • Failure models
  • Nodes may fail or go down the corresponding
    unit unable to interact with other units
  • Interconnect may fail or go down no units can
    communicate using the failed or down link

16
Fault model and Fault coverage (contd.)
  • Objective of fault tolerance
  • Any pair of units must be able to interact in the
    presence of
  • Node failures
  • Link failures
  • Performance metrics
  • How many faults (node or link failures) can be
    tolerated (fault coverage)
  • Impact on the route length number of hops
    between pairs of nodes (same as the length of the
    shortened path between a pair of nodes)
  • Can pay attention to the worst case scenario or
    impact on the average length of the paths

17
Two example architectures
  • Hypercube architecture
  • A n-cube
  • Contains 2n nodes
  • Encode the 2n nodes as n-tuples
  • Two nodes are connected using a bi-directional
    link if and only if the Hamming distance between
    them is exactly 1

18
Two example architectures (contd.)
  • Hypercube architecture (contd.)
  • A method of sending message between a pair of
    nodes
  • Find a route between two nodes
  • An algorithm for finding a route between nodes n1
    and n2
  • Use binary encoding of n1 and n2 Let it be
  • a1 a2 ak and b1 b2 bk
  • Determine the locations these two string differ
    and complement one bit at a time to find a route
    between the two nodes
  • Length of such a path can be no larger than k

19
Two example architectures (contd.)
  • Hypercube architecture (contd.)
  • Finding a route in the presence of a faulty node
  • Consider an example find path between nodes
    0011 and 0101 in the presence of 0111 being
    faulty
  • A possible path is 0011 ? 0001 ? 0101
  • Result between every pair of nodes there are k
    node disjoint paths
  • The paths are
  • Complement one bit at a time starring from the
    left most bit and keeping it that way. Thus we
    will have n starts and these will lead to n
    disjoint paths with some careful construction of
    paths

20
Two example architectures (contd.)
  • Hypercube architecture (contd.)
  • In a hypercube of dimension k, upto k-1 node
    faults can be tolerated
  • Some faults cause a degradation as the path
    length starts to increase after certain faults
  • Number of link faults that can be tolerated is at
    least the number of tolerable node faults
  • Problems that have been addresses in literature
  • Centralized observer (as discussed above)
  • Distributed algorithm in which every node knows
    the location of the faulty node
  • Distributed algorithms in which only the
    neighbors of faulty node know its status

21
Two example architectures (contd.)
  • de Bruijn networks
  • Contains 2n nodes
  • Encode the 2n nodes as n-tuples
  • Two nodes are connected using a bi-directional
    link if and only if the second node can be
    derived by logical left or right shift of the
    first node
  • An example de Bruijn network for k-3 is given next

22
Two example architectures (contd.)
  • de Bruijn networks (contd.)

010
101
23
Two example architectures (contd.)
  • de Bruijn networks (contd.)
  • There are at least two node disjoint paths
    between any pair of node
  • Hence, in the presence of a single node failure
    nodes can continue to interact
  • Many such results are known for de Bruijn networks

24
Summary
  • Described two network architectures in which
    messages can be re-configured to maintain the
    network connectivity in the presence of faulty
    nodes and/or links
  • Some other re-configuration techniques for
    reconfiguring logic in the presence of faults
    will be discussed in projects
Write a Comment
User Comments (0)
About PowerShow.com