ECE 753: FAULTTOLERANT COMPUTING - PowerPoint PPT Presentation

1 / 24

About This Presentation

Title:

ECE 753: FAULTTOLERANT COMPUTING

Description:

Department of Electrical and Computer Engineering. Software Fault Tolerance. Overview ... other re-configuration techniques for reconfiguring logic in the presence of ... – PowerPoint PPT presentation

Number of Views:18

Avg rating:3.0/5.0

Slides: 25

Provided by: kew67

Category:

more less

Transcript and Presenter's Notes

Title: ECE 753: FAULTTOLERANT COMPUTING

1
ECE 753 FAULT-TOLERANT COMPUTING

Kewal K.Saluja
Department of Electrical and Computer Engineering
Software Fault Tolerance

2
Overview

Introduction and Motivation
Basic Approaches
Fault Tolerance
Using backup
N-version programming
Recovery block scheme

3
Introduction SW fault tolerance

References
Prad96 Chapter 7 Sections 7.1-7.3
Shooman Chapter 5 Section 5.9
Lyu Chapter 14 all sections
Motivation
Large and complex software systems
Faults more likely
Fault avoidance less successful
Difficulty in testing and verification
Consequences of failure
Millions of dollars
Inconvenience non availability of phones,
network, computer
Loss of life

4
Basic Approaches

Robustness extent of continuation of operation
despite invalid inputs
Methods to overcome
Fault avoidance
Check validity of inputs
Re-read inputs
Use default values
Exits to recoverable states instead of core dump
Use of robust data structures
Object oriented programming (generate robust
structures)

5
Basic Approaches (contd.)

Fault containment fault in one module should
not affect other modules
Methods to use
Validity checks such as addresses are in range
Reasonableness checks such as divide by zero,
overflow, watchdog timer is the response timely
or is it stale
Assertion check

6
Basic Approaches (contd.)

Fault tolerance continued correct operation in
the presence of program faults
Different methods in different systems and
software architectures following already
discussed in context of checkpointing and error
recovery
Uniprocessor systems
Multiprocessor loosely coupled and tightly
coupled systems
Message passing systems

7
Fault Tolerance

Using backups or process pair architecture
Application is replicated on two processors
primary and backup processes
Normally primary provides service
Primary provides checkpoints (state, results,
etc.) to the backup
Backup can take over when primary fails
Some checks, such as I am alive, are used to
assure that primary is making progress

8
Fault Tolerance (contd.)

N-version programming
Multiple version of a software initiate
Results of all versions on their completion are
collected
Decision algorithm executes
Result is accepted/rejected to provide the
succeeding processes
Key requirement
Independent generation of versions
Issue of specifications from which versions are
generated

9
Fault Tolerance (contd.)

Recovery block scheme this is similar in
concept to standby sparing
Primary block is normally used
Alternates, with lower performance and less
disirable attributes, are used on failure
detection in primary
Each block (primary alternate or other
alternates) has its acceptance test
Programming practices, such as hierarchical and
structured approaches, make this method some
what easier to use with limited impact on
performance

10
Summary

Re-iterated SW fault tolerant methods generally
discussed earlier in the course

11
ECE 753 FAULT-TOLERANT COMPUTING

Kewal K.Saluja
Department of Electrical and Computer Engineering
Reconfiguration

12
Overview

Introduction and basic concept
Fault model and fault coverage
Two example architectures
n-cubes
de Bruijn networks
Summary

13
Introduction and basic concept

References
Prad96 Chapter 3 Sections on 3.8 3.10
Basic concept
Must avoid using the faulty unit(s) weather it
be a process, processor, program, data, link
between a pair of units, etc.
Two types of re-configurations
Fault tolerance via degraded performance
fault tolerance provided by sufficient
redundancy at design stage

14
Fault model and fault coverage

Candidate architectures
Bus bases systems
Crossbar based systems
Mash connected systems
Hypercube networks
de Bruijn Networks
Tree networks
Hexagonal networks
Other regular architectures

15
Fault model and Fault coverage (contd.)

System Model
Units are represented as nodes
Interconnects are represented as links between
nodes
Failure models
Nodes may fail or go down the corresponding
unit unable to interact with other units
Interconnect may fail or go down no units can
communicate using the failed or down link

16
Fault model and Fault coverage (contd.)

Objective of fault tolerance
Any pair of units must be able to interact in the
presence of
Node failures
Link failures
Performance metrics
How many faults (node or link failures) can be
tolerated (fault coverage)
Impact on the route length number of hops
between pairs of nodes (same as the length of the
shortened path between a pair of nodes)
Can pay attention to the worst case scenario or
impact on the average length of the paths

17
Two example architectures

Hypercube architecture
A n-cube
Contains 2n nodes
Encode the 2n nodes as n-tuples
Two nodes are connected using a bi-directional
link if and only if the Hamming distance between
them is exactly 1

18
Two example architectures (contd.)

Hypercube architecture (contd.)
A method of sending message between a pair of
nodes
Find a route between two nodes
An algorithm for finding a route between nodes n1
and n2
Use binary encoding of n1 and n2 Let it be
a1 a2 ak and b1 b2 bk
Determine the locations these two string differ
and complement one bit at a time to find a route
between the two nodes
Length of such a path can be no larger than k

19
Two example architectures (contd.)

Hypercube architecture (contd.)
Finding a route in the presence of a faulty node
Consider an example find path between nodes
0011 and 0101 in the presence of 0111 being
faulty
A possible path is 0011 ? 0001 ? 0101
Result between every pair of nodes there are k
node disjoint paths
The paths are
Complement one bit at a time starring from the
left most bit and keeping it that way. Thus we
will have n starts and these will lead to n
disjoint paths with some careful construction of
paths

20
Two example architectures (contd.)

Hypercube architecture (contd.)
In a hypercube of dimension k, upto k-1 node
faults can be tolerated
Some faults cause a degradation as the path
length starts to increase after certain faults
Number of link faults that can be tolerated is at
least the number of tolerable node faults
Problems that have been addresses in literature
Centralized observer (as discussed above)
Distributed algorithm in which every node knows
the location of the faulty node
Distributed algorithms in which only the
neighbors of faulty node know its status

21
Two example architectures (contd.)

de Bruijn networks
Contains 2n nodes
Encode the 2n nodes as n-tuples
Two nodes are connected using a bi-directional
link if and only if the second node can be
derived by logical left or right shift of the
first node
An example de Bruijn network for k-3 is given next

22
Two example architectures (contd.)

de Bruijn networks (contd.)

010
101
23
Two example architectures (contd.)

de Bruijn networks (contd.)
There are at least two node disjoint paths
between any pair of node
Hence, in the presence of a single node failure
nodes can continue to interact
Many such results are known for de Bruijn networks

24
Summary

Described two network architectures in which
messages can be re-configured to maintain the
network connectivity in the presence of faulty
nodes and/or links
Some other re-configuration techniques for
reconfiguring logic in the presence of faults
will be discussed in projects

Write a Comment

User Comments (0)