Title: ECE 753: FAULTTOLERANT COMPUTING
1ECE 753 FAULT-TOLERANT COMPUTING
- Kewal K.Saluja
- Department of Electrical and Computer Engineering
- Software Fault Tolerance
2Overview
- Introduction and Motivation
- Basic Approaches
- Fault Tolerance
- Using backup
- N-version programming
- Recovery block scheme
3Introduction SW fault tolerance
- References
- Prad96 Chapter 7 Sections 7.1-7.3
- Shooman Chapter 5 Section 5.9
- Lyu Chapter 14 all sections
- Motivation
- Large and complex software systems
- Faults more likely
- Fault avoidance less successful
- Difficulty in testing and verification
- Consequences of failure
- Millions of dollars
- Inconvenience non availability of phones,
network, computer - Loss of life
4Basic Approaches
- Robustness extent of continuation of operation
despite invalid inputs - Methods to overcome
- Fault avoidance
- Check validity of inputs
- Re-read inputs
- Use default values
- Exits to recoverable states instead of core dump
- Use of robust data structures
- Object oriented programming (generate robust
structures)
5Basic Approaches (contd.)
- Fault containment fault in one module should
not affect other modules - Methods to use
- Validity checks such as addresses are in range
- Reasonableness checks such as divide by zero,
overflow, watchdog timer is the response timely
or is it stale - Assertion check
6Basic Approaches (contd.)
- Fault tolerance continued correct operation in
the presence of program faults - Different methods in different systems and
software architectures following already
discussed in context of checkpointing and error
recovery - Uniprocessor systems
- Multiprocessor loosely coupled and tightly
coupled systems - Message passing systems
7Fault Tolerance
- Using backups or process pair architecture
- Application is replicated on two processors
primary and backup processes - Normally primary provides service
- Primary provides checkpoints (state, results,
etc.) to the backup - Backup can take over when primary fails
- Some checks, such as I am alive, are used to
assure that primary is making progress
8Fault Tolerance (contd.)
- N-version programming
- Multiple version of a software initiate
- Results of all versions on their completion are
collected - Decision algorithm executes
- Result is accepted/rejected to provide the
succeeding processes - Key requirement
- Independent generation of versions
- Issue of specifications from which versions are
generated
9Fault Tolerance (contd.)
- Recovery block scheme this is similar in
concept to standby sparing - Primary block is normally used
- Alternates, with lower performance and less
disirable attributes, are used on failure
detection in primary - Each block (primary alternate or other
alternates) has its acceptance test - Programming practices, such as hierarchical and
structured approaches, make this method some
what easier to use with limited impact on
performance
10Summary
- Re-iterated SW fault tolerant methods generally
discussed earlier in the course
11ECE 753 FAULT-TOLERANT COMPUTING
- Kewal K.Saluja
- Department of Electrical and Computer Engineering
- Reconfiguration
12Overview
- Introduction and basic concept
- Fault model and fault coverage
- Two example architectures
- n-cubes
- de Bruijn networks
- Summary
13Introduction and basic concept
- References
- Prad96 Chapter 3 Sections on 3.8 3.10
- Basic concept
- Must avoid using the faulty unit(s) weather it
be a process, processor, program, data, link
between a pair of units, etc. - Two types of re-configurations
- Fault tolerance via degraded performance
- fault tolerance provided by sufficient
redundancy at design stage
14Fault model and fault coverage
- Candidate architectures
- Bus bases systems
- Crossbar based systems
- Mash connected systems
- Hypercube networks
- de Bruijn Networks
- Tree networks
- Hexagonal networks
- Other regular architectures
15Fault model and Fault coverage (contd.)
- System Model
- Units are represented as nodes
- Interconnects are represented as links between
nodes - Failure models
- Nodes may fail or go down the corresponding
unit unable to interact with other units - Interconnect may fail or go down no units can
communicate using the failed or down link
16Fault model and Fault coverage (contd.)
- Objective of fault tolerance
- Any pair of units must be able to interact in the
presence of - Node failures
- Link failures
- Performance metrics
- How many faults (node or link failures) can be
tolerated (fault coverage) - Impact on the route length number of hops
between pairs of nodes (same as the length of the
shortened path between a pair of nodes) - Can pay attention to the worst case scenario or
impact on the average length of the paths
17Two example architectures
- Hypercube architecture
- A n-cube
- Contains 2n nodes
- Encode the 2n nodes as n-tuples
- Two nodes are connected using a bi-directional
link if and only if the Hamming distance between
them is exactly 1
18Two example architectures (contd.)
- Hypercube architecture (contd.)
- A method of sending message between a pair of
nodes - Find a route between two nodes
- An algorithm for finding a route between nodes n1
and n2 - Use binary encoding of n1 and n2 Let it be
- a1 a2 ak and b1 b2 bk
- Determine the locations these two string differ
and complement one bit at a time to find a route
between the two nodes - Length of such a path can be no larger than k
19Two example architectures (contd.)
- Hypercube architecture (contd.)
- Finding a route in the presence of a faulty node
- Consider an example find path between nodes
0011 and 0101 in the presence of 0111 being
faulty - A possible path is 0011 ? 0001 ? 0101
- Result between every pair of nodes there are k
node disjoint paths - The paths are
- Complement one bit at a time starring from the
left most bit and keeping it that way. Thus we
will have n starts and these will lead to n
disjoint paths with some careful construction of
paths
20Two example architectures (contd.)
- Hypercube architecture (contd.)
- In a hypercube of dimension k, upto k-1 node
faults can be tolerated - Some faults cause a degradation as the path
length starts to increase after certain faults - Number of link faults that can be tolerated is at
least the number of tolerable node faults - Problems that have been addresses in literature
- Centralized observer (as discussed above)
- Distributed algorithm in which every node knows
the location of the faulty node - Distributed algorithms in which only the
neighbors of faulty node know its status
21Two example architectures (contd.)
- de Bruijn networks
- Contains 2n nodes
- Encode the 2n nodes as n-tuples
- Two nodes are connected using a bi-directional
link if and only if the second node can be
derived by logical left or right shift of the
first node - An example de Bruijn network for k-3 is given next
22Two example architectures (contd.)
- de Bruijn networks (contd.)
010
101
23Two example architectures (contd.)
- de Bruijn networks (contd.)
- There are at least two node disjoint paths
between any pair of node - Hence, in the presence of a single node failure
nodes can continue to interact - Many such results are known for de Bruijn networks
24Summary
- Described two network architectures in which
messages can be re-configured to maintain the
network connectivity in the presence of faulty
nodes and/or links - Some other re-configuration techniques for
reconfiguring logic in the presence of faults
will be discussed in projects