Title: CprE 545: Fault Tolerant Systems
1CprE 545 Fault Tolerant Systems
- System Level Fault Diagnosis
2Introduction
- The basic goal of system level diagnosis is to
identify all the faulty units in a system. - In order to determine how diagnosable a system is
and for performing diagnosis the following PMC
model is used - PMC model was introduced by Preparata, Metze, and
Chien in 1967.
3PMC Model
- In the PMC model, a system S is decomposed into
n units, not necessarily identical, denoted by
U u1, u2, .un. - Each unit is considered to be completely working
or completely faulty. There is no intermediate
state. - The status of the components do not change during
the diagnosis.
4PMC Model (Contd..)
- In the PMC model, each unit belonging to U is
assigned a particular subset of U to test ( no
unit tests itself). The complete set of tests is
called connection assignment, and is represented
as a graph G (U, E). - In this graph, each node represents a unit, and
each edge represents a testing link. - An edge (Ui, Uj) exists in G if and only if node
Ui tests node Uj. - aij outcome of the test (Ui, Uj)
- The value of aij is arbitrary if the node Ui is
faulty. - The set of test outcomes of a system S is called
the syndrome of S
5Connection Assignment Graph
aij 0, if Uj is non faulty aij 1, if Uj is
faulty
U1
a12 X
a51 1
U5
U2
a23 0
a45 0
U3
U4
Is it 1-fault diagnosable?
a34 0
Is it 2-fault diagnosable?
The syndrome of this system is a 5-bit vector
(a12, a23, a34, a45, a51) (x, 0, 0, 0, 1)
6Centralized Diagnosis
- In the PMC model, the syndrome is assumed to be
analyzed by a centralized supervisor, which is an
ultra-reliable processor. - t-fault diagnosable A system S is t-diagnosable
if, given a syndrome, all faulty units S can be
identified, provided that the number of faulty
units does not exceed t. - Two conditions form sufficient condition for a
system with n units to be t-diagnosable - n 2t 1
- Each unit is tested by at least t others
- Several centralized algorithms exist to analyze
the syndrome.
7Diagnosability vs. Diagnosis problems
- Diganosability problem In t-diagnosable systems,
the problem of determining t for a given
system, i.e., determining the maximum number of
units that can be faulty, such that the set of
faulty units can be uniquely identified on the
basis of any syndrome. - Diagnosis the problem of determining the faulty
units from any syndrome, given that there are at
most t faulty units. - The diagnosability problem is concerned only with
what is theoretically possible. - The diagnosis problem is concerned with actually
finding an algorithm for diagnosis (provided, of
course, the system is diagnosable) from a given
syndrome.
8Distributed Diagnosis
- The centralized approach is not suitable for the
distributed systems. - The goal of system diagnosis in distributed
systems is to ensure that if some nodes fail (or
recover), then the other nodes in the system find
out about the failure (recovery) in a finite time.
9Adaptive Distributed System Level Diagnosis
- The Adaptive DSD algorithm is executed by each
node in the system. - Each node i maintains an array TESTED_UPi. It
contains n elements, indexed by the node
number. - Each element of TESTED_UPi contains a node
number. - The entry TESTED_UPik j means that the node
i has received diagnostic information from a
fault-free node specifying that the node k has
tested j to be fault-free - An entry TESTED_UPim may be arbitrary if the
node m is faulty.
10Adaptive DSD Overview
- The nodes are sequentially ordered in a circular
list, say as, 1, 2, , n, 1. - A node i sequentially tests nodes (i1)n,
(i2)n,till it finds a fault-free node. - Diagnostic information from this fault-free node
is copied to the local TESTED_UP array.
11Adaptive DSD Algorithm for node i (one round)
- t i
- Repeat
- t ( t 1) mod n
- Request t to forward TESTED_UPt to i
- Until( i tests t as fault-free)
- TESTED_UPii t
- For j 1 to (n-1) do
- If( i ! t ) / copies the array contents /
- TESTED_UPij TESTED_UPtj
12Adaptive DSD Example
TESTED_UP1
0
1
TESTED_UP7
7
2
TESTED_UP2
3
TESTED_UP3
6
TESTED_UP6
4
5
Over several rounds the information in the
TESTED_UP array is spread to all the nodes
13The Diagnose Algorithm
- Uses STATEik FAULTY / FAULT-FREE i.e state of
node k as found by the node i - Algorithm
- Initialize STATEij FAULTY for all j
- t i
- Repeat
- STATEit FAULT-FREE
- t TESTED_UPit
- Until (t i)
- Intuitively, it is like going backwards through
the test edges on the circular list.
14Properties Adaptive DSD algorithm
- It takes n rounds to fill up TESTED_UP array.
- STATE array can be filled in at most n steps
- Arbitrary number of faulty units can be detected
(up to n-1). - Assumption There are no failures or recovery
during the execution of the algorithm (i.e.,
during n rounds)
15What is the test?
- Node i test node j a process is created at
node j - Process creation itself verifies that the process
scheduler is operational - The process checks several hardware and software
facilities, the disk subsystem, and performs some
known arithmetic operations - If the results of the test is not provided within
a timeout period, then the node tested is
assumed to have failed.