CprE 545: Fault Tolerant Systems - PowerPoint PPT Presentation

1 / 15

About This Presentation

Title:

CprE 545: Fault Tolerant Systems

Description:

A node 'i' sequentially tests nodes (i 1)%n, (i 2)%n,...till it finds a fault-free node. ... Over several rounds the information in the TESTED_UP array is ... – PowerPoint PPT presentation

Number of Views:51

Avg rating:3.0/5.0

Slides: 16

Provided by: ecpe7

Category:

more less

Transcript and Presenter's Notes

Title: CprE 545: Fault Tolerant Systems

1
CprE 545 Fault Tolerant Systems

System Level Fault Diagnosis

2
Introduction

The basic goal of system level diagnosis is to
identify all the faulty units in a system.
In order to determine how diagnosable a system is
and for performing diagnosis the following PMC
model is used
PMC model was introduced by Preparata, Metze, and
Chien in 1967.

3
PMC Model

In the PMC model, a system S is decomposed into
n units, not necessarily identical, denoted by
U u1, u2, .un.
Each unit is considered to be completely working
or completely faulty. There is no intermediate
state.
The status of the components do not change during
the diagnosis.

4
PMC Model (Contd..)

In the PMC model, each unit belonging to U is
assigned a particular subset of U to test ( no
unit tests itself). The complete set of tests is
called connection assignment, and is represented
as a graph G (U, E).
In this graph, each node represents a unit, and
each edge represents a testing link.
An edge (Ui, Uj) exists in G if and only if node
Ui tests node Uj.
aij outcome of the test (Ui, Uj)
The value of aij is arbitrary if the node Ui is
faulty.
The set of test outcomes of a system S is called
the syndrome of S

5
Connection Assignment Graph
aij 0, if Uj is non faulty aij 1, if Uj is
faulty
U1
a12 X
a51 1
U5
U2
a23 0
a45 0
U3
U4
Is it 1-fault diagnosable?
a34 0
Is it 2-fault diagnosable?
The syndrome of this system is a 5-bit vector
(a12, a23, a34, a45, a51) (x, 0, 0, 0, 1)
6
Centralized Diagnosis

In the PMC model, the syndrome is assumed to be
analyzed by a centralized supervisor, which is an
ultra-reliable processor.
t-fault diagnosable A system S is t-diagnosable
if, given a syndrome, all faulty units S can be
identified, provided that the number of faulty
units does not exceed t.
Two conditions form sufficient condition for a
system with n units to be t-diagnosable
n 2t 1
Each unit is tested by at least t others
Several centralized algorithms exist to analyze
the syndrome.

7
Diagnosability vs. Diagnosis problems

Diganosability problem In t-diagnosable systems,
the problem of determining t for a given
system, i.e., determining the maximum number of
units that can be faulty, such that the set of
faulty units can be uniquely identified on the
basis of any syndrome.
Diagnosis the problem of determining the faulty
units from any syndrome, given that there are at
most t faulty units.
The diagnosability problem is concerned only with
what is theoretically possible.
The diagnosis problem is concerned with actually
finding an algorithm for diagnosis (provided, of
course, the system is diagnosable) from a given
syndrome.

8
Distributed Diagnosis

The centralized approach is not suitable for the
distributed systems.
The goal of system diagnosis in distributed
systems is to ensure that if some nodes fail (or
recover), then the other nodes in the system find
out about the failure (recovery) in a finite time.

9
Adaptive Distributed System Level Diagnosis

The Adaptive DSD algorithm is executed by each
node in the system.
Each node i maintains an array TESTED_UPi. It
contains n elements, indexed by the node
number.
Each element of TESTED_UPi contains a node
number.
The entry TESTED_UPik j means that the node
i has received diagnostic information from a
fault-free node specifying that the node k has
tested j to be fault-free
An entry TESTED_UPim may be arbitrary if the
node m is faulty.

10
Adaptive DSD Overview

The nodes are sequentially ordered in a circular
list, say as, 1, 2, , n, 1.
A node i sequentially tests nodes (i1)n,
(i2)n,till it finds a fault-free node.
Diagnostic information from this fault-free node
is copied to the local TESTED_UP array.

11
Adaptive DSD Algorithm for node i (one round)

t i
Repeat
t ( t 1) mod n
Request t to forward TESTED_UPt to i
Until( i tests t as fault-free)
TESTED_UPii t
For j 1 to (n-1) do
If( i ! t ) / copies the array contents /
TESTED_UPij TESTED_UPtj

12
Adaptive DSD Example
TESTED_UP1
0
1
TESTED_UP7
7
2
TESTED_UP2
3
TESTED_UP3
6
TESTED_UP6
4
5
Over several rounds the information in the
TESTED_UP array is spread to all the nodes
13
The Diagnose Algorithm

Uses STATEik FAULTY / FAULT-FREE i.e state of
node k as found by the node i
Algorithm
Initialize STATEij FAULTY for all j
t i
Repeat
STATEit FAULT-FREE
t TESTED_UPit
Until (t i)
Intuitively, it is like going backwards through
the test edges on the circular list.

14
Properties Adaptive DSD algorithm

It takes n rounds to fill up TESTED_UP array.
STATE array can be filled in at most n steps
Arbitrary number of faulty units can be detected
(up to n-1).
Assumption There are no failures or recovery
during the execution of the algorithm (i.e.,
during n rounds)

15
What is the test?

Node i test node j a process is created at
node j
Process creation itself verifies that the process
scheduler is operational
The process checks several hardware and software
facilities, the disk subsystem, and performs some
known arithmetic operations
If the results of the test is not provided within
a timeout period, then the node tested is
assumed to have failed.