CprE 545: Fault Tolerant Systems - PowerPoint PPT Presentation

1 / 15
About This Presentation
Title:

CprE 545: Fault Tolerant Systems

Description:

A node 'i' sequentially tests nodes (i 1)%n, (i 2)%n,...till it finds a fault-free node. ... Over several rounds the information in the TESTED_UP array is ... – PowerPoint PPT presentation

Number of Views:51
Avg rating:3.0/5.0
Slides: 16
Provided by: ecpe7
Category:

less

Transcript and Presenter's Notes

Title: CprE 545: Fault Tolerant Systems


1
CprE 545 Fault Tolerant Systems
  • System Level Fault Diagnosis

2
Introduction
  • The basic goal of system level diagnosis is to
    identify all the faulty units in a system.
  • In order to determine how diagnosable a system is
    and for performing diagnosis the following PMC
    model is used
  • PMC model was introduced by Preparata, Metze, and
    Chien in 1967.

3
PMC Model
  • In the PMC model, a system S is decomposed into
    n units, not necessarily identical, denoted by
    U u1, u2, .un.
  • Each unit is considered to be completely working
    or completely faulty. There is no intermediate
    state.
  • The status of the components do not change during
    the diagnosis.

4
PMC Model (Contd..)
  • In the PMC model, each unit belonging to U is
    assigned a particular subset of U to test ( no
    unit tests itself). The complete set of tests is
    called connection assignment, and is represented
    as a graph G (U, E).
  • In this graph, each node represents a unit, and
    each edge represents a testing link.
  • An edge (Ui, Uj) exists in G if and only if node
    Ui tests node Uj.
  • aij outcome of the test (Ui, Uj)
  • The value of aij is arbitrary if the node Ui is
    faulty.
  • The set of test outcomes of a system S is called
    the syndrome of S

5
Connection Assignment Graph
aij 0, if Uj is non faulty aij 1, if Uj is
faulty
U1
a12 X
a51 1
U5
U2
a23 0
a45 0
U3
U4
Is it 1-fault diagnosable?
a34 0
Is it 2-fault diagnosable?
The syndrome of this system is a 5-bit vector
(a12, a23, a34, a45, a51) (x, 0, 0, 0, 1)
6
Centralized Diagnosis
  • In the PMC model, the syndrome is assumed to be
    analyzed by a centralized supervisor, which is an
    ultra-reliable processor.
  • t-fault diagnosable A system S is t-diagnosable
    if, given a syndrome, all faulty units S can be
    identified, provided that the number of faulty
    units does not exceed t.
  • Two conditions form sufficient condition for a
    system with n units to be t-diagnosable
  • n 2t 1
  • Each unit is tested by at least t others
  • Several centralized algorithms exist to analyze
    the syndrome.

7
Diagnosability vs. Diagnosis problems
  • Diganosability problem In t-diagnosable systems,
    the problem of determining t for a given
    system, i.e., determining the maximum number of
    units that can be faulty, such that the set of
    faulty units can be uniquely identified on the
    basis of any syndrome.
  • Diagnosis the problem of determining the faulty
    units from any syndrome, given that there are at
    most t faulty units.
  • The diagnosability problem is concerned only with
    what is theoretically possible.
  • The diagnosis problem is concerned with actually
    finding an algorithm for diagnosis (provided, of
    course, the system is diagnosable) from a given
    syndrome.

8
Distributed Diagnosis
  • The centralized approach is not suitable for the
    distributed systems.
  • The goal of system diagnosis in distributed
    systems is to ensure that if some nodes fail (or
    recover), then the other nodes in the system find
    out about the failure (recovery) in a finite time.

9
Adaptive Distributed System Level Diagnosis
  • The Adaptive DSD algorithm is executed by each
    node in the system.
  • Each node i maintains an array TESTED_UPi. It
    contains n elements, indexed by the node
    number.
  • Each element of TESTED_UPi contains a node
    number.
  • The entry TESTED_UPik j means that the node
    i has received diagnostic information from a
    fault-free node specifying that the node k has
    tested j to be fault-free
  • An entry TESTED_UPim may be arbitrary if the
    node m is faulty.

10
Adaptive DSD Overview
  • The nodes are sequentially ordered in a circular
    list, say as, 1, 2, , n, 1.
  • A node i sequentially tests nodes (i1)n,
    (i2)n,till it finds a fault-free node.
  • Diagnostic information from this fault-free node
    is copied to the local TESTED_UP array.

11
Adaptive DSD Algorithm for node i (one round)
  • t i
  • Repeat
  • t ( t 1) mod n
  • Request t to forward TESTED_UPt to i
  • Until( i tests t as fault-free)
  • TESTED_UPii t
  • For j 1 to (n-1) do
  • If( i ! t ) / copies the array contents /
  • TESTED_UPij TESTED_UPtj

12
Adaptive DSD Example
TESTED_UP1
0
1
TESTED_UP7
7
2
TESTED_UP2
3
TESTED_UP3
6
TESTED_UP6
4
5
Over several rounds the information in the
TESTED_UP array is spread to all the nodes
13
The Diagnose Algorithm
  • Uses STATEik FAULTY / FAULT-FREE i.e state of
    node k as found by the node i
  • Algorithm
  • Initialize STATEij FAULTY for all j
  • t i
  • Repeat
  • STATEit FAULT-FREE
  • t TESTED_UPit
  • Until (t i)
  • Intuitively, it is like going backwards through
    the test edges on the circular list.

14
Properties Adaptive DSD algorithm
  • It takes n rounds to fill up TESTED_UP array.
  • STATE array can be filled in at most n steps
  • Arbitrary number of faulty units can be detected
    (up to n-1).
  • Assumption There are no failures or recovery
    during the execution of the algorithm (i.e.,
    during n rounds)

15
What is the test?
  • Node i test node j a process is created at
    node j
  • Process creation itself verifies that the process
    scheduler is operational
  • The process checks several hardware and software
    facilities, the disk subsystem, and performs some
    known arithmetic operations
  • If the results of the test is not provided within
    a timeout period, then the node tested is
    assumed to have failed.
Write a Comment
User Comments (0)
About PowerShow.com