Title: SystemLevel Diagnosis: A Review
1System-Level Diagnosis A Review
Computer Science Department, University of Pisa
- Seminars for the PhD in Computer Science
- Stefano Chessa
2Faults, Errors and Failures
- Fault Abnormal physical condition
- caused by temperature, cosmic rays, design
errors, age, - Classified by
- Duration
- Transient, Intermittent, Permanent
- Nature
- Logical,
- Extent
- Error caused by a fault affecting information
- Failure system component unable to work
- caused by an error
3Redundancy Management
- Fault-Detection / Masking
- Information redundancy
- Parity bit, codes,..
- Hardware redundancy
- Duplication, n-modular-redundancy,..
- Fault-Diagnosis
- Computation redundancy
- Tests
- Diagnosis algorithms
- Repair and Reconfiguration
- Replacement / Repair
- Graceful degradation
- Recovery
- Set the system in a consistent state
- Backward and Forward recovery
4System-Level Diagnosis The PMC Model
- Introduced in 1967 by Preparata, Metze and Chien
- Consider a set V of units
- The units are connected by an interconnection
structure - This defines the system graph G(V,L)
- Units may be Faulty or Fault-Free
- Permanent faults
- Units perform mutual tests exploiting the system
interconnections - Test have binary outcomes
- The Syndrome is the collection of all the test
outcomes - This defines the test assignment and the
diagnostic graph DG(V, E)
A System Graph and a Diagnostic Graph
5System-Level Diagnosis The Tests
- The test of unit v performed by unit u consists
of three steps - u sends a test input sequence to v
- v performs a computation on the test sequence and
returns the output to u - Unit u compares the output of v with the expected
results - The output is binary (0 passes 1 fails)
- requires a bidirectional connection
- Outcome g of the test performed by unit u on unit
v (denoted as u v) defined according to the
PMC model - u v Tests performed in both directions
with outcomes respectively d,g.
6System-Level Diagnosis Some Definitions
- Centralized diagnosis
- An external, reliable diagnoser
- Collects and decodes the syndrome by a diagnosis
algorithm - Distributed diagnosis
- The syndrome is decoded by a distributed
algorithm - Centralized Diagnosis
- Given a syndrome s, a consistent fault set (CFS)
Vf is such that - For each u? Vf, v? V Vf v u
- For each u,v? V Vf v u
- The goal of the diagnosis is to identify a CFS
of minimum cardinality - However, in general there are many CFSs with
that property
V1 1,2,3 V2 3,4,5
7System-Level Diagnosis Some Definitions
- The diagnosis algorithm outputs sets K, F and S
- K units declared fault-free
- F units declared faulty
- S units declared suspect
- The diagnosis is correct if K? V Vf and F ? Vf
- The diagnosis is complete if S ? (K ? F V)
8System-Level Diagnosis Some Definitions
- One-Step t-diagnosable systems
- Correct and complete diagnosis for any Vf, with
Vf?t - t is the one-step diagnosability of the system
- For any syndrome, either
- There exists a unique consistent fault set of
cardinality at most t OR - The minimum cardinality of the consistent fault
set exceeds t - Sequentially s-diagnosable systems
- Correct diagnosis for any Vf, with Vf?s
- s is the sequential diagnosability of the system
- For any syndrome, either
- The consistent fault sets of cardinality ?s have
a non-empty intersection OR - The minimum cardinality of the consistent fault
set exceeds s
9System-Level Diagnosis Three Problems
- Characterization problem
- Finding necessary and sufficient conditions in
order to achieve the desired diagnosability in a
system - Diagnosability problem
- Given a test assignment for a system, determine
its one step and sequential diagnosability - Diagnosis problem
- Given a system, a test assignment and a syndrome,
determine a consistent fault set of minimum
cardinality
10The Characterization Problem
- Let nV, and d be the diagnostic graph indegree
- Necessary conditions for the one-step
diagnosability PMC67 - n ? 2t1
- d ? t
- These conditions are sufficient if no two units
test each other HA74 - A general characterization for one-step
t-diagnosable systems is also given HA74 - n ? 2t1
- d ? t
- For each X?V, Xn 2t p, 0?p?t, X is tested
by p1 units in N X
11The Characterization Problem
- Sequential Diagnosable Systems
- n ? 2t1 is a necessary condition PMC67
- There exists a general characterization HX95
A sequentially 3-diagnosable system
A one-step 2-diagnosable system
12The Diagnosability Problem
- One-step Diagnosability
- Firstly solved by Sullivan Sul84
- The best algorithm determines the one-step
diagnosability in O(nt 2.5) RT91a - Sequential Diagnosability
- The problem is Co-NP Complete RT91b
- The sequential diagnosability can be determined
for several classes of graphs
13The Diagnosis Problem
- One-Step Diagnosis
- (with no restrictions) the problem is NP-complete
MH76 - The problem is O(n 2.5) for t-diagnosable systems
DM84 - Step 1 Constructs the L-Graph
- Vf is a minimum cover set of the graph (finding
Vf is NP for general graphs)
14The Diagnosis Problem
- Step 2 Finds a maximum matching in the L-Graph
- Matches a fault and a fault-free unit
- Step 3 Visits the L-Graph starting from a unit
not included in the matching - The unit is fault-free
- Sequential Diagnosis
- (with no restrictions) sequential diagnosis is
co-NP complete FK78 - A general heuristic O(E) have been proposed in
Man80 - Many algorithms for several classes of graphs
15The BGM model
- Permanent faults
- Faulty units never produce the same (faulty)
outcomes - Sequential Diagnosis
- Diagnosis is trivial
- Sequential diagnosability is Co-NP Complete
RT91 - One-Step Diagnosis
- Necessary condition t ? n 2 BGM76
- Sufficient conditions are also given BGM76
- One-step diagnosability O(nt 2/log t ) RT91
16The Comparison Models
- Tests performed by comparison of the output of
adjacent units - The comparator can be either internal to units or
external - Models with reliable comparators
- Two faulty units never produce the same outcomes
Mal80 - Two faulty units may produce the same outcomes
CH81 - MM81 the comparator is an external unit
subject to faults
17Distributed Diagnosis
- Releases the hypothesis of a centralized and
reliable diagnoser - Firstly introduced in KR80
- The diagnosis is performed by a distributed
algorithm - Units perform test on adjacent units
- Units exchange diagnostic information with the
neighbors - Fault-free units accept diagnostic information
only from fault-free neighbors - Same invalidation rule of the PMC model
- Faults do not occur during diagnosis
- This hypothesis was released in KR81 When a
unit u receives new diagnostic information from
v, it tests v and accept the information only if
v is fault-free - Optimal diagnosis algorithm in terms of number of
tests, messages and diagnosis latency RDZ95 - Other models will be presented in the forthcoming
seminars
18Probabilistic Diagnosis
- Introduced in MH76
- Will be presented in the next seminar by Paolo
Santi
19Applications
- Diagnosis of massively parallel systems
- A very large number of processing elements
- Regular interconnection structures
- Generally used for huge computations
- MTTF can be very small even if the single
components are reliable - Wafer-Scale Self-Test
- A large number of ICs
- ICs arranged in a regular pattern on the wafer
- A large number of faulty ICs
- Up to 50
- With the current test technology testing costs
are increasing rapidly - In the near future testing costs will
exceedmanufacturing costs!!!
20Wafer Test State of the Art
- ICs tested by a test tool
- A probe station and a controlling computer
- The pads of each IC are probed by the probe
station - The probe station supplies the IC with power,
ground and a test sequence - The IC output sequence is delivered to the
controlling computer - The controlling computer compares the output
sequence with the expected output sequence - In alternative compares the IC outcomes with the
outcomes of a golden unit - Drawbacks
- ICs generally cannot be tested at full speed
- The test computer generally does not match the
actual speed of ICs - The test is generally not accurate
- Limited fault coverage, tests mainly electrical
properties - Time required to test an entire wafer
- ICs tested sequentially
- After each test the probing station should
position the probes on the next IC - The time to test an IC increases with its
complexity
21Wafer Test State of the Art
22Wafer-Scale Self-Test
- The ICs perform mutual tests
- The tests may proceed in parallel
- The tests produce binary outcomes
- The test tool collects the syndrome and executes
the diagnosis algorithm - Advantages
- The ICs undergo an intensive test before they are
cut and packaged - saves the cost of packaging faulty ICs
- ICs tests executed at the operating speed of ICs
(or to a comparable speed) - improves test accuracy
- ICs tested in parallel
- reduces the time needed to complete the test
23Wafer-Scale Self-Test
24Wafer-Scale Self-Test Implementation Issues
- Requirements
- ICs Interconnection
- Comparators to perform comparisons
- number, placement,...
- Test vectors, clock, power and ground supply to
all ICs - Syndrome collection and diagnosis algorithm
- The diagnosis algorithm should be able to
diagnose a large fraction of ICs under realistic
fault situations - A Naive implementation
- Large number of bus interconnections across the
entire wafer - The links would cross the ICs boundary
- Wafer-level synchronization
- Very complex and expensive design
- Therefore, for a feasible implementation we need
to - reduce the design complexity by minimizing the
interconnections - and
- release the hypothesis of wafer synchronization
25Conclusions
- The classical results of system-Level Diagnosis
are unsuitable to diagnose large, regularly
interconnected systems - The most promising applications require regular
interconnection structures - For this reason
- Research has recently focused on diagnosis of
regular systems - The forthcoming seminars will address these topics