AlgorithmBased Fault Tolerance for Matrix Operations - PowerPoint PPT Presentation

1 / 13
About This Presentation
Title:

AlgorithmBased Fault Tolerance for Matrix Operations

Description:

Achieving a fault tolerant model that is algorithm based ... LU Decomposition. C = L * U Cf = Lc * Ur. Addition. A B = C Af Bf = Cf. Scalar Multiplication ... – PowerPoint PPT presentation

Number of Views:154
Avg rating:3.0/5.0
Slides: 14
Provided by: jameshil8
Category:

less

Transcript and Presenter's Notes

Title: AlgorithmBased Fault Tolerance for Matrix Operations


1
Algorithm-Based Fault Tolerance for Matrix
Operations
  • Proposed by
  • Kuang-Hua Huang
  • Jacob A. Abraham

2
Problem Description
  • Achieving a fault tolerant model that is
    algorithm based rather than hardware based
  • Existing techniques require high overhead cost
  • Error Masking (hardware redundancy)
  • Error Detection and Recovery (hardware/time
    redundancy)

3
Existing Techniques
  • Error Masking
  • Triple Module Redundancy 200
  • Quadded Logic 300
  • Error Detection and Recovery
  • TSC 73-83 hardware
  • Alternating Logic 100 time 85 hardware
  • RESO 100 time
  • Watchdog processors

4
Algorithm-Based Fault Tolerance
  • Pros
  • Detects and corrects errors
  • Extremely low overhead
  • Cons
  • Not generally applicable (mostly useful for MPP
    systems)
  • Undetectable patterns if more than one error

5
Approach
  • Encoding of data
  • Redesign of Algorithm
  • Information must be easy to recover
  • Time overhead must not be low
  • Distribution of computation steps
  • All errors can be detected and corrected

6
Checksum Matrices
  • Definitions
  • Column checksum matrix
  • Row checksum matrix
  • Full checksum matrix

7
Theorems
  • Matrix Multiplication
  • A B C ? Ac Br Cf
  • LU Decomposition
  • C L U ? Cf Lc Ur
  • Addition
  • A B C ? Af Bf Cf
  • Scalar Multiplication
  • c Af (c A)f
  • Transpose
  • AfT (AT)f

8
Error Detection and Correction
  • Detection
  • Compute the sum (S1) of information in each
    row/column and compare to the corresponding
    checksum (S2)
  • Location
  • Intersection of the inconsistent row and column
    (S1 ? S2)
  • Correction
  • Correction of the error E E (S2 S1)
  • Correct the error in checksum S1 ? S2

9
Mesh Connected Processor Arrays
In a mesh connected system, each processor
individually handles a calculation in the
resultant matrix. In a systolic array, an array
of processes handles a row of values
Array B
Array A
10
Overhead for MPP systems
Mesh Connected Arrays
Systolic Arrays
11
Undetectable Loop Patterns
X
X
  • Certain Patterns of error mask the errors
  • Caused by faulty processors
  • Requires a minimum number of processors to detect
    error

X
X
X
X
X
X
X
X
X
X
X
X
12
Uniprocessor Systems
  • In uniprocessor system, a faulty processor can
    cause all elements to be incorrect

13
Conclusion
  • Algorithm-based fault tolerance applied to matrix
    operations
  • Low ratio of redundancy
  • Ability to detect and correct errors
  • Ongoing research
Write a Comment
User Comments (0)
About PowerShow.com