Reliability - PowerPoint PPT Presentation

1 / 17
About This Presentation
Title:

Reliability

Description:

Reliability Threads for Fault Tolerance Multiprocessors: Transient fault detection Transient Faults Faults that persist for a short duration Cause: cosmic rays ... – PowerPoint PPT presentation

Number of Views:38
Avg rating:3.0/5.0
Slides: 18
Provided by: www2EngrA
Category:

less

Transcript and Presenter's Notes

Title: Reliability


1
Reliability
2
Threads for Fault Tolerance
  • Multiprocessors
  • Transient fault detection

3
Transient Faults
  • Faults that persist for a short duration
  • Cause cosmic rays, energetic particles
    originating from outer space
  • Effect knock off electrons, discharge capacitor
  • Solution
  • no practical absorbent for cosmic rays
  • 1 fault per 1000 computers per year (estimated
    fault rate)
  • Future is worse
  • smaller feature size, higher transistor count,
    reduced noise margin

4
Background
  • Fault tolerant systems use redundancy to improve
    reliability
  • Time redundancy separate executions
  • Space redundancy separate physical copies of
    resources
  • DMR/TMR
  • Data redundancy
  • ECC Automatic repeat request (ARQ) , Forward
    error correction (FEC)
  • Parity odd/even
  • Examples
  • IBM duplicated pipelines, spare processors, ECC
    in memories...
  • HP DMR/TMR processors, Parity/ECC in buses,
    memories...

5
Multiprocessors Fault Detection
  • Chip-level Redundantly Threaded processor
  • Replicates register values but not memory values
  • The leading thread commits stores only after
    checking
  • Memory is guaranteed to be correct
  • Other instructions commit without checking
  • The leading thread sends committed values for
  • branch outcomes
  • load/store values
  • store addresses

6
Sphere of Replication (SoR)
  • Logical boundary of redundant execution within a
    system
  • Components within protected via redundant
    execution
  • Components outside must be protected via other
    means
  • Its size matters
  • Error detection latency
  • Stored-state size

7
Example Spheres of Replication
ORH-Dual On-Chip Replicated Hardware (similar to
IBM G5)
Compaq Himalaya
8
Fault Detection in Compaq Himalaya System
Replicated Microprocessors Cycle-by-Cycle
Lockstepping
9
Fault Detection via Simultaneous Multithreading
(SMT)
Replicated Microprocessors Cycle-by-Cycle
Lockstepping
10
Concept
  • SMT improves the performance of a processor by
  • allowing independent threads to execute
    simultaneously
  • doing so in different functional units
  • Redundant Multithreading (RMT)
  • leverages SMTs properties to allow fault
    detection for microprocessors
  • runs two copies of the same program as
    independent threads
  • compares their outputs and initiates recovery in
    case of mismatch

11
Input Replication
  • Load Value Queue (LVQ)
  • Keep threads on same path despite I/O or MP
    writes
  • Out-of-order load issue possible

12
Output Comparison
Compare validate output before sending it
outside the SoR
13
Store Queue Comparator (STQ)
  • Store Queue Comparator
  • Compares outputs to data cache
  • Catch faults before propagating to rest of system

14
Store Queue Comparator (contd)
  • Extends residence time of leading-thread stores
  • Size constrained by cycle time goal
  • Base CPU statically partitions single queue among
    threads
  • Potential solution per-thread store queues
  • Deadlock if matching trailing store cannot commit
  • Several small but crucial changes to avoid this

15
Branch Outcome Queue (BOQ)
  • Branch Outcome Queue
  • Forward leading-thread branch targets to trailing
    fetch
  • 100 prediction accuracy in absence of faults

16
Simultaneous Redundantly Threaded Processor
(SRT)
  • SRT SMT Fault Detection
  • Less hardware compared to replicated
    microprocessors
  • SMT needs 5 more hardware over uniprocessor
  • SRT adds very little hardware overhead to
    existing SMT
  • Better performance than complete replication
  • better use of resources
  • Lower cost

17
Issues
  • Cycle-by-cycle output comparison and input
    replication
  • Equivalent insts from different threads may
    execute in different cycles
  • Equivalent insts from different threads might
    execute in different order
  • Precise scheduling of the threads crucial for
    optimal performance
  • Branch misprediction
  • Cache miss
Write a Comment
User Comments (0)
About PowerShow.com