Title: Reliability
1Reliability
2Threads for Fault Tolerance
- Multiprocessors
- Transient fault detection
3Transient Faults
- Faults that persist for a short duration
- Cause cosmic rays, energetic particles
originating from outer space - Effect knock off electrons, discharge capacitor
- Solution
- no practical absorbent for cosmic rays
- 1 fault per 1000 computers per year (estimated
fault rate) - Future is worse
- smaller feature size, higher transistor count,
reduced noise margin
4Background
- Fault tolerant systems use redundancy to improve
reliability - Time redundancy separate executions
- Space redundancy separate physical copies of
resources - DMR/TMR
- Data redundancy
- ECC Automatic repeat request (ARQ) , Forward
error correction (FEC) - Parity odd/even
- Examples
- IBM duplicated pipelines, spare processors, ECC
in memories... - HP DMR/TMR processors, Parity/ECC in buses,
memories...
5Multiprocessors Fault Detection
- Chip-level Redundantly Threaded processor
- Replicates register values but not memory values
- The leading thread commits stores only after
checking - Memory is guaranteed to be correct
- Other instructions commit without checking
- The leading thread sends committed values for
- branch outcomes
- load/store values
- store addresses
6Sphere of Replication (SoR)
- Logical boundary of redundant execution within a
system - Components within protected via redundant
execution - Components outside must be protected via other
means - Its size matters
- Error detection latency
- Stored-state size
7Example Spheres of Replication
ORH-Dual On-Chip Replicated Hardware (similar to
IBM G5)
Compaq Himalaya
8Fault Detection in Compaq Himalaya System
Replicated Microprocessors Cycle-by-Cycle
Lockstepping
9Fault Detection via Simultaneous Multithreading
(SMT)
Replicated Microprocessors Cycle-by-Cycle
Lockstepping
10Concept
- SMT improves the performance of a processor by
- allowing independent threads to execute
simultaneously - doing so in different functional units
- Redundant Multithreading (RMT)
- leverages SMTs properties to allow fault
detection for microprocessors - runs two copies of the same program as
independent threads - compares their outputs and initiates recovery in
case of mismatch
11Input Replication
- Load Value Queue (LVQ)
- Keep threads on same path despite I/O or MP
writes - Out-of-order load issue possible
12Output Comparison
Compare validate output before sending it
outside the SoR
13Store Queue Comparator (STQ)
- Store Queue Comparator
- Compares outputs to data cache
- Catch faults before propagating to rest of system
14Store Queue Comparator (contd)
- Extends residence time of leading-thread stores
- Size constrained by cycle time goal
- Base CPU statically partitions single queue among
threads - Potential solution per-thread store queues
- Deadlock if matching trailing store cannot commit
- Several small but crucial changes to avoid this
15Branch Outcome Queue (BOQ)
- Branch Outcome Queue
- Forward leading-thread branch targets to trailing
fetch - 100 prediction accuracy in absence of faults
16Simultaneous Redundantly Threaded Processor
(SRT)
- SRT SMT Fault Detection
- Less hardware compared to replicated
microprocessors - SMT needs 5 more hardware over uniprocessor
- SRT adds very little hardware overhead to
existing SMT - Better performance than complete replication
- better use of resources
- Lower cost
17Issues
- Cycle-by-cycle output comparison and input
replication - Equivalent insts from different threads may
execute in different cycles - Equivalent insts from different threads might
execute in different order - Precise scheduling of the threads crucial for
optimal performance - Branch misprediction
- Cache miss