Title: Transient Fault Detection and Recovery via Simultaneous Multithreading
1Transient Fault Detection and Recovery via
Simultaneous Multithreading
2AGENDA
- Introduction Motivation
- SMT, SRT SRTR
- Fault Detection via SMT (SRT)
- Fault Recovery via SMT (SRTR)
- Conclusion
3INTRODUCTION
- Transient Faults
- Faults that persist for a short duration
- Caused by cosmic rays (e.g., neutrons)
- Charges and/or discharges internal nodes of logic
or SRAM Cells High Frequency crosstalk - Solution
- No practical solution to absorb cosmic rays
- 1 fault per 1000 computers per year (estimated
fault rate) - Future is worse
- Smaller feature size, reduce voltage, higher
transistor count, reduced noise margin
4INTRODUCTION
- Fault tolerant systems use redundancy to improve
reliability - Time redundancy seperate executions
- Space redundancy seperate physical copies of
resources - DMR/TMR
- Data redundancy
- ECC
- Parity
5MOTIVATION
- Simultaneous Multithreading improves the
performance of a processor by allowing multiple
independent threads to execute simultaneously
(same cycle) in different functional units - Use the replication provided by the different
threads to run two copies of the same program so
we are able to detect errors
6MOTIVATION
Replicated Microprocessors Cycle-by-Cycle
Lockstepping
7MOTIVATION
Replicated Threads Cycle-by-Cycle Lockstepping
???
8MOTIVATION
- Less hardware compared to replicated
microprocessors - SMT needs 5 more hardware over uniprocessor
- SRT adds very little hardware overhead to
existing SMT - Better performance than complete replication
- Better use of resources
- Lower cost
- Avoids complete replication
- Market volume of SMT SRT
9MOTIVATION - CHALLENGES
- Cycle-by-cycle output comparison and input
replication (Cycle-by-Cycle Lockstepping) - Equivalent instructions from different threads
might execute in different cycles - Equivalent instructions from different threads
might execute in different order with respect to
other instructions in the same thread - Precise scheduling of the threads is crucial
- Branch misprediction
- Cache miss
10SMT SRT - SRTR
Simultaneous Multithreading (SMT)
11SMT SRT - SRTR
- SRT Simultaneous Redundantly Threaded
Processor - SRT SMT Fault Detection
- SRTR Simultaneous Redundantly Threaded
Processor with Recovery - SRTR SRT Fault Recovery
12Fault Detection via SMT - SRT
- Sphere of Replication (SoR)
- Output comparison
- Input replication
- Performance Optimizations for SRT
- Simulation Results
13SRT - Sphere of Replication (SoR)
- Logical boundary of redundant execution within a
system - Components inside sphere are protected against
faults using replication - External components must use other means of fault
tolerance (parity, ECC, etc.) - Its size matters
- Error detection latency
- Stored-state size
14SRT - Sphere of Replication (SoR)for SRT
Excludes instruction and data caches Alternates
SoRs possible (e.g., exclude register file)
15OUTPUT COMPARISION
- Compare validate output before sending it
outside the SoR - Catch faults before propagating
to rest of system - No need to compare every instruction Incorrect
value caused by a fault propagates through
computations and is eventually consumed by a
store, checking only stores suffices. - Check
- Address and data for stores from redundant
threads. Both comparison and validation at commit
time - Address for uncached load from redundant threads
- Address for cached load from redundant threads
not required - Other output comparison based on the boundary of
an SoR
16OUTPUT COMPARISION Store Queue
- Bottleneck if store queue is shared
- Separate per-thread store queues boost
performance
17INPUT REPLICATION
- Replicate deliver same input (coming from
outside SoR) to redundant copies. To do this - Instructions Assume no self-modification. No
check - Cached load data
- Active Load Address Buffer
- Load Value Queue
- Uncached load data
- Synchronize when comparing addresses that leave
the SoR - When data returns, replicate the value for the
two threads - External Interrupts
- Stall lead thread and deliver interrupt
synchronously - Record interrupt delivery point and deliver later
18INPUT REPLICATION Active Load Address Buffer
(ALAB)
- Delays a cache blocks replacement or
invalidation after the retirement of the trailing
load - Counter tracks trailing threads outstanding
loads - When a cache block is about to be replaced
- The ALAB is searched for an entry matching the
blocks address - If counter ! 0 then
- Do not replace nor invalidate until trailing
thread is done - Set the pending-invalidate bit
- Else replace - invalidate
19INPUT REPLICATION Load Value Queue (LVQ)
- An alternative to ALAB Simpler
- Pre-designated leading trailing threads
- Protected by ECC
LVQ
Leading Thread
Trailing Thread
20INPUT REPLICATION Load Value Queue (LVQ)
- Advantages over ALAB
- Reduces the pressure on data cache ports
- Accelerate fault detection of faulty addresses
- Simple design
21Performance Optimizations for SRT
- Idea Using one thread to improve cache and
branch prediction behavior for the other thread.
Two techniques - Slack Fetch
- Maintains a constant slack of instructions
between the threads - Prevents the trailing thread from seeing
mispredictions and cache misses - Branch Outcome Queue (BOQ)
22Performance Optimizations for SRT - Branch
Outcome Queue (BOQ)
- Sends the outcomes of the committed branch
outcomes (branch PCs and outcomes) to the
trailing thread - In the fetch stage trailing thread uses the head
of queue like a branch target buffer
23Simulation Results
- Simulation Environment
- Modified Simplescalar sim-outorder
- Long front-end pipeline because of out-of-order
nature and SMT - Simple approximation of trace cache
- Used 11 SPEC95 benchmarks
24Simulation Results
- ORH On-Chip Replicated Hardware
- ORH-Dual -gt two pipelines, each with half the
resources - SMT- Dual -gt Replicated threads with no detection
hardware
25Simulation Results - Slack Fetch Branch Outcome
Queue
- Max 27 performance improvements for SF, BOQ, and
SF BOQ - Performance better with slack of 256 instructions
over 32 or 128 - Prevents trailing thread from wasting resources
by speculating
26Simulation Results - Input Replication
- Very low performance degradation for 64- entry
ALAB or LVQ - On average a 16-entry ALAB and a 16-entry LVQ
degrade performance by 8 and 5 respectively.
27Simulation Results - Overall
- Comparison with ORH- Dual
- SRT processor 256 slack fetch, BOQ with 128
entries, 64-entry store buffer, and 64-entry LVQ - Average 16 Maksimum 29 over a lockstepping
processor with the same hardware
28Fault Recovery via SMT (SRTR)
- What is wrong with SRT A leading non-store
instruction may commit before the check for the
fault occurs - Relies on the trailing thread to trigger the
detection - However, an SRTR processor works well in a
fail-fast architecture - A faulty instruction cannot be undone once the
instruction commits.
29Fault Recovery via SMT (SRTR) - Motivation
- In SRT, a leading instruction may commit before
the check for faults occurs, relying on the
trailing thread to trigger detection. - In contrast, SRTR must not allow any leading
instruction to commit before checking occurs, - SRTR uses the time between the completion and
commit time of leading instruction and checks the
results as soon as the trailing completes - In SPEC95, complete to commit takes about 29
cycles - This short slack has some implications
- Leading thread provides branch predictions
- The StB, LVQ and BOQ need to handle
mispredictions
30Fault Recovery via SMT (SRTR) - Motivation
- Leading thread provides the trailing thread with
branch predictions instead of outcomes (SRT). - Register value queue (RVQ), to store register
values and other information necessary for
checking of instructions, avoiding bandwidth
pressure on the register file. - Dependence-based checking elision (DBCE) to
reduce the number of checks is developed - Recovery via traditional rollback ability of
modern pipelines
31SRTR Additions to SMT
SRTR Addition to MST Predq Prediction Queue LVQ
Load Value Queue CVs Commit Vectors AL
Active List RVQ Register Value Queue
32SRTR AL LVQ
- Leading and trailing instructions occupy the same
positions in their ALs (private for each thread) - May enter their AL and become ready to commit
them at different times - The LVQ has to be modified to allow speculative
loads - The Shadow Active List holds pointers to LVQ
entries - A trailing load might issue before the leading
load - Branches place the LVQ tail pointer in the SAL
- The LVQs tail pointer points to the LVQ has to
be rolled back in a misprediction
33SRTR PREDQ
- Leading thread places predicted PC
- Similar to BOQ but only holds predictions instead
of outcomes - Using the predQ, the two threads fetch
essentially the same instructions - On a misprediction detection leading clears the
predQ - ECC protected
34SRTR RVQ CV
- SRTR checks when the trailing instruction
completes - The Register Value Queue is used to store
register values for checking, avoiding pressure
on the register file - RVQ entries are allocated when instruction enter
the AL - Pointers to the RVQ entries are placed in the SAL
to facilitate their search - If check succeeds, the entries in the CV vector
are set to checked-ok and comitted - If check fails, the entries in the CV vectors are
set to failed - Rollback done when entries in head of AL
35SRTR - Pipeline
- After the leading instruction writes its result
back, it enters the fault-check stage - The leading instruction puts its value in the
RVQ using the pointer from the SAL. - The trailing instructions also use the SAL to
obtain their RVQ pointers and find their leading
counterparts
36SRTR DBCE
- SRTR uses a separate structure, the register
value queue (RVQ), to store register values and
other information necessary for checking of
instructions, avoiding bandwidth pressure on the
register file. - Check each inst brings BW pressure on RVQ
- DBCE (Dependence Based Checking Elision) scheme
reduce the number of checks, and thereby, the RVQ
bandwidth demand.
37SRTR DBCE
- Idea
- Faults propagate through dependent instructions
- Exploits register dependence chains so that only
the last instruction in a chain uses the RVQ, and
has the leading and trailing values checked.
38SRTR DBCE
- If the last instruction check succeeds, commit
previous ones - If the check fails, all the instructions in the
chain are marked as having failed and the
earliest instruction in the chain triggers a
rollback.
39SRTR - Performance
- Detection performance between SRT SRTR
- Better results in the interaction between branch
mispredictions and slack. - Better than SRT between 1-7
40SRTR - Performance
- SRTRs average performance peaks at a slack of
32
41CONCLUSION
- A more efficient way to detect Transient Faults
is presented - The trailing thread repeats the computation
performed by the leading thread, and the values
produced by the two threads are compared. - Defined some concepts LVQ, ALAB, Slack Fetch and
BOQ - An SRT processor can provide higher performance
then an equivalently sized on-chip HW replicated
solution. - SRT can be extended for fault recovery-SRTR
42REFERANCES
- T. N. Vijaykumar, Irith Pomeranz, and Karl Cheng,
Transient Fault Recovery using Simultaneous
Multithreading, Proc. 29th Annual Intl Symp. on
Computer Architecture, May 2002. - S. K. Reinhardt and S. S. Mukherjee.
Transient-fault detection via simultaneous
multithreading. In Proceedings of the 27th Annual
International Symposium on Computer Architecture,
pages 2536, June 2000. - Eric Rotenberg, AR-SMT A Microarchitectural
Approach to Fault Tolerance in Microprocessor,
Proceedings of Fault-Tolerant Computing Systems
(FTCS), 1999. - S.S.Mukherjee, M.Kontz, S.K.Reinhardt,
Detailed Design and Evaluation of Redundant
Multithreading Alternatives, International
Symposium on Computer Architecture (ISCA), 2002