Transient Fault Detection and Recovery via Simultaneous Multithreading

About This Presentation

Title:

Transient Fault Detection and Recovery via Simultaneous Multithreading

Description:

Transient Fault Detection and Recovery via Simultaneous Multithreading. Nevroz SEN ... Address and data for stores from redundant threads. ... – PowerPoint PPT presentation

Number of Views:92

Avg rating:3.0/5.0

Slides: 43

Provided by: allu182

Category:

more less

Transcript and Presenter's Notes

Title: Transient Fault Detection and Recovery via Simultaneous Multithreading

1
Transient Fault Detection and Recovery via
Simultaneous Multithreading

Nevroz SEN
26/04/2007

2
AGENDA

Introduction Motivation
SMT, SRT SRTR
Fault Detection via SMT (SRT)
Fault Recovery via SMT (SRTR)
Conclusion

3
INTRODUCTION

Transient Faults
Faults that persist for a short duration
Caused by cosmic rays (e.g., neutrons)
Charges and/or discharges internal nodes of logic
or SRAM Cells High Frequency crosstalk
Solution
No practical solution to absorb cosmic rays
1 fault per 1000 computers per year (estimated
fault rate)
Future is worse
Smaller feature size, reduce voltage, higher
transistor count, reduced noise margin

4
INTRODUCTION

Fault tolerant systems use redundancy to improve
reliability
Time redundancy seperate executions
Space redundancy seperate physical copies of
resources
DMR/TMR
Data redundancy
ECC
Parity

5
MOTIVATION

Simultaneous Multithreading improves the
performance of a processor by allowing multiple
independent threads to execute simultaneously
(same cycle) in different functional units
Use the replication provided by the different
threads to run two copies of the same program so
we are able to detect errors

6
MOTIVATION
Replicated Microprocessors Cycle-by-Cycle
Lockstepping
7
MOTIVATION
Replicated Threads Cycle-by-Cycle Lockstepping
???
8
MOTIVATION

Less hardware compared to replicated
microprocessors
SMT needs 5 more hardware over uniprocessor
SRT adds very little hardware overhead to
existing SMT
Better performance than complete replication
Better use of resources
Lower cost
Avoids complete replication
Market volume of SMT SRT

9
MOTIVATION - CHALLENGES

Cycle-by-cycle output comparison and input
replication (Cycle-by-Cycle Lockstepping)
Equivalent instructions from different threads
might execute in different cycles
Equivalent instructions from different threads
might execute in different order with respect to
other instructions in the same thread
Precise scheduling of the threads is crucial
Branch misprediction
Cache miss

10
SMT SRT - SRTR
Simultaneous Multithreading (SMT)
11
SMT SRT - SRTR

SRT Simultaneous Redundantly Threaded
Processor
SRT SMT Fault Detection
SRTR Simultaneous Redundantly Threaded
Processor with Recovery
SRTR SRT Fault Recovery

12
Fault Detection via SMT - SRT

Sphere of Replication (SoR)
Output comparison
Input replication
Performance Optimizations for SRT
Simulation Results

13
SRT - Sphere of Replication (SoR)

Logical boundary of redundant execution within a
system
Components inside sphere are protected against
faults using replication
External components must use other means of fault
tolerance (parity, ECC, etc.)
Its size matters
Error detection latency
Stored-state size

14
SRT - Sphere of Replication (SoR)for SRT
Excludes instruction and data caches Alternates
SoRs possible (e.g., exclude register file)
15
OUTPUT COMPARISION

Compare validate output before sending it
outside the SoR - Catch faults before propagating
to rest of system
No need to compare every instruction Incorrect
value caused by a fault propagates through
computations and is eventually consumed by a
store, checking only stores suffices.
Check
Address and data for stores from redundant
threads. Both comparison and validation at commit
time
Address for uncached load from redundant threads
Address for cached load from redundant threads
not required
Other output comparison based on the boundary of
an SoR

16
OUTPUT COMPARISION Store Queue

Bottleneck if store queue is shared
Separate per-thread store queues boost
performance

17
INPUT REPLICATION

Replicate deliver same input (coming from
outside SoR) to redundant copies. To do this
Instructions Assume no self-modification. No
check
Cached load data
Active Load Address Buffer
Load Value Queue
Uncached load data
Synchronize when comparing addresses that leave
the SoR
When data returns, replicate the value for the
two threads
External Interrupts
Stall lead thread and deliver interrupt
synchronously
Record interrupt delivery point and deliver later

18
INPUT REPLICATION Active Load Address Buffer
(ALAB)

Delays a cache blocks replacement or
invalidation after the retirement of the trailing
load
Counter tracks trailing threads outstanding
loads
When a cache block is about to be replaced
The ALAB is searched for an entry matching the
blocks address
If counter ! 0 then
Do not replace nor invalidate until trailing
thread is done
Set the pending-invalidate bit
Else replace - invalidate

19
INPUT REPLICATION Load Value Queue (LVQ)

An alternative to ALAB Simpler
Pre-designated leading trailing threads
Protected by ECC

LVQ
Leading Thread
Trailing Thread
20
INPUT REPLICATION Load Value Queue (LVQ)

Advantages over ALAB
Reduces the pressure on data cache ports
Accelerate fault detection of faulty addresses
Simple design

21
Performance Optimizations for SRT

Idea Using one thread to improve cache and
branch prediction behavior for the other thread.
Two techniques
Slack Fetch
Maintains a constant slack of instructions
between the threads
Prevents the trailing thread from seeing
mispredictions and cache misses
Branch Outcome Queue (BOQ)

22
Performance Optimizations for SRT - Branch
Outcome Queue (BOQ)

Sends the outcomes of the committed branch
outcomes (branch PCs and outcomes) to the
trailing thread
In the fetch stage trailing thread uses the head
of queue like a branch target buffer

23
Simulation Results

Simulation Environment
Modified Simplescalar sim-outorder
Long front-end pipeline because of out-of-order
nature and SMT
Simple approximation of trace cache
Used 11 SPEC95 benchmarks

24
Simulation Results

ORH On-Chip Replicated Hardware
ORH-Dual -gt two pipelines, each with half the
resources
SMT- Dual -gt Replicated threads with no detection
hardware

25
Simulation Results - Slack Fetch Branch Outcome
Queue

Max 27 performance improvements for SF, BOQ, and
SF BOQ
Performance better with slack of 256 instructions
over 32 or 128
Prevents trailing thread from wasting resources
by speculating

26
Simulation Results - Input Replication

Very low performance degradation for 64- entry
ALAB or LVQ
On average a 16-entry ALAB and a 16-entry LVQ
degrade performance by 8 and 5 respectively.

27
Simulation Results - Overall

Comparison with ORH- Dual
SRT processor 256 slack fetch, BOQ with 128
entries, 64-entry store buffer, and 64-entry LVQ
Average 16 Maksimum 29 over a lockstepping
processor with the same hardware

28
Fault Recovery via SMT (SRTR)

What is wrong with SRT A leading non-store
instruction may commit before the check for the
fault occurs
Relies on the trailing thread to trigger the
detection
However, an SRTR processor works well in a
fail-fast architecture
A faulty instruction cannot be undone once the
instruction commits.

29
Fault Recovery via SMT (SRTR) - Motivation

In SRT, a leading instruction may commit before
the check for faults occurs, relying on the
trailing thread to trigger detection.
In contrast, SRTR must not allow any leading
instruction to commit before checking occurs,
SRTR uses the time between the completion and
commit time of leading instruction and checks the
results as soon as the trailing completes
In SPEC95, complete to commit takes about 29
cycles
This short slack has some implications
Leading thread provides branch predictions
The StB, LVQ and BOQ need to handle
mispredictions

30
Fault Recovery via SMT (SRTR) - Motivation

Leading thread provides the trailing thread with
branch predictions instead of outcomes (SRT).
Register value queue (RVQ), to store register
values and other information necessary for
checking of instructions, avoiding bandwidth
pressure on the register file.
Dependence-based checking elision (DBCE) to
reduce the number of checks is developed
Recovery via traditional rollback ability of
modern pipelines

31
SRTR Additions to SMT
SRTR Addition to MST Predq Prediction Queue LVQ
Load Value Queue CVs Commit Vectors AL
Active List RVQ Register Value Queue
32
SRTR AL LVQ

Leading and trailing instructions occupy the same
positions in their ALs (private for each thread)
May enter their AL and become ready to commit
them at different times
The LVQ has to be modified to allow speculative
loads
The Shadow Active List holds pointers to LVQ
entries
A trailing load might issue before the leading
load
Branches place the LVQ tail pointer in the SAL
The LVQs tail pointer points to the LVQ has to
be rolled back in a misprediction

33
SRTR PREDQ

Leading thread places predicted PC
Similar to BOQ but only holds predictions instead
of outcomes
Using the predQ, the two threads fetch
essentially the same instructions
On a misprediction detection leading clears the
predQ
ECC protected

34
SRTR RVQ CV

SRTR checks when the trailing instruction
completes
The Register Value Queue is used to store
register values for checking, avoiding pressure
on the register file
RVQ entries are allocated when instruction enter
the AL
Pointers to the RVQ entries are placed in the SAL
to facilitate their search
If check succeeds, the entries in the CV vector
are set to checked-ok and comitted
If check fails, the entries in the CV vectors are
set to failed
Rollback done when entries in head of AL

35
SRTR - Pipeline

After the leading instruction writes its result
back, it enters the fault-check stage
The leading instruction puts its value in the
RVQ using the pointer from the SAL.
The trailing instructions also use the SAL to
obtain their RVQ pointers and find their leading
counterparts

36
SRTR DBCE

SRTR uses a separate structure, the register
value queue (RVQ), to store register values and
other information necessary for checking of
instructions, avoiding bandwidth pressure on the
register file.
Check each inst brings BW pressure on RVQ
DBCE (Dependence Based Checking Elision) scheme
reduce the number of checks, and thereby, the RVQ
bandwidth demand.

37
SRTR DBCE

Idea
Faults propagate through dependent instructions
Exploits register dependence chains so that only
the last instruction in a chain uses the RVQ, and
has the leading and trailing values checked.

38
SRTR DBCE

If the last instruction check succeeds, commit
previous ones
If the check fails, all the instructions in the
chain are marked as having failed and the
earliest instruction in the chain triggers a
rollback.

39
SRTR - Performance

Detection performance between SRT SRTR
Better results in the interaction between branch
mispredictions and slack.
Better than SRT between 1-7

40
SRTR - Performance

SRTRs average performance peaks at a slack of
32

41
CONCLUSION

A more efficient way to detect Transient Faults
is presented
The trailing thread repeats the computation
performed by the leading thread, and the values
produced by the two threads are compared.
Defined some concepts LVQ, ALAB, Slack Fetch and
BOQ
An SRT processor can provide higher performance
then an equivalently sized on-chip HW replicated
solution.
SRT can be extended for fault recovery-SRTR

42
REFERANCES

T. N. Vijaykumar, Irith Pomeranz, and Karl Cheng,
Transient Fault Recovery using Simultaneous
Multithreading, Proc. 29th Annual Intl Symp. on
Computer Architecture, May 2002.
S. K. Reinhardt and S. S. Mukherjee.
Transient-fault detection via simultaneous
multithreading. In Proceedings of the 27th Annual
International Symposium on Computer Architecture,
pages 2536, June 2000.
Eric Rotenberg, AR-SMT A Microarchitectural
Approach to Fault Tolerance in Microprocessor,
Proceedings of Fault-Tolerant Computing Systems
(FTCS), 1999.
S.S.Mukherjee, M.Kontz, S.K.Reinhardt,
Detailed Design and Evaluation of Redundant
Multithreading Alternatives, International
Symposium on Computer Architecture (ISCA), 2002

Write a Comment

User Comments (0)

About PowerShow.com

Transient Fault Detection and Recovery via Simultaneous Multithreading - PowerPoint PPT Presentation

Transient Fault Detection and Recovery via Simultaneous Multithreading

Transient Fault Detection and Recovery via Simultaneous Multithreading. Nevroz SEN ... Address and data for stores from redundant threads. ... – PowerPoint PPT presentation