Transient Fault Detection and Recovery via Simultaneous Multithreading - PowerPoint PPT Presentation

1 / 42
About This Presentation
Title:

Transient Fault Detection and Recovery via Simultaneous Multithreading

Description:

Transient Fault Detection and Recovery via Simultaneous Multithreading. Nevroz SEN ... Address and data for stores from redundant threads. ... – PowerPoint PPT presentation

Number of Views:92
Avg rating:3.0/5.0
Slides: 43
Provided by: allu182
Category:

less

Transcript and Presenter's Notes

Title: Transient Fault Detection and Recovery via Simultaneous Multithreading


1
Transient Fault Detection and Recovery via
Simultaneous Multithreading
  • Nevroz SEN
  • 26/04/2007

2
AGENDA
  • Introduction Motivation
  • SMT, SRT SRTR
  • Fault Detection via SMT (SRT)
  • Fault Recovery via SMT (SRTR)
  • Conclusion

3
INTRODUCTION
  • Transient Faults
  • Faults that persist for a short duration
  • Caused by cosmic rays (e.g., neutrons)
  • Charges and/or discharges internal nodes of logic
    or SRAM Cells High Frequency crosstalk
  • Solution
  • No practical solution to absorb cosmic rays
  • 1 fault per 1000 computers per year (estimated
    fault rate)
  • Future is worse
  • Smaller feature size, reduce voltage, higher
    transistor count, reduced noise margin

4
INTRODUCTION
  • Fault tolerant systems use redundancy to improve
    reliability
  • Time redundancy seperate executions
  • Space redundancy seperate physical copies of
    resources
  • DMR/TMR
  • Data redundancy
  • ECC
  • Parity

5
MOTIVATION
  • Simultaneous Multithreading improves the
    performance of a processor by allowing multiple
    independent threads to execute simultaneously
    (same cycle) in different functional units
  • Use the replication provided by the different
    threads to run two copies of the same program so
    we are able to detect errors

6
MOTIVATION
Replicated Microprocessors Cycle-by-Cycle
Lockstepping
7
MOTIVATION
Replicated Threads Cycle-by-Cycle Lockstepping
???
8
MOTIVATION
  • Less hardware compared to replicated
    microprocessors
  • SMT needs 5 more hardware over uniprocessor
  • SRT adds very little hardware overhead to
    existing SMT
  • Better performance than complete replication
  • Better use of resources
  • Lower cost
  • Avoids complete replication
  • Market volume of SMT SRT

9
MOTIVATION - CHALLENGES
  • Cycle-by-cycle output comparison and input
    replication (Cycle-by-Cycle Lockstepping)
  • Equivalent instructions from different threads
    might execute in different cycles
  • Equivalent instructions from different threads
    might execute in different order with respect to
    other instructions in the same thread
  • Precise scheduling of the threads is crucial
  • Branch misprediction
  • Cache miss

10
SMT SRT - SRTR
Simultaneous Multithreading (SMT)
11
SMT SRT - SRTR
  • SRT Simultaneous Redundantly Threaded
    Processor
  • SRT SMT Fault Detection
  • SRTR Simultaneous Redundantly Threaded
    Processor with Recovery
  • SRTR SRT Fault Recovery

12
Fault Detection via SMT - SRT
  • Sphere of Replication (SoR)
  • Output comparison
  • Input replication
  • Performance Optimizations for SRT
  • Simulation Results

13
SRT - Sphere of Replication (SoR)
  • Logical boundary of redundant execution within a
    system
  • Components inside sphere are protected against
    faults using replication
  • External components must use other means of fault
    tolerance (parity, ECC, etc.)
  • Its size matters
  • Error detection latency
  • Stored-state size

14
SRT - Sphere of Replication (SoR)for SRT
Excludes instruction and data caches Alternates
SoRs possible (e.g., exclude register file)
15
OUTPUT COMPARISION
  • Compare validate output before sending it
    outside the SoR - Catch faults before propagating
    to rest of system
  • No need to compare every instruction Incorrect
    value caused by a fault propagates through
    computations and is eventually consumed by a
    store, checking only stores suffices.
  • Check
  • Address and data for stores from redundant
    threads. Both comparison and validation at commit
    time
  • Address for uncached load from redundant threads
  • Address for cached load from redundant threads
    not required
  • Other output comparison based on the boundary of
    an SoR

16
OUTPUT COMPARISION Store Queue
  • Bottleneck if store queue is shared
  • Separate per-thread store queues boost
    performance

17
INPUT REPLICATION
  • Replicate deliver same input (coming from
    outside SoR) to redundant copies. To do this
  • Instructions Assume no self-modification. No
    check
  • Cached load data
  • Active Load Address Buffer
  • Load Value Queue
  • Uncached load data
  • Synchronize when comparing addresses that leave
    the SoR
  • When data returns, replicate the value for the
    two threads
  • External Interrupts
  • Stall lead thread and deliver interrupt
    synchronously
  • Record interrupt delivery point and deliver later

18
INPUT REPLICATION Active Load Address Buffer
(ALAB)
  • Delays a cache blocks replacement or
    invalidation after the retirement of the trailing
    load
  • Counter tracks trailing threads outstanding
    loads
  • When a cache block is about to be replaced
  • The ALAB is searched for an entry matching the
    blocks address
  • If counter ! 0 then
  • Do not replace nor invalidate until trailing
    thread is done
  • Set the pending-invalidate bit
  • Else replace - invalidate

19
INPUT REPLICATION Load Value Queue (LVQ)
  • An alternative to ALAB Simpler
  • Pre-designated leading trailing threads
  • Protected by ECC

LVQ
Leading Thread
Trailing Thread
20
INPUT REPLICATION Load Value Queue (LVQ)
  • Advantages over ALAB
  • Reduces the pressure on data cache ports
  • Accelerate fault detection of faulty addresses
  • Simple design

21
Performance Optimizations for SRT
  • Idea Using one thread to improve cache and
    branch prediction behavior for the other thread.
    Two techniques
  • Slack Fetch
  • Maintains a constant slack of instructions
    between the threads
  • Prevents the trailing thread from seeing
    mispredictions and cache misses
  • Branch Outcome Queue (BOQ)

22
Performance Optimizations for SRT - Branch
Outcome Queue (BOQ)
  • Sends the outcomes of the committed branch
    outcomes (branch PCs and outcomes) to the
    trailing thread
  • In the fetch stage trailing thread uses the head
    of queue like a branch target buffer

23
Simulation Results
  • Simulation Environment
  • Modified Simplescalar sim-outorder
  • Long front-end pipeline because of out-of-order
    nature and SMT
  • Simple approximation of trace cache
  • Used 11 SPEC95 benchmarks

24
Simulation Results
  • ORH On-Chip Replicated Hardware
  • ORH-Dual -gt two pipelines, each with half the
    resources
  • SMT- Dual -gt Replicated threads with no detection
    hardware

25
Simulation Results - Slack Fetch Branch Outcome
Queue
  • Max 27 performance improvements for SF, BOQ, and
    SF BOQ
  • Performance better with slack of 256 instructions
    over 32 or 128
  • Prevents trailing thread from wasting resources
    by speculating

26
Simulation Results - Input Replication
  • Very low performance degradation for 64- entry
    ALAB or LVQ
  • On average a 16-entry ALAB and a 16-entry LVQ
    degrade performance by 8 and 5 respectively.

27
Simulation Results - Overall
  • Comparison with ORH- Dual
  • SRT processor 256 slack fetch, BOQ with 128
    entries, 64-entry store buffer, and 64-entry LVQ
  • Average 16 Maksimum 29 over a lockstepping
    processor with the same hardware

28
Fault Recovery via SMT (SRTR)
  • What is wrong with SRT A leading non-store
    instruction may commit before the check for the
    fault occurs
  • Relies on the trailing thread to trigger the
    detection
  • However, an SRTR processor works well in a
    fail-fast architecture
  • A faulty instruction cannot be undone once the
    instruction commits.

29
Fault Recovery via SMT (SRTR) - Motivation
  • In SRT, a leading instruction may commit before
    the check for faults occurs, relying on the
    trailing thread to trigger detection.
  • In contrast, SRTR must not allow any leading
    instruction to commit before checking occurs,
  • SRTR uses the time between the completion and
    commit time of leading instruction and checks the
    results as soon as the trailing completes
  • In SPEC95, complete to commit takes about 29
    cycles
  • This short slack has some implications
  • Leading thread provides branch predictions
  • The StB, LVQ and BOQ need to handle
    mispredictions

30
Fault Recovery via SMT (SRTR) - Motivation
  • Leading thread provides the trailing thread with
    branch predictions instead of outcomes (SRT).
  • Register value queue (RVQ), to store register
    values and other information necessary for
    checking of instructions, avoiding bandwidth
    pressure on the register file.
  • Dependence-based checking elision (DBCE) to
    reduce the number of checks is developed
  • Recovery via traditional rollback ability of
    modern pipelines

31
SRTR Additions to SMT
SRTR Addition to MST Predq Prediction Queue LVQ
Load Value Queue CVs Commit Vectors AL
Active List RVQ Register Value Queue
32
SRTR AL LVQ
  • Leading and trailing instructions occupy the same
    positions in their ALs (private for each thread)
  • May enter their AL and become ready to commit
    them at different times
  • The LVQ has to be modified to allow speculative
    loads
  • The Shadow Active List holds pointers to LVQ
    entries
  • A trailing load might issue before the leading
    load
  • Branches place the LVQ tail pointer in the SAL
  • The LVQs tail pointer points to the LVQ has to
    be rolled back in a misprediction

33
SRTR PREDQ
  • Leading thread places predicted PC
  • Similar to BOQ but only holds predictions instead
    of outcomes
  • Using the predQ, the two threads fetch
    essentially the same instructions
  • On a misprediction detection leading clears the
    predQ
  • ECC protected

34
SRTR RVQ CV
  • SRTR checks when the trailing instruction
    completes
  • The Register Value Queue is used to store
    register values for checking, avoiding pressure
    on the register file
  • RVQ entries are allocated when instruction enter
    the AL
  • Pointers to the RVQ entries are placed in the SAL
    to facilitate their search
  • If check succeeds, the entries in the CV vector
    are set to checked-ok and comitted
  • If check fails, the entries in the CV vectors are
    set to failed
  • Rollback done when entries in head of AL

35
SRTR - Pipeline
  • After the leading instruction writes its result
    back, it enters the fault-check stage
  • The leading instruction puts its value in the
    RVQ using the pointer from the SAL.
  • The trailing instructions also use the SAL to
    obtain their RVQ pointers and find their leading
    counterparts

36
SRTR DBCE
  • SRTR uses a separate structure, the register
    value queue (RVQ), to store register values and
    other information necessary for checking of
    instructions, avoiding bandwidth pressure on the
    register file.
  • Check each inst brings BW pressure on RVQ
  • DBCE (Dependence Based Checking Elision) scheme
    reduce the number of checks, and thereby, the RVQ
    bandwidth demand.

37
SRTR DBCE
  • Idea
  • Faults propagate through dependent instructions
  • Exploits register dependence chains so that only
    the last instruction in a chain uses the RVQ, and
    has the leading and trailing values checked.

38
SRTR DBCE
  • If the last instruction check succeeds, commit
    previous ones
  • If the check fails, all the instructions in the
    chain are marked as having failed and the
    earliest instruction in the chain triggers a
    rollback.

39
SRTR - Performance
  • Detection performance between SRT SRTR
  • Better results in the interaction between branch
    mispredictions and slack.
  • Better than SRT between 1-7

40
SRTR - Performance
  • SRTRs average performance peaks at a slack of
    32

41
CONCLUSION
  • A more efficient way to detect Transient Faults
    is presented
  • The trailing thread repeats the computation
    performed by the leading thread, and the values
    produced by the two threads are compared.
  • Defined some concepts LVQ, ALAB, Slack Fetch and
    BOQ
  • An SRT processor can provide higher performance
    then an equivalently sized on-chip HW replicated
    solution.
  • SRT can be extended for fault recovery-SRTR

42
REFERANCES
  • T. N. Vijaykumar, Irith Pomeranz, and Karl Cheng,
    Transient Fault Recovery using Simultaneous
    Multithreading, Proc. 29th Annual Intl Symp. on
    Computer Architecture, May 2002.
  • S. K. Reinhardt and S. S. Mukherjee.
    Transient-fault detection via simultaneous
    multithreading. In Proceedings of the 27th Annual
    International Symposium on Computer Architecture,
    pages 2536, June 2000.
  • Eric Rotenberg, AR-SMT A Microarchitectural
    Approach to Fault Tolerance in Microprocessor,
    Proceedings of Fault-Tolerant Computing Systems
    (FTCS), 1999.
  • S.S.Mukherjee, M.Kontz, S.K.Reinhardt,
    Detailed Design and Evaluation of Redundant
    Multithreading Alternatives, International
    Symposium on Computer Architecture (ISCA), 2002
Write a Comment
User Comments (0)
About PowerShow.com