Title: Assessing SEU Vulnerability via CircuitLevel Timing Analysis
1Assessing SEU Vulnerabilityvia Circuit-Level
Timing Analysis
- Kypros Constantinides Stephen Plaza Jason
Blome Bin Zhang
- Valeria Bertacco Scott Mahlke Todd Austin
Michael Orshansky
- Advanced Computer Architecture Lab Department
of Electrical and Computer Engineering
- University of Michigan University
of Texas at Austin
2Introduction
- Recently there is a growing concern about
transient faults in combinational logic
- Numerous techniques already exist that deal with
the effects of transient faults
- Error Correction Codes (ECC)
- DIVA
- Simultaneous Redundantly Threading (SRT)
- and many other
- However, these techniques come with a cost on
performance, power, die size and design time.
3Introduction
- Designers have to trade-off between reliability
provided and implementation cost
- Inadequate soft-error protection maybe
useless due to poor reliability
- Excessive soft-error protection
uncompetitive in cost and/or performance
- In order to balance this trade-off, system
designers need accurate SERs (Soft-Error Rate)
for their designs
- The device community provides raw SERs for
devices of current technologies and projections
for devices of future technologies
- However, architecture-level and circuit-level
phenomena derate the raw SER
- Accurately assessing a designs SER requires
circuit-level detail analysis infrastructure
4In This Work
- We introduce a high-fidelity, high-performance
simulation infrastructure for estimating
soft-error rates
- asynchronously injects voltage pulses of various
durations at the gate level
- accurately gauge detailed circuit phenomena to
model
- fault introduction
- fault propagation
- and possible fault masking
- simulates with sufficient speed permitting the
examination of entire workloads on complex
designs (thousands of gates)
5Soft Error Masking
- Fortunately not all transient faults cause an
error
- Circuit and architectural phenomena prevent the
fault from propagating to the designs output and
causing an error
- Logic masking
- Timing masking
- Electrical masking
- Microarchitecture masking
- Software masking
6Soft Error Masking
- Logic Masking the fault gets blocked by a
following gate whose output is completely
determined by its other inputs
- Timing Masking the fault affects the input of a
latch only in the period of time that the latch
is not sensitive to its input
- Electrical Masking the faults pulse is
attenuated by subsequent logic gates due to
electrical properties, and does not affect any
latchs input - Microarchitectural Masking the fault alters a
value of at least one flip-flop, but the
incorrect values get overwritten without being
used in any computation affecting the designs
output - Software Masking the fault propagates to the
designs output but is subsequently masked by
software without affecting the applications
correct execution
7Simulation Infrastructure
Design Under Test gate-level description of the
design (netlist) - Fault-Exposed Model subjected
to fault injection - Golden Model no fault inje
cted
Fault Generator injects voltage pulses of
various durations at any gate in the design and
flips the value of any flip-flop in the design
- faults are uniformly distributed at time,
location and duration
Fault Analyzer Monitors manifested errors and
tracks all the possible ways a fault can be
masked
Model Stimuli Workload traces that exercise the
design under test
8Statistical Model for Transient Faults
- Pulse-based model for transient faults caused by
energetic particle strikes
- Faults injected into combinational logic are
classified based on their duration
- 20, 40, 60, 80 and 100 of designs clock
period
- Faults injected into sequential elements flip
their value
- The arrival rate of each type of fault is modeled
by a separate random variable
- The mean inter-arrival times for each fault type
are derived by previously published data and
detailed SPICE simulations
9Design Under Test CMP Switch
- We chose as a design under test a single chip
multiprocessor interconnection switch (baseline
provided by Li-Shiuan Peh)
- Much less complex than a microprocessor yet not
too simplistic (it includes finite state
machines, buffers, control logic, and buses)
- Wormhole switch
- pipelined at the flit level
- Specified in Verilog and
- synthesized to a gate-level netlist
- 9K logic gates and
- 1700 sequential elements
- Realistic workload
- Communication traces derived from the TRIPS
architecture
10Characterization per Fault Type
- High microarchitectural masking
- 95 of the faults that flip a flip-flops value
are masked
- Timing masking is significant only for faults
with small pulse durations
- Logic masking is increasing as the faults pulse
duration is decreasing
11Derating Factor
- Derating factor error rate-1
- i.e. a derating factor of 30 means that one of
every 30 injected faults will cause an error
(corresponds to an error rate of 3.3)
- Average derating factor for realistic workloads
is 31
- Synthetic high utilization workload leads to a
derating factor of 12
error rate 3.2
error rate 8.3
12Failure Rate Projections
- Taking into account projections from ITRS and raw
SER estimates for future process technologies, we
make failure rate projections considering the
transient-fault derating effects - Design architecture is kept intact for future
process technologies
- Two different designs
- one clocked with the projected clock frequencies
for microprocessors
- and one clocked with the projected clock
frequencies for interconnection networks
13Transient-fault Vulnerability per Component
- We observed that each switch component exhibited
different vulnerability on transient faults
- Derating effects greatly depend on the
components characteristics
- Most vulnerable component
- Switch Arbiter (12.8 error)
- 6 of switchs area
- Input Controllers
- dominate switch design
- 86 of switchs area
- The switchs vulnerability
- match with that of input
- controllers
14Effects of Multi-fault Strikes
- A single strike causes multiple faults on
neighbouring gates or flip-flops
- lack of data about frequency of such events or
models for multi-fault strikes on logic gates and
flip-flops
- we assume that each strike causes multiple
faults
- extremely pessimistic
- even under this severe environment the failure
rates are relatively low
15Conclusions Directions for Future Work
- Conclusions
- For complex designs there is significant fault
masking, with derating factors as high as 30
- Soft-error derating effects highly depend on the
designs characteristics and utilization
- Our observations suggest that the soft-error
reliability threat might have been overstated by
the computer architecture community
- Designers need to evaluate their designs
soft-error tolerance with detail analysis tools
considering circuit level derating effects and
better trade-off between the protection provided
and the implementation cost - Future Work
- Study the soft-error derating effects for several
designs with different amount of complexity and
different characteristics
- Enhance our simulation infrastructure to be able
to simulate large high-complexity systems
(millions of gates) with short simulation runs
16Questions?