Title: Rose Hill
1Dependability Benchmarking of VLSI
Circuits Cristian Constantinescu cristian.constan
tinescu_at_intel.com Intel Corporation
2Outline
- Neutron SER characterization of microprocessors
- SER scaling trends
- Experimental set-up
- Experimental Results
- Other sources of errors
- Memory intermittent faults
- Front side bus intermittent faults
- Using environmental tests as dependability
benchmarking tools - Temperature and Voltage Operating Test
- ESD Operating Test
- Summary
- Backup
- Linpack benchmark
- References
- Acknowledgement
- Neutron SER characterization Bruce Takala, Steve
Wander (LANSCE), Nelson Tam, Pat Armstrong (Intel
Corp.) - Environmental testing John Blair, Scott
Scheuneman (Intel Corp.)
3Neutron SER Characterization of Microprocessors
4Single Event Upsets
- Single event upsets (SEU) are
- induced by
- Alpha particles generated during
- radioactive decay of the package
- and interconnect materials
- Neutrons, protons, pions generated
- by cosmic rays penetrating the atmosphere
- SEU may induce errors both in storage elements
and combinational logic - Frequency of occurrence of the particle induced
induced errors soft error rate (SER)
5SER Scaling Trends
- SRAM SER per bit and chip Latch SER per
bit and chip
Assumption SRAM/latch count increases 2x per
generation
6Hadron Cascades
Main constituents of atmospheric hadron cascades
- Neutrons represent 94 of the hadrons reaching
sea level - For terrestrial applications it makes sense to
benchmark the impact of - neutron SER
7LANSCE Neutron Beam
- Los Alamos Neutron Science Center (LANSCE)
- Generates high-energy neutrons by spallation a
linear accelerator generates a pulsed proton beam
that strikes a tungsten target
Energy dependence of the natural cosmic-ray
neutron flux and the LANSCE neutron flux
8Experimental Set Up
- Itanium processor based server
- Windows NT 4.0 operating system
- Linpack benchmark
- Performs matrix computations
- Derives residues can detect silent data
corruption (SDC) - Fission ion chamber to determine neutron fluence
9Deriving MTTF
- MTTF Tua/U
- Tua duration of an equivalent experiment,
taking place in unaccelerated conditions h - U total number of upsets (failures) over the
duration of the experiment - Tua (Fcp Nc)/ Nf
- Fcp total number of fission chamber pulses,
over the duration of the experiment - Nc average neutron conversion factor
neutrons/fission pulse/cm2 - Nf cosmic-ray induced neutron flux at the
desired geographical location and altitude
neutrons/cm2/h
10Experimental Results
- Run Linpack benchmark for square matrixes of size
800 and 1000 - Completed 40 runs
- Duration of one run 10 s 5 min
- Failure types
- Blue screen
- Hang
- Silent data corruption (SDC)
11Experimental Results
- Itanium processor MTTF due to neutrons, as a
function of number of runs
12Experimental Results
- MTTF confidence intervals
- SDC one event
- Insufficient for statistical analysis
13Practical Considerations
- Error handling techniques differ greatly from one
manufacturer to another - HW error detection and correction, e.g. ECC, is
faster - FW/SW implemented recovery may be overwhelmed by
an accelerated test (near coincident faults
scenario) - Acceleration factor is an important variable
- Failure prediction and automatic deconfiguration
may lead to misleading results - Multiple experiments
- Beam divergence
- Beam attenuation
14Other Sources of Errors
15Memory Intermittent Faults
- Intermittent faults are induced by unstable or
marginal hardware - Intermittent shorts/opens
- Manufacturing residuals
- Timing faults
Number of memory single-bit errors reported by
193 systems over 16 months
Daily number of memory single-bit errors
reported by one system over 16 months
16Front Side Bus Intermittent Faults
- Front side bus (FSB) errors
- Bursts of single-bit errors (SBE) on data path
- SBE detected and corrected (data path protected
by ECC) - Failure analysis results
- Intermittent contacts at solder joints
- Fault injection showed that similar faults
experienced by control signals induce SDC
17Using Environmental Tests as Dependability
Benchmarking Tools
18Temperature and Voltage Operating Test
- Ten systems were tested
- Workload Linpack benchmark
70o C
25o C
-10o C
- 9 systems experienced SDC
- SDC events 134 (90.5)
- Detected errors 14 (9.5)
- SDC preceded detected errors
19Temperature and Voltage Operating Test
- Distribution of the SDC events
- Failure analysis results
- Memory controller setup and hold-time violations
20ESD Operating Test
- 4 servers from 2 manufacturers
- Workload Linpack benchmark
- 30 test points per server
- 20 positive and 20 negative discharges per test
point - Air discharge 4 kV 15 kV
- Contact discharge 8 kV
- One server experienced SDC
- 8 of the discharges targeted to the disk bay
area (15 kV, air) - First ESD operating test to reveal SDC in a
commercially available server
21Summary
- The need for dependability benchmarking is
increasing - Wider use of COTS components in critical
applications - Technology is a two edge sword
- Higher performance
- Higher rates of occurrence of the transient and
intermittent faults - SDC is a real threat
- We take for granted the correctness of the
computer data - Dependability benchmarks should determine whether
the circuits/systems under evaluation experience
SDC - Fault injection techniques require in depth
knowledge of the evaluated system - Appropriate for designers and manufacturers
- Accelerated neutron tests and environmental tests
are a black box approach - Capable of unveiling SDC
- In depth knowledge of the system under test is
not required - Linpack benchmark is available for free
- Can be used both by manufacturers and independent
evaluators
22Backup
23Linpack Benchmark
- Example of Linpack output large residues
indicate SDC
24References
- Neutron SER characterization of
microprocessors, Proc. of the International
Conference on Dependable Systems and Networks,
Yokohama, Japan, June 2005, pp. 754-759. - Dependability benchmarking using environmental
test tools, Proc. of the Reliability and
Maintainability Symposium, Alexandria, VA, USA,
January 2005, pp. 567 571. - Impact of deep submicron technology on
dependability of VLSI circuits, Proc. of the
International Conference on Dependable Systems
and Networks, Washington, DC, USA, June 2002, pp.
205-209.