Title: Radiation Tolerant Intelligent Memory Stack (RTIMS)
1Radiation Tolerant Intelligent Memory Stack
(RTIMS)
Tak-kwong Ng, Jeffrey HerathElectronics Systems
BranchSystems Engineering DirectorateNASA
Langley Research Centert.ng_at_nasa.govjeffrey.a.he
rath_at_nasa.gov 757-864-1097 (Tak)757-864-1098
(Jeff)
2Agenda
- What is it ?
- Goals
- Components selection
- FPGA SEU mitigation
- XTMR tools
- Status
- Future work
- Points to ponder
3What is it ?
- Radiation tolerant
- Use commercial-off-the-shelf (COTS) components
- Reprogrammable FPGA
- High performance
- Lower cost
- Pick parts with applicable mitigation techniques
- Shielding, over-current protection, triple module
redundancy, FPGA configuration scrubbing - Intelligent
- Reprogrammable FPGA
- SDRAM controller
- Capacity to add custom logic
- Memory
- Large capacity
- SDRAM
- Stack
- 3D vs 2D, board space saving
4Goals
- Large memory capacity
- 256 MB EDAC
- Single 3.3V power supply
- Simple interface, LVTTL compatible
- Throughput
- 32 MWord write
- 16 MWord read
- Reprogram via the JTAG interface
- Spare FPGA gate capacity for user application
- Radiation characteristics
- Total ionizing dose of 100 krad (Si) at 25o C
- SEU best practice
- SEL of 60 MeV-cm2/mg requirement
- Operating temperature -40o C / 85o C
5Components Selection (1/3)
- FPGA
- Reprogrammable
- Xilinx Virtex, Virtex-II
- XQR2V1000
- Total ionizing dose of 200 krad (Si) (data
sheet) - SEL of 160 MeV-cm2/mg (data sheet)
- Current limiters
- Limited SEFI
- POR, SelectMAP, JTAG
- 1.5E-6 upsets/device/day (data sheet)
- SOFT
- Mitigation techniques TMR, configuration
scrubbing - XQ2V1000-4BG575
- Military version for lower cost
- SEL may not be as good as XQR2V1000
- SEL of 124 MeV-cm2/mg
- Capacity of 1 M gates
- 328 Signal I/Os
6Components Selection (2/3)
- EEPROM
- Xilinx XQR18V04
- Total ionizing dose of 10 krad (Si) (data sheet)
- 30 krad (Si) for read only (data sheet)
- SEL of 120 MeV-mg/cm2 (data sheet)
- SEU of 120 MeV-mg/cm2 (data sheet)
- SDRAM
- Elpida EDS5108ABTA (512Mb)
- Total ionizing dose of 50 krad (Si)
- SEL of 80 MeV-mg/cm2 at 85o C, 100o C, 125o C
- SEU
- Bit error rate of 6.96E-12 errors/bit-day
- SEFI error rate of 1.3E-4 errors/device-day
- Linear Regulator
- Texas Instrument TPS75715 (1.5V LDO regulator)
- Total ionizing dose of 10 krad (Si)
- SEL of 60 MeV-cm2/mg
7Components Selection (3/3)
- Current limiters
- Maxim-IC MAX893L (1.2A) , MAX891L (0.5A)
- Total ionizing dose SEL of 30 krad (Si)
- Power-On-Reset circuit
- Maxim-IC MAX803
- Total ionizing dose of 20 krad (Si)
- Stacking technology
- Provided by 3D Plus
8Radiation Mitigation
- Total ionizing dose
- Local shielding
- Package shielding, thickness depend on
requirement - SEL
- Current limiting device
- SEU
- Memory contents
- TMR, EDAC
- FPGA SEU
- Configuration scrubbing, TMR
- SEFI
- Best effort to minimize the SEFI rate
- Mitigate at higher level
9Block Diagram
10FPGA SEU Mitigation (1/5)
- Input
- Xilinx recommendation
- Use 3 pins per signal, connected on the board
- Bus signals use one pin per signal, add EDAC,
save pins - The sending side must generate EDAC check bits
- Pins can be used up quickly
- Implementation
- Module Interface
- Use 3 pins per signal for address/controls
- Use 1 pin per signal for Din
- EDAC is optional
- Single point failure rate increases without EDAC
11FPGA SEU Mitigation (2/5)
- Output
- Xilinx recommendation
- Use 3 pins per signal, connected on the board
- Not glitch-free
- Signal integrity
- Bus signals use one pin per signal, add EDAC,
save pins - The receiving side must also implement EDAC
- Pins can be used up quickly
- Implementation
- Module interface
- Use 3 pins per signal for controls
- Use 1 pin per signal for Dout
- EDAC is optional
- Single point failure rate increases without EDAC
12FPGA SEU Mitigation (3/5)
- Output
- Implementation
- SDRAM interface
- Clock, Address
- 3 sets, equivalent signals are not connected
together on the board, - Each set drives two SDRAMs
- Controls
- 4 sets, equivalent signals are not connected
together on the board - Two of the sets, each drives two SDRAMs
- The other two sets, each drives one SDRAM
- Switch EDAC/TMR configured SDRAM
13FPGA SEU Mitigation (4/5)
- Bi-directional
- Xilinx recommendation
- Use 1 pin per signal
- Path from voter to the pin becomes possible
single point failure - Implementation
- SDRAM Interface
- TMR configured SDRAMs
- 3 sets of data bus
- EDAC configure SDRAMs
- Use 1 pin per signal
14FPGA SEU Mitigation (5/5)
- Implication on data integrity of the SDRAM
contents - EDAC configured SDRAMs
- 256 MB
- Output drivers and input receivers are possible
single point failure - TMR configured SDRAMs
- 128 MB
- No single point failure
- Back ground SDRAMs content scrubbing
15XTMR Tool (1/4)
- Fairly fast
- Gates utilized
- Average utilization cost of TMR is 3.2x
- RTIMS actual
- 4.3x
- Gates multiplier 3 3 (fraction of flops
fraction of I/Os) - It is closer to 3x for design that is mostly
gates - It is closer to 6x for design that is mostly
flops - RTIMS actual 36 flops
- Additional multiplier for design with SRL16
16XTMR Tool (2/4)
- Internal performance degradation
- Average performance impact of TMR is 10
- RTIMS actual
- 20
- 6 logic levels original
- Add a voter, 7 levels
- 15 performance impact
- Longer routing
- 3.8x gates
- 5 performance impact
17XTMR Tool (3/4)
- I/O performance degradation
- Input Pin
- TMR
- Voters after the FF
- Lock the FF in the IOB
- No TMR on input pin
- 3 FFs after the input receiver
- Cant lock the FF in the IOB
- Performance penalty
- RTIMS actual increased from 1.8 ns to 3.6 ns
18XTMR Tool (4/4)
- Output Pin
- Triplicate pin, tied together on board
- Add Voter before the output driver
- Glitch
- Cant lock the FF in the IOB
- Performance penalty
- Signal integrity
- Not triplicating pin
- Add voter before the output driver
- Glitch
- Cant lock the FF in the IOB
- Performance penalty
- RTIMS actual increased from 4.5 ns to 6.4 ns
19Storage state
- Correct SEU on storage state before the next SEU
that make it uncorrectable - Memory content
- Scrubbing
- Flop state
- Basic Xilinx flop FDCPE(PRE, D, CE, C, CLR, Q)
- Inputs of FLOP are corrected
- Unless CE is active, the Flop state is not
corrected. - 3 minority voters and 3 OR gates can be added to
force a CE on error detected - Expensive to apply this universally
- For almost static flop, the following FLOP is
used
20A few other things (1/4)
- Digital Clock Manger
- Use 3 DCMs for each DCM that is in the original
design - DCM is a unit
- SEU on a FLOP in the DCM
- Corrected by configuration scrubbing
- Reset only
- 3 counters, each counter is clocked by a DCM
- When one of the counter value is different from
the other two, we know which DCM is operating
differently than the others - Each counter is TMR so that a SEU on the counter
other than the clock path will not produce an
error
21A few other things (2/4)
- Configuration scrubbing
- Similar to Virtex
- Virtex II
- Whole configuration is loaded with 1 type 2
command - The order of configuration loading is
- GCLK, CLB and IOB, Memory Content, and Memory
Control - Script to split the loading into three type 2
command - GCLK, CLB, IOB
- Memory control
- Memory content
- On power up the whole configuration is loaded
- On scrubbing, only GCLK, CLB, IOB, and memory
control are loaded
22A few other things (3/4)
- Configuration scrubbing
- Scrubber logic is TMR and it is part of the FPGA
code - Master SelectMap for configuration with
configuration clock continue to run after initial
load - Scrubber logic is clocked by the configuration
clock - The generation of the configuration clock becomes
a possible single point failure - Can switch to Slave SelectMap and add an external
oscillator
23A few other things (4/4)
- SelectMap Interface SEFI detection
- Implement a 16x1 distribute memory as SRL16 with
initial value of all zeros - Instruct XTMR not to convert it to registers
- Write a signature into this memory prior to
configuration scrubbing - This memory shall be clear because of the
reloading of the CLB during configuration
scrubbing - Read the memory content after configuration
scrubbing - A non-zero content indicates scrubbing failure
24Stack
SDRAM
MISC
25Status
- 20 Modules
- Related paper "Radiation Tolerant and
Intelligent Memory for Space" (P1025) - 144-Lead QFP package
- Dimensions42.5mm x 42.5mm x 13.0 mm
- Mass 70g with radiation shielding
- Power 4.0 W peak
- To Be Verified / Analyzed
- Total Ionizing Dose gt 100 krad (Si)
- SEU in GEO less than 1.5E-6 per day
- Latch-Up Immune to 60 MeV-cm2/mg
26Future Work
- VHDL and Place Route
- Works in progress
- Minimize SEFI
- Error detection and recording
- Error recovery
- What is the SEFI rate of RTIMS ?
- Environment testing
- Life test (accelerated component life testing)
- 100 krad (Si) TID radiation tests
- SEL and SEU radiation tests
- Vacuum and temperature tests
- Mechanical stress tests
- Electrostatic discharge tests
27Points to ponder
- XTMR
- Not a turn key process
- Scrub memory content
- Almost static flop
- DCM failure detection and reset
- Glitch-free output is no longer glitch-free
- Signal integrity with dotted output
- IO
- 3 pins for one signal, EDAC
- Tie the triplicate IO together vs carry three
signals on the board with the voter implemented
on the receiving side - One size does not fit all