Title: Towards a Hardware-Software Co-Designed Resilient System
1Towards a Hardware-Software Co-Designed
Resilient System
- Man-Lap (Alex) Li, Pradeep Ramachandran,
- Sarita Adve, Vikram Adve, Yuanyuan Zhou
- University of Illinois at Urbana-Champaign
- In collaboration with
- Pradip Bose (IBM) and Subhasish Mitra (Stanford)
2Motivation
- Failures will happen in the field
- Design defects
- Aging
- Soft errors
- Inadequate burn-in
- Aggressive design for power/performance/reliabilit
y -
- Low-cost method to detect/recover from all
sources of failure? - Reliability problem pervasive across many markets
- Traditional solutions (e.g. nMR) too expensive
- Must incur low performance, power overhead
3A Low-Cost, Unified Reliability Solution
- Need handle only faults that propagate to
software - Hardware faults appear as software bugs
- Leverage software reliability solutions for
hardware? - One-size-fits-all near-100 coverage often
unnecessary - Solution must be customizable to application
needs
4Outline
- Motivation of Framework
- Unified Framework for H/W S/W Reliability
- Understanding the Impact of H/W Failures on S/W
- Future Work
5Unified Framework for H/W S/W Reliability
- Unified hardware/software co-designed framework
- Tackles hardware and software faults
- Software-centric solutions with near-zero H/W
overhead - Customizable to app needs, flexible for new error
sources
6Framework Components
- Detection Software symptoms, online testing
- Recovery Software/hardware checkpoint and
rollback - Diagnosis Firmware layer for rollback/replay,
online testing - Repair/reconfiguration Redundant, reconfigurable
hardware - Need to understand how hardware faults propagate
to S/W - How do hardware faults become visible to
software? - What is the latency?
- Do H/W faults affect application and/or system
state?
7Methodology
- Microarchitecture-level fault injection
- Trade-off between accuracy and simulation time
- GEMS timing models for out-of-order processor,
memory - Simics full-system simulation of Solaris
UltraSPARC III - SPEC workloads for ten million instructions
- Fault model
- Stuck-at, bridging faults in many micro-arch
structures - Fault detection
- Crashes detected through hardware generated fatal
traps - Misaligned memory access, RED state, watchdog
reset, etc. - Hangs detected using simple hardware hang detector
8How do Hardware Faults Propagate to Software?
- 97 faults (w/o FPU) detectable with simple H/W
S/W - Need H/W support or S/W monitoring for FPU
9How do Hardware Faults Propagate to Software?
- 97 faults (w/o FPU) detectable with simple H/W
S/W - Need H/W support or S/W monitoring for FPU
- gt 50 crashes/hangs in OS
10S/W Components Corrupted
- 62 of faults corrupt system state
- Need to recover system state
11Latency to Detection from Application Corruption
Total instructions executed between app state
corruption and detection
- 80 have latency lt 100K instr, amenable to H/W
recovery - Buffering for 50µs on 2 GHz processor
- May need to use software checkpoint/recovery for
others
12Latency to Detection from OS Corruption
OS-only instructions executed between OS state
corruption and detection
- 92 of injections result in latency of lt 100K OS
instructions - Amenable to hardware recovery
13Summary so far
- Hardware faults highly visible
- Over 97 of faults in 6 structures result in
crashes/hangs - Simple H/W and S/W sufficient
- Recovery through checkpointing
- S/W and/or H/W checkpoints for application
recovery - H/W checkpoints and buffering for OS recovery
14Next Steps (1 of 3)
- Improving understanding of fault propagation
- Accurate fault models, effect of transients,
intermittents - Lower-level simulations
- Better workloads
- Detection
- More software level monitoring
- Software signals, invariants, perturbations,
- H/W support to aid detection in some structures
(e.g., FPU) - Selective backup testing
- Recovery
- Enhanced detection may reduce latency
- Explore software vs. hardware, application
customizability
15Next Steps (2 of 3)
- Diagnosis
- Assume rollback/restart mechanism, multicore
system
Bug detected
Rollback to previous checkpoint, restart on
original core
Original symptom doesnt recur
Original symptom recurs
Transient h/w bug, or non-deterministic s/w
bug Continue execution
Deterministic s/w bug, or Permanent h/w bug
Rollback, restart on different core
No symptom
Symptom
Permanent defect in original core
Deterministic s/w bug
16Next Steps (3 of 3)
- Repair/reconfigure
- What should be the right field configurable unit?
- Core, FU, array entries?
- Avoidance
- Dynamic reliability management
- Implementation architecture
- Hardware firmware OS
- Itanium machine check architecture has hooks
17Thank You
18Backup Slides
19Types of fatal traps
- Faults cause different fatal traps thrown before
crashes - Junk data access leads to memory misalignment
- Repeatedly trapping leads to RED state