Towards a Hardware-Software Co-Designed Resilient System - PowerPoint PPT Presentation

About This Presentation

Title:

Towards a Hardware-Software Co-Designed Resilient System

Description:

Towards a Hardware-Software Co-Designed Resilient System Man-Lap (Alex) Li, Pradeep Ramachandran, Sarita Adve, Vikram Adve, Yuanyuan Zhou University of Illinois at ... – PowerPoint PPT presentation

Number of Views:70

Avg rating:3.0/5.0

Slides: 20

Provided by: Jayan59

Learn more at: http://rsim.cs.illinois.edu

Category:

more less

Transcript and Presenter's Notes

Title: Towards a Hardware-Software Co-Designed Resilient System

1
Towards a Hardware-Software Co-Designed
Resilient System

Man-Lap (Alex) Li, Pradeep Ramachandran,
Sarita Adve, Vikram Adve, Yuanyuan Zhou
University of Illinois at Urbana-Champaign
In collaboration with
Pradip Bose (IBM) and Subhasish Mitra (Stanford)

2
Motivation

Failures will happen in the field
Design defects
Aging
Soft errors
Inadequate burn-in
Aggressive design for power/performance/reliabilit
y
Low-cost method to detect/recover from all
sources of failure?
Reliability problem pervasive across many markets
Traditional solutions (e.g. nMR) too expensive
Must incur low performance, power overhead

3
A Low-Cost, Unified Reliability Solution

Need handle only faults that propagate to
software
Hardware faults appear as software bugs
Leverage software reliability solutions for
hardware?
One-size-fits-all near-100 coverage often
unnecessary
Solution must be customizable to application
needs

4
Outline

Motivation of Framework
Unified Framework for H/W S/W Reliability
Understanding the Impact of H/W Failures on S/W
Future Work

5
Unified Framework for H/W S/W Reliability

Unified hardware/software co-designed framework
Tackles hardware and software faults
Software-centric solutions with near-zero H/W
overhead
Customizable to app needs, flexible for new error
sources

6
Framework Components

Detection Software symptoms, online testing
Recovery Software/hardware checkpoint and
rollback
Diagnosis Firmware layer for rollback/replay,
online testing
Repair/reconfiguration Redundant, reconfigurable
hardware
Need to understand how hardware faults propagate
to S/W
How do hardware faults become visible to
software?
What is the latency?
Do H/W faults affect application and/or system
state?

7
Methodology

Microarchitecture-level fault injection
Trade-off between accuracy and simulation time
GEMS timing models for out-of-order processor,
memory
Simics full-system simulation of Solaris
UltraSPARC III
SPEC workloads for ten million instructions
Fault model
Stuck-at, bridging faults in many micro-arch
structures
Fault detection
Crashes detected through hardware generated fatal
traps
Misaligned memory access, RED state, watchdog
reset, etc.
Hangs detected using simple hardware hang detector

8
How do Hardware Faults Propagate to Software?

97 faults (w/o FPU) detectable with simple H/W
S/W
Need H/W support or S/W monitoring for FPU

9
How do Hardware Faults Propagate to Software?

97 faults (w/o FPU) detectable with simple H/W
S/W
Need H/W support or S/W monitoring for FPU
gt 50 crashes/hangs in OS

10
S/W Components Corrupted

62 of faults corrupt system state
Need to recover system state

11
Latency to Detection from Application Corruption
Total instructions executed between app state
corruption and detection

80 have latency lt 100K instr, amenable to H/W
recovery
Buffering for 50µs on 2 GHz processor
May need to use software checkpoint/recovery for
others

12
Latency to Detection from OS Corruption
OS-only instructions executed between OS state
corruption and detection

92 of injections result in latency of lt 100K OS
instructions
Amenable to hardware recovery

13
Summary so far

Hardware faults highly visible
Over 97 of faults in 6 structures result in
crashes/hangs
Simple H/W and S/W sufficient
Recovery through checkpointing
S/W and/or H/W checkpoints for application
recovery
H/W checkpoints and buffering for OS recovery

14
Next Steps (1 of 3)

Improving understanding of fault propagation
Accurate fault models, effect of transients,
intermittents
Lower-level simulations
Better workloads
Detection
More software level monitoring
Software signals, invariants, perturbations,
H/W support to aid detection in some structures
(e.g., FPU)
Selective backup testing
Recovery
Enhanced detection may reduce latency
Explore software vs. hardware, application
customizability

15
Next Steps (2 of 3)

Diagnosis
Assume rollback/restart mechanism, multicore
system

Bug detected
Rollback to previous checkpoint, restart on
original core
Original symptom doesnt recur
Original symptom recurs
Transient h/w bug, or non-deterministic s/w
bug Continue execution
Deterministic s/w bug, or Permanent h/w bug
Rollback, restart on different core
No symptom
Symptom
Permanent defect in original core
Deterministic s/w bug
16
Next Steps (3 of 3)