Towards a Hardware-Software Co-Designed Resilient System - PowerPoint PPT Presentation

About This Presentation
Title:

Towards a Hardware-Software Co-Designed Resilient System

Description:

Towards a Hardware-Software Co-Designed Resilient System Man-Lap (Alex) Li, Pradeep Ramachandran, Sarita Adve, Vikram Adve, Yuanyuan Zhou University of Illinois at ... – PowerPoint PPT presentation

Number of Views:65
Avg rating:3.0/5.0
Slides: 20
Provided by: Jayan59
Category:

less

Transcript and Presenter's Notes

Title: Towards a Hardware-Software Co-Designed Resilient System


1
Towards a Hardware-Software Co-Designed
Resilient System
  • Man-Lap (Alex) Li, Pradeep Ramachandran,
  • Sarita Adve, Vikram Adve, Yuanyuan Zhou
  • University of Illinois at Urbana-Champaign
  • In collaboration with
  • Pradip Bose (IBM) and Subhasish Mitra (Stanford)

2
Motivation
  • Failures will happen in the field
  • Design defects
  • Aging
  • Soft errors
  • Inadequate burn-in
  • Aggressive design for power/performance/reliabilit
    y
  • Low-cost method to detect/recover from all
    sources of failure?
  • Reliability problem pervasive across many markets
  • Traditional solutions (e.g. nMR) too expensive
  • Must incur low performance, power overhead

3
A Low-Cost, Unified Reliability Solution
  • Need handle only faults that propagate to
    software
  • Hardware faults appear as software bugs
  • Leverage software reliability solutions for
    hardware?
  • One-size-fits-all near-100 coverage often
    unnecessary
  • Solution must be customizable to application
    needs

4
Outline
  • Motivation of Framework
  • Unified Framework for H/W S/W Reliability
  • Understanding the Impact of H/W Failures on S/W
  • Future Work

5
Unified Framework for H/W S/W Reliability
  • Unified hardware/software co-designed framework
  • Tackles hardware and software faults
  • Software-centric solutions with near-zero H/W
    overhead
  • Customizable to app needs, flexible for new error
    sources

6
Framework Components
  • Detection Software symptoms, online testing
  • Recovery Software/hardware checkpoint and
    rollback
  • Diagnosis Firmware layer for rollback/replay,
    online testing
  • Repair/reconfiguration Redundant, reconfigurable
    hardware
  • Need to understand how hardware faults propagate
    to S/W
  • How do hardware faults become visible to
    software?
  • What is the latency?
  • Do H/W faults affect application and/or system
    state?

7
Methodology
  • Microarchitecture-level fault injection
  • Trade-off between accuracy and simulation time
  • GEMS timing models for out-of-order processor,
    memory
  • Simics full-system simulation of Solaris
    UltraSPARC III
  • SPEC workloads for ten million instructions
  • Fault model
  • Stuck-at, bridging faults in many micro-arch
    structures
  • Fault detection
  • Crashes detected through hardware generated fatal
    traps
  • Misaligned memory access, RED state, watchdog
    reset, etc.
  • Hangs detected using simple hardware hang detector

8
How do Hardware Faults Propagate to Software?
  • 97 faults (w/o FPU) detectable with simple H/W
    S/W
  • Need H/W support or S/W monitoring for FPU

9
How do Hardware Faults Propagate to Software?
  • 97 faults (w/o FPU) detectable with simple H/W
    S/W
  • Need H/W support or S/W monitoring for FPU
  • gt 50 crashes/hangs in OS

10
S/W Components Corrupted
  • 62 of faults corrupt system state
  • Need to recover system state

11
Latency to Detection from Application Corruption
Total instructions executed between app state
corruption and detection
  • 80 have latency lt 100K instr, amenable to H/W
    recovery
  • Buffering for 50µs on 2 GHz processor
  • May need to use software checkpoint/recovery for
    others

12
Latency to Detection from OS Corruption
OS-only instructions executed between OS state
corruption and detection
  • 92 of injections result in latency of lt 100K OS
    instructions
  • Amenable to hardware recovery

13
Summary so far
  • Hardware faults highly visible
  • Over 97 of faults in 6 structures result in
    crashes/hangs
  • Simple H/W and S/W sufficient
  • Recovery through checkpointing
  • S/W and/or H/W checkpoints for application
    recovery
  • H/W checkpoints and buffering for OS recovery

14
Next Steps (1 of 3)
  • Improving understanding of fault propagation
  • Accurate fault models, effect of transients,
    intermittents
  • Lower-level simulations
  • Better workloads
  • Detection
  • More software level monitoring
  • Software signals, invariants, perturbations,
  • H/W support to aid detection in some structures
    (e.g., FPU)
  • Selective backup testing
  • Recovery
  • Enhanced detection may reduce latency
  • Explore software vs. hardware, application
    customizability

15
Next Steps (2 of 3)
  • Diagnosis
  • Assume rollback/restart mechanism, multicore
    system

Bug detected
Rollback to previous checkpoint, restart on
original core
Original symptom doesnt recur
Original symptom recurs
Transient h/w bug, or non-deterministic s/w
bug Continue execution
Deterministic s/w bug, or Permanent h/w bug
Rollback, restart on different core
No symptom
Symptom
Permanent defect in original core
Deterministic s/w bug
16
Next Steps (3 of 3)
  • Repair/reconfigure
  • What should be the right field configurable unit?
  • Core, FU, array entries?
  • Avoidance
  • Dynamic reliability management
  • Implementation architecture
  • Hardware firmware OS
  • Itanium machine check architecture has hooks

17
Thank You
  • Questions?

18
Backup Slides
19
Types of fatal traps
  • Faults cause different fatal traps thrown before
    crashes
  • Junk data access leads to memory misalignment
  • Repeatedly trapping leads to RED state
Write a Comment
User Comments (0)
About PowerShow.com