Co-designed Virtual Machines for Reliable Computer Systems - PowerPoint PPT Presentation

About This Presentation
Title:

Co-designed Virtual Machines for Reliable Computer Systems

Description:

Overview of VMFTC Dual-Mode Execution Virtual Machine for Fault Tolerance Computing ... Dynamic Configuration/Switching of the VMs ... – PowerPoint PPT presentation

Number of Views:58
Avg rating:3.0/5.0
Slides: 16
Provided by: Syste98
Category:

less

Transcript and Presenter's Notes

Title: Co-designed Virtual Machines for Reliable Computer Systems


1
Co-designed Virtual Machines for Reliable
Computer Systems
  • University of Wisconsin Madison
  • September, 2002

2
Outline
  • Overview of VMFTC Dual-Mode Execution Virtual
    Machine for Fault Tolerance Computing
  • Hardware Processor Micro-Architecture Hooks
  • Hardware Memory System Design
  • Hardware Interconnection, I/O Channels and
    Disks and Networking
  • Software Virtual Machine Monitor
  • VMFTC Simulation Design
  • Problems and Feedback needed

3
Principles
  • Hardware Resource Replication is a must for both
    Fault Tolerance and Performance/ Throughput.
    Build simple HW hooks with Replicated hardware
    resources and then Export either a performance VM
    or FTC VM.
  • Maintain Conventional Architectural Interface
    between HW and SW. e.g. between HW and OS,
    Exploit COTS if possible. Is this strategy
    problematic?

4
Architecture - End Users Perspective
  • Two modes of virtual machines running on the same
    hardware platform hidden mini VMM software.
  • Performance Mode VM Fully Architected COTS
    processor resources, wider interconnection
    bandwidth, larger memory, wider I/O channel
    bandwidth, larger or more disks, and finally
    lower latency.
  • Reliable Mode VM Ultra-Reliable Architected
    processor resources, ultra reliable and available
    interconnections, memory systems, I/O channels
    and storage.
  • Positive synergetic effects Self-monitoring,
    Self healing via Error detection and recovery
    mechanisms in Reliable mode VM. Higher system
    throughput to alleviate workload pressures on the
    whole server system via performance mode virtual
    machine.

5
Architecture - End Users Perspective
6
Proposed Contributions
  • Flexible and cost effective usage of replicated
    COTS hardware resources via virtual machine
    technology to maintain conventional Architecture.
  • Ultra-Reliable architectural interface to
    software community separate hardware RAS from
    that of software for easier hierarchical
    solutions.
  • Provide simple architectural support for software
    RAS mechanisms when needed for more effective
    whole system solutions.

7
Processor Micro-Architecture Hooks
  • Performance Mode VM More architected processing
    capacity.
  • Reliable Mode VM RAS promised with lock stepped
    processor pairs. Can monitor system runtime
    hardware status.
  • Dynamic Switching between the two modes.
  • Bootstrap Reliable mode for better self
    testing.
  • Power-off VMM gets final control of the system,
    enter reliable mode to make sure everything is OK
    before power off.

8
Processors Lock-stepped UP or SMP/CMP
9
Memory System Design
  • Performance Mode VM More architected memory and
    interconnection bandwidth, slightly lower
    latency.
  • Reliable Mode VM Less but more reliable and
    available memory system. Exploit Log Bit/Parity
    bits for each memory block to perform memory
    transaction logging.
  • Storage and communications are protected by ECC
    code
  • Optional mirrored memory images
  • All logic modules such as cache coherence
    processors, could also exploit dual modes
    processing.
  • Dynamic Switching between the two modes.

10
Memory System Design
11
Interconnection I/O Channels
  • Dual-mode I/O controllers, channels. Multiple
    interconnections for both availability and
    performance.
  • Performance Mode VM More architected resource
    capacities due to physical resource replication.
  • Reliable Mode VM Less capacity for both
    communication bandwidth and storage due to
    controller cross-checking or checkpointing /
    logging overhead for hidden VMM activity.
    However, it monitors component fault rate for
    fault forecasting.
  • Dynamic Switching between the two modes.

12
I/O Interconnection Network
13
VMM issues
  • Dynamic Configuration/Switching of the VMs
  • VMM Intercepts certain interrupts in reliable
    mode Timer, Machine Check Interrupts, I/O
    Interrupts.
  • Timer triggered checkpointing. Checkpoint state
    Processor state, cache state and memory state.
    Communication states via QS
  • Memory Checkpoints Memory transaction Logging
    in main memory storage Log bits to Reduce work
  • I/O event logging for replay during Recovery.
  • Rollback Recovery Rollback memory image, Reload
    system state, I/O event replay

14
VMFTC Simulation Design
  • Simulator Infrastructure -- PHARMsim SimOS-PPC
    SimpleMP. Precise but slow.
  • Fault Injection
  • Fault Detection
  • Execution mode switch in the simulator.
  • Checkpointing/logging and recovery with full
    consideration of precise I/O event handling in
    the PHARMsim simulator.
  • Co-designed VMM ? Classical OS VMM

15
Problems
  • Potential Applications -- Servers?
    PC/workstations? Mobile computing? Embedded
    systems?
  • Whole System Level Fault Models What are common
    faults and their frequency, cost etc.
  • Cost Models in building those hooks inside the
    whole system. Cost for redundant resources.
  • How to Evaluate fault tolerant computing, how to
    perform evaluation for a research project?
  • Anything HW can help to recover from SW
    Heisenberg faults? Or anything HW can do to help
    SW fault tolerance in a co-designed style?
Write a Comment
User Comments (0)
About PowerShow.com