Co-designed Virtual Machines for Reliable Computer Systems - PowerPoint PPT Presentation

About This Presentation

Title:

Co-designed Virtual Machines for Reliable Computer Systems

Description:

Number of Views:58

Avg rating:3.0/5.0

Slides: 16

Provided by: Syste98

Learn more at: https://pages.cs.wisc.edu

Category:

more less

Transcript and Presenter's Notes

Title: Co-designed Virtual Machines for Reliable Computer Systems

1
Co-designed Virtual Machines for Reliable
Computer Systems

2
Outline

Overview of VMFTC Dual-Mode Execution Virtual
Machine for Fault Tolerance Computing
Hardware Processor Micro-Architecture Hooks
Hardware Memory System Design
Hardware Interconnection, I/O Channels and
Disks and Networking
Software Virtual Machine Monitor
VMFTC Simulation Design
Problems and Feedback needed

3
Principles

Hardware Resource Replication is a must for both
Fault Tolerance and Performance/ Throughput.
Build simple HW hooks with Replicated hardware
resources and then Export either a performance VM
or FTC VM.
Maintain Conventional Architectural Interface
between HW and SW. e.g. between HW and OS,
Exploit COTS if possible. Is this strategy
problematic?

4
Architecture - End Users Perspective

Two modes of virtual machines running on the same
hardware platform hidden mini VMM software.
Performance Mode VM Fully Architected COTS
processor resources, wider interconnection
bandwidth, larger memory, wider I/O channel
bandwidth, larger or more disks, and finally
lower latency.
Reliable Mode VM Ultra-Reliable Architected
processor resources, ultra reliable and available
interconnections, memory systems, I/O channels
and storage.
Positive synergetic effects Self-monitoring,
Self healing via Error detection and recovery
mechanisms in Reliable mode VM. Higher system
throughput to alleviate workload pressures on the
whole server system via performance mode virtual
machine.

5
Architecture - End Users Perspective
6
Proposed Contributions

Flexible and cost effective usage of replicated
COTS hardware resources via virtual machine
technology to maintain conventional Architecture.
Ultra-Reliable architectural interface to
software community separate hardware RAS from
that of software for easier hierarchical
solutions.
Provide simple architectural support for software
RAS mechanisms when needed for more effective
whole system solutions.

7
Processor Micro-Architecture Hooks

Performance Mode VM More architected processing
capacity.
Reliable Mode VM RAS promised with lock stepped
processor pairs. Can monitor system runtime
hardware status.
Dynamic Switching between the two modes.
Bootstrap Reliable mode for better self
testing.
Power-off VMM gets final control of the system,
enter reliable mode to make sure everything is OK
before power off.

8
Processors Lock-stepped UP or SMP/CMP
9
Memory System Design

Performance Mode VM More architected memory and
interconnection bandwidth, slightly lower
latency.
Reliable Mode VM Less but more reliable and
available memory system. Exploit Log Bit/Parity
bits for each memory block to perform memory
transaction logging.
Storage and communications are protected by ECC
code
Optional mirrored memory images
All logic modules such as cache coherence
processors, could also exploit dual modes
processing.
Dynamic Switching between the two modes.

10
Memory System Design
11
Interconnection I/O Channels

Dual-mode I/O controllers, channels. Multiple
interconnections for both availability and
performance.
Performance Mode VM More architected resource
capacities due to physical resource replication.
Reliable Mode VM Less capacity for both
communication bandwidth and storage due to
controller cross-checking or checkpointing /
logging overhead for hidden VMM activity.
However, it monitors component fault rate for
fault forecasting.
Dynamic Switching between the two modes.

12
I/O Interconnection Network
13
VMM issues

Dynamic Configuration/Switching of the VMs
VMM Intercepts certain interrupts in reliable
mode Timer, Machine Check Interrupts, I/O
Interrupts.
Timer triggered checkpointing. Checkpoint state
Processor state, cache state and memory state.
Communication states via QS
Memory Checkpoints Memory transaction Logging
in main memory storage Log bits to Reduce work
I/O event logging for replay during Recovery.
Rollback Recovery Rollback memory image, Reload
system state, I/O event replay

14
VMFTC Simulation Design

Simulator Infrastructure -- PHARMsim SimOS-PPC
SimpleMP. Precise but slow.
Fault Injection
Fault Detection
Execution mode switch in the simulator.
Checkpointing/logging and recovery with full
consideration of precise I/O event handling in
the PHARMsim simulator.
Co-designed VMM ? Classical OS VMM

15
Problems

Potential Applications -- Servers?
PC/workstations? Mobile computing? Embedded
systems?
Whole System Level Fault Models What are common
faults and their frequency, cost etc.
Cost Models in building those hooks inside the
whole system. Cost for redundant resources.
How to Evaluate fault tolerant computing, how to
perform evaluation for a research project?
Anything HW can help to recover from SW
Heisenberg faults? Or anything HW can do to help
SW fault tolerance in a co-designed style?