Fault Injection for High Availability Assessment - PowerPoint PPT Presentation

1 / 18

About This Presentation

Title:

Fault Injection for High Availability Assessment

Description:

use traps before and after an instruction to change the code or data used by the ... high speed of operation - triggers and injection patterns. stored in FPGA ... – PowerPoint PPT presentation

Number of Views:51

Avg rating:3.0/5.0

Slides: 19

Provided by: usersCrhc

Learn more at: http://users.crhc.illinois.edu

Category:

more less

Transcript and Presenter's Notes

Title: Fault Injection for High Availability Assessment

1
Fault Injection for High Availability Assessment

T. Liu, D. Stott, Z. Kalbarczyk, R. K. Iyer,
Center for Reliable and High-Performance
Computing
University of Illinois at Urbana-Champaign
http//www.crhc.uiuc.edu/DEPEND

2
System Lifetime Validation Approaches

Design Phase
Fault functional simulation
Electrical, logic, or functional level
DEPEND
Prototype Phase
Stress testing
Fault injection in prototype systems
N-FTAPE
Operational Phase
Study of naturally occurring faults in real
environments
Essential for believable analysis of todays
complex systems
What can we say about the future systems based on
measurements from current systems?
Analyze-NOW

3
Goal Objectives

Goal - methods and tools for validation of
networked system
Objectives
Identify realistic fault models for systems as
well as networks
Develop effective fault injection mechanisms
Develop accurate evaluation methods for
quantifying and qualifying fault tolerance
Define quantifiable metrics and benchmark for
comparing the fault tolerance of target systems
Apply technology to different computation
platforms (HW, OS and networks)

4
Approach

Develop fault injection techniques for network
systems
network media (e.g., transceiver/receiver)
port devices and drivers
communication protocols
application processes
network nodes
Define fault tolerance benchmarking, including
fault types
benchmarks
process and procedure
metric and measurement

5
Availability Benchmark

Performance benchmarks are widespread, but no
availability benchmark currently exists
Advantages of an availability benchmark
High-level comparative system characterization
for customers
Analysis for designers and researchers
Implementation
Benchmark consists of
A synthetic workload generator the creates CPU,
memory, and disk activity
A fault injector that injects fault into CPU
registers, memory locations, and disk controllers
A fixed set of faults and workload generator
inputs
Metrics
system/component/node failure
performance degradation

6
Benchmark Definition

Catastrophic incident
An event that causes the computer system to
become unusable, e.g.,
operating system panic or hangs
failed error recovery attempts that lead to an
unusable system configuration
Performance degradation
The amount of additional time required by an
application due to the presence of faults, taking
into account the number of faults injected. The
time overhead results from
the overhead of error recovery routines
the loss of resources such as CPU as CPUs or
disks, which decreases the available compute or
I/O bandwidth

7
Fault-Injection
Fault Injection Specs Injection
Strategy Stress-based path-based Random Injection
Method by hardware by software Fault
Location CPU Memory disk I/O network I/O Other
I/Os Injection Time load threshold program
execution path fault arrival rate
Workload Specs Rates and Mixes Interaction Inten
sity
Fault Injector
CPU
System Under Test
Load Level
I/O
8
Stress-based Injection Results
9
Distributed Environment - Fault Injection
Requirements

Distributed test and evaluation environment
Support for the architecture independent approach
Evaluate hardware and software implemented fault
tolerance at a node level, distributed system
level, and embedded application level
Support fault injection to variety of targets
including CPU registers, memory, I/O, network,
applications,and OS functions
Examples of fault injection strategies include
random components and locations,
selected hardware and software components (can be
the predefined or random locations within a
component)
application data and control flow
triggered by high stress conditions
Allow collecting and analysis of results to
derive histograms, distributions, coverage to
help user in understanding the impact of faults

10
Software Fault Injection Techniques

Software faults and errors
modify the text/data segment of the program
Memory faults
flip single/multiple memory bits, modify data,
control segments
CPU faults
modify CPU registers, accessible functional
units, buffers
Bus faults
use traps before and after an instruction to
change the code or data used by the instruction
and then restore them after the instruction is
executed
Network faults
modify or delete transmitted messages
introduce faults in network controller
(processor, memory), drivers, buffers

11
Principles of the Fault Injection System
Architecture

Distributed client-server architecture
Separation of the target(s) and the evaluation
tool
Modular and portable
Well defined, narrow interface between user
applications and the test environment
Ability to customize the environment by
specifying attributes of a fault injection
strategy
injection approach (e.g., random, custom,
stress-based)
injection target (e.g., application, system,
network)
injection location (CPU, memory, I/O, network
controller) and time
a fault model for maximum impact on system
operation and application execution
the type of analysis

12
Principles of the Fault Injection System
Architecture (cont.)

Shared libraries to support
communication handlers for different nodes
fault injection to different nodes and
components (CPU, memory, I/O, and network)
variety of fault models including single and
multiple-bit flip, transient and permanent faults
Graphical user interface to
register a remote machine - a fault injection
target
select and customize a fault injection strategy
invoke a fault injector and workload/application
synchronize fault injector with
workload/application activity
collect and present fault injection results

13
Fault Injection System Architecture
Node A (client A)
Example Interface
Communication control handler
CONTROL NODE
Customized Fault Injector
Application(s)
Software Libraries
Fault injectors
Monitor/measure system/error activity
Workloads
Fault models
Common control mechanism
Node B (client B)
Node C (client C)
FAULT INJECTION TARGETS
Campaign strategy
Error logging
14
The Control Node

User can implement a variety of fault injection
strategies
Front End GUI as the user interface to the fault
injection system
Configuration through defining the Campaign
Strategy, i.e., a file that specifies all the
parameters for fault injection experiments
Experiments controlled by the Common Control
Mechanism

15
The Target Node

Control handler provides all of the process
management services including process execution
and termination
processes comprise injectors, workloads,
applications, monitors
all processes are treated the same as an abstract
process object rather than a process of some
specific type
Communication handlar supports communication
between the Control Node and the Target Node

16
Fault Injection Scenario

The user specifies the Fault Injection Campaign
(done from the common GUI and transformed into a
script file)
The GUI submits the Fault Injection Campaign
(script file) to the Control Node (the Control
Node resides on a dedicated workstation)
The Control Node generates RPC to invoke the
Communication and Control Handlers on the Target
Node
The Control Node processes the script file (Fault
Injection Campaign) to determine which processes
(e.g., fault injector, application) need to run
on the Target Node
The Control Node sends a message to the
Communication and Control Handlers for starting
the fault injection campaign
The environment runs the fault injection campaign
and collects results from fault injections
The Communication Handler sends the results to
the Control Node for storing and analysis

17
Fault Injectors - examples

Driver fault injector - uses device driver to
inject memory, registers, OS functions
ptrace fault injector (debugging based) -
controls execution of the target process and
injects faults into memory and registers
I/O functions fault injector - uses dedicated
device driver and possibly modifications of the
original I/O drivers
Direct injections to the memory image of a
process, e.g., use of proc file system in Solaris
to control execution of target application
Network fault injector - employs dedicated (or
modifies existing) software that controls network
hardware to inject faults into network cards or
controllers can intercept and corrupt messages
Use of performance monitors (built into the CPU)
to trigger fault injection