Title: Fault Injection for High Availability Assessment
1Fault Injection for High Availability Assessment
- T. Liu, D. Stott, Z. Kalbarczyk, R. K. Iyer,
- Center for Reliable and High-Performance
Computing - University of Illinois at Urbana-Champaign
- http//www.crhc.uiuc.edu/DEPEND
2System Lifetime Validation Approaches
- Design Phase
- Fault functional simulation
- Electrical, logic, or functional level
- DEPEND
- Prototype Phase
- Stress testing
- Fault injection in prototype systems
- N-FTAPE
- Operational Phase
- Study of naturally occurring faults in real
environments - Essential for believable analysis of todays
complex systems - What can we say about the future systems based on
measurements from current systems? - Analyze-NOW
3Goal Objectives
- Goal - methods and tools for validation of
networked system - Objectives
- Identify realistic fault models for systems as
well as networks - Develop effective fault injection mechanisms
- Develop accurate evaluation methods for
quantifying and qualifying fault tolerance - Define quantifiable metrics and benchmark for
comparing the fault tolerance of target systems - Apply technology to different computation
platforms (HW, OS and networks)
4Approach
- Develop fault injection techniques for network
systems - network media (e.g., transceiver/receiver)
- port devices and drivers
- communication protocols
- application processes
- network nodes
- Define fault tolerance benchmarking, including
- fault types
- benchmarks
- process and procedure
- metric and measurement
5Availability Benchmark
- Performance benchmarks are widespread, but no
availability benchmark currently exists - Advantages of an availability benchmark
- High-level comparative system characterization
for customers - Analysis for designers and researchers
- Implementation
- Benchmark consists of
- A synthetic workload generator the creates CPU,
memory, and disk activity - A fault injector that injects fault into CPU
registers, memory locations, and disk controllers - A fixed set of faults and workload generator
inputs - Metrics
- system/component/node failure
- performance degradation
6Benchmark Definition
- Catastrophic incident
- An event that causes the computer system to
become unusable, e.g., - operating system panic or hangs
- failed error recovery attempts that lead to an
unusable system configuration - Performance degradation
- The amount of additional time required by an
application due to the presence of faults, taking
into account the number of faults injected. The
time overhead results from - the overhead of error recovery routines
- the loss of resources such as CPU as CPUs or
disks, which decreases the available compute or
I/O bandwidth
7Fault-Injection
Fault Injection Specs Injection
Strategy Stress-based path-based Random Injection
Method by hardware by software Fault
Location CPU Memory disk I/O network I/O Other
I/Os Injection Time load threshold program
execution path fault arrival rate
Workload Specs Rates and Mixes Interaction Inten
sity
Fault Injector
CPU
System Under Test
Load Level
I/O
8Stress-based Injection Results
9Distributed Environment - Fault Injection
Requirements
- Distributed test and evaluation environment
- Support for the architecture independent approach
- Evaluate hardware and software implemented fault
tolerance at a node level, distributed system
level, and embedded application level - Support fault injection to variety of targets
including CPU registers, memory, I/O, network,
applications,and OS functions - Examples of fault injection strategies include
- random components and locations,
- selected hardware and software components (can be
the predefined or random locations within a
component) - application data and control flow
- triggered by high stress conditions
- Allow collecting and analysis of results to
derive histograms, distributions, coverage to
help user in understanding the impact of faults
10Software Fault Injection Techniques
- Software faults and errors
- modify the text/data segment of the program
- Memory faults
- flip single/multiple memory bits, modify data,
control segments - CPU faults
- modify CPU registers, accessible functional
units, buffers - Bus faults
- use traps before and after an instruction to
change the code or data used by the instruction
and then restore them after the instruction is
executed - Network faults
- modify or delete transmitted messages
- introduce faults in network controller
(processor, memory), drivers, buffers
11Principles of the Fault Injection System
Architecture
- Distributed client-server architecture
- Separation of the target(s) and the evaluation
tool - Modular and portable
- Well defined, narrow interface between user
applications and the test environment - Ability to customize the environment by
specifying attributes of a fault injection
strategy - injection approach (e.g., random, custom,
stress-based) - injection target (e.g., application, system,
network) - injection location (CPU, memory, I/O, network
controller) and time - a fault model for maximum impact on system
operation and application execution - the type of analysis
12Principles of the Fault Injection System
Architecture (cont.)
- Shared libraries to support
- communication handlers for different nodes
- fault injection to different nodes and
components (CPU, memory, I/O, and network) - variety of fault models including single and
multiple-bit flip, transient and permanent faults
- Graphical user interface to
- register a remote machine - a fault injection
target - select and customize a fault injection strategy
- invoke a fault injector and workload/application
- synchronize fault injector with
workload/application activity - collect and present fault injection results
13Fault Injection System Architecture
Node A (client A)
Example Interface
Communication control handler
CONTROL NODE
Customized Fault Injector
Application(s)
Software Libraries
Fault injectors
Monitor/measure system/error activity
Workloads
Fault models
Common control mechanism
Node B (client B)
Node C (client C)
FAULT INJECTION TARGETS
Campaign strategy
Error logging
14The Control Node
- User can implement a variety of fault injection
strategies - Front End GUI as the user interface to the fault
injection system - Configuration through defining the Campaign
Strategy, i.e., a file that specifies all the
parameters for fault injection experiments - Experiments controlled by the Common Control
Mechanism
15The Target Node
- Control handler provides all of the process
management services including process execution
and termination - processes comprise injectors, workloads,
applications, monitors - all processes are treated the same as an abstract
process object rather than a process of some
specific type - Communication handlar supports communication
between the Control Node and the Target Node
16Fault Injection Scenario
- The user specifies the Fault Injection Campaign
(done from the common GUI and transformed into a
script file) - The GUI submits the Fault Injection Campaign
(script file) to the Control Node (the Control
Node resides on a dedicated workstation) - The Control Node generates RPC to invoke the
Communication and Control Handlers on the Target
Node - The Control Node processes the script file (Fault
Injection Campaign) to determine which processes
(e.g., fault injector, application) need to run
on the Target Node - The Control Node sends a message to the
Communication and Control Handlers for starting
the fault injection campaign - The environment runs the fault injection campaign
and collects results from fault injections - The Communication Handler sends the results to
the Control Node for storing and analysis
17Fault Injectors - examples
- Driver fault injector - uses device driver to
inject memory, registers, OS functions - ptrace fault injector (debugging based) -
controls execution of the target process and
injects faults into memory and registers - I/O functions fault injector - uses dedicated
device driver and possibly modifications of the
original I/O drivers - Direct injections to the memory image of a
process, e.g., use of proc file system in Solaris
to control execution of target application - Network fault injector - employs dedicated (or
modifies existing) software that controls network
hardware to inject faults into network cards or
controllers can intercept and corrupt messages - Use of performance monitors (built into the CPU)
to trigger fault injection
18Hardware Implemented Fault Injector
- - non-intrusive operation
- - rapid response
- - high speed of operation
- - triggers and injection patterns
- stored in FPGA
- - driven by high-level system
- analysis tools