Fault Injection for High Availability Assessment - PowerPoint PPT Presentation

1 / 18
About This Presentation
Title:

Fault Injection for High Availability Assessment

Description:

use traps before and after an instruction to change the code or data used by the ... high speed of operation - triggers and injection patterns. stored in FPGA ... – PowerPoint PPT presentation

Number of Views:51
Avg rating:3.0/5.0
Slides: 19
Provided by: usersCrhc
Category:

less

Transcript and Presenter's Notes

Title: Fault Injection for High Availability Assessment


1
Fault Injection for High Availability Assessment
  • T. Liu, D. Stott, Z. Kalbarczyk, R. K. Iyer,
  • Center for Reliable and High-Performance
    Computing
  • University of Illinois at Urbana-Champaign
  • http//www.crhc.uiuc.edu/DEPEND

2
System Lifetime Validation Approaches
  • Design Phase
  • Fault functional simulation
  • Electrical, logic, or functional level
  • DEPEND
  • Prototype Phase
  • Stress testing
  • Fault injection in prototype systems
  • N-FTAPE
  • Operational Phase
  • Study of naturally occurring faults in real
    environments
  • Essential for believable analysis of todays
    complex systems
  • What can we say about the future systems based on
    measurements from current systems?
  • Analyze-NOW

3
Goal Objectives
  • Goal - methods and tools for validation of
    networked system
  • Objectives
  • Identify realistic fault models for systems as
    well as networks
  • Develop effective fault injection mechanisms
  • Develop accurate evaluation methods for
    quantifying and qualifying fault tolerance
  • Define quantifiable metrics and benchmark for
    comparing the fault tolerance of target systems
  • Apply technology to different computation
    platforms (HW, OS and networks)

4
Approach
  • Develop fault injection techniques for network
    systems
  • network media (e.g., transceiver/receiver)
  • port devices and drivers
  • communication protocols
  • application processes
  • network nodes
  • Define fault tolerance benchmarking, including
  • fault types
  • benchmarks
  • process and procedure
  • metric and measurement

5
Availability Benchmark
  • Performance benchmarks are widespread, but no
    availability benchmark currently exists
  • Advantages of an availability benchmark
  • High-level comparative system characterization
    for customers
  • Analysis for designers and researchers
  • Implementation
  • Benchmark consists of
  • A synthetic workload generator the creates CPU,
    memory, and disk activity
  • A fault injector that injects fault into CPU
    registers, memory locations, and disk controllers
  • A fixed set of faults and workload generator
    inputs
  • Metrics
  • system/component/node failure
  • performance degradation

6
Benchmark Definition
  • Catastrophic incident
  • An event that causes the computer system to
    become unusable, e.g.,
  • operating system panic or hangs
  • failed error recovery attempts that lead to an
    unusable system configuration
  • Performance degradation
  • The amount of additional time required by an
    application due to the presence of faults, taking
    into account the number of faults injected. The
    time overhead results from
  • the overhead of error recovery routines
  • the loss of resources such as CPU as CPUs or
    disks, which decreases the available compute or
    I/O bandwidth

7
Fault-Injection
Fault Injection Specs Injection
Strategy Stress-based path-based Random Injection
Method by hardware by software Fault
Location CPU Memory disk I/O network I/O Other
I/Os Injection Time load threshold program
execution path fault arrival rate
Workload Specs Rates and Mixes Interaction Inten
sity
Fault Injector
CPU
System Under Test
Load Level
I/O
8
Stress-based Injection Results
9
Distributed Environment - Fault Injection
Requirements
  • Distributed test and evaluation environment
  • Support for the architecture independent approach
  • Evaluate hardware and software implemented fault
    tolerance at a node level, distributed system
    level, and embedded application level
  • Support fault injection to variety of targets
    including CPU registers, memory, I/O, network,
    applications,and OS functions
  • Examples of fault injection strategies include
  • random components and locations,
  • selected hardware and software components (can be
    the predefined or random locations within a
    component)
  • application data and control flow
  • triggered by high stress conditions
  • Allow collecting and analysis of results to
    derive histograms, distributions, coverage to
    help user in understanding the impact of faults

10
Software Fault Injection Techniques
  • Software faults and errors
  • modify the text/data segment of the program
  • Memory faults
  • flip single/multiple memory bits, modify data,
    control segments
  • CPU faults
  • modify CPU registers, accessible functional
    units, buffers
  • Bus faults
  • use traps before and after an instruction to
    change the code or data used by the instruction
    and then restore them after the instruction is
    executed
  • Network faults
  • modify or delete transmitted messages
  • introduce faults in network controller
    (processor, memory), drivers, buffers

11
Principles of the Fault Injection System
Architecture
  • Distributed client-server architecture
  • Separation of the target(s) and the evaluation
    tool
  • Modular and portable
  • Well defined, narrow interface between user
    applications and the test environment
  • Ability to customize the environment by
    specifying attributes of a fault injection
    strategy
  • injection approach (e.g., random, custom,
    stress-based)
  • injection target (e.g., application, system,
    network)
  • injection location (CPU, memory, I/O, network
    controller) and time
  • a fault model for maximum impact on system
    operation and application execution
  • the type of analysis

12
Principles of the Fault Injection System
Architecture (cont.)
  • Shared libraries to support
  • communication handlers for different nodes
  • fault injection to different nodes and
    components (CPU, memory, I/O, and network)
  • variety of fault models including single and
    multiple-bit flip, transient and permanent faults
  • Graphical user interface to
  • register a remote machine - a fault injection
    target
  • select and customize a fault injection strategy
  • invoke a fault injector and workload/application
  • synchronize fault injector with
    workload/application activity
  • collect and present fault injection results

13
Fault Injection System Architecture
Node A (client A)
Example Interface
Communication control handler
CONTROL NODE
Customized Fault Injector
Application(s)
Software Libraries
Fault injectors
Monitor/measure system/error activity
Workloads
Fault models
Common control mechanism
Node B (client B)
Node C (client C)
FAULT INJECTION TARGETS
Campaign strategy
Error logging
14
The Control Node
  • User can implement a variety of fault injection
    strategies
  • Front End GUI as the user interface to the fault
    injection system
  • Configuration through defining the Campaign
    Strategy, i.e., a file that specifies all the
    parameters for fault injection experiments
  • Experiments controlled by the Common Control
    Mechanism

15
The Target Node
  • Control handler provides all of the process
    management services including process execution
    and termination
  • processes comprise injectors, workloads,
    applications, monitors
  • all processes are treated the same as an abstract
    process object rather than a process of some
    specific type
  • Communication handlar supports communication
    between the Control Node and the Target Node

16
Fault Injection Scenario
  • The user specifies the Fault Injection Campaign
    (done from the common GUI and transformed into a
    script file)
  • The GUI submits the Fault Injection Campaign
    (script file) to the Control Node (the Control
    Node resides on a dedicated workstation)
  • The Control Node generates RPC to invoke the
    Communication and Control Handlers on the Target
    Node
  • The Control Node processes the script file (Fault
    Injection Campaign) to determine which processes
    (e.g., fault injector, application) need to run
    on the Target Node
  • The Control Node sends a message to the
    Communication and Control Handlers for starting
    the fault injection campaign
  • The environment runs the fault injection campaign
    and collects results from fault injections
  • The Communication Handler sends the results to
    the Control Node for storing and analysis

17
Fault Injectors - examples
  • Driver fault injector - uses device driver to
    inject memory, registers, OS functions
  • ptrace fault injector (debugging based) -
    controls execution of the target process and
    injects faults into memory and registers
  • I/O functions fault injector - uses dedicated
    device driver and possibly modifications of the
    original I/O drivers
  • Direct injections to the memory image of a
    process, e.g., use of proc file system in Solaris
    to control execution of target application
  • Network fault injector - employs dedicated (or
    modifies existing) software that controls network
    hardware to inject faults into network cards or
    controllers can intercept and corrupt messages
  • Use of performance monitors (built into the CPU)
    to trigger fault injection

18
Hardware Implemented Fault Injector
  • - non-intrusive operation
  • - rapid response
  • - high speed of operation
  • - triggers and injection patterns
  • stored in FPGA
  • - driven by high-level system
  • analysis tools
Write a Comment
User Comments (0)
About PowerShow.com