Title: Design of High Availability Systems and Networks Validation
1Design of High Availability Systems and
NetworksValidation
Ravi K. Iyer Center for Reliable and
High-Performance Computing Department of
Electrical and Computer Engineering
and Coordinated Science Laboratory University of
Illinois at Urbana-Champaign iyer_at_crhc.uiuc.edu
http//www.crhc.uiuc.edu/DEPEND
2Outline
- Introduction
- Validation methods
- Design phase
- Fault simulation
- Prototype phase
- HW or SW implemented fault injection
- Operational phase
- Measurement and analysis of field systems
3Why Validation/Benchmarking?
- Characterize different detection and recovery
mechanisms - Coverage
- Performance overhead
- Determine system/application sensitivity to
errors - Single points of failures (dependability
bottlenecks) - Error propagation patterns
- Placement of detection and recovery mechanisms
- Error susceptibility of runtime fault management
infrastructure - Analyze costsreliabilityperformance tradeoffs
- Need benchmarks to facilitates meaningful
comparison of designs and systems
4Experimental Validation
- Early Design Phase
- Approach and Goals
- CAD environments used to evaluate design via
simulation - Simulated fault injection experiments
- Evaluate effectiveness of fault-tolerant
mechanisms - Provide timely feedback to system designers
- Information produced
- error latency, error detection coverage,
recovery time distribution - Limitation/issues
- Simulations need accurate inputs, fault models,
and validation of results simulation time
- Prototype Phase
- Approach and Goals
- System runs under controlled workload
conditions - Controlled fault injections used to evaluate
system in presents of faults - Information produced
- error latency, propagation, detection
distributions, availability - Limitation/issues
- Injected faults should create/induce failure
scenarios representative of actual system
operation
- Operational Phase
- Approach and Goal
- Study naturally occurring errors
- Study systems in the field, under real
workloads - Analyze collected error and
performance data - Information produced
- actual failure characteristics, failure
rates, time to failure distribution - Limitation/issues
- HW/SW instrumentation, analysis tools
5Design Phase Fault Injection
Hybrid Simulation
6Simulation at Different Levels
- Electrical level
- transistor circuit chip
- Logic level
- circuit VLSI systems
- Function level
- VLSI system computer and network systems
Levels of Simulated Fault Injection Fault
Injection
Electrical level Change current Change voltage
Logic level Stuck-at 0 or 1 Inverted fault
Function level Change CPU registers Network Flip
memory bit
Electrical Circuits
Functional Units
Logic Gates
Physical Process
Logic Operation
7Issues in Simulated Fault Injection
- Fault models
- Fault conditions, fault types
- Number of faults
- Fault times
- Fault locations
- Workload
- Real applications
- Benchmarks
- Synthetic programs
- Simulation time explosion
- Mix-mode simulation
- Importance sampling
- Concurrent simulation
- Accelerated fault simulation
- Hierarchical simulation
8Fault Injection at Electric Level
- Why is it needed?
- Study the impact of physical causes
- Simple stuck-at models do not represent many real
types of faults
Transistor Level Simulation
Device Physics Level Simulation
9Simulated Fault Injection at Logic Level
- Fault Model
- Basic models
- stuck-at (permanent) - forcing logic value for
entire simulation duration - inverted fault (transient) - altering logic value
momentarily - Fault dictionary approach
- Use electrical level simulation to derive
logic-level fault models - dictionary entry - input vector, injection time,
fault location
Transistor Level Description of 4-bit Adder
A B Cin Input 0000 0000 0 S
Cout ---- F 34 ---F F 39 --F-
F 7 F-F- F 20 Input 1111 1111
1 ---- F 23 ---F - 1 --FF
F 9 -F-- - 33 -FFF - 33 FFFFF 1
A(30)
B(30)
Logic Level Fault Dictionary
Current-Burst Fault Model
Cout
Cin
S(30)
For all nodes, for all input combinations
10Fault Injection at Function Level
- Diversity of Components
- Object-oriented approach
- Fault Models
- Various types - depending on the type of
components - Examples
- Single bit-flip for a memory or register fault
- Message corruption for communication channel
fault - Service interrupt for a node fault
- More detailed fault models derived from
lower-level simulation - Impact of Software
- Impact of faults is application dependent
- Software effect can be studied at this level
11Hierarchical Fault Simulation
Host 2
Local Network
Hardware
System Level
Logic Level
Switch
Host 3
Chip Level
Host 1
Intrf.
Network Interface
Host 4
Program
Memory
Other details DMA,
module j
module i
modulek
Software
Lower level fault effects propagate to the
higher levels
Electric Level
Vdd
Electric Level
A
Processor
B
AB
Fault model
Ionizingparticles
Device Physics Level
GND
SiO2
Transistor Level
n
p
Electrical Transient
p
n channel stop
Chip Level
n
Logic Level
12Prototype Phase Hardware-Implemented Fault
Injection
Generation of System Activity
Input Files
System Under Study
Fault Injector Monitor/ Controller
MESSALINE - Architecture
Input Files
- Developed at LAAS-CNRS, France
- Both probes and socket insertion are used
- Can inject up to 32 injection points
- Applications
- A subsystem of a railway interlocking control
system - A distributed communication system
13Prototype Phase Software Implemented Fault
Injection
- Advantages flexibility, low cost
- Disadvantages perturbation to workload, low time
resolution - Targets for software fault injection
- Software faults and errors
- modify the text/data segment of the program
- Memory faults
- flip memory bits
- CPU faults
- modify CPU registers, cache, buffers
- Bus faults
- use traps before and after an instruction to
change the code or data used by the instruction
restore the data after the instruction is
executed - Network faults
- modify or delete transmitted messages
- introduce faults in network controllers, drivers,
buffers
14Prototype Phase Fault Injection Requirements
- Distributed test and evaluation environment
- Support for the architecture independent approach
- Evaluate hardware and software implemented fault
tolerance of single node architectures,
distributed systems and embedded applications - Support fault injection to variety of targets
including CPU registers, cache, memory, I/O,
network, applications,and OS functions - Examples of fault injection strategies include
- random components and locations
- selected hardware and software components (can be
the predefined or random locations within a
component) - application data and control flow
- triggered by high stress conditions
- impact the system timing (e.g., to mimic omission
failures) - Allow collecting and analysis of results to
derive measures for characterizing the system
(e.g., coverage, fault severity, propagation,
latency, availability .)
15Fault/Error Injection
Fault Injection Specs Injection
Strategy Stress-based path-based Random Injection
Method by hardware by software Fault
Location CPU Memory disk I/O network I/O Other
I/Os Injection Time load threshold program
execution path fault arrival rate
Workload Specs Rates and Mixes Interaction Inten
sity
Fault Injector
CPU
System Under Test
Load Level
I/O
16NFTAPE
- NFTAPE is a tool for conducting automated
fault/error injection-based dependability
characterization - Tool, which enables a user (1) to specify a
fault/error injection plan, (2) to carry on
injection experiments, and (3) to collect the
experimental results for analysis. - Facilitates automated execution of fault/error
injection experiments. - Targets assessment of a broad set of
dependability metrics, e.g., availability,
reliability, coverage, mean time to failure. - Operates in a distributed environment.
- Imposes minimal disturbance of target systems
17NFTAPE Architecture
Error Injection Targets
Control Host
Injector Process
LAN
Process Manager
Campaign Script
ApplicationProcess
Control Host
Injector Process
Process Manager
Log
Application Process
18Control Host Process Manager
- Control Host
- Processes a Campaign Script, a file that
specifies a state machine or control flow
followed by the control host during the fault
injection campaign - Simple yet general way to customize a fault
injection experiment - Experiments controlled by the Common Control
Mechanism - Implemented in Java to ensure portability
- Process Manager
- Daemon on each target node to manage processes on
the target node(s) including process execution
and termination - processes include injectors, workloads,
applications, monitors - all processes are treated the same as an abstract
process object rather than a process of some
specific type - Facilitates communication between the Control
Host and Target Nodes.
19State Machine of an Example Campaign Script
Start Application Activate Trigger
ST_run_app
Inject errors Deactivate the Trigger after a
specified time
Initialization (variable, events), Start a
logfile
Condition_2
Condition_1
ST_Trigger_ON
ST_init
ST_start_fi_trig
TRUE
Condition_3
ER_condition_2
Start the Fault Injector Trigger processes
ST_error2
ST_finish
TRUE
ER_condition_1
ST_error1
TRUE
Terminate processes Exit Control Host
20Fault Injectors and Fault Models
- Debugger-based fault injector injection to the
target process memory and registers - Driver-based fault injector injection to
memory, registers, OS functions, I/O devices - Network injector injection to network
cards/controllers corrupting messages - Use of performance monitors (built into the CPU)
to trigger fault injection - Fault injection targets
- CPU registers, memory, network, application,
specific OS function, - Fault injection triggers
- random (based on time), application supplied
breakpoint, externally supplied breakpoint
21Applications of NFTAPE
- Motorola IDEN MicroLite critical base station
controller (call-processing application and
database) in digital mobile telephone network - DHCP (Dynamic Host Configuration Protocol) server
evaluation of application control flow checking - Software implemented fault tolerance (SIFT)
environment on REE testbed - evaluation of
recovery coverage and performance overhead of the
SIFT environment - Internet server applications ftp and ssh (secure
shell) - evaluation of error induced security
vulnerabilities in ftp and ssh applications - Voltan and Chameleon ARMORs software middlewares
- evaluation of fail-silence provided by process
duplication (Voltan) versus internal error
detection (Chameleon) - Linux kernel characterization of Linux kernel
under errors - Myrinet based network failure analysis of
high-speed network
22Group Communication Protocols under Errors
23Observations
- Group Communication Systems (GCSs) provide basic
services for building dependable distributed
applications - Only few studies assessed experimentally the
dependability of GCS implementations - Often under simple failure models (e.g., killing
target process) - We use error injection to study impact of memory
and network errors into Ensemble - Focus on fail silence violations and error
propagation - Understanding the error-propagation patterns is
vital in maintaining system integrity
24Experimental Setup
- Testbed consisting of three machines (Pentium III
500 MHz) interconnected by an Ethernet 100 Mbps
LAN - Operating system Linux 2.4
- Group communication system Ensemble 1.40
- Error injection experiments
- memory injections to assess the impact of
errors in a process text- and heap-memory
segments, - network injections to analyze the impact of
corruption of messages exchanged in support of
the communication protocols.
25Benchmarks
- Use synthetic benchmarks to exercise the
different group communication protocols - Group ? exercises the group membership service
- Fifo ? exercises the fifo-ordered reliable
multicast - Atomic ? exercises the total ordered reliable
multicast - Sequencer based Ensembles implementation
- Three processes join a multicast group and
(possibly) exchange messages in rounds
26Profiling Ensemble Function Invocations
- Ensemble is a 2.5 MB static library containing
6000 functions (only 1000 are actually used). - About 5 of function invocations are for the
Ensemble micro-protocols - part of a GCS that is usually formally specified
and verified - 20 of function invocations are for utility
functions belonging to the Ensemble source code - About 50 of run-time function invocations are
for the run-time support of OCAML.
27Memory Injections
- Error Models
- TEXT bit errors in text segment of the target
process - HEAP bit errors in the allocated heap memory of
the target process - Outcome categories
- Manifested errors are divided in
- Crash failure application stops executing, e.g.,
termination by the OS (e.g., SIGSEGV), HANG,
ASSERT (target process shuts itself down) - Fail silence violation the application performs
invalid computation, e.g., sends corrupted
message to other processes causing them to fail
28TEXT/HEAP Injection Results
- Over 90,000 bit errors injected in Ensembles
text/heap memory - For the manifested errors
- 95 result in clean crash failures (5-10 of
these detected by Ensemble assertion) - 5 result in fail silence violations
Fail Silence Violations
- Text
- Fail silence violations are rare for group
(0.5) but not absent - The reason is the underlying heartbeat
between group members - The addition of communication among
application processes significantly increases
the chances of fail silence violations - 4 (fifo) and 36 (atomic)
- Heap
- No occurrence of fail silence violations for
group benchmark - In presence of application-level
communications (fifo and atomic benchmarks)
fail silence violations account for 5 of the
manifested errors
29Fail Silence Violations
- A majority of the fail silence violations due to
heap errors are due to a corrupted application
message being sent/received - About 4080 of the fail silence violations due
to text-errors (atomic and fifo benchmarks) are
caused by application-level omission failures - not the same as the omission failures of the
underlying GCS, detected and recovered
transparently to the application by means of
sequence numbers and retransmissions.
- Crash of non-injected processes (15 cases due
to text errors)
30Application-Level Omission Failure (example from
the layer implementing flow control for multicast
messages)
- Given two processes p and q
- p can send to q only if send_creditpq gt 0. At
that time, send_creditpq is decremented. - Every 50KB of data q receives from p, q sends
an ack-credit to p - On p receiving ack-credit from q, send_creditpq
is incremented and ps buffered messages are sent
based on the new credit. - No process can ever send to q
- e does not detect it because processes can still
heart-beat each other - Heart-beats are not subjected to MFLOW
- Due to an injected error q skips send an
ack-credit to p - No process can ever send to q
- Ensemble does not detect it because processes can
still heart-beat each other - Heart-beats are not subjected to MFLOW
31Fail Silence Violations (Cont.)
- Fail silence violations are due to error
propagation
P2
P1
32Network Injections
- Error Model
- Single bit errors injected in Ensemble messages
- Errors occur before/after any encoding (e.g.,
checksum) is applied/removed - Purpose
- Test Ensemble robustness to invalid inputs
- Investigate error propagation
33Network Injections Major Results
- Ensemble does not check validity of certain
message fields - e.g., sender id used to index arrays in
micro-protocols - Solution add a range check for the message
sender field - OCAMLs marshal/unmarshal mechanism is highly
error sensitive - Errors can lead to invalid objects reconstructed
and thus, to heap corruption - Solution use more robust encoding for marshaled
messages, e.g., by means of object delimiters - Majority of crashes occur in a small subset of
Ensemble functions - Solution harden implementation of most error
prone functions
34Network-Level Error Propagation (example from
the group benchmark)
- An error is injected in the sender field of a
message sent by the targeted group member. - The corrupted message received by another group
member is used to derive an index to the array
indicating whether group members are faulty - Segmentation violation occurs due invalid access
to the array - All group members except the injected process
member crash.are not subjected to MFLOW
35Summary
- Presented experimental study of Ensemble GCS
under memory/network errors, with focus on FSVs - 5-6 of manifested errors result in FSVs
- In contrast with crash/omission assumption
- FSVs are an impediment to high dependability
- Recovery from such failures can be costly
- Using protocols capable of handling application
value errors (e.g.,Byzantine agreement) will not
suffice - FSVs can affect the mechanism for communication
- A fault tolerance middleware must tolerate its
own errors
36Steps in Measurement-Based Analysis
- Step 1 data processing
- Step 2 model identification and measure
acquisition - Step 3 model solution if necessary
- Step 4 analysis of models and measures
37Measurement Issues
- Deciding what and how to measure is difficult.
- A combination of installed and custom
instrumentation is used in most studies. - Sound evaluations require a considerable amount
of data. - Failures are infrequent and measurements must be
taken over a long period of time. - Systems must be exposed to a wide range of usage
conditions - Only detected errors can be measured.
38Goals
- Understand the nature of failures computer
systems - essential in improving the system availability
and reliability - Characterize failure behavior
- Provide insight into
- error propagation (in particular between nodes in
a network) - impact of correlated errors
- system availability
- Identify deficiencies and suggest improvements in
the error logging mechanism
39Correlated Failures
- Significantly degrade availability, reliability,
and performance - Single failure tolerance is not enough
- Models assuming failure independence are not
appropriate - Partial coverage models need to be modified
- Example analysis of DEC and Tandem systems
indicate that 10-30 of reported problems
involve correlated failures
40Correlated Failures (cont.)
41Correlated Failures (cont.)
42Failure Data Analysis of a LAN of Windows NT
Computers
43Data Used
- Failures found in a network of about 70 Windows
NT based mail servers (running Microsoft Exchange
software). - Event logs collected over a six-month period from
the mail routing network of a commercial
organization. - Analysis of machine reboots
- a major portion of all logged failure data and
- most severe type of failures.
44Classification of Data Collected from a LAN of
Windows NT-based Servers
- The breakup of system reboot data is based on the
events that preceded the current reboot by no
more then one hour - The reboot is categorized based on the source and
the id of most frequently occurring events
45Classification of Data Collected from a LAN of
Windows NT-based Servers (cont.)
- 29 of the reboots cannot be categorized
- A significant percentage (22) of the reboots
have reported connectivity problems. - indication of possible error propagation
- Only a small percentage (10) of the reboots can
be traced to a system hardware component. Most
of the identifiable problems are software
related. - Nearly 50 of the reboots are abnormal reboots
(i.e., the reboots were due to a problem with the
machine rather than due to a normal shutdown). - In nearly 15 of the cases, server problems with
a crucial mail server application force a reboot
of the machine.
46Machine Uptime Downtime Statistics
- 50 of the downtimes last about 12 minutes.
- Too short a period to replace hardware
components and reconfigure the machine. - Majority of the problems are software related
(memory leaks, misloaded drivers,
application errors etc.).
47Availability
- Availability from the system perspective
- Availability from the application/user
perspective - Typical machine provides acceptable service
only about 92 of the time, (on average).
48Modeling Machine Behavior Machine States
49Modeling Machine Behavior State Transitions of a
Typical Machine
- 92 of all transitions are into the Functional
state - this figure is a measure of the average
availability of a typical machine, i.e., the
ability of the machine to provide service, not
just to stay alive. - Only about 40 of the transitions out of the
Reboot states are to the Functional state. - More than half of the transitions out of the
Startup problems are to the Connectivity problems
state. - More than 50 of the transitions out of Disk
problems state are to the Functional state.
50Modeling Domain Behavior
- Nearly 77 (excluding self-loops) of transitions
from the F state, are to the BDC state. - Transitions from F state to MBDC state indicate
correlated failures and recovery among BDCs. - Majority of transitions from state PDC are to
state F. - most of the problems with the PDC are not
propagated to the BDCs, the PDC recovers before
any such propagation takes effect on the BDCs - problems on the PDC do not bring the machine
down,
51Error PropagationTest Results Example
- Using event-specific tests to implement automated
detection of error propagation - Most of the identifiable problems are local
machine related. - Good news
- Error propagation of failures is not observed on
a regular basis. - Bed news
- in a number of cases the tests were not able to
classify the event - it is quite possible that some of these unknowns
represent propagated failures.
52Lessons Learned
- Most of the problems that lead to reboots are
software related. Only 10 are attributable to
specific hardware components. - Connectivity problems contribute to the most of
reboots. A significant percentage of these
problems are persistent. - Rebooting the machine does not appear to solve
the problem in many cases. - Average availability evaluates to over 99,
however a typical machine in the domain, on
average, provides acceptable service only about
92 of the time. - There are indications of propagated or correlated
failures.
53Lessons Learned Insight Into the Logging
Mechanism
- The presence of a Windows NT shutdown event will
improve the accuracy in identifying the causes of
reboots. It will also lead to better estimates of
machine availability. - Improved event logging by the lower-level system
components (protocol drivers, memory managers)
can significantly enhance the value of event logs
in diagnosis. - The Primary Domain Controller logs error events
in bursts - periodic logging of a healthy event by the PDC
would help to increase our understanding of PDC
behavior
54Concluding RemarksSystem Evaluation/Validation
Evaluation/Validation
Operation
Prototype
Design
Fault Injection
Analysis on Field Failure Data
Models
Formal Methods
HW Implemented
SW Implemented
Analytical
Simulation
Corrections of Assumptions
Coverage, Error Latency
Coverage
Failure Rates, Fault Models
55Concluding Remarks (cont.)
- Design/Simulation
- Phase
- Fault tolerance issues
- need well established system level fault models
- impact of software faults
- effect of failures on robustness and system
integrity - Simulation issues
- simulation time explosion
- validation of the simulation methodology
- Prototype (Fault Injection) Phase
- Fault models and their validity
- hardware
- - permanent
- - transient
- software
- - errors
- - faults/defects
- Comparison (validation) of various fault
injection tools - claims, portability, coverage
- Operational Measurement
- Phase
- What to measure
- When to measure
- From case studies to fundamental results
- Isolation of machine specific vs. general system
software dependability characteristics - On-line diagnosis
- Prediction of impact of configuration,
technology and workload changes based on
field measurements