Design of High Availability Systems and Networks Validation

About This Presentation

Title:

Design of High Availability Systems and Networks Validation

Description:

Need benchmarks to facilitates meaningful comparison of designs and systems ... Internet server applications: ftp and ssh (secure shell) - evaluation of error ... – PowerPoint PPT presentation

Number of Views:70

Avg rating:3.0/5.0

Slides: 56

Provided by: centerforr3

Category:

more less

Transcript and Presenter's Notes

Title: Design of High Availability Systems and Networks Validation

1
Design of High Availability Systems and
NetworksValidation
Ravi K. Iyer Center for Reliable and
High-Performance Computing Department of
Electrical and Computer Engineering
and Coordinated Science Laboratory University of
Illinois at Urbana-Champaign iyer_at_crhc.uiuc.edu
http//www.crhc.uiuc.edu/DEPEND
2
Outline

Introduction
Validation methods
Design phase
Fault simulation
Prototype phase
HW or SW implemented fault injection
Operational phase
Measurement and analysis of field systems

3
Why Validation/Benchmarking?

Characterize different detection and recovery
mechanisms
Coverage
Performance overhead
Determine system/application sensitivity to
errors
Single points of failures (dependability
bottlenecks)
Error propagation patterns
Placement of detection and recovery mechanisms
Error susceptibility of runtime fault management
infrastructure
Analyze costsreliabilityperformance tradeoffs
Need benchmarks to facilitates meaningful
comparison of designs and systems

4
Experimental Validation

Early Design Phase
Approach and Goals
CAD environments used to evaluate design via
simulation
Simulated fault injection experiments
Evaluate effectiveness of fault-tolerant
mechanisms
Provide timely feedback to system designers
Information produced
error latency, error detection coverage,
recovery time distribution
Limitation/issues
Simulations need accurate inputs, fault models,
and validation of results simulation time

Prototype Phase
Approach and Goals
System runs under controlled workload
conditions
Controlled fault injections used to evaluate
system in presents of faults
Information produced
error latency, propagation, detection
distributions, availability
Limitation/issues
Injected faults should create/induce failure
scenarios representative of actual system
operation

Operational Phase
Approach and Goal
Study naturally occurring errors
Study systems in the field, under real
workloads
Analyze collected error and
performance data
Information produced
actual failure characteristics, failure
rates, time to failure distribution
Limitation/issues
HW/SW instrumentation, analysis tools

5
Design Phase Fault Injection
Hybrid Simulation
6
Simulation at Different Levels

Electrical level
transistor circuit chip
Logic level
circuit VLSI systems
Function level
VLSI system computer and network systems

Levels of Simulated Fault Injection Fault
Injection
Electrical level Change current Change voltage
Logic level Stuck-at 0 or 1 Inverted fault
Function level Change CPU registers Network Flip
memory bit
Electrical Circuits
Functional Units
Logic Gates
Physical Process
Logic Operation
7
Issues in Simulated Fault Injection

Fault models
Fault conditions, fault types
Number of faults
Fault times
Fault locations
Workload
Real applications
Benchmarks
Synthetic programs
Simulation time explosion
Mix-mode simulation
Importance sampling
Concurrent simulation
Accelerated fault simulation
Hierarchical simulation

8
Fault Injection at Electric Level

Why is it needed?
Study the impact of physical causes
Simple stuck-at models do not represent many real
types of faults

Transistor Level Simulation
Device Physics Level Simulation
9
Simulated Fault Injection at Logic Level

Fault Model
Basic models
stuck-at (permanent) - forcing logic value for
entire simulation duration
inverted fault (transient) - altering logic value
momentarily
Fault dictionary approach
Use electrical level simulation to derive
logic-level fault models
dictionary entry - input vector, injection time,
fault location

Transistor Level Description of 4-bit Adder
A B Cin Input 0000 0000 0 S
Cout ---- F 34 ---F F 39 --F-
F 7 F-F- F 20 Input 1111 1111
1 ---- F 23 ---F - 1 --FF
F 9 -F-- - 33 -FFF - 33 FFFFF 1
A(30)
B(30)
Logic Level Fault Dictionary
Current-Burst Fault Model
Cout
Cin
S(30)
For all nodes, for all input combinations
10
Fault Injection at Function Level

Diversity of Components
Object-oriented approach
Fault Models
Various types - depending on the type of
components
Examples
Single bit-flip for a memory or register fault
Message corruption for communication channel
fault
Service interrupt for a node fault
More detailed fault models derived from
lower-level simulation
Impact of Software
Impact of faults is application dependent
Software effect can be studied at this level

11
Hierarchical Fault Simulation
Host 2
Local Network
Hardware
System Level
Logic Level
Switch
Host 3
Chip Level
Host 1
Intrf.
Network Interface
Host 4
Program
Memory
Other details DMA,
module j
module i
modulek
Software
Lower level fault effects propagate to the
higher levels
Electric Level
Vdd
Electric Level
A
Processor
B
AB
Fault model
Ionizingparticles
Device Physics Level
GND
SiO2
Transistor Level
n
p
Electrical Transient
p
n channel stop
Chip Level
n
Logic Level
12
Prototype Phase Hardware-Implemented Fault
Injection
Generation of System Activity
Input Files
System Under Study
Fault Injector Monitor/ Controller
MESSALINE - Architecture
Input Files

Developed at LAAS-CNRS, France
Both probes and socket insertion are used
Can inject up to 32 injection points
Applications
A subsystem of a railway interlocking control
system
A distributed communication system

13
Prototype Phase Software Implemented Fault
Injection

Advantages flexibility, low cost
Disadvantages perturbation to workload, low time
resolution
Targets for software fault injection
Software faults and errors
modify the text/data segment of the program
Memory faults
flip memory bits
CPU faults
modify CPU registers, cache, buffers
Bus faults
use traps before and after an instruction to
change the code or data used by the instruction
restore the data after the instruction is
executed
Network faults
modify or delete transmitted messages
introduce faults in network controllers, drivers,
buffers

14
Prototype Phase Fault Injection Requirements

Distributed test and evaluation environment
Support for the architecture independent approach
Evaluate hardware and software implemented fault
tolerance of single node architectures,
distributed systems and embedded applications
Support fault injection to variety of targets
including CPU registers, cache, memory, I/O,
network, applications,and OS functions
Examples of fault injection strategies include
random components and locations
selected hardware and software components (can be
the predefined or random locations within a
component)
application data and control flow
triggered by high stress conditions
impact the system timing (e.g., to mimic omission
failures)
Allow collecting and analysis of results to
derive measures for characterizing the system
(e.g., coverage, fault severity, propagation,
latency, availability .)

15
Fault/Error Injection
Fault Injection Specs Injection
Strategy Stress-based path-based Random Injection
Method by hardware by software Fault
Location CPU Memory disk I/O network I/O Other
I/Os Injection Time load threshold program
execution path fault arrival rate
Workload Specs Rates and Mixes Interaction Inten
sity
Fault Injector
CPU
System Under Test
Load Level
I/O
16
NFTAPE

NFTAPE is a tool for conducting automated
fault/error injection-based dependability
characterization
Tool, which enables a user (1) to specify a
fault/error injection plan, (2) to carry on
injection experiments, and (3) to collect the
experimental results for analysis.
Facilitates automated execution of fault/error
injection experiments.
Targets assessment of a broad set of
dependability metrics, e.g., availability,
reliability, coverage, mean time to failure.
Operates in a distributed environment.
Imposes minimal disturbance of target systems

17
NFTAPE Architecture
Error Injection Targets
Control Host
Injector Process
LAN
Process Manager
Campaign Script
ApplicationProcess
Control Host
Injector Process
Process Manager
Log
Application Process
18
Control Host Process Manager

Control Host
Processes a Campaign Script, a file that
specifies a state machine or control flow
followed by the control host during the fault
injection campaign
Simple yet general way to customize a fault
injection experiment
Experiments controlled by the Common Control
Mechanism
Implemented in Java to ensure portability
Process Manager
Daemon on each target node to manage processes on
the target node(s) including process execution
and termination
processes include injectors, workloads,
applications, monitors
all processes are treated the same as an abstract
process object rather than a process of some
specific type
Facilitates communication between the Control
Host and Target Nodes.

19
State Machine of an Example Campaign Script
Start Application Activate Trigger
ST_run_app
Inject errors Deactivate the Trigger after a
specified time
Initialization (variable, events), Start a
logfile
Condition_2
Condition_1
ST_Trigger_ON
ST_init
ST_start_fi_trig
TRUE
Condition_3
ER_condition_2
Start the Fault Injector Trigger processes
ST_error2
ST_finish
TRUE
ER_condition_1
ST_error1
TRUE
Terminate processes Exit Control Host
20
Fault Injectors and Fault Models

Debugger-based fault injector injection to the
target process memory and registers
Driver-based fault injector injection to
memory, registers, OS functions, I/O devices
Network injector injection to network
cards/controllers corrupting messages
Use of performance monitors (built into the CPU)
to trigger fault injection
Fault injection targets
CPU registers, memory, network, application,
specific OS function,
Fault injection triggers
random (based on time), application supplied
breakpoint, externally supplied breakpoint

21
Applications of NFTAPE

Motorola IDEN MicroLite critical base station
controller (call-processing application and
database) in digital mobile telephone network
DHCP (Dynamic Host Configuration Protocol) server
evaluation of application control flow checking
Software implemented fault tolerance (SIFT)
environment on REE testbed - evaluation of
recovery coverage and performance overhead of the
SIFT environment
Internet server applications ftp and ssh (secure
shell) - evaluation of error induced security
vulnerabilities in ftp and ssh applications
Voltan and Chameleon ARMORs software middlewares
- evaluation of fail-silence provided by process
duplication (Voltan) versus internal error
detection (Chameleon)
Linux kernel characterization of Linux kernel
under errors
Myrinet based network failure analysis of
high-speed network

22
Group Communication Protocols under Errors
23
Observations

Group Communication Systems (GCSs) provide basic
services for building dependable distributed
applications
Only few studies assessed experimentally the
dependability of GCS implementations
Often under simple failure models (e.g., killing
target process)
We use error injection to study impact of memory
and network errors into Ensemble
Focus on fail silence violations and error
propagation
Understanding the error-propagation patterns is
vital in maintaining system integrity

24
Experimental Setup

Testbed consisting of three machines (Pentium III
500 MHz) interconnected by an Ethernet 100 Mbps
LAN
Operating system Linux 2.4
Group communication system Ensemble 1.40
Error injection experiments
memory injections to assess the impact of
errors in a process text- and heap-memory
segments,
network injections to analyze the impact of
corruption of messages exchanged in support of
the communication protocols.

25
Benchmarks

Use synthetic benchmarks to exercise the
different group communication protocols
Group ? exercises the group membership service
Fifo ? exercises the fifo-ordered reliable
multicast
Atomic ? exercises the total ordered reliable
multicast
Sequencer based Ensembles implementation
Three processes join a multicast group and
(possibly) exchange messages in rounds

26
Profiling Ensemble Function Invocations

Ensemble is a 2.5 MB static library containing
6000 functions (only 1000 are actually used).
About 5 of function invocations are for the
Ensemble micro-protocols
part of a GCS that is usually formally specified
and verified
20 of function invocations are for utility
functions belonging to the Ensemble source code
About 50 of run-time function invocations are
for the run-time support of OCAML.

27
Memory Injections

Error Models
TEXT bit errors in text segment of the target
process
HEAP bit errors in the allocated heap memory of
the target process
Outcome categories
Manifested errors are divided in
Crash failure application stops executing, e.g.,
termination by the OS (e.g., SIGSEGV), HANG,
ASSERT (target process shuts itself down)
Fail silence violation the application performs
invalid computation, e.g., sends corrupted
message to other processes causing them to fail

28
TEXT/HEAP Injection Results

Over 90,000 bit errors injected in Ensembles
text/heap memory
For the manifested errors
95 result in clean crash failures (5-10 of
these detected by Ensemble assertion)
5 result in fail silence violations

Fail Silence Violations

Text
Fail silence violations are rare for group
(0.5) but not absent
The reason is the underlying heartbeat
between group members
The addition of communication among
application processes significantly increases
the chances of fail silence violations
4 (fifo) and 36 (atomic)

Heap
No occurrence of fail silence violations for
group benchmark
In presence of application-level
communications (fifo and atomic benchmarks)
fail silence violations account for 5 of the
manifested errors

29
Fail Silence Violations

A majority of the fail silence violations due to
heap errors are due to a corrupted application
message being sent/received
About 4080 of the fail silence violations due
to text-errors (atomic and fifo benchmarks) are
caused by application-level omission failures
not the same as the omission failures of the
underlying GCS, detected and recovered
transparently to the application by means of
sequence numbers and retransmissions.

Crash of non-injected processes (15 cases due
to text errors)

30
Application-Level Omission Failure (example from
the layer implementing flow control for multicast
messages)

Given two processes p and q
p can send to q only if send_creditpq gt 0. At
that time, send_creditpq is decremented.
Every 50KB of data q receives from p, q sends
an ack-credit to p
On p receiving ack-credit from q, send_creditpq
is incremented and ps buffered messages are sent
based on the new credit.
No process can ever send to q
e does not detect it because processes can still
heart-beat each other
Heart-beats are not subjected to MFLOW

Due to an injected error q skips send an
ack-credit to p
No process can ever send to q
Ensemble does not detect it because processes can
still heart-beat each other
Heart-beats are not subjected to MFLOW

31
Fail Silence Violations (Cont.)

Fail silence violations are due to error
propagation

P2
P1
32
Network Injections

Error Model
Single bit errors injected in Ensemble messages
Errors occur before/after any encoding (e.g.,
checksum) is applied/removed
Purpose
Test Ensemble robustness to invalid inputs
Investigate error propagation

33
Network Injections Major Results

Ensemble does not check validity of certain
message fields
e.g., sender id used to index arrays in
micro-protocols
Solution add a range check for the message
sender field
OCAMLs marshal/unmarshal mechanism is highly
error sensitive
Errors can lead to invalid objects reconstructed
and thus, to heap corruption
Solution use more robust encoding for marshaled
messages, e.g., by means of object delimiters
Majority of crashes occur in a small subset of
Ensemble functions
Solution harden implementation of most error
prone functions

34
Network-Level Error Propagation (example from
the group benchmark)

An error is injected in the sender field of a
message sent by the targeted group member.
The corrupted message received by another group
member is used to derive an index to the array
indicating whether group members are faulty
Segmentation violation occurs due invalid access
to the array
All group members except the injected process
member crash.are not subjected to MFLOW

35
Summary

Presented experimental study of Ensemble GCS
under memory/network errors, with focus on FSVs
5-6 of manifested errors result in FSVs
In contrast with crash/omission assumption
FSVs are an impediment to high dependability
Recovery from such failures can be costly
Using protocols capable of handling application
value errors (e.g.,Byzantine agreement) will not
suffice
FSVs can affect the mechanism for communication
A fault tolerance middleware must tolerate its
own errors

36
Steps in Measurement-Based Analysis

Step 1 data processing
Step 2 model identification and measure
acquisition
Step 3 model solution if necessary
Step 4 analysis of models and measures

37
Measurement Issues

Deciding what and how to measure is difficult.
A combination of installed and custom
instrumentation is used in most studies.
Sound evaluations require a considerable amount
of data.
Failures are infrequent and measurements must be
taken over a long period of time.
Systems must be exposed to a wide range of usage
conditions
Only detected errors can be measured.

38
Goals

Understand the nature of failures computer
systems
essential in improving the system availability
and reliability
Characterize failure behavior
Provide insight into
error propagation (in particular between nodes in
a network)
impact of correlated errors
system availability
Identify deficiencies and suggest improvements in
the error logging mechanism

39
Correlated Failures

Significantly degrade availability, reliability,
and performance
Single failure tolerance is not enough
Models assuming failure independence are not
appropriate
Partial coverage models need to be modified
Example analysis of DEC and Tandem systems
indicate that 10-30 of reported problems
involve correlated failures

40
Correlated Failures (cont.)
41
Correlated Failures (cont.)
42
Failure Data Analysis of a LAN of Windows NT
Computers
43
Data Used

Failures found in a network of about 70 Windows
NT based mail servers (running Microsoft Exchange
software).
Event logs collected over a six-month period from
the mail routing network of a commercial
organization.
Analysis of machine reboots
a major portion of all logged failure data and
most severe type of failures.

44
Classification of Data Collected from a LAN of
Windows NT-based Servers

The breakup of system reboot data is based on the
events that preceded the current reboot by no
more then one hour
The reboot is categorized based on the source and
the id of most frequently occurring events

45
Classification of Data Collected from a LAN of
Windows NT-based Servers (cont.)

29 of the reboots cannot be categorized
A significant percentage (22) of the reboots
have reported connectivity problems.
indication of possible error propagation
Only a small percentage (10) of the reboots can
be traced to a system hardware component. Most
of the identifiable problems are software
related.
Nearly 50 of the reboots are abnormal reboots
(i.e., the reboots were due to a problem with the
machine rather than due to a normal shutdown).
In nearly 15 of the cases, server problems with
a crucial mail server application force a reboot
of the machine.

46
Machine Uptime Downtime Statistics

50 of the downtimes last about 12 minutes.
Too short a period to replace hardware
components and reconfigure the machine.
Majority of the problems are software related
(memory leaks, misloaded drivers,
application errors etc.).

47
Availability

Availability from the system perspective

Availability from the application/user
perspective
Typical machine provides acceptable service
only about 92 of the time, (on average).

48
Modeling Machine Behavior Machine States
49
Modeling Machine Behavior State Transitions of a
Typical Machine

92 of all transitions are into the Functional
state
this figure is a measure of the average
availability of a typical machine, i.e., the
ability of the machine to provide service, not
just to stay alive.
Only about 40 of the transitions out of the
Reboot states are to the Functional state.
More than half of the transitions out of the
Startup problems are to the Connectivity problems
state.
More than 50 of the transitions out of Disk
problems state are to the Functional state.

50
Modeling Domain Behavior

Nearly 77 (excluding self-loops) of transitions
from the F state, are to the BDC state.
Transitions from F state to MBDC state indicate
correlated failures and recovery among BDCs.
Majority of transitions from state PDC are to
state F.
most of the problems with the PDC are not
propagated to the BDCs, the PDC recovers before
any such propagation takes effect on the BDCs
problems on the PDC do not bring the machine
down,

51
Error PropagationTest Results Example

Using event-specific tests to implement automated
detection of error propagation
Most of the identifiable problems are local
machine related.
Good news
Error propagation of failures is not observed on
a regular basis.
Bed news
in a number of cases the tests were not able to
classify the event
it is quite possible that some of these unknowns
represent propagated failures.

52
Lessons Learned

Most of the problems that lead to reboots are
software related. Only 10 are attributable to
specific hardware components.
Connectivity problems contribute to the most of
reboots. A significant percentage of these
problems are persistent.
Rebooting the machine does not appear to solve
the problem in many cases.
Average availability evaluates to over 99,
however a typical machine in the domain, on
average, provides acceptable service only about
92 of the time.
There are indications of propagated or correlated
failures.

53
Lessons Learned Insight Into the Logging
Mechanism

The presence of a Windows NT shutdown event will
improve the accuracy in identifying the causes of
reboots. It will also lead to better estimates of
machine availability.
Improved event logging by the lower-level system
components (protocol drivers, memory managers)
can significantly enhance the value of event logs
in diagnosis.
The Primary Domain Controller logs error events
in bursts
periodic logging of a healthy event by the PDC
would help to increase our understanding of PDC
behavior

54
Concluding RemarksSystem Evaluation/Validation
Evaluation/Validation
Operation
Prototype
Design
Fault Injection
Analysis on Field Failure Data
Models
Formal Methods
HW Implemented
SW Implemented
Analytical
Simulation
Corrections of Assumptions
Coverage, Error Latency
Coverage
Failure Rates, Fault Models
55
Concluding Remarks (cont.)

Design/Simulation
Phase
Fault tolerance issues
need well established system level fault models
impact of software faults
effect of failures on robustness and system
integrity
Simulation issues
simulation time explosion
validation of the simulation methodology

Prototype (Fault Injection) Phase
Fault models and their validity
hardware
- permanent
- transient
software
- errors
- faults/defects
Comparison (validation) of various fault
injection tools
claims, portability, coverage

Operational Measurement
Phase
What to measure
When to measure
From case studies to fundamental results
Isolation of machine specific vs. general system
software dependability characteristics
On-line diagnosis
Prediction of impact of configuration,
technology and workload changes based on
field measurements