Title: A survey of dependability patterns
1A survey of dependability patterns
- Ingrid Buckley and Eduardo B. Fernandez
- Dept. of Computer Science and Engineering
Florida Atlantic UniversityBoca Raton, FL, USA - January 18, 2007
2Introduction
- Dependability is that property of a system that
allows one to rely - on its service
- Dependability for critical systems is of utter
importance in - business and critical infrastructures such as
hospitals, airport and - the electricity grid of a country.
- Dependability is comprised of several pertinent
aspects - Fault Tolerance
- Safety
- Availability
- Reliability
3Introduction contd
- Fault Tolerance as it relates to systems,
software and hardware is the ability to remain
operable in the presence of faults. - Safety is the prevention of catastrophic effects
on the environment or the users of the system - Availability is the ability of a system to
perform its functions when needed. - Reliability measures the success with which the
system conforms to its specification. - We use the Unified Modeling Language (UML), to
represent fault tolerance patterns.
4Objectives
- Classify software and hardware fault
- tolerance patterns according to their
- objectives
- Analyze and evaluate the classified fault
tolerance patterns - Determine how to improve upon existing patterns.
- Design new fault tolerance patterns for
unsupported areas within critical systems. -
5Background
- A pattern is an encapsulated solution to a
recurrent problem that solves a specific problem
in a given context and can be tailored to fit
different situations. - A fault is a defective value in the state of a
component or in the design of a system a fault
is the manifestation of an error. An error is a
defective value in an erroneous state of a system
- A system failure occurs when there is a deviation
from the systems specification. A failure is the
manifestation of an error. - The System Development Life Cycle (SDLC) is the
entire process of formal, logical steps taken to
develop software.
6Fault Tolerance
- A system that can mask the effects of a fault and
continue operating correctly is said to be fault
tolerant. - Fault tolerance requires redundancy and diversity
which are directly linked to reliability and
support availability of a system. - Diversity in this sense speaks of having
different versions of a function or system where
all have the same functionality. - The integration of hardware and software fault
tolerance to cope with the various kinds of
faults that can appear in a software system is a
good foundation towards achieving a fault
tolerant system. - There are several fault tolerance patterns that
have already been written and support different
levels of the system architecture. Our aim is to
focus on hardware and software fault tolerant
patterns.
7Fault Tolerance Contd
- Fault Tolerance patterns are a fairly new area in
association with critical systems , the need for
them has increased with the need to secure
systems against failure caused accidentally or
intentionally by attackers. - Due to the diversity of attacks on different
types of systems, it is highly important to have
effective fault tolerance techniques to mitigate
faults that may lead to a failure in a critical
system. - To prevent failures the following is required
- Detection - Detecting the occurrence of errors
- Locating the unit or component where the error
has occurred (diagnosis). - Masking- masking errors so as to prevent
malfunctioning of the system if a fault occurs. - Containment of faults -Confine or delimit the
effects of the error. - Recovery- Reconfigure the system to remove the
faulty unit and erase the effects of the error.
8Hardware Fault Tolerant Patterns
- Hardware fault tolerance applies hardware
- replication to enhance the system
- availability/reliability in the presence of
- hardware faults.
- Hardware Fault Tolerance patterns
- -The Watch Dog pattern primarily provides
- protection against time-based faults by
- creating an alarm whenever liveness
- messages are not received in a given time
- frame.
9Hardware Fault Tolerant Patterns Contd
- Fail Stop Processor The Fail-Stop Processor
pattern mainly aims at transforming errors that
lead to Byzantine/complex failures, and is based
on redundancy and comparing output from all
replicas to reach an agreement.
- Acknowledgement The Acknowledgement pattern
detects crash failures and is based on
acknowledging the reception of input within a
given time interval.
10Software Fault Tolerant Patterns
- Software fault tolerance applies software
redundancy by means of diversity of design to
tolerate software faults that can occur at the
design, programming or maintaining phases of the
software development cycle. - Software Fault Tolerance patterns
- Roll forward The Roll Forward pattern is a
failure recovery pattern which detects and
recovers from a fault by monitoring two replicas
for errors.
11Software Fault Tolerant Patterns Cont
- Input Guard Input Guard pattern stops erroneous
input from propagating the error inside a
component. A guard is placed at every access
point of the component to check the validity of
the input. - Fault Container The Fault Container pattern
provides the same benefits as the combination of
the Input Guard and the Output Guard patterns,
because it prevents an error from being
propagated inside and outside a given component .
12Hardware/Software Fault Tolerance Pattern
- The Software Redundancy Pattern deals with
hardware, software and environmental faults at
the same time.
13Patterns diagram for the fault tolerance domain
14Analysis of Patterns
Pattern Advantage Disadvantage
Watchdog Can be used improve deadlock detection, where strokes can be keyed or contains data to identify strokes from different computational steps. Does not actually checks that the internal computation processing is correct
Acknowledgement The design complexity introduced by the is very low . Does not introduce any space overhead Does not provide means to tolerate faults in a system. Rather, it provides means detect errors. It introduces relatively elevated space overhead that is proportional to the number of simultaneous errors it can deal with
Fail Stop Processor Introduces low time overhead since the processors function in parallel The processors are replicas of the original system on which the Fail-Stop Processor pattern is applied, without any additional functionality. meaning that in practice the processors can be replicas of a legacy system, which cannot be subject to any internal changes such as those that are needed if additional functionality would be required by the processors. The error on the monitored system is detected only after some input has been issued to it. The timeout must be set based on the time it takes for the input to reach the monitored system plus the time it takes for the acknowledge to reach monitoring system.
15Analysis of Patterns Contd
Pattern Advantage Disadvantage
Roll Forward The time overhead imposed by this pattern is low when errors occur the failed replica is discarded, and the unaffected replica processes the subsequent inputs . The time overhead imposed by this pattern in the absence of errors is high before the replica Is able to receive and process new input, it must copy its new state to the other replica.
Input Guard It stops the contamination of the guarded component from erroneous input that does not conform to the specification of the guarded component. There are various ways that the Input Guard pattern can be implemented, each providing different benefits with respect to the time or space overhead introduced by the guard. Cannot prevent the propagation of errors that do conform with the specification of the guarded component. Has significant time and space over head
Fault Container It stops of errors expressed as input and output content or timing that does not conform to a component specification from entering or exiting that component. The undefined behavior of the container in the presence of errors allows its combination with error detection and error masking patterns The Fault Container pattern cannot prevent the propagation of errors that do not conform with the specification of the contained component. Unless combined with some error detection and system recovery mechanisms, this pattern will result in send- or receive-omission failures (i.e. failure to send output or receive input of the contained component).
16Conclusion
- There is a need to improve upon current Fault
Tolerant Patterns based on our analysis. - New Fault Tolerance Patterns are necessary to
provide dependability in distributed systems
because many of the fault Tolerance patterns are
very similar and do not provide a comprehensive
support for errors that can lead to failure.
17Future Work
- Safety, Availability and Reliability Patterns
being researched. - Defining areas of need where current Fault
Tolerance Patterns are lacking or require
improvement. - Designing new Fault Tolerance Patterns.
18Recommendations and Questions