Title: Seminarie Informatica
1Seminarie Informatica
- Fault-tolerant Systems The Software Viewpoint
A series of seminars coordinated byVincenzo De
Florio http//www.pats.ua.ac.be
2The matter
- The exam
- The topics
- This lecture
- Application-level fault tolerance provisions
3Introduction to the exam
- Seminarie informatica
- 10 seminars on hot topics of computer science
- Topic of this cycle software fault-tolerant
systems - Next 3 seminars 15, 22 November 6 December
- Next year seminars to be announced on
http//www.win.ua.ac.be/vincenz/si/0607.html
4Introduction to the exam
- Oral discussion of 2 papers
- A 56 page paper based on one or more of the
topics of the seminars - A paper with the analysis of a case study
- See later for examples
- Evaluation criteria
- Do the papers contain original ideas? Do they
follow too strictly the seminar? - Does the author understand the subject? Is (s)he
able to reason independently about the subject? - Papers must be submitted by May 15, 2007
- E-mail to vincenzo.deflorio_at_ua.ac.be
5The Topics
Dependability the property of a system such
that reliance can justifiably be placed on the
service it delivers
Fault tolerance one of the means of
dependability
6The Dependability Tree
7Fault tolerance (FT)
Fault-tolerant system is system that continues
to function in spite of faults
defect IC
bug in program
operation fault
sensor drift
8Attributes of dependability
- Availability
- Readiness for usage
- A(t) probability that system is conform to
specification at time t - Reliability
- Continuity of service
- R(t) probability that system is conform to
specifications during t0,t, provided that so it
is at t0
9Attributes of dependability (2)
- Safety
- Non-occurrence of catastrophic consequences on
environment - S(t) probability that a system is either
conform to specification, or reaches a safe halt,
at time t - Fail-safe systems
10Attributes of dependability (3)
- Maintainability
- Aptitude to undergo repairs and evolution
- M(t) probability that system is back to
specifications at t if failed at t0
11Attributes of dependability (4)
- Confidentiality
- Non-occurrence of unauthorised disclosure of
information - Integrity
- Non-occurrence of improper alterations of
information
12Related attributes
- Testability
- Ability to test features of a system
- Related to maintainability
13Related attributes
- Security
- Integrity availability confidentiality
14References
- Jean-Claude Laprie, Dependable Computing and
Fault Tolerance Concepts and Terminology, in
Proc. of the 15th Int. Symposium on
Fault-Tolerant Computing (FTCS-15), Ann Arbor,
Mich., June 1985, pp.2-11 - Jean-Claude Laprie, Dependability---Its
Attributes, Impairments and Means, in
Predictably Dependable Computing Systems, ESPRIT
Basic Research Series, B. Randell and J.-C.
Laprie and H. Kopetz and B. Littlewood (eds.),
Springer Verlag, 1995, pp. 3-18.
15The lecture
- We now focus on application-level fault tolerance
- Why do we need ALFT? Why do we need software FT
in the first place? - We explain why
- We survey the existing methods and assess their
pros and cons against a set of properties - Surprising conclusion still an open problem
16Structure
- Introduction
- Identification of main problems to tackle / key
properties to achieve - Qualitative survey
- Sketch of a possible ideal solution
- Conclusions
17Software Fault Tolerance
- Human society more and more expects
and relies on good quality of complex
services supplied by computers
18Software Fault Tolerance
- Consequences of a failure in the 40s(Computers
as fast solvers of numerical problems) - Errors in computations, long downtimes
- Consequences of a failures nowadays(Computers
controlling nuclear plants, airborne equipment,
healthcare) - Incalculable penalty (catastrophes)
19Software Fault Tolerance
- Traditional answer Hardware Fault
Tolerance - This is an important ingredient, but not the
only one needed today!
- Complexity is also in the SW layers
- Hierarchies of complex abstract machines
20Software Fault Tolerance
- Complexity is also in SW layers (cont.ed)
- Software is often networked and distributed
- Relationships among software components are often
complex - Object model Þ Easier SW reuse Þ Hidden
explicit Complexity
21Software Fault Tolerance
- In conclusion No amount of verification,
validation and testing can eliminate all faults
in an application and give complete confidence in
the availability and data consistency of
applications
- Fault tolerance in SW is key
- SW failures can have the same extent in
consequences of failures in HW
Ariane 5 !
22Problems of SW FT
The lighter the color, the more general purpose
the (virtual) machine
The lighter the color, the more complexthe
problem ofexpressing fault tolerance
23Problems of Application-levelFault Tolerance
- The only alternative and effective means for
increasing software reliability is that of
incorporating in the application software
provisions for SFT - The Application software has to manage
- Functional aspects
- Fault tolerance (FT) aspects
- at the same time / in the same space
24Problems and properties of Application-level
Fault Tolerance
- Hazard code intrusion
- FT provisions are specified side by side with the
service - Conflicting design concerns
- Overall design complexity gets increased
- Larger development and maintenance costs times
- Larger probability of introducing software bugs
25Problems and properties of Application-level
Fault Tolerance
- Separation of design concerns ( SDC )
- In what follows we call an ALFT a means to
express fault tolerance in the application
software - A criterion to compare ALFTs is by their degree
of SDC
26Problems and properties of Application-level
Fault Tolerance
- Hazard porting code ¹
porting service - FT code assumes fault model f(e)
- If e changes, or
- If the code is moved to another environment e
- the QoS may degrade
27Problems and properties of Application-level
Fault Tolerance
- Hazard porting code ¹
porting service
370 Million Euros in the sink
- An interesting case Ariane 5 501
- Ariane 4 missions software re-used inAriane 5
- The early part of the trajectory of Ariane 5
differed from that of Ariane 4 and resulted in
quite higher horizontal velocity values
This could be a case study for the exam
28Problems and properties of Application-level
Fault Tolerance
- Problem service portability
- Porting FT comes not for free
- Hardwired fault model static environment
- More difficult to adapt / test / maintain
- More prone to Ariane 5 - effects
What is the most often overlooked risk in sw
engineering? That the environment will do
something the designer never anticipated
J. Horning
29Problems and properties of Application-level
Fault Tolerance
- Adaptability ( AD )
- Does the ALFT provide means to adapt,
dynamically, to new environmental conditions? - A criterion to compare 2 ALFTs is by their
degree of AD
30Problems and properties of Application-level
Fault Tolerance
- Problem adding complexity can decrease the
dependability - The ALFT (the means to express FT) must be based
on a simple strategy - It must be syntactically adequate to host several
mechanisms
31Problems and properties of Application-level
Fault Tolerance
- Hazard
- Languages shape the way we think Warf
- If all you have is a hammer, everything looks
like a nail
/usr/share/fortune
- Syntactical Adequacy ( SA )
- Does the ALFT provide simple means to host many
FT solutions? - A criterion to compare 2 ALFTs is by their
degree of SA
32Summary
- Separation of design concerns ( SDC )
- Adaptability ( AD )
- Syntactical Adequacy ( SA )
- A base of attributes we can use to compare
ALFTs with one another
33System structures for SFT
- Single-version FT
- Multiple-version FT
- Object model
- Linda Model
- FT Languages
- Recovery metaprogram
Each of these could be a case study for the exam
34Single-version Fault Tolerance
- Single-version SFT embedding in the user
application of a simplex system a set of error
detection / recovery features - Explicit code intrusion (bad SDC )
- Increases size and complexity (bad SA )
- Bad for transparency, maintainability,
portability - Increases development times and costs
- No support for dynamic adaptability (bad AD )
- Libraries
- SwIFT, HATS, EFTOS
35Multiple-version Fault Tolerance
- Multiple-version SFT NVP and RB
- Idea redundancy of software independently
designed versions of software - Randell (1975) All fault tolerance must be
based on the provision of useful redundancy, both
for error detection and error recovery. In
software the redundancy required is not simple
replication of programs but redundancy of design - Assumption random component failures. Correlated
failures Þ sudden exhaustion of available
redundancy - Again, Ariane 5 flight 501 two crucial
components were operating in parallel with
identical hardware and software
36Multiple-version Fault Tolerance
include ltftmacros.hgt ...
ENSURE(acceptance-test) Alternate
1 ELSEBY Alternate 2
... ENSURE
37Multiple-version Fault Tolerance
include ltftmacros.hgt ... NVP
VERSION block 1 SENDVOTE(v-pointer, v-size)
VERSION block 2
SENDVOTE(v-pointer, v-size)
ENDVERSION(timeout, v-size) if
(!agreeon(v-pointer)) error_handler()
ENDNVP
38Multiple-version Fault Tolerance
- Multiple-version SFT
- Implies N-fold design costs, N-fold maintenance
costs - The risk of correlated failures is not negligible
- Code intrusion is limited (Acceptable SDC )
- System structure is fixed (Bad SA )
- No support for dynamic adaptability (bad AD )
- Can be combined with other means
39Object-centred Strategies
- Strategies based on the object model
- Metaobject protocols and reflection
- Open implementation of the run-time executive of
an OO-language - Reflection, reification
- Composition filters
- Each object has a set of filters. Messages sent
to any object are trapped by its filters. These
filters possibly manipulate the message before
passing it to the object.
40Object-centred Strategies
- Active objects
- Objects that have control over the
synchronisation of incoming requests from other
objects. Objects can autonomously decide, e.g.,
to delay a request until it is acceptable, i.e.,
until a guard is met - FRIENDS, SINA, Correlate
- Full separation of design concerns (Good SDC )
- No code intrusion
- Syntactically adequate - at least for a subset of
FT strategies (Acceptable SA )
41Object-centred Strategies
- Assumption application written in extended
OO-language - Adaptability? (Questionable AD )
42FT Linda Systems
- Generative communication - messages are not
sent, they are stored in a public, distributed
shared memory - A shared relational database for storing and
withdrawing tuples - Tuples lists of objects identified by their
contents, cardinality and type - A Linda process inserts, reads, and withdraws
tuples via blocking or non-blocking primitives - Synchronisation presence / absence of a matching
tuple
43Linda
- In master-worker applications
- Dynamic load balancing, also in heterogeneous
clusters - Inherently tolerates crash failures of workers
- Single-op atomicity
- Solutions
- Atomic transactions with multiple TS ops
- Stable tuple space
- Tuple space checkpointing, etc.
Possible case study for the exam
44Linda
- FT-Linda, Persistent Linda...
- Full separation of design concerns (Good SDC )
- No code intrusion
- Syntactically adequate - at least for a subset of
FT strategies (Acceptable SA ) - Assumption application written in Linda
- Adaptability? (Questionable AD )
45FT Languages
- FT Languages
- Enhanced, pre-existing
- Examples
- FT-SR
- Fail-stop modules - abstract unit of
encapsulation - Atomic execution
- Composability
- x-Linda (x C, Fortran, C, )
46FT Languages
- FT Languages
- Novel languages
- Examples
- Argus distributed OO programming language and
operating system - Guardians objects performing user-definable
actions in response to remote requests - Atomic transactions
- FTAG functional language based on attribute
grammars
47FT Languages
- FTAG
- Computation collection of pure mathematical
functions, the modules. - Each module has a set of input values, called
inherited attributes, and of output variables,
called synthesized attributes.
48FTAG (cont.d)
- Primitive modules can be executed
- Non-primitive modules require other modules to be
performed first - FTAG program decomposing a root module into
its basic sub-modules and then applying
recursively this decomposition process to each of
the sub-modules (computation tree)
49FTAG (cont.d)
- Natural support for redoing (replacing a portion
of the computation tree with a new computation) - Natural support for replication (replicated
decomposition a module is decomposed into N
identical sub-modules implementing the function
to replicate)
50FT Languages
- Conclusions for FT languages
- adequate separation of design concerns,
transparency (good SDC ) - special purpose syntax (potentially good SA )
- application must be written with non standard
language - bad portability
- Adaptability ( AD ) unknown
51RMP
- Recovery Metaprogram
- Two cooperating processing contexts
- User-placed breakpoints in the user context bring
to the execution of a meta-program - When the meta-program ends, control is returned
to the user program - Meta-program is to be written in CSP
52RMP
- Adequate, e.g., for recovery blocks
- Breakpoint can trigger the execution of
- CHECKPOINT
- ALTERNATES
- ACCEPTANCE TESTS...
53RMP
- RMP summary
- Full separation of design concerns
- No code intrusion (Good SDC )
- Syntactically adequate - at least for a subset of
FT strategies (Average SA ) - The meta-program is written in a fixed,
pre-existing language (CSP) - Inefficient implementation (huge performance
overhead for switching execution modes) - No adaptability (Bad AD )
54Summary
- No optimal solution exists yet
- Challenging research problem!
55Conclusions in search of optimum
- A dependable service is one that persists even
when, for instance, its corresponding program
experiences faults to some agreed upon extent - An F-dependable service (resp. F-dependable
program, system) is one that persists despite
the occurrence of faults as described in F - F is the fault model
56Conclusions in search of optimum
- F is the model of an environment (E)
- An F-dependable service may tolerate faults in E
and may not for those in E - What if F matches an environment E?
- What if E changes into E?
- What if an F-service is moved?
- A failure may occur!
57Conclusions in search of optimum
- Adapting services
- X-dependable services, where X f(E)
- X changes when
- The service is moved
- The environment mutates
- Changes should occur automatgically (High AD)
- The expression of adaptability and dependability
concerns should not increase complexity too
much (High SA )
58Conclusions
- Ideally, the code should be made of two
components - (service, FT)
- (Optimal SDC )
- and FT should adapt dynamically w.r.t. e
59Conclusions
- Risks this may call for complexity!
- But generic architectures can be thought so as to
go for a limited complexity - Optimizations are possible
- In a future seminar a compliant architecture
that is being designed within PATS
60All citations by B. Randell if no author is
specified