Title: Compositional SpecificationBased FaultTolerance
1Compositional Specification-Based Fault-Tolerance
- Anish Arora and Murat Demirbas
- Ohio State University and SUNY Buffalo
-
- August 2009
2Principles of Fault-tolerance Design
- Separability
- of fault-tolerance components vs. functionality
components - Minimality
- maximal reuse of system functionality in design
- Scalability
- design avoids full implementation or replica
synchrony - Incrementality
- add new tolerances without modifying older
components - In our approach Compositionality deals with
first two, Specifications are exploited for
last two
3Compositional Design Overview
- The separation principle
- fault-tolerant system C'
-
- fault-intolerant system C
- composed with
- tolerance components
- The minimality principle
- tolerance components used to achieve tolerance,
not to resatisfy the specification - reuse criterion
- in the absence of faults, C' behaves as C does
- in the presence of faults, if C' recovers it
behaves as C does
4Specification-Based Design Overview
- The scalability principle
- Given spec B, compositionally design fault
tolerant B ? W - Compile B ? W while preserving tolerance
- The incrementality principle
- Given C ref B and fault tolerant B ? W
- Compile W into W separately, so C ? W has same
tolerance
5Lets now formalize (I) Systems
state
event
computation
- Computations of a system Alpern,
Schneider 85 - safety computations ? liveness computations
- safety is a set of sequences in which no sequence
does anything bad - liveness is a set of sequences that contains for
each prefix an extension which does something
good
6(II) Specifications
- A set of desirable sequences, like systems
- can be decomposed into safety and liveness
parts - Let C be a system , B a specification
- C ref B
- iff
- computations(C) ? computations(B)
- Note definition of ref is readily extended to
allow internal state in C and B (? on
projections of computations on external state)
7(III) Faults
- Classes
- message loss, corruption, replay, preplay,
forgery - process hangs, crash, fail-stops, Byzantine
failure - sensor stuck-at, intermittent failure
- memory transient corruption
- channel eavesdropping, fail-stops
-
- Computations of a fault-class F are sequences
too! - Let C ? F be computations of system C in presence
of F - not (C ref B) ? (C ? F) ref B
- nor (C ? F) ref (B ? F)
8(IV) Fault-tolerance
- In the presence of a fault-class, a
fault-tolerant system must satisfy a tolerant
specification - Tolerant specifications are potentially weaker
than the original specifications -
- Types of tolerant specifications
- masking original specification
- fail-safe safety part only
- stabilizing liveness part ? eventual
safety part
9Theory of Tolerance Components
- For the class of reuse design
A, Kulkarni 97a,b - Theorem For fail-safe implementation,
- Detectors are necessary sufficient
-
- Theorem For stabilizing implementation,
- Correctors are necessary sufficient
- Theorem For masking implementation,
- Detectors and correctors are necessary
- sufficient
10Why State-Predicate based Detectors
suffice for Fail-safe Tolerance
- Before a method is executed, detect whether
extended prefix would violate safety can detect
using only last state of prefix - ? system methods ? a state predicate s.t.
execution of the method in a state where that
predicate is true satisfies safety - detect whether execution of method in given state
is safe
11Why State Predicate based Correctors
suffice for Stabilizing Tolerance
- Ensure that eventually safety and liveness are
satisfied
states reached in presence of faults
states from where safety and liveness are
satisfied
- Restore system to a state from where its safety
and liveness are both satisfied
12Specification-based Tolerance Theory
- Q If B ? W is stabilizing, can C ? W be
stabilizing? - A Depends on properties of compiler
B
W
?
compiler
compiler
C
W
?
13(Option 1) Use Convergence Refinement Compilers
- C is a convergence refinement of B
- C ref B
- Every computation of C that starts from a
noninitial state is a compression of some
computation of B starting from the corresponding
state - Theorem Demirbas, A
02 - If B ? W is stabilizing, and
- both compilers are convergence refinements,
- Then C ? W is stabilizing
14(Option 2) Using Total-Onto Refinements
- Assume W stabilizes B atomically, i.e. in a
single step - Theorem Demirbas, A 08
- If W is self-stabilizing and stabilizes B
atomically - B to C compiler yields total-onto abstraction
fn - W to W compiler is convergence refinement
- Then C ? W is stabilizing
15Dealing with Distributed Systems
- Decompose B and W into several processes
- B (? j Bj )
- W (? j Wj )
- where Wj is defined for Bj
- Compile each process separately
- But
- System may not be stabilizing even if each
compiled Cj ? Wj is - corruption from a process in faulty state may
spread and cycle through processes
16Using Compositional and Specification-Based
Methods
- For corruption cycling use compositional
fault-tolerance - e.g., stabilization by layers lower-level
processes are oblivious to higher-level processes
- corruption and also correction spread from lower
to higher - The total-onto refinement theorem holds for
distributed system - We now illustrate the theory presented so far.
17Case Study STALK, Wireless Sensor Network Service
- STALK is a hierarchical tracking program
- tracking structure is a path rooted at the
highest level - target resides at leaf of the tracking path
- each node in tracking path has 1 child, either
at its level or one level below - We start with a simple guarded command (GC)
program - GC uses shared memory, IOA uses message passing
- Compile GC code to an IOA level STALK intolerant
program - Compile GC wrappers to make IOA program
self-stabilizing - Theorem applies
18Tracking tree
level 2
level 1
level 0
object
19An example of find (we dont discuss the find
program further)
object
find
find
find
20An example of move
object
object
object
object
object
object
object
21Deriving the IOA program
- In GC program, node i maintains two variables
i.c and i.p - Corresponding to child and parent pointers
- Tracking path is a doubly linked list
- In GC, node deletes itself from path by setting
i.ci.pnil - At IOA level, c and p maps to those at GC level
- Also, hidden state stime is introduced to
propagate shrink upwards. Hidden states do not
have effect on mapping
- In GC, node adds itself to path by setting i.c
and i.p according to hierarchy level rules - At IOA level, c and p is set in a corresponding
manner - For this, hidden states gtime gnbrquery and
gqack are introduced.
22Deriving the IOA wrappers
- Start-shrink action for cleaning unrooted trees
- i.cnil ? i.p?nil ?i.pnil
- Start-growth action for rebuilding upper levels
of a rooted tree - i.c?nil ? i.pnil ? set i.p
- These two wrappers are refined in a
straightforward manner - Hidden states stime and and gtime are corrected
- Detect if node does not have a child
- (i.c).p?i ? i.cnil
- This is implemented using heartbeat wrapper
23Refining the IOA program
- Stalk is hierarchical information (both
corruption correction) flow from lower to
higher level processes - Correctors for Stalk are local and atomic
- Hence our Theorem applies the IOA program is
stabilizing - Note Theorem also applies for refining from IOA
to C - Refine IOA to C by using an total and onto
compiler Tauber 04 - Refine IOA wrappers to C by everywhere
refinements - note that start-shrink and start-growth in IOA
are stateless - heart-beat wrapper introduced a soft-state
bounded space timer
24Conclusions
- Fault-tolerance principles of separability,
minimality, scalability, and incrementality are
met by compositional, specification-based
approach - Detection and correction of state predicates
suffice to enforce tolerance specifications - Certain forms of compilers suffice to preserve
tolerance properties - We believe application to wireless sensor network
applications is promising and are developing
compilers for this environment
25Detectors
Specification (detection state predicate ,
witness state predicate)
safety
liveness
- Large detectors in distributed systems are
built out of parallel or sequential
composition of smaller ones -
- Traditional examples error detection codes,
acceptance tests, comparators, snapshot
procedures, exception conditions
26Correctors
Specification (correction state predicate ,
witness state predicate)
safety
liveness
- Large correctors in distributed systems are
built out of parallel or sequential
composition of smaller ones -
- Traditional examples error correction codes,
reset procedures, voters, rollback recovery,
constraint satisfaction
27Self-tolerance of Tolerance Components
- Detectors for fail-safe systems must be (at
least) fail-safe -
- Correctors for stabilizing systems must be (at
least) stabilizing -
- Detectors correctors for masking systems must
be masking