Title: Outline
1Outline
- Introduction to Fault Tolerance and Recovery
- Recovery
2Fault Tolerance and Recovery
- An important goal of distributed systems is to
increase the availability and reliability by
automatically recovering from partial failures
without seriously affecting the overall
performance - Closely related to dependable systems
- Two broad classes of algorithms
- Recovery
- When a failure has occurred, recover the system
from the erroneous state to an error-free state - Fault tolerance
- Failure masking
3Recovery
- Recovery in computer systems refers to restoring
a system (due to failure) to its normal
operational state
4Basic Concepts
- A system consists of a set of hardware and
software components - Each component is designed to provide a set of
specified services - Failure in a system occurs if the system does not
perform its services in the specified manner - An erroneous state of the system is a state which
could lead to a system failure by a sequence of
valid state transitions - Recovery is a process that involves restoring an
erroneous state to an error-free state
5Classification of Failures
6Classification of Failures cont.
- Process failure
- System failure
- Amnesia failure
- A partial amnesia failure
- A pause failure
- A halting failure
- Secondary storage failure
- Communication medium failure
7Backward and Forward Error Recovery
- Forward-error recovery
- If the errors and damages caused by faults can be
completely and accurately accessed, then those
errors can be removed and resulting state is
error-free - Backward-error recovery
- If the errors and damages can not be accessed,
the system can be restored to a previous
error-free state - Problems include performance penalty, recurrence
of faults, and unrecoverable components
8Basic Approach in Backward Error Recovery
- The basic idea is to save some recovery points on
a stable storage - Operation-based approach
- State-based approach
9Recovery Stable Storage
10The Operation-based Approach
- In this approach, all the modifications that are
made to the state of a process are recorded so
that a previous state can be restored by
reversing all the changes made to the state - The record of the system activity is known as an
audit trail or log
11The Operation-based Approach cont.
- Updating-in-place
- Every update operating to an object updates the
object and results in a log to be recorded in a
stable storage which has enough information to
completely undo and redo the operation - The information needs to include the name of the
object, the old state of the object, and the new
state of the object - It can be implemented as a collection of
operations - Do, undo, redo, and display operations
- The major problem is a do operation cannot be
undone if the system crashes after the update
operation but before the log record is stored
12The Operation-based Approach cont.
- The write-ahead-log protocol
- To overcome the problem in updating-in-place
scheme, a recoverable operation is implemented by
the following operations - Update an object only after the undo log is
recorded - Before committing the updates, redo and undo logs
are recorded - Problems with operation-based approach
- Writing a log on every update operation is
expensive in terms of storage requirement and CPU
overhead
13The State-based Approach
- In this approach, the complete state of a process
is saved when a recovery point is established - Recovering involves reinstating its saved state
and resuming the execution of the process from
that state - The process of saving state is referred to as
checkpointing or taking a checkpoint - A recovery point at which checkpointing occurs is
referred to as a checkpoint - The process of restoring a process to a
prior-state is referred to as rolling-back the
process
14The State-based Approach cont.
- Shadow pages
- Whenever a process wants to modify an object, the
pages containing the object are duplicated and
maintained on a stable storage - Only one of the copies undergoes all the
modifications done by the process - The other unmodified copy is known as the shadow
pages
15Issues in Recovery in Concurrent Systems
- How and when to do checkpointing
- Independently or collectively
- Asynchronous vs. synchronous checkpointing
- Recovery
- Which checkpoints to be used when processes roll
back? - Strongly consistent checkpoints
- Consistent checkpoints
16Orphan Messages and Domino Effect
17Lost Messages
18Livelocks
19Consistent Set of Checkpoints
20A Simple Method for Taking a Consistent Set of
Checkpoints
- If the action of taking a checkpoint and the
action of sending or receiving a message are
indivisible - If every process takes a checkpoint after sending
every message, the set of most recent checkpoints
is always consistent - It is not strongly consistent
- However, taking a checkpoint after each message
is expensive - If we take a checkpoint after every K (gt1)
message, the method suffers from the domino
effect
21Synchronous Checkpointing
- The processes are coordinated so that the set of
all recent checkpoints is guaranteed to be
consistent - Assumptions
- Processes communicate by exchanging messages
- Communication channels are FIFO
- Communication failures do not partition the
network - Two kinds of checkpoints
- Permanent checkpoints
- Tentative checkpoints
- Two phases in checkpointing
22Synchronous Checkpointing cont.
- First phase
- An initiating process Pi takes a tentative
checkpoint and requests all other processes to
take tentative checkpoints - Each process informs Pi whether it succeeded in
taking a tentative checkpoint - If Pi learns that all the processes have
successfully taken tentative checkpoints, Pi
decides that all checkpoints should be made
permanent otherwise, Pi decides all the
tentative checkpoints should be discarded
23Synchronous Checkpointing cont.
- Second phase
- Pi informs all the processes of its decision
- A process, on receiving the message from Pi will
act accordingly. - Therefore, either all the processes take the
permanent checkpoints or none - Every process must not send messages related to
the computation after taking the tentative
checkpointing and but before receiving the
decision from Pi - Correctness
- A set of permanent checkpoints taken by the
algorithm is consistent
24Synchronous Checkpointing cont.
25Synchronous Checkpointing cont.
Condition for optimization
26Synchronous Checkpointing cont.
27Synchronous Checkpointing cont.
28Synchronous Checkpointing cont.
(continued)
29The Rollback Recovery Algorithm
- Assumptions
- A single process invokes the algorithm
- Rollback recovery and checkpointing are not
concurrently invoked - First phase
- A process Pi checks to see if all the processes
are willing to restart from their previous
checkpoints - If all processes reply yes, Pi decides to
restart otherwise, no - Second phase
- Pi propagates its decision to all the processes
30The Rollback Recovery Algorithm cont.
31The Rollback Recovery Algorithm cont.
32The Rollback Recovery Algorithm cont.
33The Rollback Recovery Algorithm cont.
(continued)
34Asynchronous Checkpointing and Recovery
- In asynchronous checkpointing and recovery,
checkpoints are taken independently - A recovery algorithm has to search for the most
recent consistent set of checkpoints before it
can initiate recovery
35Asynchronous Checkpointing and Recovery cont.
- Asynchronous checkpointing
- Two types of log storage volatile log and stable
log - Each processor, after an event, records a triplet
- Its state before the event
- The message
- The set of messages sent by the processor
36Asynchronous Checkpointing and Recovery cont.
- Recovery
- The basic idea is to find out whether there are
orphan messages if processors roll back to a set
of checkpoints - The existence of orphan messages is discovered by
comparing the number of messages sent and received
37Asynchronous Checkpointing and Recovery cont.
38Asynchronous Checkpointing and Recovery cont.
39Asynchronous Checkpointing and Recovery cont.
40Summary
- Recovery attempts to remove errors in the state
- Forward-error recovery
- Backward-error recovery
- Operation-based
- State-based approach
- Synchronous checkpointing and recovery
- Asynchronous checkpointing and recovery
- Next time we will talk about fault tolerance
- Ways to mask failures