Outline - PowerPoint PPT Presentation

About This Presentation

Title:

Outline

Description:

An important goal of distributed systems is to increase the availability and ... reinstating its saved state and resuming the execution of the process from that state ... – PowerPoint PPT presentation

Number of Views:45

Avg rating:3.0/5.0

Slides: 41

Provided by: xiuwe

Learn more at: http://www.cs.fsu.edu

Category:

more less

Transcript and Presenter's Notes

Title: Outline

1
Outline

Introduction to Fault Tolerance and Recovery
Recovery

2
Fault Tolerance and Recovery

An important goal of distributed systems is to
increase the availability and reliability by
automatically recovering from partial failures
without seriously affecting the overall
performance
Closely related to dependable systems
Two broad classes of algorithms
Recovery
When a failure has occurred, recover the system
from the erroneous state to an error-free state
Fault tolerance
Failure masking

3
Recovery

Recovery in computer systems refers to restoring
a system (due to failure) to its normal
operational state

4
Basic Concepts

A system consists of a set of hardware and
software components
Each component is designed to provide a set of
specified services
Failure in a system occurs if the system does not
perform its services in the specified manner
An erroneous state of the system is a state which
could lead to a system failure by a sequence of
valid state transitions
Recovery is a process that involves restoring an
erroneous state to an error-free state

5
Classification of Failures
6
Classification of Failures cont.

Process failure
System failure
Amnesia failure
A partial amnesia failure
A pause failure
A halting failure
Secondary storage failure
Communication medium failure

7
Backward and Forward Error Recovery

Forward-error recovery
If the errors and damages caused by faults can be
completely and accurately accessed, then those
errors can be removed and resulting state is
error-free
Backward-error recovery
If the errors and damages can not be accessed,
the system can be restored to a previous
error-free state
Problems include performance penalty, recurrence
of faults, and unrecoverable components

8
Basic Approach in Backward Error Recovery

The basic idea is to save some recovery points on
a stable storage
Operation-based approach
State-based approach

9
Recovery Stable Storage
10
The Operation-based Approach

In this approach, all the modifications that are
made to the state of a process are recorded so
that a previous state can be restored by
reversing all the changes made to the state
The record of the system activity is known as an
audit trail or log

11
The Operation-based Approach cont.

Updating-in-place
Every update operating to an object updates the
object and results in a log to be recorded in a
stable storage which has enough information to
completely undo and redo the operation
The information needs to include the name of the
object, the old state of the object, and the new
state of the object
It can be implemented as a collection of
operations
Do, undo, redo, and display operations
The major problem is a do operation cannot be
undone if the system crashes after the update
operation but before the log record is stored

12
The Operation-based Approach cont.

The write-ahead-log protocol
To overcome the problem in updating-in-place
scheme, a recoverable operation is implemented by
the following operations
Update an object only after the undo log is
recorded
Before committing the updates, redo and undo logs
are recorded
Problems with operation-based approach
Writing a log on every update operation is
expensive in terms of storage requirement and CPU
overhead

13
The State-based Approach

In this approach, the complete state of a process
is saved when a recovery point is established
Recovering involves reinstating its saved state
and resuming the execution of the process from
that state
The process of saving state is referred to as
checkpointing or taking a checkpoint
A recovery point at which checkpointing occurs is
referred to as a checkpoint
The process of restoring a process to a
prior-state is referred to as rolling-back the
process

14
The State-based Approach cont.

Shadow pages
Whenever a process wants to modify an object, the
pages containing the object are duplicated and
maintained on a stable storage
Only one of the copies undergoes all the
modifications done by the process
The other unmodified copy is known as the shadow
pages

15
Issues in Recovery in Concurrent Systems

How and when to do checkpointing
Independently or collectively
Asynchronous vs. synchronous checkpointing
Recovery
Which checkpoints to be used when processes roll
back?
Strongly consistent checkpoints
Consistent checkpoints

16
Orphan Messages and Domino Effect
17
Lost Messages
18
Livelocks
19
Consistent Set of Checkpoints
20
A Simple Method for Taking a Consistent Set of
Checkpoints

If the action of taking a checkpoint and the
action of sending or receiving a message are
indivisible
If every process takes a checkpoint after sending
every message, the set of most recent checkpoints
is always consistent
It is not strongly consistent
However, taking a checkpoint after each message
is expensive
If we take a checkpoint after every K (gt1)
message, the method suffers from the domino
effect

21
Synchronous Checkpointing

The processes are coordinated so that the set of
all recent checkpoints is guaranteed to be
consistent
Assumptions
Processes communicate by exchanging messages
Communication channels are FIFO
Communication failures do not partition the
network
Two kinds of checkpoints
Permanent checkpoints
Tentative checkpoints
Two phases in checkpointing

22
Synchronous Checkpointing cont.

First phase
An initiating process Pi takes a tentative
checkpoint and requests all other processes to
take tentative checkpoints
Each process informs Pi whether it succeeded in
taking a tentative checkpoint
If Pi learns that all the processes have
successfully taken tentative checkpoints, Pi
decides that all checkpoints should be made
permanent otherwise, Pi decides all the
tentative checkpoints should be discarded

23
Synchronous Checkpointing cont.

Second phase
Pi informs all the processes of its decision
A process, on receiving the message from Pi will
act accordingly.
Therefore, either all the processes take the
permanent checkpoints or none
Every process must not send messages related to
the computation after taking the tentative
checkpointing and but before receiving the
decision from Pi
Correctness
A set of permanent checkpoints taken by the
algorithm is consistent

24
Synchronous Checkpointing cont.
25
Synchronous Checkpointing cont.
Condition for optimization
26
Synchronous Checkpointing cont.
27
Synchronous Checkpointing cont.
28
Synchronous Checkpointing cont.
(continued)
29
The Rollback Recovery Algorithm

Assumptions
A single process invokes the algorithm
Rollback recovery and checkpointing are not
concurrently invoked
First phase
A process Pi checks to see if all the processes
are willing to restart from their previous
checkpoints
If all processes reply yes, Pi decides to
restart otherwise, no
Second phase
Pi propagates its decision to all the processes

30
The Rollback Recovery Algorithm cont.
31
The Rollback Recovery Algorithm cont.
32
The Rollback Recovery Algorithm cont.
33
The Rollback Recovery Algorithm cont.
(continued)
34
Asynchronous Checkpointing and Recovery

In asynchronous checkpointing and recovery,
checkpoints are taken independently
A recovery algorithm has to search for the most
recent consistent set of checkpoints before it
can initiate recovery

35
Asynchronous Checkpointing and Recovery cont.

Asynchronous checkpointing
Two types of log storage volatile log and stable
log
Each processor, after an event, records a triplet
Its state before the event
The message
The set of messages sent by the processor

36
Asynchronous Checkpointing and Recovery cont.

Recovery
The basic idea is to find out whether there are
orphan messages if processors roll back to a set
of checkpoints
The existence of orphan messages is discovered by
comparing the number of messages sent and received

37
Asynchronous Checkpointing and Recovery cont.
38
Asynchronous Checkpointing and Recovery cont.
39
Asynchronous Checkpointing and Recovery cont.
40
Summary