Outline - PowerPoint PPT Presentation

About This Presentation
Title:

Outline

Description:

An important goal of distributed systems is to increase the availability and ... reinstating its saved state and resuming the execution of the process from that state ... – PowerPoint PPT presentation

Number of Views:45
Avg rating:3.0/5.0
Slides: 41
Provided by: xiuwe
Learn more at: http://www.cs.fsu.edu
Category:
Tags: outline | resuming

less

Transcript and Presenter's Notes

Title: Outline


1
Outline
  • Introduction to Fault Tolerance and Recovery
  • Recovery

2
Fault Tolerance and Recovery
  • An important goal of distributed systems is to
    increase the availability and reliability by
    automatically recovering from partial failures
    without seriously affecting the overall
    performance
  • Closely related to dependable systems
  • Two broad classes of algorithms
  • Recovery
  • When a failure has occurred, recover the system
    from the erroneous state to an error-free state
  • Fault tolerance
  • Failure masking

3
Recovery
  • Recovery in computer systems refers to restoring
    a system (due to failure) to its normal
    operational state

4
Basic Concepts
  • A system consists of a set of hardware and
    software components
  • Each component is designed to provide a set of
    specified services
  • Failure in a system occurs if the system does not
    perform its services in the specified manner
  • An erroneous state of the system is a state which
    could lead to a system failure by a sequence of
    valid state transitions
  • Recovery is a process that involves restoring an
    erroneous state to an error-free state

5
Classification of Failures
6
Classification of Failures cont.
  • Process failure
  • System failure
  • Amnesia failure
  • A partial amnesia failure
  • A pause failure
  • A halting failure
  • Secondary storage failure
  • Communication medium failure

7
Backward and Forward Error Recovery
  • Forward-error recovery
  • If the errors and damages caused by faults can be
    completely and accurately accessed, then those
    errors can be removed and resulting state is
    error-free
  • Backward-error recovery
  • If the errors and damages can not be accessed,
    the system can be restored to a previous
    error-free state
  • Problems include performance penalty, recurrence
    of faults, and unrecoverable components

8
Basic Approach in Backward Error Recovery
  • The basic idea is to save some recovery points on
    a stable storage
  • Operation-based approach
  • State-based approach

9
Recovery Stable Storage
10
The Operation-based Approach
  • In this approach, all the modifications that are
    made to the state of a process are recorded so
    that a previous state can be restored by
    reversing all the changes made to the state
  • The record of the system activity is known as an
    audit trail or log

11
The Operation-based Approach cont.
  • Updating-in-place
  • Every update operating to an object updates the
    object and results in a log to be recorded in a
    stable storage which has enough information to
    completely undo and redo the operation
  • The information needs to include the name of the
    object, the old state of the object, and the new
    state of the object
  • It can be implemented as a collection of
    operations
  • Do, undo, redo, and display operations
  • The major problem is a do operation cannot be
    undone if the system crashes after the update
    operation but before the log record is stored

12
The Operation-based Approach cont.
  • The write-ahead-log protocol
  • To overcome the problem in updating-in-place
    scheme, a recoverable operation is implemented by
    the following operations
  • Update an object only after the undo log is
    recorded
  • Before committing the updates, redo and undo logs
    are recorded
  • Problems with operation-based approach
  • Writing a log on every update operation is
    expensive in terms of storage requirement and CPU
    overhead

13
The State-based Approach
  • In this approach, the complete state of a process
    is saved when a recovery point is established
  • Recovering involves reinstating its saved state
    and resuming the execution of the process from
    that state
  • The process of saving state is referred to as
    checkpointing or taking a checkpoint
  • A recovery point at which checkpointing occurs is
    referred to as a checkpoint
  • The process of restoring a process to a
    prior-state is referred to as rolling-back the
    process

14
The State-based Approach cont.
  • Shadow pages
  • Whenever a process wants to modify an object, the
    pages containing the object are duplicated and
    maintained on a stable storage
  • Only one of the copies undergoes all the
    modifications done by the process
  • The other unmodified copy is known as the shadow
    pages

15
Issues in Recovery in Concurrent Systems
  • How and when to do checkpointing
  • Independently or collectively
  • Asynchronous vs. synchronous checkpointing
  • Recovery
  • Which checkpoints to be used when processes roll
    back?
  • Strongly consistent checkpoints
  • Consistent checkpoints

16
Orphan Messages and Domino Effect
17
Lost Messages
18
Livelocks
19
Consistent Set of Checkpoints
20
A Simple Method for Taking a Consistent Set of
Checkpoints
  • If the action of taking a checkpoint and the
    action of sending or receiving a message are
    indivisible
  • If every process takes a checkpoint after sending
    every message, the set of most recent checkpoints
    is always consistent
  • It is not strongly consistent
  • However, taking a checkpoint after each message
    is expensive
  • If we take a checkpoint after every K (gt1)
    message, the method suffers from the domino
    effect

21
Synchronous Checkpointing
  • The processes are coordinated so that the set of
    all recent checkpoints is guaranteed to be
    consistent
  • Assumptions
  • Processes communicate by exchanging messages
  • Communication channels are FIFO
  • Communication failures do not partition the
    network
  • Two kinds of checkpoints
  • Permanent checkpoints
  • Tentative checkpoints
  • Two phases in checkpointing

22
Synchronous Checkpointing cont.
  • First phase
  • An initiating process Pi takes a tentative
    checkpoint and requests all other processes to
    take tentative checkpoints
  • Each process informs Pi whether it succeeded in
    taking a tentative checkpoint
  • If Pi learns that all the processes have
    successfully taken tentative checkpoints, Pi
    decides that all checkpoints should be made
    permanent otherwise, Pi decides all the
    tentative checkpoints should be discarded

23
Synchronous Checkpointing cont.
  • Second phase
  • Pi informs all the processes of its decision
  • A process, on receiving the message from Pi will
    act accordingly.
  • Therefore, either all the processes take the
    permanent checkpoints or none
  • Every process must not send messages related to
    the computation after taking the tentative
    checkpointing and but before receiving the
    decision from Pi
  • Correctness
  • A set of permanent checkpoints taken by the
    algorithm is consistent

24
Synchronous Checkpointing cont.
25
Synchronous Checkpointing cont.
Condition for optimization
26
Synchronous Checkpointing cont.
27
Synchronous Checkpointing cont.
28
Synchronous Checkpointing cont.
(continued)
29
The Rollback Recovery Algorithm
  • Assumptions
  • A single process invokes the algorithm
  • Rollback recovery and checkpointing are not
    concurrently invoked
  • First phase
  • A process Pi checks to see if all the processes
    are willing to restart from their previous
    checkpoints
  • If all processes reply yes, Pi decides to
    restart otherwise, no
  • Second phase
  • Pi propagates its decision to all the processes

30
The Rollback Recovery Algorithm cont.
31
The Rollback Recovery Algorithm cont.
32
The Rollback Recovery Algorithm cont.
33
The Rollback Recovery Algorithm cont.
(continued)
34
Asynchronous Checkpointing and Recovery
  • In asynchronous checkpointing and recovery,
    checkpoints are taken independently
  • A recovery algorithm has to search for the most
    recent consistent set of checkpoints before it
    can initiate recovery

35
Asynchronous Checkpointing and Recovery cont.
  • Asynchronous checkpointing
  • Two types of log storage volatile log and stable
    log
  • Each processor, after an event, records a triplet
  • Its state before the event
  • The message
  • The set of messages sent by the processor

36
Asynchronous Checkpointing and Recovery cont.
  • Recovery
  • The basic idea is to find out whether there are
    orphan messages if processors roll back to a set
    of checkpoints
  • The existence of orphan messages is discovered by
    comparing the number of messages sent and received

37
Asynchronous Checkpointing and Recovery cont.
38
Asynchronous Checkpointing and Recovery cont.
39
Asynchronous Checkpointing and Recovery cont.
40
Summary
  • Recovery attempts to remove errors in the state
  • Forward-error recovery
  • Backward-error recovery
  • Operation-based
  • State-based approach
  • Synchronous checkpointing and recovery
  • Asynchronous checkpointing and recovery
  • Next time we will talk about fault tolerance
  • Ways to mask failures
Write a Comment
User Comments (0)
About PowerShow.com