Checkpointing and Recovery - PowerPoint PPT Presentation

1 / 19
About This Presentation
Title:

Checkpointing and Recovery

Description:

Checkpointing and Recovery ... Depends on application Could be as simple as just program counter ... A simple approach for stable storage Approaches Asynchronous ... – PowerPoint PPT presentation

Number of Views:315
Avg rating:3.0/5.0
Slides: 20
Provided by: s10123
Category:

less

Transcript and Presenter's Notes

Title: Checkpointing and Recovery


1
Checkpointing and Recovery
2
Purpose
  • Consider a long running application
  • Regularly checkpoint the application
  • Expensive task
  • In case of failure, restore to the previous
    checkpoint
  • What happens in case of a distributed application
  • One (or more) processes fail
  • Restoration to previous checkpoint should be done
    consistently

3
Examples
4
What to Save?
  • Depends on application
  • Could be as simple as just program counter
    information
  • Could be the state of the entire process,
    including messages received, etc

5
Stable Storage
  • Checkpoints must survive failure of processes
    (including failure during a disk write)
  • A simple approach for stable storage

6
Approaches
  • Asynchronous
  • The local checkpoints at different processes are
    taken independently
  • Synchronous
  • The local checkpoints at different processes are
    coordinated
  • They may not be at the same time

7
Asynchronous Checkpointing
  • Problem
  • Domino effect

Failed process
8
Other Issues with Asynchronous Checkpointing
  • Useless checkpoints
  • Need for garbage collection
  • Recovery requires significant coordination

9
Asynchronous Checkpointing (Continued)
  • Identify dependency between different checkpoint
    intervals
  • This information is stored along with checkpoints
    in a stable storage
  • When a process repairs, it requests this
    information from others to determine the need for
    rollback

10
Two Examples of Asynchronous Checkpointing
  • Bhargava and Lian
  • Wang et al

11
Algorithm by Bhargava et al
  • Draw an edge from ci, x to cj,y if either
  • i j and y x1
  • i ? j and a message m is sent from Ii, x and
    received in Ij, y
  • Where Ii, x is the interval between ci, x-1 and
    ci, x
  • Rollback recovery line used for recovery as well
    as garbage collection

12
Algorithm by Wang et al
  • Difference
  • If a message sent from Ii, x is received in Ij, y
    then draw an edge between cj, x-1 to cj, y
  • Recovery line obtained is similar to that by by
    Bhargava and Lian
  • Advantage
  • Number of useful checkpoints is at most N(N1)/2
  • This can be shown that the number of checkpoints
    that are ahead of recovery line

13
Coordinated Checkpointing
  • Using diffusing computation
  • How can we use diffusing computation to obtain a
    consistent snapshot?

14
Algorithm by Tamir and Sequin
  • Blocking checkpoint
  • A coordinator decides when a checkpoint is taken
  • Coordinator sends a request message to all
  • Each process
  • Stops executing
  • Flushes the channels
  • Takes a tentative checkpoint
  • Replies to coordinator
  • When all processes send replies, the coordinator
    asks them to change it to a permanent checkpoint

15
Algorithm by Tamir and Sequin
  • How many checkpoints need to be stored per
    process?

16
Checkpointing in Timed Systems
  • If perfectly synchronized clocks?

17
Checkpointing in Timed Systems
  • What if clocks are loosely synchronized?
  • Max clock drift, ?, is known?
  • All processes take a checkpoint at a fixed
    (local) time
  • After the checkpoint, a process does not send any
    messages for 2?
  • The set of local checkpoints is guaranteed to be
    consistent

18
Minimal Checkpoint Coordination
  • Approach by Koo and Toueg
  • Require processes to take a checkpoint only if
    they have to

19
Logging Protocols
  • Pessimistic
  • Optimistic
  • Causal
Write a Comment
User Comments (0)
About PowerShow.com