Title: 12. Recovery
112. Recovery
- Study Meeting
- M1 Yuuki Horita
- 2004/5/14
2Contents
- Introduction
- Recovery
- Checkpointing
- Difficulty of Checkpointing
- Synchronous checkpointing / recovery
- (Asynchronous checkpointing / recovery)
3Introduction
- Long computation in distributed environments
- High failure rate
- Host failure (a lot of hosts)
- Network failure
- One failure may disturb entire computation
- ? Need to start it again from the beginning
- High cost
- Why dont we utilize the previous computation?
Recovery
4Recovery is not easy
- Suppose that a parallel computation is running in
distributed resources -
1
7
8
1
7
8
1
1
7
7
for(i0 iltMAXITER i) local_compute()
// compute at each host global_state_exchange()
// communicate with neighbors
- need to save process states periodically
- usually other processes have to restore to
previous state - overhead
5Recovery
6Back/Forward Error Recovery
- Forward-error recovery
- Only when it is possible to remove errors
- Enable processes to move forward
- Ex) Redundancy, vote
- Backward-error recovery
- General
- Restore to a previous error-free state
- Ex) Checkpoint
7Backward-error recovery
- operational-based approach
- Record all modifications of a process state
- state-based approach
- Record complete state at certain point
8State-based approach
- Terminology
- checkpointing the process of saving state
- checkpoint the recovery point at which
checkpointing occurs - rolling back the process of restoring a
process to a prior-state -
9Checkpointing
10Problem of naïve checkpointing
- Orphan Messages and the Domino Effect
- Orphan message a message that make an
inconsistent state - Domino Effect what a single rolling back
induce other rolling back - Lost Messages
- Livelocks
11Orphan message and Domino Effect
x1
x2
x3
X
Y has not sent yet, but X has received.
y1
y2
Orphan message
Y
Roll back
z1
z2
Z
Domino Effect
12Lost messages
x1
x2
x3
X
X has sent, but Y cannot receive forever
y1
y2
Lost message
Y
Roll back
z1
z2
Z
13Livelocks
x1
X
n2
n1
m2
m1
n1
y1
Y
14Consistency of Checkpoint
- Strongly consistent set of checkpoints
- no messages penetrating the set
- Consistent set of checkpoints
- no messages penetrating the set backward
x1
x2
need to deal with lost messages
y1
y2
Strongly consistent
consistent
z1
z2
15Checkpoint/Recovery Algorithm
- Synchronous
- with global synchronization at checkpointing
- Asynchronous
- without global synchronization at checkpointing
16Preliminary (Assumption)
Synchronous Checkpoint
- Goal
- To make a consistent global checkpoint
- Assumptions
- Communication channels are FIFO
- No partition of the network
- End-to-end protocols cope with message loss due
to rollback recovery and communication failure - No failure during the execution of the algorithm
17Preliminary (Two types of checkpoint)
Synchronous Checkpoint
- tentative checkpoint
- a temporary checkpoint
- a candidate for permanent checkpoint
- permanent checkpoint
- a local checkpoint at a process
- a part of a consistent global checkpoint
18Checkpoint Algorithm
Synchronous Checkpoint
- Algorithm
- an initiating process (a single process that
invokes this algorithm) takes a tentative
checkpoint - it requests all the processes to take tentative
checkpoints - it waits for receiving from all the processes
whether taking a tentative checkpoint has been
succeeded - if it learns all the processes has succeeded, it
decides all tentative checkpoints should be made
permanent otherwise, should be discarded. - it informs all the processes of the decision
- The processes that receive the decision act
accordingly - Supplement
- Once a process has taken a tentative
checkpoint, it shouldnt send messages until it
is informed of initiators decision.
19Diagram of Checkpoint Algorithm
Synchronous Checkpoint
Tentative checkpoint
decide to commit
Initiator
permanent checkpoint
request to take a tentative checkpoint
OK
consistent global checkpoint
Unnecessary checkpoint
consistent global checkpoint
20Optimized Algorithm
Synchronous Checkpoint
- Each message is labeled by order of sending
- Labeling Scheme
- ? smallest label
- ? largest label
- last_label_rcvdXY the last message that X
received from Y after X has taken its last
permanent or tentative checkpoint. if not exists,
?is in it. - first_label_sentXY the first message that
X sent to Y after X took its last permanent or
tentative checkpoint . if not exists, ?is in it. - ckpt_cohortX the set of all processes that may
have to take checkpoints when X decides to take a
checkpoint.
X
x3
x2
y1
y2
Y
y2
x2
Checkpoint request need to be sent to only the
processes included in ckpt_cohort
21Optimized Algorithm
Synchronous Checkpoint
- ckpt_cohortX Y last_label_rcvdXY gt ?
- Y takes a tentative checkpoint only if
- last_label_rcvdXY gt first_label_sentYX gt ?
last_label_rcvdXY
X
Y
first_label_sentYX
22Optimized Algorithm
Synchronous Checkpoint
- Algorithm
- an initiating process takes a tentative
checkpoint - it requests p ? ckpt_cohort to take tentative
checkpoints ( this message includes
last_label_rcvdreciever of sender ) - if the processes that receive the request need to
take a checkpoint, they do the same as 1.2.
otherwise, return OK messages. - they wait for receiving OK from all of p ?
ckpt_cohort - if the initiator learns all the processes have
succeeded, it decides all tentative checkpoints
should be made permanent otherwise, should be
discarded. - it informs p ? ckpt_cohort of the decision
- The processes that receive the decision act
accordingly
23Diagram of Optimized Algorithm
Synchronous Checkpoint
Tentative checkpoint
Permanent checkpoint
decide to commit
A
2 gt 0 gt 0
ab1
ba1
ba2
ac1
ca2
2 gt 1 gt 0
B
OK
ac2
cb2
cb1
bd1
2 gt 2 gt 0
C
cd1
dc1
dc2
D
- ckpt_cohortX Y last_label_rcvdXY gt ?
last_label_rcvdXY gt first_label_sentYX gt ?
24Correctness
Synchronous Checkpoint
- A set of permanent checkpoints taken by this
algorithm is consistent - No process sends messages after taking a
tentative checkpoint until the receipt of the
decision - New checkpoints include no message from the
processes that dont take a checkpoint - The set of tentative checkpoints is fully either
made to permanent checkpoints or discarded.
25Recovery Algorithm
Synchronous Recovery
- Labeling Scheme
- ? smallest label
- ? largest label
- last_label_rcvdXY the last message that X
received from Y after X has taken its last
permanent or tentative checkpoint. If not exists,
?is in it. - first_label_sentXY the first message that X
sent to Y after X took its last permanent or
tentative checkpoint . If not exists, ?is in it. - roll_cohortX the set of all processes that may
have to roll back to the latest checkpoint when
process X rolls back. - last_label_sentXY the last message that X
sent to Y before X takes its latest permanent
checkpoint. If not exist, ? is in it.
26Recovery Algorithm
Synchronous Recovery
- roll_cohortX Y X can send messages to Y
- Y will restart from the permanent checkpoint only
if - last_label_rcvdYX gt last_label_sentXY
27Recovery Algorithm
Synchronous Recovery
- Algorithm
- an initiator requests p ? roll_cohort to prepare
to rollback ( this message includes
last_label_sentreciever of sender ) - if the processes that receive the request need to
rollback, they do the same as 1. otherwise,
return OK message. - they wait for receiving OK from all of p ?
ckpt_cohort. - if the initiator learns p ? roll_cohort have
succeeded, it decides to rollback otherwise, not
to rollback. - it informs p ? roll_cohort of the decision
- the processes that receive the decision act
accordingly
28Diagram of Synchronous Recovery
decide to roll back
A
ab1
ba1
ba2
ac1
OK
2 gt 1
0 gt 1
B
request to roll back
ac2
cb2
cb1
bd1
C
2 gt 1
dc1
dc1
dc2
D
0 gt?
0 gt?
roll_cohortX Y X can send messages to Y
last_label_rcvdYX gt last_label_sentXY
29Drawbacks of Synchronous Approach
- Additional messages are exchanged
- Synchronization delay
- An unnecessary extra load on the system if
failure rarely occurs
30Asynchronous Checkpoint
- Characteristic
- Each process takes checkpoints independently
- No guarantee that a set of local checkpoints is
consistent - A recovery algorithm has to search consistent set
of checkpoints - No additional message
- No synchronization delay
- Lighter load during normal excution
31Preliminary (Assumptions)
Asynchronous Checkpoint / Recovery
- Goal
- To find the latest consistent set of checkpoints
- Assumptions
- Communication channels are FIFO
- Communication channels are reliable
- The underlying computation is event-driven
32Preliminary (Two types of log)
Asynchronous Checkpoint / Recovery
- save an event on the memory at receipt of
messages (volatile log) - volatile log periodically flushed to the disk
(stable log) ? checkpoint - volatile log
- quick accesslost if the corresponding
processor fails - stable log
- slow accessnot lost even if processors fail
33Preliminary (Definition)
Asynchronous Checkpoint / Recovery
- Definition
- CkPti the checkpoint (stable log) that i rolled
back to when failure occurs - RCVDi?j (CkPti / e ) the number of messages
received by processor i from processor j, per the
information stored in the checkpoint CkPti or
event e. - SENTi?j(CkPti / e ) the number of messages
sent by processor i to processor j, per the
information stored in the checkpoint CkPti or
event e
34Recovery Algorithm
Asynchronous Checkpoint / Recovery
- Algorithm
- When one process crashes, it recovers to the
latest checkpoint CkPt. - It broadcasts the message that it had failed.
Others receive this message, and rollback to the
latest event. - Each process sends SENT(CkPt) to neighboring
processes - Each process waits for SENT(CkPt) messages from
every neighbor - On receiving SENTj?i(CkPtj) from j, if i notices
RCVDi?j (CkPti) gt SENTj?i(CkPtj), it rolls back
to the event e such that RCVDi?j (e)
SENTj?i(e), - repeat 3,4,and 5 N times (N is the number of
processes)
35Asynchronous Recovery
XY
XZ
x1
Ex0
Ex1
Ex2
Ex3
X
3 lt 2
2 lt 2
0 lt 0
(X,2)
(Z,0)
(Y,2)
YX
YZ
y1
Ey0
Ey1
Ey2
Ey3
1 lt 2
1 lt 1
Y
(X,0)
(Z,1)
(Y,1)
ZX
ZY
Ez1
Ez2
Ez0
0 lt 0
2 lt 1
1 lt 1
Z
z1
RCVDi?j (CkPti) lt SENTj?i(CkPtj)