Title: RollbackRecovery
1Rollback-Recovery
2Fault-Tolerance the Good Old Days
- Target
- life-critical applications
- Primary concern
- tolerate arbitrary failures
- Secondary concerns
- performance
- resources
- transparency
3The times they are a-changin
- Target
- non-life-critical applications
- Primary Concerns
- tolerate common failures with few dedicated
resources - negligible impact during failure-free executions
- fast recovery
- transparency
- Secondary Concerns
- tolerate arbitrary failures
4Replica Coordination
- Agreement Every non-faulty replica receives
every request - Order Every non-faulty replica processes
requests in the same relative order
5Implementing Replica Coordination
- Clients use (Causal) Atomic Broadcast to
disseminate their requests - Clients forward requests to one of the replicas
- That replica initiates the Reliable Broadcast to
the other replicas
What are the differences?
6Primary-Backup The Idea
- One replica (primary) executes all
non-deterministic events - Primary broadcasts to other replicas (backups)
- requests form clients
- outcome of executing non-deterministic events at
the primary
7Definitions
- Failover time of a PB service longest time
during which some client does not know the
identity of the primary - Server outage at t Some correct client sends a
request at time t to service, but does not
receive a response
- (k,?)-bofo server service in which all server
outages can be grouped into at most k intervals
of time, each of at most length ?
8Primary-Backup The Spec(Budhiraja, Marzullo,
Schneider, Toueg)
Safety
- PB1 There exists a local predicate Prmys on the
state of each server s. At any time, there is at
most one server s whose state satisfies Prmys
PB3 If a client request arrives at a server
that is not the current primary, then that
request is not enqueued (and therefore is not
processed)
PB2 Each client i maintains a server identity
Desti such that to make a request, client i sends
a message to Desti
PB4 There exist fixed values k and ? such that
the service behaves like a single (k,?)-bofo
server
Liveness
9A simple protocol
- In addition
- p1 sends heartbeat message to p2 every ? seconds
- Assume
- point-to-point communication
- non-faulty channels
- upper bound ? on message delivery time
- at most one process crashes
- Primary p1
- Backup p2
- Process p2
- updates its state upon receiving state update
from p1 - if it doesnt receive heartbeat for ??? seconds,
p2 becomes primary - informs clients
- begins processing subsequen requests from clients
- On receipt of a request, process p1
- Processes request and updates its state
- Sends info about update to p2 (state update
message) - Without waiting for an ack from p2, p1 sends a
response to client
10that meets the PB spec
- PB1
- Failover Time during which
p1
p2
has not received a message from p1 for ? ? ?
? ? 2?
11indeed, it does!
- k 1 (since at most one crash)
- ? ? longest interval during which a request
elicits no response - assume p1 crashes at tc
- any client request sent to p1 at time tc ??? or
later may be lost - p2 may not learn about p1 crash until tc ??? ??2?
- client may not learn that p2 is new primary for
another ?
- PB2, PB3 Follow immediately from protocol
- PB4 Find k, ? to implement (k,?)-bofo server
? ? ? ? 4?
12Active Replication vs. Primary-Backup
- Active Replication
- tolerates arbitrary failures
- masks failures
- consumes lots of resources
- Primary Backup
- does not tolerate arbitrary failures
- if the primary fails, requests may be lost
- service can become unavailable while a leader
election algorithm is run to determine the new
primary - consumes less resources
13Some like it hot
- Hot Backups process information from the primary
as soon as they receive it - Cold Backups log information received from
primary, and process it only if primary fails - Rollback Recovery implements cold backups
cheaply - primary logs directly to stable storage
information needed by backups - if primary crashes, a newly initialized process
is given content of logsbackups are generated
on demand
14Uncoordinated Checkpointing
- Easy to understand
- No synchronization overhead
- Flexible
- can choose when to checkpoint
- To recover from a crash
- go back to last checkpoint
- restart
p
?
?
15How to (not) take a checkpoint
- Block execution, save entire process state to
stable storage - very high overhead during failure-free execution
- lots of unnecessary data saved on stable storage
16How to take a checkpoint
- Take checkpoints incrementally
- save only pages modified since last checkpoint
- use dirty bit to determine which pages to save
- Save only interesting parts of address space
- use application hints or compiler help to avoid
saving useless data (e.g. dead variables) - Do not block application execution during
recovery - copy-on-write
- precopying
17The Domino Effect
18How to Avoid theDomino Effect
- Coordinated Checkpointing
- No independence
- Synchronization Overhead
- Easy Garbage Collection
- Communication Induced Checkpointing detect
dangerous communication patterns and checkpoint
appropriately - Less synchronization
- Less independence
- Complex
19The Output Commit Problem
- Coordinated checkpoint for every output commit
- High overhead if frequent I/O with external
environment
20Message Logging
- Can avoid domino effect
- Works with coordinated checkpoint
- Works with uncoordinated checkpoint
- Can reduce cost of output commit
- More difficult to implement
21How Message Logging Works
Recovery Unit
Application
- To tolerate crash failures
- periodically checkpoint application state
- log on stable storage determinants of
non-deterministic events executed after
checkpointed state. - for message delivery events
- m (m.dest, m.rsn, m.source, m.ssn)
Log
Recovery restore latest checkpointed
state replay non-deterministic events
according to determinants
22Pessimistic Logging
p1
m2
p2
m3
m1
p3
- Never creates orphans
- may incur blocking
- straightforward recovery
23Case study 1Sender Based Logging
(Johnson and Zwaenepoel, FTCS 87)
- Message log is maintained in volatile storage at
the sender. - A message m is logged in two steps
- i) before sending m, the sender logs its
content m is partially logged - ii) the receiver tells the sender the receive
sequence number of m, and the sender adds this
information to its log m is fully logged .
24More on SBL
- Recovery the recovering process collects the
logs from the senders, and replays the messages
in ascending rsn order - Optimistic SBL may create orphans. Assume
transient link failures
25Optimistic Logging
- p2 sends m3 without first logging
determinants. - If p2 fails before logging the determinants
of m1 and m2, p3 becomes an orphan.
p1
m2
p2
m3
m1
p3
- Eliminates orphans during recovery
- non-blocking during failure-free executions
- rollback of correct processes
- complex recovery
26Causal Logging
- No blocking in failure-free executions
- No orphans
- No additional messages
- Tolerates multiple concurrent failures
- Keeps determinant in volatile memory
- Localized output commit
27Preliminary Definitions
Given a message m sent from m.source to m.dest,
Depend(m)
Log(m) set of processes with a copy of the
determinant of m in their volatile memory
p orphan of a set C of crashed processes
28The No-OrphansConsistency Condition
No orphans after crash C if
No orphans after any C if
The Consistency Condition
29Optimistic and Pessimistic
No orphans after crash C if
Optimistic weakens it to
- No orphans after any crash if
Pessimistic strengthens it to
30Causal Message Logging
No orphans after any crash of size at most f
if
Causal strengthens it to
31An Example
Causal Logging
If f 1, stable(m) ºLog(m) ³ 2
p1
m2
m4
p2
m3ltm1,m2gt
m1
m5ltm3gt
p3
32Recovery for f 1
1
2
3
4
parents of p
Messages previously sent to p by its parents
SSN order
what is the next message from each parent?
p
who is my next parent?
RSN order
Determinants of messages delivered by p
8
5
2
6
children of p
33Family-Based Logging
- Each process p
- maintains in a volatile log Dp all the
determinants m such that p ??Log(m) - piggybacks on application messages to q all
determinants m ? Dp such that
- upon receipt of application message m
- adds m to Dp
- adds to Dp any new determinant piggybacked on m
- scans the information piggybacked to m to update
its estimate of ?Log(m)?p for all determinants
m ? Dp - caches in a volatile log Sp all the messages it
sends
34Estimating Log(m) and ?Log(m)?
- Each process p maintains estimates of
- and
- p piggybacks m on m? to q if
-
- How can p estimate and ?
- How accurate should these estimates be?
- inaccurate estimates cause useless piggybacking
- keeping estimates accurate requires extra
piggybacking
35?Det Keep It Simple
- p piggybacks m on m? to q
- Updating Rule
-
-
- Cost
- requires no additional space over the piggybacked
determinants.
36??Log???Send?the Size
- Whenever p piggybacks m on m??to q, it also
includes ?Log(m) ?p . - Updating Rule
- when q receives m for the first time
- Cost
- requires 1 integer associated with each
determinant. - a similar protocol can be implemented that
carries f n additional integers with each
message.
37?Log Tell All You Know
- Whenever p piggybacks m on m? to q, it also
includes Log(m)p . - Updating Rule
- Cost
- requires up to f integers associated with each
determinant. - a similar protocol can be implemented that
carries n? additional integers with each message.
38Estimating Log(m)
- Because
- we can approximate Log(m) from below with
- and then use vector clocks to track Depend(m)!
39Dependency Vectors
- Dependency Vector (DV) vector clock that tracks
causal dependencies between message delivery
events. -
40Weak Dependency Vectors
Weak Dependency Vector (WDV) track causal
dependencies on deliver(m) as long as
41Dependency Matrix
- Use WDVs to determine if p ? Log(m)
Each process p maintains a Dependency Matrix
(DMp), whose rows are weak dependency
vectors. Given m ltu, s, 14,
15gt, let
s
and Log(m)p p, q, s
42Rollback Recovery Protocols A Success Story?
- Over 300 papers in the area
- Relatively few implementations
- Why?
- Integrating recovery protocols with applications
non trivial - Performance issues not understood
- One size doesnt fit all
43Egida
- A toolkit for supporting rollback recovery
- Transparent
- seamless integration with applications
- Extensible
- can easily handle new sources of non-determinism
- can easily include new protocols
- Flexible
- allows to select best protocol for application
- Smart
- dont want to implement 300 protocols...
- Powerful
- a microscope to understand rollback recovery
44The Unifying Theme
- All rollback recovery protocols enforce the
no-orphans consistency condition - The challenge is handling non-determinism
- A process may execute non-deterministic events
- A process may interact with other processes or
with the environment and generate dependencies on
these events - Characterize a protocol according to how it
handles non-determinism - Identify relevant events
- Specify which actions to take when event occurs
45Handling Non-Determinism
- Five classes of relevant events
- Non-deterministic events
- Ex message delivery, file read, clock read, lock
acquire - Failure-detection events
- time-out, message delivery
- Internal dependency-generating events
- Ex message send, file write, lock release
- External dependency generating events
- output to printer or screen, file write
- Checkpointing events
- Ex timeout, explicit instruction, message
delivery
46The Architecture
- Event handlers invoked on relevant events
- Library of modules
- implement core functionalities
- (checkpointing, creating determinants, logging,
piggybacking, detecting orphans, restarting a
faulty process, etc.) - provide basic services
- (stable storage, failure detection, etc)
- single interface multiple implementations
- Use a specification language to select desired
modules and corresponding implementations - Synthesize protocol automatically from
specification
47An Example of Protocol Specification
- Causal Logging
- / non-deterministic events statement /
- receive
- determinant source, ssn, dest, desn
- Log determinant on volatile memory of processes
- / internal dependency-generating events
statement / - send
- Piggyback determinants
- Log message on volatile memory of self
- / external dependency-generating events
statement/ - send
- Output Commit determinants
- Implementation independent
- / checkpoint statement /
- Checkpoint independent, asynchronous on NFS
disk - Implementation incremental
- Scheduling policy periodic
48Integration with MPICH
- MPICH
- 2-layered architecture
- upper layer exports MPI functions to application
- lower layer performs data transfer using platform
specific libraries (e.g. P4)
- Modifications to MPICH
- In upper layer, replace calls to P4 with
corresponding calls to Egida API - Modification to P4
- Handle socket-level errors
- Allow recovering process to set up connections
with correct processes - Modifications to Applications NONE
Egida
49Bringing the Recovery back to Rollback-Recovery
- Traditionally, high availability active
replication - Few incentives for studying recovery performance
of rollback-recovery protocols - Lots of qualitative arguments
- No experimental study
50Experimental Setup
- Protocol Suite
- Pessimistic receiver-based
- Pessimistic sender-based
- Optimistic
- Causal
- Application Suite
- Benchmarks from NASAs NPB 2.3
- Methodology
- 4 Pentium-based workstations
- Solaris 2.5
- Lightly-loaded 100Mb/s Ethernet
- Failures induced about 3 minutes after checkpoint
- 95 confidence interval
- For the optimistic protocol, process flushes its
volatile logs to disk asynchronously once every
10 seconds
51The stopngo Effect
- In sender-based and causal, sender stores
messages in volatile memory - If sender fails, can get stopngo effect
- recovery of receiver delayed until sender
regenerates messages - Impact of stopngo depends on how much blocking
during failure-free execution
cg
200
Receiver-based
Pessimistic
150
Sender-based
Pessimistic
100
Time (sec.)
50
0
f 1
f 2
f 3
52Failure-free Overhead
53Bad News?
- Receiver-based pessimistic
- Fast crash recovery
- Fault containment
- Slow failure-free execution
- Sender-based pessimistic
- Fault containment
- Slow crash recovery when f gt 1
- Optimistic
- Fast crash recovery and fast failure-free
execution - No fault containment
- Causal
- Fast failure-free execution
- Fault containment
- Slow crash recovery when f gt 1
54Hybrid Protocols
- Sender logs message in volatile memory
- Receiver logs message and determinant
asynchronously to disk - On prefix of recovery information available to
recovering process, no stopngo ! - Best of both worlds
- Low overhead during failure-free execution
- Fast crash recovery
55Hybrid ProtocolsRecovery Performance
cg
200
Receiver-based
Pessimistic
150
Optimistic
Time (sec.)
100
Hybrid-Causal
50
0
f 1
f 2
f 3
Number of failures
56Hybrid Protocols Failure-free Overhead
300
Receiver-based
250
Pessimistic
200
150
Failure-free Overhead()
100
Causal
50
0
bt
lu
cg
sp
mg
Application
Hybrid causal imposes at most 2 higher overhead
than causal
57A Comparison of RR Protocols