Title: CS556: Distributed Systems
1CS-556 Distributed Systems
Fault Tolerance (I)
- Manolis Marazakis
- maraz_at_csd.uoc.gr
2The gossip architecture (I)
- Replicate data close to points where groups of
clients need it - Periodic exchange of msgs among RMs
- Front-ends send queries updates to any RM they
choose - Any RM that is available can provide acceptable
response times - Consistent service over time
- Relaxed consistency bet. replicas
3The gossip architecture (II)
- Causal update ordering
- Forced ordering
- Causal total
- A Forced-order a Causal-order update that are
related by the happened-before relation may be
applied in different orders at different RMs ! - Immediate ordering
- Updates are applied in a consistent order
relative to any other update at all RMs
4The gossip architecture (III)
- Bulletin board application example
- Posting items -gt causal order
- Adding a subscriber -gt forced order
- Removing a subscriber -gt immediate order
- Gossip messages updates among RMs
- Front-ends maintain prev vector timestamp
- One entry per RM
- RMs respond with new vector timestamp
5State components of a gossip RM
6Query operations in gossip
- RM must return a value that is at least as recent
as the requests timestamp - Q.prev lt valueTS
- List of pending query operations
- Hold back until above condition is satisfied
- RM can wait for missing updates
- or request updates from the RMs concerned
- RMs response includes valueTS
7Updates in causal order
- RM-i checks to see if operation ID is in its
executed table or in its log - Discard update if it has already seen it
- Increment i-th element of replica timestamp
- Count of updates received from front-ends
- Assign vector timestamp (ts) to the update
- Replace i-th element of u.prev by i-th element of
replica timestamp - Insert log entry
- lti, ts, u.op, u.prev, u.idgt
- Stability condition u.prev lt valueTS
- All updates on which u depends have been applied
8Forced immediate order
- Unique sequence number is appended to update
timestamps - Primary RM acts as sequencer
- Another RM can be elected to take over
consistently as sequencer - Majority of RMs (including primary) must record
which update is the next in sequence - Immediate ordering by having the primary order
them in the sequence (along with forced updates
considering causal updates as well) - Agreement protocol on sequence
9Gossip timestamps
- Gossip msgs bet. RMs
- Replica timestamp log
- Receivers tasks
- Merge arriving log m.log with its own
- Add record r to local log if replicaTS lt r.ts
- Apply any updates that have become stable
- This may in turn make pending updates become
stable - Eliminate records from log entries in executed
table - Once it is established that they have been
applied everywhere - Sort the set of stable updates in timestamp order
- r is applied only if there is no s s.t. s.prev lt
r.prev - tableTSj m.ts
- If tableTSic gt r.tsc, for all i, then r is
discarded - c RM that created record r
- ACKs by front-ends to discard records from
executed table
10Update propagation
- How long before all RMs receive an update ?
- Frequency duration of network partitions
- Beyond systems control !
- Frequency of gossip msgs
- Policy for choosing a gossip partner
- Random
- Weighted probabilities to favor near partners
- Surprisingly robust !
- But exhibits variable update propagation times
- Deterministic
- Simple function of RMs state
- Eg Examine timestamp table choose the RM that
appears to be the furthest behind in updates
received - Topological
- Based on fixed arrangement of RMs into a graph
- Ring, mesh, trees
- Trade-off amount of communication against higher
latencies the possibility that a single failure
will affect other RMs
11Scalability concerns
- 2 messages per query (bet. front-end RM)
- Causal update
- G messages per gossip message
- 2 (R-1)/G messages exchanged
- Increasing G leads to
- Less messages
- but also worse delivery latencies
- RM has to wait for more updates to arrive before
propagating them - Improvement by having read-only replicas
- Provided that update/query ratio is low !
- Updated by gossip msgs but do not receive
updates directly from front-ends - Can be situated close to client groups
- Vector timestamps need only include updateable RMs
12Dependability Basic Concepts
- Availability
- Reliability
- Safety
- Maintainability
Fault ? Error ? Failure
- Faults
- -Transient
- Intermittent
- Permanent
13Failure Models
14Failure detectors
- Not necessarily reliable !
- P is here message, every T sec, assuming a max.
message transmission delay D - Categorization of processes (hints)
- suspected vs unsuspected
- A process may be functioning correctly on the
other side of a partitioned network - or it could be slow to respond to probes
- Reliable detection
- unsuspected vs failed (crashed)
- Feasible only in synchronous systems
- It is possible to give different responses to
different processes - different comm. conditions
15Failure Masking by Redundancy (I)
- Hide the occurrence of failures from other
processes, by redundancy - Information
- Extra bits to allow recovery
- Time
- Transactions to allow abort/redo
- Physical
- Extra equipment to tolerate loss/malfunction of
some components - Voter circuitry
- Voters are components too ? They may themselves
fail !
16Failure Masking by Redundancy (II)
- Triple modular redundancy (TMR)
17Flat vs Hierarchical Groups (I)
Process resilience by replicating processes into
groups
Group membership protocols
18Flat vs Hierarchical Groups (II)
- Flat groups
- Symmetrical (no special roles)
- No single point of failure
- Complex operation protocols (eg voting)
- Hierarchical groups
- Coordinator is a single point of failure
19Failure Masking Replication
- Having a group of identical processes allows us
to mask gt1 faulty processes - Primary-backup protocols
- Hierarchical organization
- Replicated-write protocols
- Flat process groups
- Active replication
- Quorum protocols
- K-fault tolerant system
- Fail-silent processes ? group size (k 1)
- Byzantine failures ? group size (2k 1)
20Coordination/Agreement
- A set of process must collaborate
- or agree with one or more processes
- without a fixed master/slave relationships
- failure assumptions failure detectors
- Problems
- mutual exclusion
- election
- multicast
- reliability ordering semantics
- consensus
- Byzantine agreement
21Problems of Agreement
- A set of processes need to agree on a value
(decision), after one or more processes have
proposed what that value (decision) should be - Examples
- mutual exclusion, election, transactions
- Processes may be correct, crashed, or they may
exhibit arbitrary (Byzantine) failures - Messages are exchanged on an one-to-one basis,
and they are not signed
22Two Agreement Problems
- Consensus problem every process i proposes a
value vi, while in the undecided state. Process i
exchanges messages until it makes decision di and
moves to decided state. - Termination all correct processes must make a
decision - Agreement same decision for all correct
processes - Integrity if all correct processes proposed same
value, any correct process decides that value - Byzantine generals problem a commander
process i orders value v. - The lieutenant processes must agree on what the
commander ordered. - Processes may be faulty
- provide wrong or contradictory messages
- Integrity requirement
- A distinguished process decides a value for
others to agree upon - Solution only exists if N gt 3f, where f faulty
processes
23Consensus for 3 processes
24The Two-Army Problem
- How can two perfect processes reach agreement
about 1 bit of information ? - over an unreliable comm. Channel
- Red army 5000 troops
- Blue army 1, 2 3000 troops each
- How can the blue armies reach agreement on when
to attack ? - Their only means of communication is by sending
messengers - that may be captured by the enemy !
- No solution!
- Proof by contradiction Assume there is a
solution with a minimum messages
25Consensus No Failures Case
majority(v1, , vN) returns most frequently
occurring value - returns if no majority
exists
Consensus via reliable multicast
For ordered values, min/max could be used instead
of majority
In general, if failures can occur it is not 100
certain that consensus can be reached in finite
time !
Terminating Reliable Multicast (TRB) A single
process multicasts a msg, and all
correct processes must agree on that msg -
Even if sender crashes, all correct processes
must deliver a special msg (Server-Fault)
26Relation among problems
A problem B reduces to a problem A if there is an
algorithm which transforms any algorithm for A
into an algorithm for B.
Synchronous systems TRB is equivalent to
Consensus
Asynchronous systems Consensus reduces to
TRB but not vice versa!
Asynchronous systems with crash failures
Atomic Multicast is equivalent to Consensus
27Consensus in synchronous systems
Duration of round max. delay of B-multicast
Up to f faulty processes
Dolev Strong, 1983 Any algorithm to reach
consensus despite up to f failures requires (f
1) rounds.
28Byzantine agreement synchronous
Faulty process
Nothing can be done to improve a correct
process knowledge beyond the first stage -
It cannot tell which process is faulty.
3 says 1 says u
Lamport et al, 1982 No solution for N 3, f
1
Pease et al, 1982 No solution for Nlt 3f
(assuming private comm. channels)
29Agreement in Faulty Systems (I)
- The Byzantine generals problem for 3 loyal
generals and 1 traitor - The generals announce their troop strengths
- The vectors that each general assembles based on
(a) - The vectors that each general receives in step 3.
Consensus by generals 1, 2, 4 ? (1, 2, UNKNOWN,
4))
30Agreement in Faulty Systems (II)
- The same as in previous slide, except now with 2
loyal generals and one traitor.
31Byzantine agreement for N gt 3f
Example with N4, f1 - 1st round Commander
sends a value to each lieutenant - 2nd round
Each of the lieutenants sends the value it has
received to each of its peers.
- A lieutenant receives a total of (N 2) 1
values, of which (N 2) are correct. -
By majority(), the correct lieutenants compute
the same value.
In general, O(N(f1)) msgs
O(N2) for signed msgs
32Impossibility of (deterministic) consensus in
asynchronous systems
M.J. Fischer, N. Lynch, and M. Paterson
Impossibility of distributed consensus with one
faulty process, J. ACM, 32(2), pp. 374-382,
1985.
A crashed process cannot be distinguished from a
slow one. - Not even with a 100 reliable
comm. network !
There is always a chance that some continuation
of the processes execution avoid consensus being
reached.
No guarantee for consensus, but Prob(consensus)
gt 0
Solutions based on randomization or
(unreliable) failure detectors or by fault
masking
33Reliable client-server communication
What about reliable point-to-point transport
protocols ?
- TCP masks omission failures
- by using ACKs retransmissions
- but it does not mask crash failures !
- Eg When a connection is broken, the client is
only notified via an exception
345 classes of failures in RPC
- Client is unable to locate server
- Binding exception
- at the expense of transparency
- Request message is lost
- Is it safe to retransmit ?
- Allow server to detect it is dealing with a retry
- Server crashes after receiving a request
- Reply message is lost
- Client crashes after sending a request
35Lost Request Messages Server Crashes (I)
- A server in client-server communication
- Normal case
- Crash after execution
- Crash before execution
36Server Crashes (II)
- At-least-once semantics
- Client keeps retransmitting until it gets a
response - At-most-once semantics
- Give up immediately report failure
- Guarantee nothing
- Ideal would be exactly-once semantics
- no general way to arrange this !
37Server Crashes (III)
- Print server scenario
- M servers completion message
- Server may send M either before or after printing
- P servers print operation
- C servers crash
- Possible event orderings
- M ? P ? C
- M ? C (? P)
- P ? M ? C
- P ? C (? M)
- C (? P ? M)
- C (? M ? P)
38Server Crashes (IV)
- Different combinations of client server
strategies in the presence of server crashes.
No combination of client server strategy is
correct for all cases !
39Lost Reply Messages
- Is it safe to retransmit the request ?
- Idempotent requests
- Example Read a files first 1024 bytes
- Counterexample money transfer order
- Assign sequence number to request
- Server keeps track of clients most recently
received sequence - additionally, set a RETRANSMISSION bit in the
request header
40Client Crashes (I)
- Orphan computation
- No process waiting for the result
- Waste of resources (CPU cycles, locks)
- Possible confusion upon clients recovery
- 4 alternative strategies proposed by Nelson
(1981) - Extermination
- Client keeps log of requests to be issued
- Upon recovery, explicitly kill orphans
- Overhead of logging (for every RPC)
- Problems with grand-orphans
- Problems with network partitions
41Client Crashes (II)
- Reincarnation
- Divide time up into epochs (sequentially
numbered) - Upon reboot, client broadcasts start-of-epoch
- Upon receipt, all remote computations on behalf
of this client are killed - After a network partition, an orphans response
will contain an obsolete epoch number ? easily
detected - Gentle reincarnation
- Upon receipt of start-of-epoch, each server
checks to see if it has any remote computations - If the owner cannot be found, the computation is
killed - Expiration
- Each RPC is given a time quantum T to complete
- must explicitly ask for another if it cannot
finish in time - After reboot, client only needs to wait a time T
- How to select a reasonable value for T ?
42Basic Reliable-Multicasting Schemes
- A simple solution to reliable multicasting when
all receivers are known are assumed not to fail - Message transmission
- Reporting feedback
43Nonhierarchical Feedback Control
- Several receivers have scheduled a request for
retransmission, but the first retransmission
request leads to the suppression of others.
44Hierarchical Feedback Control
- The essence of hierarchical reliable
multicasting - Each coordinator forwards the message to its
children. - A coordinator handles retransmission requests.
45Virtual Synchrony (I)
- The logical organization of a distributed system
to distinguish between message receipt and
message delivery
46Virtual Synchrony (II)
- The principle of virtual synchronous multicast.
47Message Ordering (I)
- Three communicating processes in the same group.
The ordering of events per process is shown along
the vertical axis.
48Message Ordering (II)
- Four processes in the same group with two
different senders, and a possible delivery order
of messages under FIFO-ordered multicasting
49Implementing Virtual Synchrony (I)
50Implementing Virtual Synchrony (II)
- Process 4 notices that process 7 has crashed,
sends a view change - Process 6 sends out all its unstable messages,
followed by a flush message - Process 6 installs the new view when it has
received a flush message from everyone else