CS556: Distributed Systems

About This Presentation

Title:

CS556: Distributed Systems

Description:

Replicate data close to points where groups of clients need it ... Hold back until above condition is satisfied. RM can wait for missing updates ... – PowerPoint PPT presentation

Number of Views:43

Avg rating:3.0/5.0

Slides: 51

Provided by: mar177

Category:

more less

Transcript and Presenter's Notes

Title: CS556: Distributed Systems

1
CS-556 Distributed Systems
Fault Tolerance (I)

Manolis Marazakis
maraz_at_csd.uoc.gr

2
The gossip architecture (I)

Replicate data close to points where groups of
clients need it
Periodic exchange of msgs among RMs
Front-ends send queries updates to any RM they
choose
Any RM that is available can provide acceptable
response times
Consistent service over time
Relaxed consistency bet. replicas

3
The gossip architecture (II)

Causal update ordering
Forced ordering
Causal total
A Forced-order a Causal-order update that are
related by the happened-before relation may be
applied in different orders at different RMs !
Immediate ordering
Updates are applied in a consistent order
relative to any other update at all RMs

4
The gossip architecture (III)

Bulletin board application example
Posting items -gt causal order
Adding a subscriber -gt forced order
Removing a subscriber -gt immediate order
Gossip messages updates among RMs
Front-ends maintain prev vector timestamp
One entry per RM
RMs respond with new vector timestamp

5
State components of a gossip RM
6
Query operations in gossip

RM must return a value that is at least as recent
as the requests timestamp
Q.prev lt valueTS
List of pending query operations
Hold back until above condition is satisfied
RM can wait for missing updates
or request updates from the RMs concerned
RMs response includes valueTS

7
Updates in causal order

RM-i checks to see if operation ID is in its
executed table or in its log
Discard update if it has already seen it
Increment i-th element of replica timestamp
Count of updates received from front-ends
Assign vector timestamp (ts) to the update
Replace i-th element of u.prev by i-th element of
replica timestamp
Insert log entry
lti, ts, u.op, u.prev, u.idgt
Stability condition u.prev lt valueTS
All updates on which u depends have been applied

8
Forced immediate order

Unique sequence number is appended to update
timestamps
Primary RM acts as sequencer
Another RM can be elected to take over
consistently as sequencer
Majority of RMs (including primary) must record
which update is the next in sequence
Immediate ordering by having the primary order
them in the sequence (along with forced updates
considering causal updates as well)
Agreement protocol on sequence

9
Gossip timestamps

Gossip msgs bet. RMs
Replica timestamp log
Receivers tasks
Merge arriving log m.log with its own
Add record r to local log if replicaTS lt r.ts
Apply any updates that have become stable
This may in turn make pending updates become
stable
Eliminate records from log entries in executed
table
Once it is established that they have been
applied everywhere
Sort the set of stable updates in timestamp order
r is applied only if there is no s s.t. s.prev lt
r.prev
tableTSj m.ts
If tableTSic gt r.tsc, for all i, then r is
discarded
c RM that created record r
ACKs by front-ends to discard records from
executed table

10
Update propagation

How long before all RMs receive an update ?
Frequency duration of network partitions
Beyond systems control !
Frequency of gossip msgs
Policy for choosing a gossip partner
Random
Weighted probabilities to favor near partners
Surprisingly robust !
But exhibits variable update propagation times
Deterministic
Simple function of RMs state
Eg Examine timestamp table choose the RM that
appears to be the furthest behind in updates
received
Topological
Based on fixed arrangement of RMs into a graph
Ring, mesh, trees
Trade-off amount of communication against higher
latencies the possibility that a single failure
will affect other RMs

11
Scalability concerns

2 messages per query (bet. front-end RM)
Causal update
G messages per gossip message
2 (R-1)/G messages exchanged
Increasing G leads to
Less messages
but also worse delivery latencies
RM has to wait for more updates to arrive before
propagating them
Improvement by having read-only replicas
Provided that update/query ratio is low !
Updated by gossip msgs but do not receive
updates directly from front-ends
Can be situated close to client groups
Vector timestamps need only include updateable RMs

12
Dependability Basic Concepts

Availability
Reliability
Safety
Maintainability

Fault ? Error ? Failure

Faults
-Transient
Intermittent
Permanent

13
Failure Models
14
Failure detectors

Not necessarily reliable !
P is here message, every T sec, assuming a max.
message transmission delay D
Categorization of processes (hints)
suspected vs unsuspected
A process may be functioning correctly on the
other side of a partitioned network
or it could be slow to respond to probes
Reliable detection
unsuspected vs failed (crashed)
Feasible only in synchronous systems
It is possible to give different responses to
different processes
different comm. conditions

15
Failure Masking by Redundancy (I)

Hide the occurrence of failures from other
processes, by redundancy
Information
Extra bits to allow recovery
Time
Transactions to allow abort/redo
Physical
Extra equipment to tolerate loss/malfunction of
some components
Voter circuitry
Voters are components too ? They may themselves
fail !

16
Failure Masking by Redundancy (II)

Triple modular redundancy (TMR)

17
Flat vs Hierarchical Groups (I)
Process resilience by replicating processes into
groups
Group membership protocols
18
Flat vs Hierarchical Groups (II)

Flat groups
Symmetrical (no special roles)
No single point of failure
Complex operation protocols (eg voting)
Hierarchical groups
Coordinator is a single point of failure

19
Failure Masking Replication

Having a group of identical processes allows us
to mask gt1 faulty processes
Primary-backup protocols
Hierarchical organization
Replicated-write protocols
Flat process groups
Active replication
Quorum protocols

K-fault tolerant system
Fail-silent processes ? group size (k 1)
Byzantine failures ? group size (2k 1)

20
Coordination/Agreement

A set of process must collaborate
or agree with one or more processes
without a fixed master/slave relationships
failure assumptions failure detectors
Problems
mutual exclusion
election
multicast
reliability ordering semantics
consensus
Byzantine agreement

21
Problems of Agreement

A set of processes need to agree on a value
(decision), after one or more processes have
proposed what that value (decision) should be
Examples
mutual exclusion, election, transactions
Processes may be correct, crashed, or they may
exhibit arbitrary (Byzantine) failures
Messages are exchanged on an one-to-one basis,
and they are not signed

22
Two Agreement Problems

Consensus problem every process i proposes a
value vi, while in the undecided state. Process i
exchanges messages until it makes decision di and
moves to decided state.
Termination all correct processes must make a
decision
Agreement same decision for all correct
processes
Integrity if all correct processes proposed same
value, any correct process decides that value
Byzantine generals problem a commander
process i orders value v.
The lieutenant processes must agree on what the
commander ordered.
Processes may be faulty
provide wrong or contradictory messages
Integrity requirement
A distinguished process decides a value for
others to agree upon
Solution only exists if N gt 3f, where f faulty
processes

23
Consensus for 3 processes
24
The Two-Army Problem

How can two perfect processes reach agreement
about 1 bit of information ?
over an unreliable comm. Channel
Red army 5000 troops
Blue army 1, 2 3000 troops each
How can the blue armies reach agreement on when
to attack ?
Their only means of communication is by sending
messengers
that may be captured by the enemy !
No solution!
Proof by contradiction Assume there is a
solution with a minimum messages

25
Consensus No Failures Case
majority(v1, , vN) returns most frequently
occurring value - returns if no majority
exists
Consensus via reliable multicast
For ordered values, min/max could be used instead
of majority
In general, if failures can occur it is not 100
certain that consensus can be reached in finite
time !
Terminating Reliable Multicast (TRB) A single
process multicasts a msg, and all
correct processes must agree on that msg -
Even if sender crashes, all correct processes
must deliver a special msg (Server-Fault)
26
Relation among problems
A problem B reduces to a problem A if there is an
algorithm which transforms any algorithm for A
into an algorithm for B.
Synchronous systems TRB is equivalent to
Consensus
Asynchronous systems Consensus reduces to
TRB but not vice versa!
Asynchronous systems with crash failures
Atomic Multicast is equivalent to Consensus
27
Consensus in synchronous systems
Duration of round max. delay of B-multicast
Up to f faulty processes
Dolev Strong, 1983 Any algorithm to reach
consensus despite up to f failures requires (f
1) rounds.
28
Byzantine agreement synchronous
Faulty process
Nothing can be done to improve a correct
process knowledge beyond the first stage -
It cannot tell which process is faulty.
3 says 1 says u
Lamport et al, 1982 No solution for N 3, f
1
Pease et al, 1982 No solution for Nlt 3f
(assuming private comm. channels)
29
Agreement in Faulty Systems (I)

The Byzantine generals problem for 3 loyal
generals and 1 traitor
The generals announce their troop strengths
The vectors that each general assembles based on
(a)
The vectors that each general receives in step 3.

Consensus by generals 1, 2, 4 ? (1, 2, UNKNOWN,
4))
30
Agreement in Faulty Systems (II)

The same as in previous slide, except now with 2
loyal generals and one traitor.

31
Byzantine agreement for N gt 3f
Example with N4, f1 - 1st round Commander
sends a value to each lieutenant - 2nd round
Each of the lieutenants sends the value it has
received to each of its peers.
- A lieutenant receives a total of (N 2) 1
values, of which (N 2) are correct. -
By majority(), the correct lieutenants compute
the same value.
In general, O(N(f1)) msgs
O(N2) for signed msgs
32
Impossibility of (deterministic) consensus in
asynchronous systems
M.J. Fischer, N. Lynch, and M. Paterson
Impossibility of distributed consensus with one
faulty process, J. ACM, 32(2), pp. 374-382,
1985.
A crashed process cannot be distinguished from a
slow one. - Not even with a 100 reliable
comm. network !
There is always a chance that some continuation
of the processes execution avoid consensus being
reached.
No guarantee for consensus, but Prob(consensus)
gt 0
Solutions based on randomization or
(unreliable) failure detectors or by fault
masking
33
Reliable client-server communication
What about reliable point-to-point transport
protocols ?

TCP masks omission failures
by using ACKs retransmissions
but it does not mask crash failures !
Eg When a connection is broken, the client is
only notified via an exception

34
5 classes of failures in RPC

Client is unable to locate server
Binding exception
at the expense of transparency
Request message is lost
Is it safe to retransmit ?
Allow server to detect it is dealing with a retry
Server crashes after receiving a request
Reply message is lost
Client crashes after sending a request

35
Lost Request Messages Server Crashes (I)

A server in client-server communication
Normal case
Crash after execution
Crash before execution

36
Server Crashes (II)

At-least-once semantics
Client keeps retransmitting until it gets a
response
At-most-once semantics
Give up immediately report failure
Guarantee nothing

Ideal would be exactly-once semantics
no general way to arrange this !

37
Server Crashes (III)

Print server scenario
M servers completion message
Server may send M either before or after printing
P servers print operation
C servers crash
Possible event orderings
M ? P ? C
M ? C (? P)
P ? M ? C
P ? C (? M)
C (? P ? M)
C (? M ? P)

38
Server Crashes (IV)

Different combinations of client server
strategies in the presence of server crashes.

No combination of client server strategy is
correct for all cases !
39
Lost Reply Messages

Is it safe to retransmit the request ?
Idempotent requests
Example Read a files first 1024 bytes
Counterexample money transfer order
Assign sequence number to request
Server keeps track of clients most recently
received sequence
additionally, set a RETRANSMISSION bit in the
request header

40
Client Crashes (I)

Orphan computation
No process waiting for the result
Waste of resources (CPU cycles, locks)
Possible confusion upon clients recovery
4 alternative strategies proposed by Nelson
(1981)
Extermination
Client keeps log of requests to be issued
Upon recovery, explicitly kill orphans
Overhead of logging (for every RPC)
Problems with grand-orphans
Problems with network partitions

41
Client Crashes (II)

Reincarnation
Divide time up into epochs (sequentially
numbered)
Upon reboot, client broadcasts start-of-epoch
Upon receipt, all remote computations on behalf
of this client are killed
After a network partition, an orphans response
will contain an obsolete epoch number ? easily
detected
Gentle reincarnation
Upon receipt of start-of-epoch, each server
checks to see if it has any remote computations
If the owner cannot be found, the computation is
killed
Expiration
Each RPC is given a time quantum T to complete
must explicitly ask for another if it cannot
finish in time
After reboot, client only needs to wait a time T
How to select a reasonable value for T ?

42
Basic Reliable-Multicasting Schemes

A simple solution to reliable multicasting when
all receivers are known are assumed not to fail
Message transmission
Reporting feedback

43
Nonhierarchical Feedback Control

Several receivers have scheduled a request for
retransmission, but the first retransmission
request leads to the suppression of others.

44
Hierarchical Feedback Control

The essence of hierarchical reliable
multicasting
Each coordinator forwards the message to its
children.
A coordinator handles retransmission requests.

45
Virtual Synchrony (I)

The logical organization of a distributed system
to distinguish between message receipt and
message delivery

46
Virtual Synchrony (II)

The principle of virtual synchronous multicast.

47
Message Ordering (I)

Three communicating processes in the same group.
The ordering of events per process is shown along
the vertical axis.

48
Message Ordering (II)

Four processes in the same group with two
different senders, and a possible delivery order
of messages under FIFO-ordered multicasting

49
Implementing Virtual Synchrony (I)
50
Implementing Virtual Synchrony (II)

Process 4 notices that process 7 has crashed,
sends a view change
Process 6 sends out all its unstable messages,
followed by a flush message
Process 6 installs the new view when it has
received a flush message from everyone else

Write a Comment

User Comments (0)

About PowerShow.com

CS556: Distributed Systems - PowerPoint PPT Presentation

CS556: Distributed Systems

Replicate data close to points where groups of clients need it ... Hold back until above condition is satisfied. RM can wait for missing updates ... – PowerPoint PPT presentation