Title: Replicated Distributed Systems
1Replicated Distributed Systems
2Overview
- Introduction and Background (Queenie)
- A Model of Replicated Distributed Programs
- Implementing Distributed Modules and Threads
- Implementing Replicated Procedure Calls (Alix)
- Performance Analysis
- Concurrency Control (Joanne)
- Binding Agents
- Troupe Reconfiguration
3Background
- Present a new software architecture for
fault-tolerant distributed programs - Designed by Eric C. Cooper
- A co-founder of FORE systems a leader supplier
of networks for enterprise and service providers
4Introduction
- Goal address the problem of constructing highly
available distributed programs - Tolerate crashes of the underlying hardware
automatically - Continues to operate despite failure of its
components - First approach replicate each components of the
system - By von Neumann (1955)
- Drawback costly - use reliable hardware
everywhere
5Introduction (contd)
- Eric C. Coopers new approach
- Replication on per-module basis
- Flexible not burdening the programmer
- Provide location and replication transparency to
programmer - Fundamental mechanism
- Troupes a replicated module
- Troupe members - replicas
- Replicated procedure call (many-to-many
communication between troupes)
6Introduction (contd)
- Important Properties give this mechanism
flexibility and power - individual members of a troupe do not communicate
among themselves - unaware of one anothers existence
- each troupe member behaves as no replicas
7A Model of Replicated Distributed Programs (contd)
A model of replicated distributed program
Replicated Distributed Program
State information
module
Troupe
Procedure
8A Model of Replicated Distributed Programs (contd)
- Module
- Package the procedure and state information which
is needed to implement a particular abstraction - Separate the interface to that abstraction from
its implementation - Express the static structure of a program when it
is written
9A Model of Replicated Distributed Programs (contd)
- Threads
- A thread ID unique identifier
- Particular thread runs in exactly one module at a
given time - Multiple threads may be running in the same
module concurrently
10Implementing Distributed Modules and Threads
- No machine boundaries
- Provide location transparency the programmer
dont need to know the eventual configuration of
a program - Module
- implemented by a server whose address space
contains the modules procedure and data - Thread
- implemented by using remote procedure calls to
transfer control from server to server
11Adding Replication
- Processor and network failure of the distributed
program - Partial failures
- Solution replication
- Introduce replication transparency at the module
level
12Adding Replication (contd)
- Assumption troupe members execute on fail-stop
processors - If not gt complex agreement
- Replication transparency in troupe model is
guaranteed by - All troupes are deterministic
- (same input ? same output)
13Troupe Consistency
- When all its members are in the same state
- gt A troupe is consistent
- gt Its clients dont need to know that is
replicated - ? Replication transparency
14Troupe Consistency (contd)
Execution of a remote procedure call (I)
Server
Client
15Troupe Consistency (contd)
Execution of a remote procedure call (II)
Server
Client
16Execution of Procedure call
- As a tree of procedure invocations
- The invocation trees rooted at each troupe member
are identical - The server troupe make the same procedure calls
and returns with the same arguments and results - All troupes are initially consistent
- ? All troupes remain consistent
17Replicated Procedure Calls
- Goal allow distributed programs to be written in
the same as conventional programs for centralized
computers - Replicated procedure call is Remote procedure
call - Exactly-once execution at all troupe members
18Circus Paired Message Protocol
- Characteristics
- Paired messages (e.g. call and return)
- Reliably delivered
- Variable length
- Call sequence numbers
- Based on the RPC
- Use UDP, the DARPA User Datagram Protocol
- Connectionless but retransmission
19Implementing Replicated Procedure Calls
- Implemented on top of the paired message layer
- Two subalgorithms in the many-to-many call
- One-to-many
- Many-to-one
- Implemented as part of the run-time system that
is linked with each users program
20(No Transcript)
21One-to-many calls
- Client half of RPC performs a one-to-many call
- Purpose is to guarantee that the procedure is
executed at each server troupe member - Same call message with the same call number
- Client waits for return messages
- Waits for all the return messages before
proceeding in Circus
22(No Transcript)
23Synchronization Point
- After all the server troupe members have returned
- Each client troupe member knows that all server
troupe members have performed the procedure - Each server troupe member knows that all client
troupe members have received the result
24Many-to-one calls
- Server will receive call messages from each
client troupe member - Server executes the procedure only once
- Returns the results to all the client troupe
members - Two problems
- Distinguishing between unrelated call messages
- How many other call messages are expected?
- Circus waits for all clients to send a call
message before proceeding
25(No Transcript)
26Many-to-many calls
- A replicated procedure call is called a
many-to-many call from a client troupe to a
server troupe
27Many-to-many steps
- A call message is sent from each client troupe
member to each server troupe member. - A call message is received by each server troupe
member from each client server troupe member. - The requested procedure is run on each server
troupe member. - A return message is sent from each server troupe
member to each client troupe member. - A return message is received by each client
troupe member from each server troupe member.
28Multicast Implementation
- Dramatic difference in efficiency
- Suppose m client troupe members and n server
troupe members - Point-to-point
- mn messages sent
- Multicast
- mn messages sent
29Waiting for messages to arrive
- Troupes are assumed to be deterministic,
therefore all messages are assumed to be
identical - When should computation proceed?
- As soon as the first messages arrives or only
after the entire set arrives?
30Waiting for all messages
- Able provide error detection and error correction
- Inconsistencies are detected
- Execution time determined by the slowest member
of each troupe - Default in Circus system
31First-come approach
- Forfeit error detection
- Computation proceeds as soon as the first message
in each set arrives - Execution time is determined by the fastest
member of each troupe - Requires a simple change to the one-to-many call
protocol - Client can use call sequence number to discard
return messages from slow server troupe members
32First-come approach
- More complicated changes required in the
many-to-one call protocol - When a call message from another member arrives,
the server cannot execute the procedure again - Would violate exactly-once execution
- Server must retain the return messages until all
other call messages have been received from the
client troupe members - Return messages is sent when the call is received
- Execution seems instantaneous to the client
33A better first come approach
- Buffer messages at the client rather than at the
server - Server broadcasts return messages to the entire
client troupe after the first call message - A client troupe member may receive a return
message before sending the call message - Return message is retained until the client
troupe member is ready to send the call message
34Advantages of buffering at client
- Work of buffering return messages and pairing
them with call messages is placed on the client
rather than a shared server - The server can broadcast rather than
point-to-point communication - No communication is required by a slow client
35What about error detection?
- To provide error detection and still allow
computation to proceed, a watchdog scheme can be
used - Create another thread of control after the first
message is received - This thread will watch for remaining messages and
compare - If there is an inconsistency, the main
computation is aborted
36Crashes and Partitions
- Underlying message protocol uses probing and
timeouts to detect crashes - Relies on network connectivity and therefore
cannot distinguish between crashes and network
partitions - To prevent troupe members from diverging
- Require that each troupe member receives majority
of expected set of messages
37Collators
- Can relax the determinism requirement by allowing
programmers to reduce a set of messages into a
single message - A collator maps a set of messages into a single
result - Collator needs enough messages to make a decision
- Three kinds
- Unanimous
- Majority
- First come
38Performance Analysis
- Experiments conducted at Berkeley during an
inter-semester break - Measured the cost of replicated procedure calls
as a function of the degree of replication - UDP and TCP echo tests used as a comparison
39Performance Analysis
- Performance of UDP, TCP and Circus
- TCP echo test faster than UDP echo test
- Cost of TCP connection establishment ignored
- UDP test makes two alarm calls and therefore two
settimer calls - Read and Write interface to TCP more streamlined
40(No Transcript)
41Performance Analysis
- Unreplicated Circus remote procedure call
requires almost twice the amount of time as a
simple UDP exchange - Due to extra system calls require to handle
Circus - Elaborate code to handle multi-homed machines
- Some Berkeley machines had as many as 4 network
addresses - Design oversight by Berkeley, not a fundamental
problem
42Performance Analysis
- Expense of a replicated procedure call increments
linearly as the degree of replication increases - Each additional troupe member adds between 10-20
milliseconds - Smaller than the time for a UDP datagram exchange
43(No Transcript)
44Performance Analysis
- Execution profiling tool used to analyze Circus
implementation in finer detail - 6 Berkeley 4.2BSD system calls account for more
than ½ the total CPU time to perform a replicated
call - Most of the time required for a Circus replicated
procedure call is spent in the simulation of
multicasting
45(No Transcript)
46(No Transcript)
47Concurrency Control
- Server troupe controls calls from different
clients using multiple threads - Conflicts arise when concurrent calls need to
access the same resource
48Concurrency Control
- Serialization at each troupe member
- Local concurrency control algorithms
- Serialization in the same order among members
- Preserve troupe consistency
- Need coordination between replicated procedure
calls mechanism and synchronization mechanism - gt Replicated Transactions
49Replicated Transactions
- Requirements
- Serializability
- Atomicity
- Ensure that aborting a transaction does not
affect other concurrently executed transactions - Two-phase locking with unanimous update
- Drawback too strict
- Troupe Commit Protocol
50Troupe Commit Protocol
- Before a server troupe member commits (or aborts)
a transaction, it invokes the ready_to_commit
remote procedure call to the client troupe
call-back - Client troupe returns whether it agrees to commit
(or abort) the transaction - If server troupe members serialize transactions
in different order, a deadlock will occur - Detecting conflicting transactions is converted
to deadlock detection
51An example of Troupe Commit Protocol
- Two server troupe members SM1 and SM2
- Two client troupes C1 and C2
- C1 performs transaction T1 and C2 performs
transaction T2
52An example of Troupe Commit Protocol
- Scenario 1 T1 and T2 are serialized in the same
order, say T1 first and T2 second, on SM1 and SM2
3.commit T1
6.commit T2
53An example of Troupe Commit Protocol
- Scenario 1 (contd)
- SM1 and SM2 call ready_to_commit first at C1
passing true as argument - C1 returns true to both SM1 and SM2
- SM1 and SM2 commit T1
- SM1 and SM2 commit T2 by repeating steps (1) (3)
54An example of Troupe Commit Protocol
- Scenario 2 T1 and T2 are serialized in the
different order, say SM1 wants to commit T1 and
SM2 wants to commit T2. If transactions are
committed, SM1 and SM2 will be inconsistent
55An example of Troupe Commit Protocol
1.ready_to_commit (true)
1.ready_to_commit (true)
56An example of Troupe Commit Protocol
- Scenario 2 (contd)
- SM1 calls ready_to_commit at C1 and SM2 calls
ready_to_commit at C2 - C1 will not return any value because it is
waiting for the call-back from SM2. The same
thing happens to C2. - Without returning values from C2, SM2 cannot
commit T2 or proceed T1. Neither can SM1. - DEADLOCK! gt Different serialization orders are
detected
57An example of Troupe Commit Protocol
- Scenario 3 T1 and T2 are serialized in the
different order. However, committing them will
NOT leave SM1 and SM2 at inconsistent states - SM1 and SM2 calls two ready_to_commit at C1 and
C2 in parallel. - Both server troupe members will commit T1 and T2
in parallel after C1 and C2 return true.
58An example of Troupe Commit Protocol
3.commit T1 and T2
59 Binding
60(No Transcript)
61Binding for Replicated Program
- Cache Invalidation Problem
- A clients binding information becomes stale
- Causes
- A server troupe member or an entire troupe is no
longer available - The specified interface is no longer exported
- A new troupe member is added
62Binding for Replicated Program
- Cache Invalidation Detection
- Paired message protocol can detect missing troupe
members - Remote procedure call level can detect
non-exported interface - Added troupe members CANNOT be detected by
clients alone gt Need help from binding agents
63Binding for Replicated Program
- How is a new troupe member added?
- Assume the new member is already in the same
state as other members - The new member calls the add_troupe_member at the
binding agent - The binding agent invokes the set_troupe_id
procedure at each troupe member
64Binding for Replicated Program
- Result of adding a new troupe member
- The updated troupe contain the new member
- The troupe ID is changed
- Client will detect this update by finding the
original server troupe ID is no longer valid
65Binding for Replicated Program
- Cache Invalidation Recovery
- Clients call rebind at the binding agent
- Clients update binding information
- Binding agent may garbage collect unavailable
server hinted by this call
66Troupe Reconfiguration
- Recovery from partial failure
- Replace broken troupe member with a new one
- Similar to the problem of adding a new troupe
member
67Troupe Reconfiguration Steps
- Atomic transaction
- Bring the new member into the state consistent
with other members - get_state procedure call
- Add the new member to binding agent
- add_troupe_member
- set_troupe_id