Title: CS556: Distributed Systems
1CS-556 Distributed Systems
Consistency Replication (III)
- Manolis Marazakis
- maraz_at_csd.uoc.gr
2Fault Tolerance ?
- Define correctness criteria
- When 2 replicas are separated by network
partition - Both are deemed incorrect stop serving.
- One (the master) continues the other ceases
service. - One (the master) continues to accept updates
both continue to supply reads (of possibly stale
data). - Both continue service subsequently synchronise.
3Fault Tolerance
- Design to recover after a failure w/o loss of
(committed) data. - Designs for fault tolerance
- Single server, fail and recover
- Primary server with trailing backups
- Replicated service
4Network Partitions
- Separate but viable groups of servers
- Optimistic schemes validate on recovery
- Available copies with validation
- Pessimistic schemes limit availability until
recovery
5Replication under partitions
- Available copies with validation
- Validation involves
- Aborting conflicting Txs
- Compensations
- Precedence graphs for detecting inconsistencies
- Quorum consensus
- Version number per replica
- Operations should only be applied to replicas
with the current version number - Virtual partition
6Transactions with Replicated Data
- Better performance
- Concurrent service
- Reduced latency
- Higher availability
- Fault tolerance
- What if a replica fails or becomes isolated ?
- Upon rejoining, it must catch up
- Replicated transaction service
- Data replicated at a set of replica managers
- Replication transparency
- One copy serializability
- Read one, write all
Failures must be observed to have happened
before any active Txs at other servers
7Active Replication (I)
- The problem of replicated invocations.
8Active Replication (II)
- (a) Forwarding an invocation request from a
replicated object. - (b) Returning a reply to a replicated object.
9Active replication (III)
10Active replication (IV)
- RMs are state machines with equivalent roles
- Front ends communicates the client requests to RM
group, using totally ordered reliable multicast - RMs process independently requests reply to
front end (correct RMs process each request
identically) - Front end can synthesize final response to client
(tolerating Byzantine failures) - Active replication provides sequential
consistency if multicast is reliable ordered - Byzantine failures (F out of 2F1) front end
waits until it gets F1 identical responses
11Available Copies Replication
replica managers
getBalance(A)
- Not all copies will always be available.
- Failures
- Timeout at failed replica
- Rejected by recovering, unsynchronised replica
deposit(B)
deposit(A)
getBalance(B)
12Giffords quorum scheme (I)
- Version numbers or timestamps per copy
- A number of votes is assigned to each physical
copy - weight related to demand for a particular copy
- totV(g) total number of votes for group of RMs
- totV total votes
- Obtain quorum before read/write
- R votes before read
- W votes before write
- W gt 0.5totV
- (R W) gt totV(g)
- Any quorum pair must contain common copies
- In case of partition, it is not possible to
perform conflicting operations on the same copy
13Giffords quorum scheme (II)
- Read
- Version number inquiries to find set (g) of RMs
- totV(g) gt R
- Not all copies need to be up-to-date
- Every read quorum contains at least one current
copy - Write
- Version number inquiries to find set (g) of RMs
- totV(g) gt W
- up-to-date copies
- If there are insufficient up-to-date copies,
replace a non-current copy with a copy of the
current copy - Groups of RMs can be configured to provide
different performance/reliability characteristics - Decrease W to improve writes
- Decrease R to improve reads
14Giffords quorum scheme (III)
- Performance penalty for reads
- Due to the need for collecting a read quorum
- Support for copies on local disks of clients
- Assigned zero votes - weak representatives
- These copies cannot be included in a quorum
- After obtaining a read quorum, a read may be
carried out on the local copy if it is up-to-date - Blocking probability
- In some cases, a quorum cannot be obtained
15Giffords quorum scheme (IV)
Ex1 file with high read/write
Ex2 file with moderate read/write
Reads can be satisfied by local RM, but writes
must also access one remote RM
Ex3 file with very high read/write
Examples assume 99 availability for RMs
16Quorum-Based Protocols
- Three examples of the voting algorithm
- A correct choice of read write set
- A choice that may lead to write-write conflicts
- A correct choice, known as ROWA (read one, write
all)
17Passive Replication (I)
- At any time, system has single primary RM
- One or more secondary backup RMs
- Front ends communicate with primary, primary
executes requests, response to all backups - If primary fails, one backup is promoted to
primary - New primary starts from Coordination phase for
each new request - What happens if primary crashes
before/during/after agreement phase?
18Passive Replication (II)
19Passive replication (III)
- Satisfies linearizability
- Front end looks up new primary, when current
primary does not respond - Primary RM is performance bottleneck
- Can tolerate F failures for F1 RMs
- Variation clients can access backup RMs
(linearizability is lost, but clients get
sequential consistency) - SUN NIS (yellow pages) uses passive replication
clients can contact primary or backup servers for
reads, but only primary server for updates
20Bayou
- A data management system to support collaboration
in diverse network environments - Variable degrees of connectedness
- Rely only on occasional pair-wise communication
- No notion of disconnected mode of operation
- Tentative Committed data
- Update everywhere
- No locking
- Incremental reconciliation
- Eventual consistency
- Application participation
- Conflict detection resolution
- Merge procedures
- No replication transparency
21Design Overview
- Client stub (API) Servers
- Per replica state
- Database (Tuple Store)
- Write log Undo log
- Version vector
- Reconciliation protocol (peer-to-peer)
- eventual consistency
- All replicas receive all updates
- Any two replicas that have received the same set
of updates have identical databases.
22Accommodating weak connectivity
- Weakly consistent replication
- read-any/write-any access
- Session sharing semantics
- Epidemic propagation
- pair-wise contacts (anti-entropy sessions)
- agree on the set of writes their order
- Convergence rate depends on
- connectivity frequency of anti-entropy
sessions - partner selection policy
23Conventional Approaches
- Version vectors
- Optimistic concurrency control
- Problems
- Concurrent writes to the same object may not
conflict, while writes to different objects may
conflict - depending on object granularity
- Bayou Account for application semantics
- conflict detection with help of application
24Conflict detection resolution
- Shared calendar example
- Conflicting meetings overlap in time
- Resolution reschedule to alternate time
- Dependency check
- included in every write
- Together with expected result
- calls merge procedure if a conflict is detected
- can query the database
- produces new update
25A Bayou write
- Processed at each replica
- Bayou_Write(update,dep_check,mergeproc)
- IF (DB_EVAL(dep_check.query) ltgt
- dep_check.expected_result)
- resolved_update EXECUTE(mergeproc)
- ELSE
- resolved_update update
- DB_EXEC(resolved_update)
26Example of write in the shared calendar
- Updateinsert, Meetings, 12/18/95, 130pm,
60min, Budget Meeting - Dependency_checkquerySELECT key FROM Meetings
WHERE day12/18/95 AND startlt230pm AND
endgt130pm, expected_resultEMPTY - MergeProc
- alternates 12/18/95, 300pm, 12/19/95,
930am - FOREACH a IN alternates
- / check if feasible, produce newupdate /
- if(newupdate ) / no feasible alternate /
- newupdate insert, ErrorLog, Update
- Return(newupdate)
27Eventual consistency
- Propagation
- All replicas receive all updates
- chain of pair-wise interactions
- Determinism
- All replicas execute writes in the same way
- Including conflict detection resolution
- Global order
- All replicas apply writes to their databases in
the same order - Since writes include arbitrarily complex merge
procedures, it is effectively impossible to
determine if two writes commute or to transform
them so that they can be re-ordered - Tentative writes are ordered by the timestamp
assigned by their accepting servers - Total order using ltTimestamp, serverIDgt
- Desirable so that a cluster of isolated servers
agree on the tentative resolution of conflicts
28Undoing re-applying writes
- Servers may accept writes (from clients or other
servers) in an order that differs from the
acceptable execution order - Servers immediately apply all known writes
- Therefore
- Servers must be able to undo the effects of some
previous tentative execution of a write
re-apply it in a different order - The number of retries depends only on the order
in which writes arrive (via anti-entropy
sessions) - Not on the likelihood of conflicts
- Each server maintains write log undo log
- Sorted by committed or tentative timestamp
- Committed writes take precedence
29Constraints on write
- Must produce the same result on all replicas with
equal write logs preceding that write - Client-provided merge procedures
- can only access the database parameters
provided - cannot access time-varying or server-specific
state - pid, time, file system
- have uniform bounds on memory processor
- So that failures due to resource usage are
uniform - Otherwise, non-deterministic behavior !
30Global order
- Writes are totally ordered wrt write-stamp
- (commit-stamp, accept-stamp, server-id)
- accept-stamp
- assigned by the server that initially receives
the write - derived from logical clock
- monotonically increasing
- Global clock sync. is not required
- commit-stamp
- initialized to ??
- updated when write is stabilized
31Stabilizing writes (I)
- A write is stable at a replica when it has been
executed for the last time at that replica - All writes with earlier write-stamps are known to
the replica, and no future writes will be given
earlier write-stamps - Convenient for applications to have a notion of
confirmation/commitment - Stabilize as soon as possible
- allow replicas to prune their write-logs
- inform applications/users that writes have been
confirmed -or- fully resolved
32Stabilizing writes (II)
- A write may be executed several times at a server
may produce different results - Depending on servers execution history
- The Bayou API provides means to inquire about the
stability of a write - By including the current clock value in
anti-entropy sessions, a server can determine
that a write is stable when it has a lower
timestamp than all servers clocks - A single server that remains disconnected may
prevent writes from stabilizing, and cause
rollbacks upon its re-connection - Explicit commit procedure
33Committing writes
- In a given data collection, one replica is
designated the primary - commits writes by assigning a commit-stamp
- No requirement for majority quorum
- A disconnected server can be the primary for a
users personal data objects - Committing a write makes it stable
- Commit-stamp determined total order
- Committed writes are ordered before tentative
writes - Replicas are informed of committed writes in
commit-stamp order
34Applying sets of writes
Receive_Writes(wset, from) if(from
CLIENT) TS MAXsysClock, TS1 w
First(wset) w.WID TS, mySrvID
w.state TENTATIVE WriteLogAppend(w)
Bayou_Write(w.update, w.dependency_check,
w.mergeproc) else / received via
anti-entropy -gt wset is ordered / w
First(wset) insertionPoint
WriteLogIdentifyInsertionPoint(w.WID)
TupleStoreRollbackTo(insertionPoint)
WriteLogInsert(wset) for each w in
WriteLog, w after insertionPoint
Bayou_Write(w.update, w.dependency_check,
w.mergeproc) w Last(wset)
TS MAXTS, w.WID.timestamp
35Epidemic protocols (I)
- Scalable propagation of updates in
eventually-consistent data stores - eventually all replicas receive all updates
- in as few msgs as possible
- Aggregation of multiple updates in each msg
- Classification of servers
- Infective
- Susceptible
- Removed
36Epidemic protocols (II)
- Anti-entropy propagation
- Server P randomly selects server Q
- Options
- P pushes updates to Q
- Problem of delay if we have relatively many
infective servers - P pulls updates from Q
- Spreading of updates is triggered by susceptible
servers - P and Q exchange updates (push/pull)
- Assuming that only a single infective server
- Both push pull eventually spread updates
- Optimization
- Ensure that at least a number of servers
immediately become infective
37Epidemic protocols (III)
- Rumor spreading (gossiping)
- Server P randomly selects Q to push updates
- If Q already has seen the updates of P, then P
may lose interest - with probability 1/k
- Rapid propagation
- but no guarantee that all servers will be
updates
s servers that remain ignorant of an update
s e (k1)(1-s)
Enhancements by combing gossiping with
anti-entropy
k 3 ? s lt 0.02
38Epidemic protocols (IV)
- Spreading a deletion is hard
- After removing a data item, a server may receive
old copies ! - Must record deletions spread them
- Death certificates
- Time-stamped upon creation time
- Enforce TTL of certificates
- Based on estimate of max. update propagation time
- Maintain a few dormant death certificates
- that never expire
- as a guarantee that a death certificate can be
re-spread in case an obsolete update is received
for a deleted data item
39The gossip architecture (I)
- Replicate data close to points where groups of
clients need it - Periodic exchange of msgs among RMs
- Front-ends send queries updates to any RM they
choose - Any RM that is available can provide acceptable
response times - Consistent service over time
- Relaxed consistency bet. replicas
40The gossip architecture (II)
- Causal update ordering
- Forced ordering
- Causal total
- A Forced-order a Causal-order update that are
related by the happened-before relation may be
applied in different orders at different RMs ! - Immediate ordering
- Updates are applied in a consistent order
relative to any other update at all RMs
41The gossip architecture (III)
- Bulletin board application example
- Posting items -gt causal order
- Adding a subscriber -gt forced order
- Removing a subscriber -gt immediate order
- Gossip messages updates among RMs
- Front-ends maintain prev vector timestamp
- One entry per RM
- RMs respond with new vector timestamp
42State components of a gossip RM
43Query operations in gossip
- RM must return a value that is at least as recent
as the requests timestamp - Q.prev lt valueTS
- List of pending query operations
- Hold back until above condition is satisfied
- RM can wait for missing updates
- or request updates from the RMs concerned
- RMs response includes valueTS
44Updates in causal order
- RM-i checks to see if operation ID is in its
executed table or in its log - Discard update if it has already seen it
- Increment i-th element of replica timestamp
- Count of updates received from front-ends
- Assign vector timestamp (ts) to the update
- Replace i-th element of u.prev by i-th element of
replica timestamp - Insert log entry
- lti, ts, u.op, u.prev, u.idgt
- Stability condition u.prev lt valueTS
- All updates on which u depends have been applied
45Forced immediate order
- Unique sequence number is appended to update
timestamps - Primary RM acts as sequencer
- Another RM can be elected to take over
consistently as sequencer - Majority of RMs (including primary) must record
which update is the next in sequence - Immediate ordering by having the primary order
them in the sequence (along with forced updates
considering causal updates as well) - Agreement protocol on sequence
46Gossip timestamps
- Gossip msgs bet. RMs
- Replica timestamp log
- Receivers tasks
- Merge arriving log m.log with its own
- Add record r to local log if replicaTS lt r.ts
- Apply any updates that have become stable
- This may in turn make pending updates become
stable - Eliminate records from log entries in executed
table - Once it is established that they have been
applied everywhere - Sort the set of stable updates in timestamp order
- r is applied only if there is no s s.t. s.prev lt
r.prev - tableTSj m.ts
- If tableTSic gt r.tsc, for all i, then r is
discarded - c RM that created record r
- ACKs by front-ends to discard records from
executed table
47Update propagation
- How long before all RMs receive an update ?
- Frequency duration of network partitions
- Beyond systems control !
- Frequency of gossip msgs
- Policy for choosing a gossip partner
- Random
- Weighted probabilities to favor near partners
- Surprisingly robust !
- But exhibits variable update propagation times
- Deterministic
- Simple function of RMs state
- Eg Examine timestamp table choose the RM that
appears to be the furthest behind in updates
received - Topological
- Based on fixed arrangement of RMs into a graph
- Ring, mesh, trees
- Trade-off amount of communication against higher
latencies the possibility that a single failure
will affect other RMs
48Scalability concerns
- 2 messages per query (bet. front-end RM)
- Causal update
- G messages per gossip message
- 2 (R-1)/G messages exchanged
- Increasing G leads to
- Less messages
- but also worse delivery latencies
- RM has to wait for more updates to arrive before
propagating them - Improvement by having read-only replicas
- Provided that update/query ratio is low !
- Updated by gossip msgs but do not receive
updates directly from front-ends - Can be situated close to client groups
- Vector timestamps need only include updateable RMs
49References
- R. Ladin, B. Liskov, L. Shrira and S. Ghemawat,
Providing Availability using Lazy Replication,
ACM Trans. Computer Systems, vol. 10, no.4, pp.
360-391, 1992. - A. Demers, D. Greene, C. Hauser, W. Irish and J.
Larson, Epidemic algorithms for replicated
database maintenance, Proc. 6th ACM Symposium
on Principles of Distributed Computing, pp. 1-12,
1987. - D. B. Terry, M. M. Theimer, K. Petersen, A. J.
Demers, M. J. Spreitzer, and C. Hauser.Â
Managing Update Conflicts in Bayou, a Weakly
Connected Replicated Storage System'', Proc. 15th
ACM SOSP, pp.172-183, 1995.