CS556: Distributed Systems - PowerPoint PPT Presentation

1 / 49
About This Presentation
Title:

CS556: Distributed Systems

Description:

Undoing & re ... Servers must be able to undo the effects of some previous ' ... Each server maintains write log & undo log. Sorted by committed or ... – PowerPoint PPT presentation

Number of Views:68
Avg rating:3.0/5.0
Slides: 50
Provided by: mar177
Category:

less

Transcript and Presenter's Notes

Title: CS556: Distributed Systems


1
CS-556 Distributed Systems
Consistency Replication (III)
  • Manolis Marazakis
  • maraz_at_csd.uoc.gr

2
Fault Tolerance ?
  • Define correctness criteria
  • When 2 replicas are separated by network
    partition
  • Both are deemed incorrect stop serving.
  • One (the master) continues the other ceases
    service.
  • One (the master) continues to accept updates
    both continue to supply reads (of possibly stale
    data).
  • Both continue service subsequently synchronise.

3
Fault Tolerance
  • Design to recover after a failure w/o loss of
    (committed) data.
  • Designs for fault tolerance
  • Single server, fail and recover
  • Primary server with trailing backups
  • Replicated service

4
Network Partitions
  • Separate but viable groups of servers
  • Optimistic schemes validate on recovery
  • Available copies with validation
  • Pessimistic schemes limit availability until
    recovery

5
Replication under partitions
  • Available copies with validation
  • Validation involves
  • Aborting conflicting Txs
  • Compensations
  • Precedence graphs for detecting inconsistencies
  • Quorum consensus
  • Version number per replica
  • Operations should only be applied to replicas
    with the current version number
  • Virtual partition

6
Transactions with Replicated Data
  • Better performance
  • Concurrent service
  • Reduced latency
  • Higher availability
  • Fault tolerance
  • What if a replica fails or becomes isolated ?
  • Upon rejoining, it must catch up
  • Replicated transaction service
  • Data replicated at a set of replica managers
  • Replication transparency
  • One copy serializability
  • Read one, write all

Failures must be observed to have happened
before any active Txs at other servers
7
Active Replication (I)
  • The problem of replicated invocations.

8
Active Replication (II)
  • (a) Forwarding an invocation request from a
    replicated object.
  • (b) Returning a reply to a replicated object.

9
Active replication (III)
10
Active replication (IV)
  • RMs are state machines with equivalent roles
  • Front ends communicates the client requests to RM
    group, using totally ordered reliable multicast
  • RMs process independently requests reply to
    front end (correct RMs process each request
    identically)
  • Front end can synthesize final response to client
    (tolerating Byzantine failures)
  • Active replication provides sequential
    consistency if multicast is reliable ordered
  • Byzantine failures (F out of 2F1) front end
    waits until it gets F1 identical responses

11
Available Copies Replication
replica managers
getBalance(A)
  • Not all copies will always be available.
  • Failures
  • Timeout at failed replica
  • Rejected by recovering, unsynchronised replica

deposit(B)
deposit(A)
getBalance(B)
12
Giffords quorum scheme (I)
  • Version numbers or timestamps per copy
  • A number of votes is assigned to each physical
    copy
  • weight related to demand for a particular copy
  • totV(g) total number of votes for group of RMs
  • totV total votes
  • Obtain quorum before read/write
  • R votes before read
  • W votes before write
  • W gt 0.5totV
  • (R W) gt totV(g)
  • Any quorum pair must contain common copies
  • In case of partition, it is not possible to
    perform conflicting operations on the same copy

13
Giffords quorum scheme (II)
  • Read
  • Version number inquiries to find set (g) of RMs
  • totV(g) gt R
  • Not all copies need to be up-to-date
  • Every read quorum contains at least one current
    copy
  • Write
  • Version number inquiries to find set (g) of RMs
  • totV(g) gt W
  • up-to-date copies
  • If there are insufficient up-to-date copies,
    replace a non-current copy with a copy of the
    current copy
  • Groups of RMs can be configured to provide
    different performance/reliability characteristics
  • Decrease W to improve writes
  • Decrease R to improve reads

14
Giffords quorum scheme (III)
  • Performance penalty for reads
  • Due to the need for collecting a read quorum
  • Support for copies on local disks of clients
  • Assigned zero votes - weak representatives
  • These copies cannot be included in a quorum
  • After obtaining a read quorum, a read may be
    carried out on the local copy if it is up-to-date
  • Blocking probability
  • In some cases, a quorum cannot be obtained

15
Giffords quorum scheme (IV)
Ex1 file with high read/write
Ex2 file with moderate read/write
Reads can be satisfied by local RM, but writes
must also access one remote RM
Ex3 file with very high read/write
Examples assume 99 availability for RMs
16
Quorum-Based Protocols
  • Three examples of the voting algorithm
  • A correct choice of read write set
  • A choice that may lead to write-write conflicts
  • A correct choice, known as ROWA (read one, write
    all)

17
Passive Replication (I)
  • At any time, system has single primary RM
  • One or more secondary backup RMs
  • Front ends communicate with primary, primary
    executes requests, response to all backups
  • If primary fails, one backup is promoted to
    primary
  • New primary starts from Coordination phase for
    each new request
  • What happens if primary crashes
    before/during/after agreement phase?

18
Passive Replication (II)
19
Passive replication (III)
  • Satisfies linearizability
  • Front end looks up new primary, when current
    primary does not respond
  • Primary RM is performance bottleneck
  • Can tolerate F failures for F1 RMs
  • Variation clients can access backup RMs
    (linearizability is lost, but clients get
    sequential consistency)
  • SUN NIS (yellow pages) uses passive replication
    clients can contact primary or backup servers for
    reads, but only primary server for updates

20
Bayou
  • A data management system to support collaboration
    in diverse network environments
  • Variable degrees of connectedness
  • Rely only on occasional pair-wise communication
  • No notion of disconnected mode of operation
  • Tentative Committed data
  • Update everywhere
  • No locking
  • Incremental reconciliation
  • Eventual consistency
  • Application participation
  • Conflict detection resolution
  • Merge procedures
  • No replication transparency

21
Design Overview
  • Client stub (API) Servers
  • Per replica state
  • Database (Tuple Store)
  • Write log Undo log
  • Version vector
  • Reconciliation protocol (peer-to-peer)
  • eventual consistency
  • All replicas receive all updates
  • Any two replicas that have received the same set
    of updates have identical databases.

22
Accommodating weak connectivity
  • Weakly consistent replication
  • read-any/write-any access
  • Session sharing semantics
  • Epidemic propagation
  • pair-wise contacts (anti-entropy sessions)
  • agree on the set of writes their order
  • Convergence rate depends on
  • connectivity frequency of anti-entropy
    sessions
  • partner selection policy

23
Conventional Approaches
  • Version vectors
  • Optimistic concurrency control
  • Problems
  • Concurrent writes to the same object may not
    conflict, while writes to different objects may
    conflict
  • depending on object granularity
  • Bayou Account for application semantics
  • conflict detection with help of application

24
Conflict detection resolution
  • Shared calendar example
  • Conflicting meetings overlap in time
  • Resolution reschedule to alternate time
  • Dependency check
  • included in every write
  • Together with expected result
  • calls merge procedure if a conflict is detected
  • can query the database
  • produces new update

25
A Bayou write
  • Processed at each replica
  • Bayou_Write(update,dep_check,mergeproc)
  • IF (DB_EVAL(dep_check.query) ltgt
  • dep_check.expected_result)
  • resolved_update EXECUTE(mergeproc)
  • ELSE
  • resolved_update update
  • DB_EXEC(resolved_update)

26
Example of write in the shared calendar
  • Updateinsert, Meetings, 12/18/95, 130pm,
    60min, Budget Meeting
  • Dependency_checkquerySELECT key FROM Meetings
    WHERE day12/18/95 AND startlt230pm AND
    endgt130pm, expected_resultEMPTY
  • MergeProc
  • alternates 12/18/95, 300pm, 12/19/95,
    930am
  • FOREACH a IN alternates
  • / check if feasible, produce newupdate /
  • if(newupdate ) / no feasible alternate /
  • newupdate insert, ErrorLog, Update
  • Return(newupdate)

27
Eventual consistency
  • Propagation
  • All replicas receive all updates
  • chain of pair-wise interactions
  • Determinism
  • All replicas execute writes in the same way
  • Including conflict detection resolution
  • Global order
  • All replicas apply writes to their databases in
    the same order
  • Since writes include arbitrarily complex merge
    procedures, it is effectively impossible to
    determine if two writes commute or to transform
    them so that they can be re-ordered
  • Tentative writes are ordered by the timestamp
    assigned by their accepting servers
  • Total order using ltTimestamp, serverIDgt
  • Desirable so that a cluster of isolated servers
    agree on the tentative resolution of conflicts

28
Undoing re-applying writes
  • Servers may accept writes (from clients or other
    servers) in an order that differs from the
    acceptable execution order
  • Servers immediately apply all known writes
  • Therefore
  • Servers must be able to undo the effects of some
    previous tentative execution of a write
    re-apply it in a different order
  • The number of retries depends only on the order
    in which writes arrive (via anti-entropy
    sessions)
  • Not on the likelihood of conflicts
  • Each server maintains write log undo log
  • Sorted by committed or tentative timestamp
  • Committed writes take precedence

29
Constraints on write
  • Must produce the same result on all replicas with
    equal write logs preceding that write
  • Client-provided merge procedures
  • can only access the database parameters
    provided
  • cannot access time-varying or server-specific
    state
  • pid, time, file system
  • have uniform bounds on memory processor
  • So that failures due to resource usage are
    uniform
  • Otherwise, non-deterministic behavior !

30
Global order
  • Writes are totally ordered wrt write-stamp
  • (commit-stamp, accept-stamp, server-id)
  • accept-stamp
  • assigned by the server that initially receives
    the write
  • derived from logical clock
  • monotonically increasing
  • Global clock sync. is not required
  • commit-stamp
  • initialized to ??
  • updated when write is stabilized

31
Stabilizing writes (I)
  • A write is stable at a replica when it has been
    executed for the last time at that replica
  • All writes with earlier write-stamps are known to
    the replica, and no future writes will be given
    earlier write-stamps
  • Convenient for applications to have a notion of
    confirmation/commitment
  • Stabilize as soon as possible
  • allow replicas to prune their write-logs
  • inform applications/users that writes have been
    confirmed -or- fully resolved

32
Stabilizing writes (II)
  • A write may be executed several times at a server
    may produce different results
  • Depending on servers execution history
  • The Bayou API provides means to inquire about the
    stability of a write
  • By including the current clock value in
    anti-entropy sessions, a server can determine
    that a write is stable when it has a lower
    timestamp than all servers clocks
  • A single server that remains disconnected may
    prevent writes from stabilizing, and cause
    rollbacks upon its re-connection
  • Explicit commit procedure

33
Committing writes
  • In a given data collection, one replica is
    designated the primary
  • commits writes by assigning a commit-stamp
  • No requirement for majority quorum
  • A disconnected server can be the primary for a
    users personal data objects
  • Committing a write makes it stable
  • Commit-stamp determined total order
  • Committed writes are ordered before tentative
    writes
  • Replicas are informed of committed writes in
    commit-stamp order

34
Applying sets of writes
Receive_Writes(wset, from) if(from
CLIENT) TS MAXsysClock, TS1 w
First(wset) w.WID TS, mySrvID
w.state TENTATIVE WriteLogAppend(w)
Bayou_Write(w.update, w.dependency_check,
w.mergeproc) else / received via
anti-entropy -gt wset is ordered / w
First(wset) insertionPoint
WriteLogIdentifyInsertionPoint(w.WID)
TupleStoreRollbackTo(insertionPoint)
WriteLogInsert(wset) for each w in
WriteLog, w after insertionPoint
Bayou_Write(w.update, w.dependency_check,
w.mergeproc) w Last(wset)
TS MAXTS, w.WID.timestamp
35
Epidemic protocols (I)
  • Scalable propagation of updates in
    eventually-consistent data stores
  • eventually all replicas receive all updates
  • in as few msgs as possible
  • Aggregation of multiple updates in each msg
  • Classification of servers
  • Infective
  • Susceptible
  • Removed

36
Epidemic protocols (II)
  • Anti-entropy propagation
  • Server P randomly selects server Q
  • Options
  • P pushes updates to Q
  • Problem of delay if we have relatively many
    infective servers
  • P pulls updates from Q
  • Spreading of updates is triggered by susceptible
    servers
  • P and Q exchange updates (push/pull)
  • Assuming that only a single infective server
  • Both push pull eventually spread updates
  • Optimization
  • Ensure that at least a number of servers
    immediately become infective

37
Epidemic protocols (III)
  • Rumor spreading (gossiping)
  • Server P randomly selects Q to push updates
  • If Q already has seen the updates of P, then P
    may lose interest
  • with probability 1/k
  • Rapid propagation
  • but no guarantee that all servers will be
    updates

s servers that remain ignorant of an update
s e (k1)(1-s)
Enhancements by combing gossiping with
anti-entropy
k 3 ? s lt 0.02
38
Epidemic protocols (IV)
  • Spreading a deletion is hard
  • After removing a data item, a server may receive
    old copies !
  • Must record deletions spread them
  • Death certificates
  • Time-stamped upon creation time
  • Enforce TTL of certificates
  • Based on estimate of max. update propagation time
  • Maintain a few dormant death certificates
  • that never expire
  • as a guarantee that a death certificate can be
    re-spread in case an obsolete update is received
    for a deleted data item

39
The gossip architecture (I)
  • Replicate data close to points where groups of
    clients need it
  • Periodic exchange of msgs among RMs
  • Front-ends send queries updates to any RM they
    choose
  • Any RM that is available can provide acceptable
    response times
  • Consistent service over time
  • Relaxed consistency bet. replicas

40
The gossip architecture (II)
  • Causal update ordering
  • Forced ordering
  • Causal total
  • A Forced-order a Causal-order update that are
    related by the happened-before relation may be
    applied in different orders at different RMs !
  • Immediate ordering
  • Updates are applied in a consistent order
    relative to any other update at all RMs

41
The gossip architecture (III)
  • Bulletin board application example
  • Posting items -gt causal order
  • Adding a subscriber -gt forced order
  • Removing a subscriber -gt immediate order
  • Gossip messages updates among RMs
  • Front-ends maintain prev vector timestamp
  • One entry per RM
  • RMs respond with new vector timestamp

42
State components of a gossip RM
43
Query operations in gossip
  • RM must return a value that is at least as recent
    as the requests timestamp
  • Q.prev lt valueTS
  • List of pending query operations
  • Hold back until above condition is satisfied
  • RM can wait for missing updates
  • or request updates from the RMs concerned
  • RMs response includes valueTS

44
Updates in causal order
  • RM-i checks to see if operation ID is in its
    executed table or in its log
  • Discard update if it has already seen it
  • Increment i-th element of replica timestamp
  • Count of updates received from front-ends
  • Assign vector timestamp (ts) to the update
  • Replace i-th element of u.prev by i-th element of
    replica timestamp
  • Insert log entry
  • lti, ts, u.op, u.prev, u.idgt
  • Stability condition u.prev lt valueTS
  • All updates on which u depends have been applied

45
Forced immediate order
  • Unique sequence number is appended to update
    timestamps
  • Primary RM acts as sequencer
  • Another RM can be elected to take over
    consistently as sequencer
  • Majority of RMs (including primary) must record
    which update is the next in sequence
  • Immediate ordering by having the primary order
    them in the sequence (along with forced updates
    considering causal updates as well)
  • Agreement protocol on sequence

46
Gossip timestamps
  • Gossip msgs bet. RMs
  • Replica timestamp log
  • Receivers tasks
  • Merge arriving log m.log with its own
  • Add record r to local log if replicaTS lt r.ts
  • Apply any updates that have become stable
  • This may in turn make pending updates become
    stable
  • Eliminate records from log entries in executed
    table
  • Once it is established that they have been
    applied everywhere
  • Sort the set of stable updates in timestamp order
  • r is applied only if there is no s s.t. s.prev lt
    r.prev
  • tableTSj m.ts
  • If tableTSic gt r.tsc, for all i, then r is
    discarded
  • c RM that created record r
  • ACKs by front-ends to discard records from
    executed table

47
Update propagation
  • How long before all RMs receive an update ?
  • Frequency duration of network partitions
  • Beyond systems control !
  • Frequency of gossip msgs
  • Policy for choosing a gossip partner
  • Random
  • Weighted probabilities to favor near partners
  • Surprisingly robust !
  • But exhibits variable update propagation times
  • Deterministic
  • Simple function of RMs state
  • Eg Examine timestamp table choose the RM that
    appears to be the furthest behind in updates
    received
  • Topological
  • Based on fixed arrangement of RMs into a graph
  • Ring, mesh, trees
  • Trade-off amount of communication against higher
    latencies the possibility that a single failure
    will affect other RMs

48
Scalability concerns
  • 2 messages per query (bet. front-end RM)
  • Causal update
  • G messages per gossip message
  • 2 (R-1)/G messages exchanged
  • Increasing G leads to
  • Less messages
  • but also worse delivery latencies
  • RM has to wait for more updates to arrive before
    propagating them
  • Improvement by having read-only replicas
  • Provided that update/query ratio is low !
  • Updated by gossip msgs but do not receive
    updates directly from front-ends
  • Can be situated close to client groups
  • Vector timestamps need only include updateable RMs

49
References
  • R. Ladin, B. Liskov, L. Shrira and S. Ghemawat,
    Providing Availability using Lazy Replication,
    ACM Trans. Computer Systems, vol. 10, no.4, pp.
    360-391, 1992.
  • A. Demers, D. Greene, C. Hauser, W. Irish and J.
    Larson, Epidemic algorithms for replicated
    database maintenance,  Proc. 6th ACM Symposium
    on Principles of Distributed Computing, pp. 1-12,
    1987.
  • D. B. Terry, M. M. Theimer, K. Petersen, A. J.
    Demers, M. J. Spreitzer, and C. Hauser. 
    Managing Update Conflicts in Bayou, a Weakly
    Connected Replicated Storage System'', Proc. 15th
    ACM SOSP, pp.172-183, 1995.
Write a Comment
User Comments (0)
About PowerShow.com