Title: Formal Models for Distributed Negotiations Commit Protocols
1Formal Models forDistributed NegotiationsCommit
Protocols
XVII Escuela de Ciencias Informaticas (ECI 2003),
Buenos Aires, July 21-26 2003
Roberto Bruni Dipartimento di Informatica
Università di Pisa
2Distributed DataBases
- Data can be inherently distributed
- e.g. customers accounts in different branches of
the same bank - Data are distributed to achieve failure
independence - e.g. replicated file systems
- Partial failures can lead to inconsistent results
- Commits have to be coordinated among participants
to preserve data consistency
3Distributed DataBases
user
user
DB
user
user
user
Centralized
Distributed
4Atomic Commitment Problem
- Reach a globally consistent state despite
failures - Each participant has two possible decision values
- commit
- All participants will make the transactions
updates permanent - abort
- All will roll-back
- Individual decisions are irreversible
- A commit decision requires unanimity of YES votes
5Atomic Commitment Properties
- Consensus
- All participants that decide reach the same
decision - If any participant decides commit, then all
participants must have voted YES - If all participants have voted YES and no
failures occur, then commit is decided - Irreversibility
- Each participant decides at most once
6Commitment Protocols
- Atomic commitment protocol
- satisfies all atomic commitment properties
- ensures that transactions terminate consistently
at all participating sites of a distributed
database, even in presence of failures - Non-blocking
- if it permits transaction termination to proceed
at correct participants despite failures of
others - is the activity of ensuring that Sw and Hw
failures do not corrupt persistent data - can limit time intervals of resource locking
7Some Assumptions
- One of the participants acts as unique
coordinator (centralized version) - At most one (if no failures, then there is one
coordinator) - A participant assumes the role of coordinator
within a fixed time interval from the beginning
of the transaction - The transaction begins at a single participant
called the invoker - sends start messages to other participants
- Only undeliverable messages are dropped
- All participants can communicate (useful later)
8Generic ACP Coordinator
- send VOTE-REQTid to all participants
- set-timeout
- wait-for voteTid from all participants
- if (all votes are YES) then
- broadcast(commitTid, participants)
- else // at least one vote is NO
- broadcast(abortTid, participants)
- on-timeout // escape blocking wait-for
- broadcast(abortTid, participants)
Phase 1
Phase 2
9Generic ACP Participants
- set-timeout
- wait-for VOTE-REQTid from coordinator // 1
- send voteTid to coordinator
- if (voteNO) then // unilateral abort
- decide abort
- else
- set-timeout
- wait-for decision from coordinator // 2
- if (decisionabort) then decide abort
- else decide commit
- on-timeout termination-protocol // escape 2
- on-timeout decide abort //escape 1
10Simple Broadcast
- broadcast(m,S)
- // Broadcaster
- send m to all processes in S
- deliver m
- // other processes in S
- upon-receipt m // non-blocking
- deliver m
- This corresponds to the 2PC Protocol
11Timeout Actions
- Participants must wait
- VOTE_REQ from coordinator
- If this takes too long can just decide abort
- Coordinator collects votes
- No global decision is yet made
- Coordinator can decide abort
- commit / abort from coordinator
- The participants already took a decision (YES)
- It is now uncertain
- It must consult other participants according to
the termination protocol
12Termination Protocol (TP)
- What if a participant that voted YES times out
waiting for the response from coordinator? - It invokes a termination protocol to contact
- the coordinator
- other participants (cooperative TP)
- can have already voted or not yet voted
- There are failure scenarios for which no
termination protocol can lead to a decision - Blocking scenario correct participants cannot
decide - e.g. coordinator crashes during broadcast
- all faulty participants deliver and crash
- all correct participants do not deliver the
decision - if faulty participants do not recover any
decision could contradict the decision of a
participant that crashed
13Non-Blocking ACP I
- set-timeout
- wait-for VOTE-REQTid from coordinator // 1
- send voteTid to coordinator
- if (voteNO) then // unilateral abort
- decide abort
- else
- set-timeout
- wait-for decision from coordinator // 2
- if (decisionabort) then decide abort
- else decide commit
- on-timeout decide abort // escape 2
- on-timeout decide abort //escape 1
14Non-Blocking ACP II
- broadcast(m,S)
- // Broadcaster as before
- // other processes in S
- upon-first-receipt m
- send m to all processes in S // S can be sent
along VOTE_REQ - deliver m
- any process receiving m relays m to all others
(if any correct process receives m, all correct
process receive m, even if broadcaster crashes) - m is delivered only after relaying
15Recovery
- Participant p is recovering from a failure
- Must reach a consistent decision
- Suppose p remembers its state at the time it
failed - Before voting
- it can unilaterally abort
- After deciding abort
- it can unilaterally abort
- After receiving commit / abort from coordinator
- it had already decided and must behave
accordingly - During the uncertainty period (voted YES)
- Independent recovery is not possible!
- Termination protocol is needed
16Distributed Transaction Log
- DTL is kept in stable storage at each site
- Its content must survive failures
- Coordinators and participants at that site can
record information about transactions - Before/after sending VOTE_REQ, the coordinator C
writes start2PC(S,Tid) - Before voting YES, a participant writes
yes(C,S,Tid) - Before/after voting NO, a participant writes
abort(Tid) - Before C sends commit, it writes commit(Tid)
- Before/after C sends abort, it writes abort(Tid)
- After receiving the decision, participant writes
commit/abort
17Recovery From DTL
- If DTL contains start2PC (the site hosted the
coordinator) - If it also contains commit/abort
- The coordinator decided before failure
- Otherwise
- The coordinator can decide abort (and record it
in DTL) - Otherwise
- It contains commit/abort
- The participant has reached decision before the
failure - Does not contain yes
- Either failed before voting or voted no
- The participant can unilaterally abort
- Otherwise (it contains yes but not commit/abort)
- The participant failed in its uncertainty period
- Must use the termination protocol
18Cooperative TP Initiator
- send DECISION_REQTid to all processes in S
- wait-for decisionTid from any process
- if (decisioncommit) then
- write commit in DTL
- else // decisionabort
- write abort in DTL
19Cooperative TP Responder
- wait-for decisionTid from any process p
- if (abort(Tid) in DTL) then
- send abort to p
- else if (commit(Tid) in DTL) then
- send commit to p
20Evaluation of 2PC
- Criteria Reliability vs Efficiency
- Resiliency
- What failures can be tolerated?
- Blocking
- Can processes be blocked?
- Under which conditions?
- Time Complexity
- How long does it take to reach a decision?
- Message Complexity
- How many messages are exchanged to reach a
decision? - What are their dimensions?
21Balancing
- Reliability and Efficiency are conflicting goals
- each can be achieved at the expenses of the other
- The choice of protocol depends on which goal is
more important for a specific application - Whatever protocol is chosen, we should optimize
for the case with no failures - Hopefully the normal operating state of the system
22Measuring Time Complexity
- A round is the max time for a message to reach
its destination - Timeouts are based on the assumption that such a
delay is known - Note that many messages can be sent in a single
round - Two messages must belong to different rounds iff
one cannot be sent before the other is received - Rounds are taken as time units
- We count the number of rounds needed for
unblocked sites to reach a decision, in the worst
case - This neglects the time needed to process messages
- Reasonable messages delays usually exceed
processing delays - Other two factors can be relevant
- DTL management (on stable storage)
- Broadcasting preparation (to a large number of
processes)
23Measuring Message Complexity
- Number of messages sent during the whole protocol
- Reasonable measure if individual messages are not
very large - Otherwise we should measure the length of
messages, not merely their number - Here messages are short, so we abstract away from
their lengths
24Reliability of 2PC
- Resiliency
- 2PC is resilient to
- site failures
- communication failures
- In fact, the cause of timeouts is not important
- Blocking
- 2PC is subject to blocking
- Probabilistic analysis can be performed depending
on the probabilistic distribution of failures
25Time Complexity of 2PC
- In absence of failure, 2PC requires 3 rounds
- Broadcast VOTE-REQ
- Collect votes
- Broadcast global decision
- If failures happen, The TP may need 2 additional
rounds - Broadcast DECISION_REQ
- Reply from a process outside its uncertainty
period - Note that several TPs can be initiated separately
in the same round - Up to 5 rounds, independently from the number of
failures! - But processes may remain blocked for an unbounded
period of time
26Message Complexity of 2PC
- Let N1 be the number of participants, including
the coordinator - In each round of 2PC, there are N messages sent
- Hence, in absence of failures 2PC uses 3N
messages - Cooperative TP is invoked by all participants
that voted YES but did not receive commit / abort
- Let there be M such participants
- M initiators, each sending N DECISION_REQ (MN
messages) - At most N-M1 processes will respond to the first
request - In the worst case only one process abandons its
uncertainty and will respond to another
initiator (N-M1)(N-M2)N
27Calculating the Message Complexity of 2PC
- In the worst case the total number of TP messages
will be - NM ?i1 (N-Mi) NM NM M2 M(M1)/2
- 2NM M2/2 M/2 messages
- This quantity is maximum when MN
- N(3N1)/2 messages
- The 2PC together with worst-case TP amount to
- 3N N(3N1)/2 N(3N7)/2 messages
M
28Communication Topology
- The communication topology of a protocol is the
specification of who sends messages to whom - e.g. in 2PC without TP, the coordinator sends
messages to participants and vice versa - Participants do not send messages directly to
each other - The topology is described as a tree of height 1
Coordinator
Participant
Participant
Participant
Participant
29Alternative 2PCs
- To reduce time and message complexity of
centralized 2PC, two variations have been
proposed, based on different communication
topologies - Decentralized 2PC
- Communication topology is a complete graph
- Improve time complexity
- Linear 2PC (aka Nested 2PC)
- Linearly ordered processes
- Reduce the number of messages
30Decentralized 2PC
- Depending on its own vote, the coordinator sends
YES or NO to all participants - Informs that it is time to vote
- Tells the coordinators vote
- If the message is NO
- Each participant decides abort and stops
- Otherwise, each participant sends back its vote
to ALL OTHER PARTICIPANTS - After receiving all votes each process can decide
autonomously - If all are YES and its own vote is YES, decide
commit - Otherwise it decides abort
- Timeouts can be employed as in the centralized
2PC
31Evaluation of Decentralized 2PC
- In the absence of failures, only 2 rounds are
necessary - Coordinator voting YES / NO
- Each participant voting YES / NO
- More messages are needed N2N messages
- N messages in the first round
- N2 messages in the second round
- (and this is just in absence of failures)
32Linear 2PC
- Each participant can communicate only with its
left / right neighbors - The coordinator is the leftmost process
- It sends its vote YES / NO to its right neighbor
- This message has a dual meaning as in
decentralized 2PC - Each participant p waits for the vote from its
left neighbor - If it is YES, and p votes YES, then p tells YES
to its right neighbor - Otherwise, p tells NO to its right neighbor
- When the rightmost participant receives the vote,
it makes the final decision commit / abort - The decision is propagated from right to left
- When the coordinator receives it, the protocol
ends - Timeout periods are influenced by positions
33Evaluation of Linear 2PC
- Only 2N messages needed
- N votes from left to right
- N decisions from right to left
- (and this is just in absence of failures)
- Unfortunately the same amount of rounds is
needed 2N rounds - No two messages are sent concurrently
34Comparison of 2PC Variants
- Hybrid communication topologies are also possible
- e.g. Linear for voting, complete for conveying
decision - 2N messages, N1 rounds
- The choice of the protocol might be influenced by
the available communication topology
35From 2PC to 3PC
- In 2PC, if all operational participants are
uncertain, they are blocked - They cannot decide abort even if aware that
processes they cannot communicate with have
failed, because some of them could have decided
commit before failure - The 3CP is an ACP designed to rule out this
situation - It guarantees that if any operational process is
uncertain, then no (operational / failed) process
can have decided commit - Thus, if p realizes that any operational site is
uncertain, then p can decide abort - Why does 2PC violate this property?
- A participant p can receive commit while q is
still uncertain
36Sketch of 3PC The Idea
- After the coordinator has found that all votes
were YES, it sends pre-commit messages to all
participants - When a participant p receives pre-commit, it
knows that all participants voted YES - p is no longer uncertain, but does not decide
commit yet - p knows that it will decide commit unless it
fails - p acknowledges the receipt of pre-commit
- When the coordinator collects all acks it knows
that no participant is uncertain - The coordinator sends commit to all participants
- When a participant receives commit, it decides
commit - If a participant voted NO, then 3PC behaves as 2PC
37Sketch of 3PC Some Notes
- In absence of failures, 3PC involves 5 rounds and
up to 5N messages - Participants have four possible states
- Aborted, Uncertain, Committable, Committed
- For p and q any two participants, only certain
combinations of their states are possible - Timeouts can occur in five situations
- 3 are trivially handled
- 2 require a complex termination protocol
- Election protocol (for a new coordinator) based
on a linear ordering of participants - The new coordinator checks the states of all
operational participants - Timeouts are again necessary
38Recap
- We have seen
- Atomic Commitment Problem
- Several ACP protocols
- Generic ACP
- Centralized 2PC (Good middle ground)
- Non-Blocking ACP
- Decentralized 2PC (OK if end-to-end delays must
be minimized) - Linear 2PC (OK if messages are expensive)
- 3PC (sketched)
- Learned some criteria to evaluate and compare
protocols - Usually also dependent on the communication
topology
39References
- Concurrency control and recovery in database
systems (Chapter 7, Addison-Wesley 1987) - P. Bernstein, N. Goodman, V. Hadzilacos
- Non-blocking atomic commitment (Chapter 6 of
Distributed Systems, Addison-Wesley 1995) - O. Babaoglu, S. Toueg