Title: Distributed Transaction Recovery
1Distributed Transaction Recovery
2Goal
- All_or_nothing semantics of atomicity must be
extended to a transactions updates on multiple
servers - Distributed system can fail partially
- The solution is to set up a special handshake
protocol between servers from a family of
distributed commit protocols - The protocol implies that there are certain
circumstances under which a failed server must
communicate to other servers during its restart
to find out the systemwide decision about the
termination status of in-doubt transactions
(uncertain transactions)
3Goal
- Servers are no longer autonomous in that they can
always independently recover in splendid
isolation - Commit protocols make very few assumptions about
how the various servers implement their local
recovery - The requirement there is that the rules of the
commit protocol are followed by all parties of a
distributed system - The 2PC protocol has been standardized and is
generally accepted in the software industry
4Distributed Commit
- The problem
- How can we ensure atomicity of a distributed
transaction? - Example
- Consider a chain of stores belonging to a
supermarket. Suppose a manager of the chain wants
to query all the stores, find the inventory of
pants at each, and issue instructions to move
pants from store to store in order to balance the
inventory. The whole operation is done by a
single global transaction T that has component Ti
at each ith store and component To at the office
where the manager is located
5Distributed Commit
- The sequence of activity performed by T are
summarized below - Component To is created at the site of the
manager - To sends messages to all the stores instructing
them to create components Ti - Each Ti executes a query at store i to discover
the number of pants in inventory and reports this
number to To - To takes these numbers and determines what
shipments of pants are desired. To sends messages
such as store 10 should ship 100 pants to store
7 to the appropriate stores - Stores receiving instructions update their
inventory and perform the shipments
6Failed Coordination of a Distributed Transaction
(1)
Committed
T1
Active
Aborted
Crash/ recover
Committed
T2
Active
Aborted
Both T1 and T2 are ready to commit
Crash/ recover
7Failed Coordination of a Distributed Transaction
(2)
Committed
T1
T1 committed T2 crashes and recovers
Active
Aborted
Crash/ recover
Committed
T2
What we need is some new state that has more
flexibility than this simple state diagram
Active
Aborted
Crash/ recover
8Transaction State Diagram with Durable Prepared
State
Committed
Active
Precommitted
Aborted
Crash/ recover
9Precommitted state
- Precommitted
- a transaction becomes precommitted as a result
of a request by a distributed transaction
coordinator (TM). From a precommitted state, the
transaction can enter either committed state or
aborted state. In the event of a system crash and
subsequent recovery, a transaction in a
precommitted state returns to precommitted state.
10Two-Phase Commit Protocol
- We assume that the atomicity mechanisms at each
site assure that either the local component
commits or it has no effect on the database state
at that site i.e. local components of a global
transaction are atomic - By enforcing the rule that either all components
of a distributed transaction commit or none does,
we make the distributed transaction itself atomic
11Two-Phase Commit Protocol
- Some assumptions about the 2PC protocol
- Each site logs actions at that site, but there is
no global log - One site, coordinator, plays a special role in
deciding whether or not the distributed
transaction can commit - 2PC protocol involves sending certain messages
between the coordinator and the other sites. As
the message is sent, it is logged at the sending
site, to aid in recovery if should it be
necessary
12Two-Phase Commit Protocol Phase 1
- The coordinator places a log record ltPrepare Tgt
on the log at its site - The coordinator sends to each components site
the message ltprepare Tgt - Each site receiving the message ltprepare Tgt
decides whether to commit or abort its component
of T. The site can delay if the component has not
yet completed its activity, but must eventually
send a response - If a site wants to commit its component of T, it
must enter a state called precommitted. Once in
the precommitted state, the site cannot abort its
component of T without a directive to do so from
the coordinator.
13Two-Phase Commit Protocol Phase 1
- The following steps are done to become
precommitted - Perform whatever steps are necessary to be sure
the local component of T will not have to abort,
even if there is a system failure followed by
recovery at the site. So, all appropriate actions
regarding the log must be taken so that T will be
redone rather than undone in a recovery - Place the record ltReady Tgt on the local log and
flush the log to disk - Send to the coordinator the message ltready Tgt
- If a site wants to abort its component of T, then
it logs the record ltDont commit Tgt and sends the
message ltdont commit Tgt to the coordinator. It
is safe to abort the component at this time,
since T will surely abort even if only one
component wants to abort
14Two-Phase Commit Protocol Phase 2
- If the coordinator has received ltready Tgt from
all components of T, then it decides to commit T.
The coordinator - Logs ltCommit Tgt at its site, and
- Sends message ltcommit Tgt to all sites involved in
T - If the coordinator has received ltdont commit Tgt
from one or more sites, then it decides to abort
T. The coordinator - Logs ltAbort Tgt at its site, and
- Sends message ltabort Tgt to all sites involved in
T
15Two-Phase Commit Protocol Phase 2
- If a site receives a ltcommit Tgt message, it
commits the component of T at that site, logging
ltCommit Tgt as it does - If a site receives the message ltabort Tgt, it
aborts the component of T at that site, and
writes the log record ltAbort Tgt
16Failure model
- Message losses a message does nor arrive at the
destination process because of a network failure - Message duplications some network component may
end up duplicating a message - Transient process failures one or more of the
involved processes, participants or coordinators,
exhibit a soft crash and need to be restarted,
but without any damage to data on secondary
storage - Idea distribute the responsibilities for
handling various failure classes among
transactional federation and the underlying
communication system
17Failure model
- 2PC does not make any assumptions about the
underlying communication system (datagrams as
well as session oriented protocols) - Omission failures versus commission failures
the later ones would lead to the class of
distributed consensus protocols known as
Byzantine agreement - We take into account only omission failures
18Prepared log entry for in-doubt transactions
- Participant that replies yes to the
coordinators poll in the first message round can
then crashes. - The participant can no longer perform local crash
recovery as if there were no distributed
transactions at all - The restarted participant needs to check back
with the coordinator first before it can decide
to consider the transaction commited. - Thus, every participant that votes yes must
actually be prepared to go either way redo or
undo teh updates.
19Two Phase Commit (cd)
- There are three places in 2PC where a process is
waiting for a message - the participant is waiting for the ltprepare Tgt
message from the coordinator - the coordinator is waiting for the ltready Tgt
message from all the participants - the participant is waiting for the ltcommit Tgt or
ltabort Tgt message from the coordinator
20Two Phase Commit (cd)
- Timeout Actions
- ad.1. unilateral abort after time_out
(participant) - ad.2. unilateral abort after time_out
(coordinator) - ad.3. the site can be blocked
21Restart and Termination Protocols
- 2PC is robust with respect to failures in the
sense that each failed and restarted process
resumes its work in the last remembered state - 2PC is not robust with respect to failures in the
sense that all processes guarantee active
progress toward a global commit or rollback - Two extensions of the basic 2PC
- A restart protocol specifies how failed and
restarted protocol should proceed - A termination protocol specifies how a process
should behave upon a timeout while it is waiting
for some message
22Restart and Termination Protocols
- As all participants follow the same protocol, we
have to specify four cases - The coordinator restart protocol the
continuation of the coordinators protocol after
a coordinator restart - The coordinator termination protocol the
coordinator behavior upon timeout - The participant restart protocol the
continuation of the participants protocol after
a participant restart - The participant termination protocol the
participant behavior upon timeout
23Termination protocol for 2PC
- The simplest termination protocol is the
following - The participant P remains blocked until it can
re-establish communication with the coordinator.
Then the coordinator can tell P the appropriate
decision - Drawback P may be unnecessarily blocked for a
long time - Participants can exchange messages directly
(without the mediation of the coordinator) - We assume that coordinator attaches the list of
the participants identities to the ltprepare Tgt
message it sends to each of them - The fact that participants do not know each other
before receiving the message is of no
consequence, since a participant may unilaterally
decide to abort transaction
24Termination protocol for 2PC
- The cooperative termination protocol
- A participant P (initiator) that times out while
in its uncertainty period sends a ltdecision
requestgt message to every other process Q
(responder) to inquire whether Q either knows the
decision or can unilaterally reach one - Q has already decided Commit (or Abort) Q sends
a commit or abort to P, and P decides accordingly
- Q has not voted yet Q can unilaterally decide
Abort. It then sends an ltabort Tgt to P, and P
decides Abort - Q has voted Yes but has not yet reached a
decision
252PC with restart and termination protocols
- Statechart for 2PC with restart and termination
protocols - F transitions are triggered during restart after
a process failure. Once the processs last state
is determined from the log entries on the
processs stable log, the transition is made
without any further preconditions - T transitions are triggered upon timeout and are
also made without further preconditions
262PC statechart - coordinator
T or F
initial
T or F
prepare_to_commit
variant
collecting
commit
abort
abort
commit
T or F
Ack
T or F
C-pending
A-pending
forgotten
272PC statechart - participant
T or F
initial
Prepared/ no
Prepared/ yes
T or F
prepared
commit/ ack
abort/ ack
committed
aborted
Commit/ack
Abort/ack
28Communication Topologies for 2PC
- The communication topology of a protocol is the
specification of who sends messages to whom - Centralized 2PC
Coordinator
Participants
29Communication Topologies for 2PC
- Decentralized 2PC (reduce time complexity)
Linear 2PC (reduce message complexity)
30Three Phase Commit (3PC)
- The protocol is non-blocking in the absence of
total site failures - The protocol may cause inconsistent decisions to
be reached in the event of communication failures
- We assume that only site failures occur
- All operational sites can communicate with each
other - Process that times out waiting for a message from
process q knows that q is down and therefore that
no processing can be taking place there (no other
process can be communicating with q)
31Three Phase Commit (3PC)
- Which of the processes involved in a transaction
takes the role of the coordinator ? - The choice of the coordinator depends on
- Transaction initiator client PC, application
server, Web server, database server, ORB - Reliability and speed of participants how many
participants does the transaction involve? - Communication topology and protocol
- The simplest way transaction initiator is
chosen as the coordinator - If the initiator is a client, it often makes more
sense to choose a data server as a coordinator
32Three Phase Commit (3PC)
- Nonblocking property
- If any operational process is uncertain then no
process (whether operational or failed) can have
decided to Commit - If the operational sites discover they are all
uncertain, they can decide Abort, safe in their
knowledge that the failed processes had not
decided Commit. When the failed processes recover
they can be told to decide Abort. This way
blocking is prevented - Does 2PC obeys NB property? NO
33Three Phase Commit (3PC)
- The coordinator sends a ltprepare Tgt message to
all participants - When a participant receives a ltprepare Tgt
message, it responds with YES or NO message,
depending on its vote. If a participant sends NO,
it decides Abort and stops - The coordinator collects the vote messages from
all participants. If any of them was NO or if the
coordinator voted No, then the coordinator
decides Abort, sends ltabort Tgt to all
participants that voted YES, and stops.
Otherwise, the coordinator sends ltpre-commit Tgt
messages to all participants
34Three Phase Commit (3PC)
- A participant that voted Yes waits for
ltpre-commit Tgt or ltabort Tgt message from the
coordinator. If it receives an ltabort Tgt , the
participant decides Abort and stops. If it
receives a ltpre-commit Tgt , it responds with an
ACK message to the coordinator - The coordinator collects the ACKs. When they have
all been received, it decides Commit, sends
ltcommit Tgt to all participants, and stops - A participant waits for a ltcommit Tgt message from
the coordinator. When it receives that message,
it decides Commit and stops
353PC timeout actions
- What a process should do when it times out
depends on the message it is was waiting for - In step (2) participants wait for ltprepare Tgt
message - In step (3) the coordinator waits for the votes
- In step (4) participants wait for a ltpre-commit
Tgt or ltabort Tgt message - In step (5) the coordinator waits for ACKs
- In step (6) participants wait for ltcommit Tgt
message
363PC timeout actions
- Cases (1) and (2) are handled exactly as in 2PC
- In case (4) the coordinator may decide Commit
while some failed participant is uncertain - In case (3) the participant must communicate with
other processes to reach a consistent decision - In case (5) the participant must communicate with
other processes to reach a consistent decision.
Why can participant ignore the timeout and simply
decide commit? The participant could violates
condition NB (the coordinator failed after
sending pre-commit to p but before sending it to
q
37Termination protocol for 3PC
- The state of the process relative to the message
it has sent or received - Aborted the process has not voted No, has voted
No, or has received ltabort Tgt - Uncertain the process is in its uncertain period
- Committable the process has received a
ltpre-commit Tgt, but has not yet received a
ltcommit Tgt - Committed the process has received a ltcommit Tgt
- When a participant times out in cases (3) or (5)
it initiates an election protocol - The election protocol involves all processes that
are operational and results in the election of
a new coordinator (the old one must have failed,
otherwise no participant would have timed out!)
38Termination protocol for 3PC
- The coordinator sends a ltstate-reqgt message to
all participants that participated in the
election. Each participant responds to this
message by sending its state. The coordinator
collects these states and proceeds according to
the following termination protocol - TR1 If some process is Aborted, the coordinator
decides Abort, sends ltabort Tgt messages to all
participants, and stops. -
- TR2 If some process is Committed, the
coordinator decides Commit, sends ltcommit Tgt
messages to all participants, and stops.
39Termination protocol for 3PC
- TR3 If all processes that reported their state
are Uncertain, the coordinator decides Abort,
sends ltabort Tgt messages to all participants, and
stops. - TR4 If some process is Committable but none is
Committed, the coordinator first sends
ltpre-commit Tgt messages to all processes that
reported Uncertain, and waits for
acknowledgements from these processes. After
having received these acknowledgements the
coordinator decides Commit, sends ltcommit Tgt
messages to all participants, and stops. - A participant that receives a ltcommit Tgt (or
ltabort Tgt) message, decides Commit (or Abort),
and stops
40Correctness of 3PC and its Termination Protocol
- Theorem In the absence of total failures, 3PC
and its termination protocol never cause
processes to block - Theorem Under 3PC and its termination protocol,
all operational processes reach the same decision
41Tree 2PC algorithm
- Which of the processes involved in a transaction
takes the role of the coordinator ? - LAN vs WAN
- Everybody can talk to everybody
- The initiator does not know all participants (and
probably cannot communicate with them directly
and efficiently) - Observation
- During the execution of a transaction, the
involved processes dynamically form a tree with
the transaction initiator as the root each edge
in the tree corresponds to a dynamically
established communication link
42Tree 2PC algorithm
- In the hierarchical commit protocol the message
flow and writing of log entries follows from the
two roles of an intermediate node, participant
with regard to its caller and coordinator for its
subtree
Process 0
initiator
Process 3
Process 1
Process 4
Process 5
Process 2
Communication during commit protocol
43initiator
Process 1
Process 2
Process 3
Process 4
Process 5
prepare
prepare
prepare
prepare
prepare
Yes
Yes
Yes
Yes
Yes
commit
commit
..........
44Optimized Algorithms for Distributed Commit
- What we can do
- Reducing the number of messages and the number of
log writes - Shortening the critical path from the begin of
the commit protocol to the point when local locks
can be released - Reducing the probability of blocking
45Optimized Algorithms for Distributed Commit
- The number of messages and forced log writes can
be reduced by introducing specific conventions
for the presumed behavior of a process (i.e.
default reaction) in the absence of more explicit
information - e.g. If a participant did not force a commit or
abort log entry and thus could lose this info in
a local crash, this participant could by default
contact the coordinator to obtain missing
information (but coordinator has already
truncated its log and forgotten the transaction)
46Optimized Algorithms for Distributed Commit
- Basic 2PC
- 4n messages 2N2 forced log entries (coordinator
- begin and commit or rollback log entry) - When a certain piece of information about
transactions behavior is missing two
variations of 2PC - Presumed abort (PA)
- Presumed commit (PC)
- Presumed abort 2PC has been considered as the
method of choice and has been selected for the
industry standard XA
47Heterogeneous Commit Coordinators
- As networks grow together, applications
increasingly must access servers and data
residing on heterogeneous systems, performing
transactions that involve multiple nodes of a
network. These nodes have different TP monitors
and different commit protocols - If each transaction manager supports a standard
and open commit protocol, then portability and
interoperability among them are relatively easy
to achieve - There are two standard commit protocols and
application programming interfaces - LU6.2 (IBM) - embodied by CICS and CICS API
- OSI-TP - combined with X/Open DTP
48Closed versus Open Transaction Managers
- Some transaction managers have an open commit
protocol resource managers can participate in
the commit decision, and the commit message
formats and protocols are public - CICS, Decdtm, TOPEND (NCR), Transarc TM
- Some transaction managers have a closed commit
protocol resource managers cannot participate in
the commit decision. The term is used to describe
TP systems that have only private protocols and
therefore cannot cooperate with other transaction
processing systems - IBMs IMS, Tandems TMF