Title: CS556: Distributed Systems
1CS-556 Distributed Systems
Transactions
- Manolis Marazakis
- maraz_at_csd.uoc.gr
2Terminology (I)
- Reliability - the extent to which a system yields
expected results on repeated trials - Reliability is measured by mean-time-between-fail
ures (MTBF) - Availability - fraction of time the system yields
expected results - reduced by downtime, repair, preventive
maintenance - Availability MTBF/(MTBFMTTR), where MTTR is
mean-time-to-repair
3Terminology (II)
- 24x7 - system should always be operational
- Failure - an event where the system gives
unexpected results (e.g. wrong or no result) - Fault - an identified or hypothesized cause of
failure - A bad memory board (fault) causes OS to fail
- Masking - prevents a fault from becoming a
failure - Eg OS reconfigures around a faulty disk
controller - Transient fault - the fault does not reoccur when
retrying the operation that caused the fault - Permanent fault - non-transient fault
4Where Do Failures Come From?
Tandem 89 Tandem
85 ATT/ESS 85 Environment 7
14 Hardware 8 18
Subtotal 15 32 20 System Mgmt 21
42 30 Software 64 25 50
- Software is most of the problem !
- Environment fire, flood, earthquake, vandalism,
communications/power/air conditioning failures - System management maintenance, system operation,
system configuration - System managers vendors are responsible for
environment and administration
5Recovery in Client/Server Systems
- For each component how to detect failure and
recovery, and what to do about them
- Client submits a request
- Server processes the request
- Server returns a reply
- A failure can occur during any of these activities
6Detecting Failures
- Fault detection must be accurate
- or youll take recovery actions for a
still-active process - Fault detection must be fast
- since it contributes to MTTR
- Techniques
- Processes sends heartbeats (Im alive) to a
monitor process. - No heartbeat gt failed process
- Monitor polls processes for Im alive messages
- Process gets an OS lock. Monitor waits for the
lock. If process fails, OS releases its lock, so
monitor gets it and detects the failure
7Client Recovery
- What calls were in-flight at failure time?
- How much of each call was processed?
- How to finish these calls before proceeding?
- General-purpose solution
- Execute requests as transactions.
- Maintain a persistent queue of requests and
replies, shared by client and server. - At recovery time, use the queue contents to
decide which transactions executed. Re-run ones
that didnt.
8Server Recovery (I)
- Assume no transactions, for now
- When it failed, the server maybe was processing
a call. If all of the calls operations are
redoable, then just re-execute the call. - What if it performed non-redoable operations
before it failed (increment inventory, print
ticket, )? - Recover to a state after the last non-redoable
operation - Requires careful bookkeeping during normal
operation, so theres enough state to figure out
how to recover - Checkpoint - an action to avoid redo during
recovery - E.g. save servers memory state to disk
9Server Recovery (II)
- Checkpoint before each non-redoable operation
- At recovery time
- Restore last checkpoint
- Check if last non-redoable operation actually ran
- E.g. save incremented inventory value in memory
before checkpointing. At recovery compare memory
and disk copies of inventory. - E.g. read ticket number from ticket printer
before checkpointing. At recovery, read ticket
number from ticket printer and compare to number
before checkpoint.
10Server-side Transactions
- If servers run each clients call within a
transaction, then at recovery time, abort all
active transactions re-run them. No
checkpointing required. - Its as if server checkpoints before Start
transaction and Commit is the non-redoable
operation to check at recovery to see if it
actually ran. - This assumes all transaction operations are
undoable - Transaction recovery techniques are much cheaper
than memory checkpointing. - Client performs non-redoable operations outside a
transaction, and uses other checkpointing
techniques
11Transactions
- Dependable computing
- All objects managed by a server remain in a
consistent state - when they are accessed by multiple clients
- in the presence of failures
- Protect the operations of clients from
interference - Atomic operations at the server
- Synchronized access to shared resources
- Synchronization cooperation of clients
- Wait/notify/notify_all, locks
- Transactions
- Predictable faults
- Although errors can occur, they can be detected
dealt with before any incorrect behavior occurs - Stable storage processors
12All-or-nothing (Exactly-Once)
- A transaction either completes successfully
- in which case the effects of all of its
operations are recorded in the systems objects - or fails (or is deliberately aborted)
- in which case it has no effect at all
- Failure atomicity
- Atomic effects, even if server crashes
- Durability
- State data are made permanent
- Will survive if server process crashes
- The intermediate effects of a transaction must
not be visible to others
13Serializable execution
- Transactions are allowed to execute concurrently
- but only as long as their overall effects are
equivalent to some serial execution
14Masking faults
- Architecture Hardware Faults
- Software Masks Environmental Faults
- Distribution Maintenance
- Software automates / eliminates operators
- In the limit there are only software design
faults. -
- Software-fault tolerance the key to
dependability
15Fault Tolerance Techniques
- FAIL FAST MODULES work or stop
- SPARE MODULES instant repair time.
- INDEPENDENT MODULE FAILS by design MTTFPair
MTTF2/ MTTR (must have small MTTR) - MESSAGE BASED OS Fault Isolation Software has
no shared memory. - SESSION-ORIENTED COMM Reliable messages Detect
lost/duplicate messages Coordinate messages
with commit - PROCESS PAIRS Mask Hardware Software Faults
- TRANSACTIONS A.C.I.D. (simple fault model)
16Fail-Fast Repair
Lifecycle of a module fail-fast gives short
fault latency High Availability is
low UN-Availability Unavailability MTTR
MTTF
Return
-
- Improving either MTTR or MTTF gives benefit
- Simple redundancy is not enough on its own.
17Atomic Actions
- Either the action is executed completely (and
successfully), or it has not happened at all. - If it is not successful, it has not left any side
effects. - Except for simple instructions at the processor
level, no operation is really atomic - Most operations are implemented by a sequence of
more primitive operations. - Precautions have to be taken
- The lower-level operations must not make any
visible changes before it is clear that the
top-level operation will succeed. - For any temporary changes the lower-level
operations make, it must be made sure that they
do not become visible, and that they can be
revoked automatically should anything go wrong.
18A Classification of Action Types
- Unprotected actions lack all of the ACID
properties except for consistency - Unprotected actions are not atomic, and their
effects cannot be depended upon. - Almost anything can fail.
- Protected actions actions that do not
externalize their results before they are
completely done - Their updates can be rolled back
- Once they have reached their normal end, there
will be no unilateral rollback. - Real actions affect the real, physical world
- consistent, isolated, durable
- irreversible (in the majority of cases)
19Sample Transaction Program (I)
exec sql begin declare section long Aid, Bid,
Tid, delta, Abalance exec sql end declare
section DCApplication() read input
msg exec sql begin work Abalance
DoDebitCredit(Bid, Tid, Aid, delta) send output
msg exec sql commit work
20Sample Transaction Program (II)
long DoDebitCredit(long Bid, long Tid, long Aid,
long delta) exec sql update accounts set
Abalance Abalance delta where Aid
Aid exec sql select Abalance into Abalance
from accounts where Aid Aid exec sql
update tellers set Tbalance Tbalance
delta where Tid Tid exec sql update
branches set Bbalance Bbalance delta
where Bid Bid exec sql insert into
history(Tid, Bid, Aid, delta, time) values
(Tid, Bid, Aid, delta, CURRENT) return(Abala
nce)
21ACID Properties (I)
- AtomicityÂ
- State transition occurs without any observable
intermediate states - ... or the system appears as though it never left
the initial state - It holds whether the transaction, the entire
application, the operating system, or other
components function normally, function
abnormally, or crash. - For a transaction to be atomic, it must behave
atomically to any outside observer.
22ACID Properties (II)
- Consistency A transaction produces consistent
results only otherwise it aborts. - A result is consistent if the new state of the
database fulfills all the consistency constraints
of the application - The program has functioned according to
specification.
23ACID Properties (III)
- Isolation
- A program running under transaction protection
must behave exactly as it would in single-user
mode. - That does not mean transactions cannot share data
objects. - The definition of isolation is based on
observable behavior from the outside, rather than
on what is going on inside.
24ACID Properties (IV)
- Durability The results of transactions having
completed successfully must not be forgotten by
the system - Once the system has acknowledged the execution of
a transaction, it must be able to reestablish its
results after any type of subsequent failure,
whether caused by the user, the environment, or
the hardware components.
25Concurrency control (I)
- Lost update
- 3 accounts (A, B, C)
- with balances 100, 200, 300
- T1 transfers from A to B, for 10 increase
- T2 transfers from C to B, for 10 increase
- Both T1, T2 read balance of B (200)
- T1 overwrites the update by T2
- Without seeing it
Transactions should not read a stale value
use it in computing a new value
26Concurrency control (II)
- Inconsistent retrievals
- T1 transfers 10 of account A to account B
- T2 computes sum of account balances
- T2 computes sum before T1 updates B
Update transactions should not interfere with
retrievals.
In general Transactions should not violate
operation conflict rules.
27Concurrency control (III)
Serial equivalence criterion for correct
concurrent execution
T1 serially equivalent with T2 iff All pairs of
conflicting operations of the two transactions
are executed in the same order at all objects
that both transactions access.
- 3 approaches to CC
- Locking
- Optimistic CC
- Timestamp ordering
Txs wait for one another OR Restart Txs after
conflicts have been detected
28Common Performance Problems
- Convoys on semaphores or high-traffic locks
- Log semaphore is hotspot
- Sequential insert is hotspot
- Lock manager costs too much A good number 300
instructions for lockunlock (no wait case) - File or page granularity locking causes hotspot
for small files
29Comparisons of CC methods
- Order of serialization
- Timestamp-based static
- when Txs begin
- 2-phase locking dynamic
- based on access pattern
- 2-phase locking is better when the frequency of
updates is high - Otherwise, timestamp-based optimistic perform
better - Handling conflicts
- 2-phase locking block (danger for deadlock)
- Timestamp-based immediate abort
30Challenging applications
- Multi-user, collaborative
- Atomic updates, in the presence of concurrency
server crashes - Must also support notification of changes
access to work-in-progress (sharing) - Relaxed isolation guarantees
- Long-lasting Txs
- CAD/CAM, software development
- Independent versions of objects
- Check-in / check-out / merge operations
31Recoverability from aborts (I)
- Servers must prevent a aborting Tx from affecting
other concurrent Txs. - Dirty reads
- T2 sees result update by T1 on account A
- T2 performs its own update on A then commits.
- T1 aborts -gt T2 has seen a transient value
- T2 is not recoverable
- If T2 delays its commit until T1s outcome is
resolved - Abort(T1) -gt Abort(T2)
- However, if T3 has seen results of T2
- Abort(T2) -gt Abort(T3) !
- Cascading aborts
Txs should only read values written by committed
Txs
32Recoverability from aborts (II)
- Premature writes
- Assume server implements abort by maintaining the
before image of all update operations - T1 T2 both updates account A
- T1 completes its work before T2
- If T1 commits T2 aborts, the balance of A is
correct - If T1 aborts T2 commits, the before image
that is restored corresponds to the balance of A
before T2 - If both T1 T2 abort, the before image that is
restored corresponds to the balance of A as set
by T1
Txs should be delayed until earlier Txs that
update the same objects have been either
committed or aborted.
33Recoverability from aborts (III)
- Txs should delay both their reads updates in
order to avoid interference - Strict execution -gt enforce isolation
- Servers should maintain tentative versions of
objects in volatile memory
Txs should be delayed until earlier Txs that
update the same objects have been either
committed or aborted.
34Interface of transaction coordinator
openTransaction() -gt trans starts a new
transaction and delivers a unique TID trans. This
identifier will be used in the other operations
in the transaction. closeTransaction(trans) -gt
(commit, abort) ends a transaction a commit
return value indicates that the transaction has
committed an abort return value indicates that
it has aborted. abortTransaction(trans) aborts
the transaction.
35Outcomes of a Flat Transaction
36Stateless Servers (I)
- Assume requests run as transactions
- Two types of servers
- application server - application logic
- resource manager - shared resources (e.g. disks)
Client
- Application server is stateless if it
- runs in transactions
- maintains all state in resource mgrs
- E.g. start Tx, call resource mgrs, commit
- It doesnt need to recover Only needs to know
which request it was processing (to redo it).
Application Server
Resource Manager
Resource
37Stateless Servers (II)
- Resource mgr runs all requests in Txs
- Needs to be capable to perform recovery work
- abort all partially completed transactions
- ensure resources include all the results of all
committed transactions (atomic and durable) - database recovery algorithms (complex!)
38Stateful Servers (I)
- Sometimes a server maintains state on clients
behalf. E.g., - Server scans a file. Each time it hits a relevant
record it returns it. Next call picks up the scan
where it left off. - cursors
- Server maintains lots of user information, which
is too expensive to reconstruct on every call. - Approach 1 Client passes state to server on each
call, and server returns it on each reply. Server
retains no state. - This is the default assumption outside TP, but
doesnt work well for TP. - Note that transaction id context is handled this
way.
39Stateful Servers (II)
- Approach 2 server maintains state, indexed by
client id (e.g. transaction id). Subsequent RPCs
from the client must go to the same server and
pick up the retained context. - RPC can provide a binding handle for subsequent
calls. This ensures later calls go to the same
server. - If the client fails (e.g. it aborts the
transaction), server must be notified to release
clients context - its just like a resource manager that releases
locks - so encapsulate context as a (volatile) resource
- If state must be maintained across transaction
boundaries, then treat it like any resource
manager (e.g. DBMS)
40Fault Tolerance (I)
- What to do if a client doesnt receive a reply
within its timeout period? - Why not just retry?
- In TP, many RPC calls are not idempotent
- Idempotent any number of operation executions
has the same effect as one execution - Queries (read-only) are idempotent, but not most
updates - Send a ping for non-idempotent calls
- After giving up, ignore late-arriving responses
- Cant assume that the call didnt run, so usually
requires aborting the transaction that made the
call (up to the application)
41Fault Tolerance (II)
- Interface definition can say whether server is
idempotent - could even be done per member function
- More abstract view
- RPC executes idempotent calls at least once
- RPC executes non-idempotent calls at most once
- If the goal is exactly once, execute the RPC
within a transaction use transaction retry
logic to ensure transaction actually runs
42Fault Tolerance By Logging Device I/O
- Consider a transaction all of whose operations
are undoable. - Log all of the transactions interaction with the
outside world. - If the transaction fails, replay it.
- During the replay,
- get input from the log
- validate that output is identical to what was
logged. - If the output diverges from the log, then start
asking for live input (and then ignore rest of
the log). - A variation of this is used by Digitals RTR
43Transparent Transaction IDs
- Ideally, Start returns a transaction ID thats
hidden from the caller - Procedures dont need to explicitly pass
transaction ids. - Easier avoids errors
- Moreover, when a transaction first arrives at a
site, the local transaction manager needs to be
notified. - Application shouldnt need to deal with this
- This is what makes RPC (or other paradigms)
transactional.
44Implementing transactions
45Private Workspace
- The file index and disk blocks for a three-block
file - The situation after a transaction has modified
block 0 and appended block 3 - After committing
46Writeahead Log
47Locking schemes
- Give each Tx the illusion that there are no
concurrent updates - Hide concurrency anomalies.
- Do it automatically
- Apps do not know transaction semantics
- Goal
- Although there is concurrency in systemexecution
is equivalent to some serial execution of the
system - Not deterministic outcome, just a consistent
transformation
48Two-Phase Locking
49Strict Two-Phase Locking
50Deadlock Detection
- Deadlock a cycle in the wait-for graph
- Kinds of waits database locks terminal/device
storage session server - Correct detection must get complete graph
- Not likely, so always fall back on timeout
- Model of deadlock showswaits are raredeadlocks
are rare2 (very very rare)virtually all cycles
are length 2so do depth-first search either as
soon as you wait OR after a timeout
51Model of Deadlock (I)
- R number of objects (locks)
- r objects locked per transaction
- N1 Concurrent Transactions
- ASSUME
- Transaction is LOCKr lock steps, then commit
- Uniform distribution
- exclusive locks only
- Nr ltlt R
Prob. a request waits
Prob. a transaction waits
52Model of Deadlock (II)
- Probability of cycle length 2 length 3 ...
-
-
Prob transaction deadlocks PD assumes all
cycles of length 2
System deadlock rate is N1 times higher
Conclusions control transaction size
duration limit multiprogramming
53Atomicity of multi-server transactions
- Goal is to ensure the atomicity of a Tx that
accesses multiple resource managers - Data
- Messages
- Anything shared by Txs
- Problems
- What if resource manager RMi fails after a
transaction commits at RMk? - What if other RMs are down when RMi recovers?
- What if a Tx thinks RMi failed therefore
aborted, when it actually is still running?
54Transactional Communications
- Three paradigms for communications between
application programs in transactions - remote procedure call - procedure calls between
address spaces - peer-to-peer messages - send-message /
receive-message - queues - enqueue, dequeue to a shared queue
- These paradigms are not unique to TP, but they
all have TP-specific aspects
55Remote Procedure Call
- Program calls remote procedure the same way it
would call a local procedure - Variation asynchronous call return, for
single-threaded client - Most widely-used standard is RPC in OSF/DCE
- Hides certain underlying complexities
- communications and message ordering errors
- data representation differences between programs
56Desirable RPC Features
- A way of pipelining large parameters on call or
return (e.g. for queries). Pipe in DCE/RPC. - pass a handle as parameter, with a type, so
client and server agree on whats being passed - receiver can claim pieces, a chunk-at-a-time
- Callbacks - a server calls a procedure in the
client - essentially a reverse RPC
- requires another controlled binding to the right
client entry point - useful for controlled conversational access
between server and client - not supported in DCE RPC
57RPC Performance
- RPC costs
- marshaling unmarshaling
- RPC runtime and network protocol
- physical wire transfer
- In the remote case, these are about equal (but
people are working to do better) - Typical commercial numbers are 10-15K machine
instructions - Can do much better in the local case by avoiding
a full context switch
58IBMs LU6.2 Peer-to-Peer
- De-facto standard
- APPC CPI-C APIs
- Programs establish conversations (i.e. session)
via Allocate - Close the conversation with Deallocate
- Then send and receive messages over the
conversation using Send_Data and Receive_Data - Uses the chained transaction model. Announce
transaction done using Syncpoint or Backout. - One pipe model - data (send/receive) control
(2-phase commit) messages flow on the same
session.
59Conversations Two-way Alternate
- A conversation is half-duplex.
- Reflects the call-return style of most TP
communications - One participant is in send mode and the other is
in receive mode. - The sender must explicitly turn over send control
to the receiver, in a Send_Data call. - The receiver cant start sending until it
receives from the sender (using Receive_data) a
send-mode signal (a.k.a. polarity indicator)
60Conversation Trees
- When a program issues an Allocate(program-name),
the called instance of program-name becomes a
child of the caller - Thus Allocate calls from programs cause a
conversation tree to develop - E.g. A calls Allocate(B) B calls Allocate(C),
Allocate(D), and Allocate(A)
A
B
C
D
A
61Synchronization Levels
- There are 3 levels of synchronization in LU6.2
- Level 2 - programs in the conversation tree
execute in a transaction. Each program explicitly
says when to commit by issuing Syncpoint. - Level 1 - No transactions. Each program can
acknowledge receipt of a message by issuing a
Confirm signal, which is meant to indicate that
the program has processed the message(s). - Level 0 - No transactions. No confirm. Just send
and receive message over a conversation. - Many non-IBM implementations are level 0.
62Syncpoint Rules (I)
- A program issues Syncpoint to announce its done
with its part of the transaction - Causes Syncpoint message to propagate to its
neighbors in the conversation tree. - A program can issue Syncpoint if either
- all of its conversations are in send mode, and it
has not received a Syncpoint request over any
other conversation, or - all but one of its conversations are in send
mode, and it received a Syncpoint over the
receive-mode conversation - Syncpoint blocks the caller until the whole
transaction is committed or aborted (return code
tells which).
63Syncpoint Rules (II)
- Next statement is part of a new transaction
(chained model) - all programs in the conversation are part of the
same new transaction (chaining is in the
protocol, not just the API) - Eliminates some but not all protocol errors.
E.g., - A and C are in send mode to B, and no Syncpoints
yet
- A and C issue Syncpoint, which collides at B
- B is stuck and will never satisfy the rules
- LU6.2 is commit-from-anywhere. I.e. any program
in the conversation tree can be the first to call
Syncpoint. It neednt be the root of the
conversation tree.
64Stateful Programs (I)
- Connection-oriented communications model
- A conversation names some shared state between
the communicating programs - direction of communications
- direction of the link
- transaction id
- state of the transaction
- Since programs hold conversations across message
exchanges, they may rely on each others retained
state from previous message exchanges.
65Stateful Programs (II)
- E.g., P1 has a connection to P2. P1 scans a file
owned by P2. P2 maintains a cursor (retained
state), indicating P1s position in the file. - Since connections arent recoverable across
system failures, programs must be able to
reconstruct retained state after they recover. - I.e. after each transaction commits or aborts
- When a session is lost, programs must be able to
release retained state (needed anyway to abort
automatically when a program fails)
66Message Passing Flexibility (I)
- Request-reply protocols (RPC) require programs to
properly nest their request and reply messages. - Example - Request-reply matching
A
B
C
Call
T I M E
Call
Return
Return
67Message Passing Flexibility (II)
- But peer-to-peer allows arbitrary message flows
between the two parties to a conversation - Example - peer-to-peer message passing
A
B
C
Send
Rcv
- To communicate with
- an application that uses
- peer-to-peer, you must
- know the message
- flows (protocol) that
- the application expects
T I M E
68Termination Model (I)
- In RPC, a program normally announces termination
by returning to its caller - It must not return until all of its outbound
calls have returned - In peer-to-peer, a program announces termination
by invoking Syncpoint. - This also tells the programs transaction to
start committing, but each program decides
independently when to commit (by issuing
Syncpoint) - Termination errors are the price of more message
passing flexibility ...
69Termination Model (II)
- Certain programming errors are possible in
peer-to-peer - P1 invokes Syncpoint, but P2 is waiting for a
message from P1. P1 and P2 are deadlocked. - P2 gives up waiting for P1s message, so P2
invokes Syncpoint. P2 must be ready for P1s
message after Syncpoint returns.
70Connection Models
- To cope with stateful servers, both models need a
way to manage shared state. - In peer-to-peer, the state is implicitly attached
to conversation (session) context - In RPC, it is either exchanged in parameters or a
session is created above the communications layer
using a binding handle. - In both models, we need to clean up retained
state after a failure and need to reconstruct
shared state at appropriate times.
71Multi-Transaction Requests
- Some requests cannot execute as one transaction
because - it executes too long (causing lock contention)
- resources dont support a compatible 2-phase
commit protocol. - Transaction may run too long because
- It requires display I/O with user
- People or machines are unavailable (hotel
reservation system, manager who approves the
request) - It requires long-running real-world actions (eg
get 2 estimates before settling an insurance
claim) - Subsystems transactions must be ACID (placing an
order, scheduling a shipment, reporting
commission)
72Workflow
- A multi-transaction request is called a workflow
- Specialized workflow products are being offered.
- IBM Flowmark, Action, JetForm, Wang/Kodak, ...
- They have special features, such as
- Flow-graph language for describing processes
consisting of steps, with preconditions for
moving between steps - representation of organizational structure and
roles (manual step can be performed by a person
in a role, with complex role resolution
procedure) - tracing of steps, locating in-flight workflows
- ad hoc workflow, integrated with e-mail (case
mgmt)
73Managing Workflow with Queues
- Each workflow step is a request
- Send the request to the queue of the server that
can process the request - Server outputs request(s) for the next step(s) of
the workflow
74Workflows Can Violate Atomicity Isolation
- Since a workflow runs as many transactions,
- it may not be serializable relative to other
workflows - it may not be all-or-nothing
- Consider a money transfer run as 2 Txs, T1 T2
- Conflicting money transfers could run between T1
T2 - A failure after T1 might prevent T2 from running
- These problems require application-specific logic
- E.g. T2 must send ack to T1s node. If T1s node
times out waiting for the ack, it takes action,
possibly compensating for T1
75Automated Compensation
- Saga workflow specification, where for each step
we identify a compensation. - If a workflow stops making progress, run
compensations for all committed steps, in reverse
order (like Tx abort). - Need to ensure that each compensations input is
available (e.g. log it) and that it definitely
can run (enforce constraints until workflow
completes). - Concept is still at the research stage.
76Pseudo-conversations
- A conversational transaction interacts with its
user during its execution. - Since this is long-running, it should run as
multiple requests - Since there are exactly two participants, just
pass the request back and forth - request carries all workflow context
- request is recoverable, e.g. send/receive is
logged or request is stored in shared disk area - This is a simpler mechanism than queues