CS556: Distributed Systems

About This Presentation

Title:

CS556: Distributed Systems

Description:

Reliability - the extent to which a system yields expected results on repeated trials ... Next call picks up the scan where it left off. cursors ... – PowerPoint PPT presentation

Number of Views:20

Avg rating:3.0/5.0

Slides: 77

Provided by: mar177

Category:

more less

Transcript and Presenter's Notes

Title: CS556: Distributed Systems

1
CS-556 Distributed Systems
Transactions

Manolis Marazakis
maraz_at_csd.uoc.gr

2
Terminology (I)

Reliability - the extent to which a system yields
expected results on repeated trials
Reliability is measured by mean-time-between-fail
ures (MTBF)
Availability - fraction of time the system yields
expected results
reduced by downtime, repair, preventive
maintenance
Availability MTBF/(MTBFMTTR), where MTTR is
mean-time-to-repair

3
Terminology (II)

24x7 - system should always be operational
Failure - an event where the system gives
unexpected results (e.g. wrong or no result)
Fault - an identified or hypothesized cause of
failure
A bad memory board (fault) causes OS to fail
Masking - prevents a fault from becoming a
failure
Eg OS reconfigures around a faulty disk
controller
Transient fault - the fault does not reoccur when
retrying the operation that caused the fault
Permanent fault - non-transient fault

4
Where Do Failures Come From?
Tandem 89 Tandem
85 ATT/ESS 85 Environment 7
14 Hardware 8 18
Subtotal 15 32 20 System Mgmt 21
42 30 Software 64 25 50

Software is most of the problem !
Environment fire, flood, earthquake, vandalism,
communications/power/air conditioning failures
System management maintenance, system operation,
system configuration
System managers vendors are responsible for
environment and administration

5
Recovery in Client/Server Systems

For each component how to detect failure and
recovery, and what to do about them

Client submits a request
Server processes the request
Server returns a reply
A failure can occur during any of these activities

6
Detecting Failures

Fault detection must be accurate
or youll take recovery actions for a
still-active process
Fault detection must be fast
since it contributes to MTTR
Techniques
Processes sends heartbeats (Im alive) to a
monitor process.
No heartbeat gt failed process
Monitor polls processes for Im alive messages
Process gets an OS lock. Monitor waits for the
lock. If process fails, OS releases its lock, so
monitor gets it and detects the failure

7
Client Recovery

What calls were in-flight at failure time?
How much of each call was processed?
How to finish these calls before proceeding?
General-purpose solution
Execute requests as transactions.
Maintain a persistent queue of requests and
replies, shared by client and server.
At recovery time, use the queue contents to
decide which transactions executed. Re-run ones
that didnt.

8
Server Recovery (I)

Assume no transactions, for now
When it failed, the server maybe was processing
a call. If all of the calls operations are
redoable, then just re-execute the call.
What if it performed non-redoable operations
before it failed (increment inventory, print
ticket, )?
Recover to a state after the last non-redoable
operation
Requires careful bookkeeping during normal
operation, so theres enough state to figure out
how to recover
Checkpoint - an action to avoid redo during
recovery
E.g. save servers memory state to disk

9
Server Recovery (II)

Checkpoint before each non-redoable operation
At recovery time
Restore last checkpoint
Check if last non-redoable operation actually ran
E.g. save incremented inventory value in memory
before checkpointing. At recovery compare memory
and disk copies of inventory.
E.g. read ticket number from ticket printer
before checkpointing. At recovery, read ticket
number from ticket printer and compare to number
before checkpoint.

10
Server-side Transactions

If servers run each clients call within a
transaction, then at recovery time, abort all
active transactions re-run them. No
checkpointing required.
Its as if server checkpoints before Start
transaction and Commit is the non-redoable
operation to check at recovery to see if it
actually ran.
This assumes all transaction operations are
undoable
Transaction recovery techniques are much cheaper
than memory checkpointing.
Client performs non-redoable operations outside a
transaction, and uses other checkpointing
techniques

11
Transactions

Dependable computing
All objects managed by a server remain in a
consistent state
when they are accessed by multiple clients
in the presence of failures
Protect the operations of clients from
interference
Atomic operations at the server
Synchronized access to shared resources
Synchronization cooperation of clients
Wait/notify/notify_all, locks
Transactions
Predictable faults
Although errors can occur, they can be detected
dealt with before any incorrect behavior occurs
Stable storage processors

12
All-or-nothing (Exactly-Once)

A transaction either completes successfully
in which case the effects of all of its
operations are recorded in the systems objects
or fails (or is deliberately aborted)
in which case it has no effect at all
Failure atomicity
Atomic effects, even if server crashes
Durability
State data are made permanent
Will survive if server process crashes
The intermediate effects of a transaction must
not be visible to others

13
Serializable execution

Transactions are allowed to execute concurrently
but only as long as their overall effects are
equivalent to some serial execution

14
Masking faults

Architecture Hardware Faults
Software Masks Environmental Faults
Distribution Maintenance
Software automates / eliminates operators
In the limit there are only software design
faults.
Software-fault tolerance the key to
dependability

15
Fault Tolerance Techniques

FAIL FAST MODULES work or stop
SPARE MODULES instant repair time.
INDEPENDENT MODULE FAILS by design MTTFPair
MTTF2/ MTTR (must have small MTTR)
MESSAGE BASED OS Fault Isolation Software has
no shared memory.
SESSION-ORIENTED COMM Reliable messages Detect
lost/duplicate messages Coordinate messages
with commit
PROCESS PAIRS Mask Hardware Software Faults
TRANSACTIONS A.C.I.D. (simple fault model)

16
Fail-Fast Repair
Lifecycle of a module fail-fast gives short
fault latency High Availability is
low UN-Availability Unavailability MTTR
MTTF
Return

Improving either MTTR or MTTF gives benefit
Simple redundancy is not enough on its own.

17
Atomic Actions

Either the action is executed completely (and
successfully), or it has not happened at all.
If it is not successful, it has not left any side
effects.
Except for simple instructions at the processor
level, no operation is really atomic
Most operations are implemented by a sequence of
more primitive operations.
Precautions have to be taken
The lower-level operations must not make any
visible changes before it is clear that the
top-level operation will succeed.
For any temporary changes the lower-level
operations make, it must be made sure that they
do not become visible, and that they can be
revoked automatically should anything go wrong.

18
A Classification of Action Types

Unprotected actions lack all of the ACID
properties except for consistency
Unprotected actions are not atomic, and their
effects cannot be depended upon.
Almost anything can fail.
Protected actions actions that do not
externalize their results before they are
completely done
Their updates can be rolled back
Once they have reached their normal end, there
will be no unilateral rollback.
Real actions affect the real, physical world
consistent, isolated, durable
irreversible (in the majority of cases)

19
Sample Transaction Program (I)
exec sql begin declare section long Aid, Bid,
Tid, delta, Abalance exec sql end declare
section DCApplication() read input
msg exec sql begin work Abalance
DoDebitCredit(Bid, Tid, Aid, delta) send output
msg exec sql commit work
20
Sample Transaction Program (II)
long DoDebitCredit(long Bid, long Tid, long Aid,
long delta) exec sql update accounts set
Abalance Abalance delta where Aid
Aid exec sql select Abalance into Abalance
from accounts where Aid Aid exec sql
update tellers set Tbalance Tbalance
delta where Tid Tid exec sql update
branches set Bbalance Bbalance delta
where Bid Bid exec sql insert into
history(Tid, Bid, Aid, delta, time) values
(Tid, Bid, Aid, delta, CURRENT) return(Abala
nce)
21
ACID Properties (I)

Atomicity
State transition occurs without any observable
intermediate states
... or the system appears as though it never left
the initial state
It holds whether the transaction, the entire
application, the operating system, or other
components function normally, function
abnormally, or crash.
For a transaction to be atomic, it must behave
atomically to any outside observer.

22
ACID Properties (II)

Consistency A transaction produces consistent
results only otherwise it aborts.
A result is consistent if the new state of the
database fulfills all the consistency constraints
of the application
The program has functioned according to
specification.

23
ACID Properties (III)

Isolation
A program running under transaction protection
must behave exactly as it would in single-user
mode.
That does not mean transactions cannot share data
objects.
The definition of isolation is based on
observable behavior from the outside, rather than
on what is going on inside.

24
ACID Properties (IV)

Durability The results of transactions having
completed successfully must not be forgotten by
the system
Once the system has acknowledged the execution of
a transaction, it must be able to reestablish its
results after any type of subsequent failure,
whether caused by the user, the environment, or
the hardware components.

25
Concurrency control (I)

Lost update
3 accounts (A, B, C)
with balances 100, 200, 300
T1 transfers from A to B, for 10 increase
T2 transfers from C to B, for 10 increase
Both T1, T2 read balance of B (200)
T1 overwrites the update by T2
Without seeing it

Transactions should not read a stale value
use it in computing a new value
26
Concurrency control (II)

Inconsistent retrievals
T1 transfers 10 of account A to account B
T2 computes sum of account balances
T2 computes sum before T1 updates B

Update transactions should not interfere with
retrievals.
In general Transactions should not violate
operation conflict rules.
27
Concurrency control (III)
Serial equivalence criterion for correct
concurrent execution
T1 serially equivalent with T2 iff All pairs of
conflicting operations of the two transactions
are executed in the same order at all objects
that both transactions access.

3 approaches to CC
Locking
Optimistic CC
Timestamp ordering

Txs wait for one another OR Restart Txs after
conflicts have been detected
28
Common Performance Problems

Convoys on semaphores or high-traffic locks
Log semaphore is hotspot
Sequential insert is hotspot
Lock manager costs too much A good number 300
instructions for lockunlock (no wait case)
File or page granularity locking causes hotspot
for small files

29
Comparisons of CC methods

Order of serialization
Timestamp-based static
when Txs begin
2-phase locking dynamic
based on access pattern
2-phase locking is better when the frequency of
updates is high
Otherwise, timestamp-based optimistic perform
better
Handling conflicts
2-phase locking block (danger for deadlock)
Timestamp-based immediate abort

30
Challenging applications

Multi-user, collaborative
Atomic updates, in the presence of concurrency
server crashes
Must also support notification of changes
access to work-in-progress (sharing)
Relaxed isolation guarantees
Long-lasting Txs
CAD/CAM, software development
Independent versions of objects
Check-in / check-out / merge operations

31
Recoverability from aborts (I)

Servers must prevent a aborting Tx from affecting
other concurrent Txs.
Dirty reads
T2 sees result update by T1 on account A
T2 performs its own update on A then commits.
T1 aborts -gt T2 has seen a transient value
T2 is not recoverable
If T2 delays its commit until T1s outcome is
resolved
Abort(T1) -gt Abort(T2)
However, if T3 has seen results of T2
Abort(T2) -gt Abort(T3) !
Cascading aborts

Txs should only read values written by committed
Txs
32
Recoverability from aborts (II)

Premature writes
Assume server implements abort by maintaining the
before image of all update operations
T1 T2 both updates account A
T1 completes its work before T2
If T1 commits T2 aborts, the balance of A is
correct
If T1 aborts T2 commits, the before image
that is restored corresponds to the balance of A
before T2
If both T1 T2 abort, the before image that is
restored corresponds to the balance of A as set
by T1

Txs should be delayed until earlier Txs that
update the same objects have been either
committed or aborted.
33
Recoverability from aborts (III)

Txs should delay both their reads updates in
order to avoid interference
Strict execution -gt enforce isolation
Servers should maintain tentative versions of
objects in volatile memory

Txs should be delayed until earlier Txs that
update the same objects have been either
committed or aborted.
34
Interface of transaction coordinator
openTransaction() -gt trans starts a new
transaction and delivers a unique TID trans. This
identifier will be used in the other operations
in the transaction. closeTransaction(trans) -gt
(commit, abort) ends a transaction a commit
return value indicates that the transaction has
committed an abort return value indicates that
it has aborted. abortTransaction(trans) aborts
the transaction.
35
Outcomes of a Flat Transaction
36
Stateless Servers (I)

Assume requests run as transactions
Two types of servers
application server - application logic
resource manager - shared resources (e.g. disks)

Client

Application server is stateless if it
runs in transactions
maintains all state in resource mgrs
E.g. start Tx, call resource mgrs, commit
It doesnt need to recover Only needs to know
which request it was processing (to redo it).

Application Server
Resource Manager
Resource
37
Stateless Servers (II)

Resource mgr runs all requests in Txs
Needs to be capable to perform recovery work
abort all partially completed transactions
ensure resources include all the results of all
committed transactions (atomic and durable)
database recovery algorithms (complex!)

38
Stateful Servers (I)

Sometimes a server maintains state on clients
behalf. E.g.,
Server scans a file. Each time it hits a relevant
record it returns it. Next call picks up the scan
where it left off.
cursors
Server maintains lots of user information, which
is too expensive to reconstruct on every call.
Approach 1 Client passes state to server on each
call, and server returns it on each reply. Server
retains no state.
This is the default assumption outside TP, but
doesnt work well for TP.
Note that transaction id context is handled this
way.

39
Stateful Servers (II)

Approach 2 server maintains state, indexed by
client id (e.g. transaction id). Subsequent RPCs
from the client must go to the same server and
pick up the retained context.
RPC can provide a binding handle for subsequent
calls. This ensures later calls go to the same
server.
If the client fails (e.g. it aborts the
transaction), server must be notified to release
clients context
its just like a resource manager that releases
locks
so encapsulate context as a (volatile) resource
If state must be maintained across transaction
boundaries, then treat it like any resource
manager (e.g. DBMS)

40
Fault Tolerance (I)

What to do if a client doesnt receive a reply
within its timeout period?
Why not just retry?
In TP, many RPC calls are not idempotent
Idempotent any number of operation executions
has the same effect as one execution
Queries (read-only) are idempotent, but not most
updates
Send a ping for non-idempotent calls
After giving up, ignore late-arriving responses
Cant assume that the call didnt run, so usually
requires aborting the transaction that made the
call (up to the application)

41
Fault Tolerance (II)

Interface definition can say whether server is
idempotent
could even be done per member function
More abstract view
RPC executes idempotent calls at least once
RPC executes non-idempotent calls at most once
If the goal is exactly once, execute the RPC
within a transaction use transaction retry
logic to ensure transaction actually runs

42
Fault Tolerance By Logging Device I/O

Consider a transaction all of whose operations
are undoable.
Log all of the transactions interaction with the
outside world.
If the transaction fails, replay it.
During the replay,
get input from the log
validate that output is identical to what was
logged.
If the output diverges from the log, then start
asking for live input (and then ignore rest of
the log).
A variation of this is used by Digitals RTR

43
Transparent Transaction IDs

Ideally, Start returns a transaction ID thats
hidden from the caller
Procedures dont need to explicitly pass
transaction ids.
Easier avoids errors
Moreover, when a transaction first arrives at a
site, the local transaction manager needs to be
notified.
Application shouldnt need to deal with this
This is what makes RPC (or other paradigms)
transactional.

44
Implementing transactions
45
Private Workspace

The file index and disk blocks for a three-block
file
The situation after a transaction has modified
block 0 and appended block 3
After committing

46
Writeahead Log
47
Locking schemes

Give each Tx the illusion that there are no
concurrent updates
Hide concurrency anomalies.
Do it automatically
Apps do not know transaction semantics
Goal
Although there is concurrency in systemexecution
is equivalent to some serial execution of the
system
Not deterministic outcome, just a consistent
transformation

48
Two-Phase Locking
49
Strict Two-Phase Locking
50
Deadlock Detection

Deadlock a cycle in the wait-for graph
Kinds of waits database locks terminal/device
storage session server
Correct detection must get complete graph
Not likely, so always fall back on timeout
Model of deadlock showswaits are raredeadlocks
are rare2 (very very rare)virtually all cycles
are length 2so do depth-first search either as
soon as you wait OR after a timeout

51
Model of Deadlock (I)

R number of objects (locks)
r objects locked per transaction
N1 Concurrent Transactions
ASSUME
Transaction is LOCKr lock steps, then commit
Uniform distribution
exclusive locks only
Nr ltlt R

Prob. a request waits
Prob. a transaction waits
52
Model of Deadlock (II)

Probability of cycle length 2 length 3 ...

Prob transaction deadlocks PD assumes all
cycles of length 2
System deadlock rate is N1 times higher
Conclusions control transaction size
duration limit multiprogramming
53
Atomicity of multi-server transactions

Goal is to ensure the atomicity of a Tx that
accesses multiple resource managers
Data
Messages
Anything shared by Txs
Problems
What if resource manager RMi fails after a
transaction commits at RMk?
What if other RMs are down when RMi recovers?
What if a Tx thinks RMi failed therefore
aborted, when it actually is still running?

54
Transactional Communications

Three paradigms for communications between
application programs in transactions
remote procedure call - procedure calls between
address spaces
peer-to-peer messages - send-message /
receive-message
queues - enqueue, dequeue to a shared queue
These paradigms are not unique to TP, but they
all have TP-specific aspects

55
Remote Procedure Call

Program calls remote procedure the same way it
would call a local procedure
Variation asynchronous call return, for
single-threaded client
Most widely-used standard is RPC in OSF/DCE
Hides certain underlying complexities
communications and message ordering errors
data representation differences between programs

56
Desirable RPC Features

A way of pipelining large parameters on call or
return (e.g. for queries). Pipe in DCE/RPC.
pass a handle as parameter, with a type, so
client and server agree on whats being passed
receiver can claim pieces, a chunk-at-a-time
Callbacks - a server calls a procedure in the
client
essentially a reverse RPC
requires another controlled binding to the right
client entry point
useful for controlled conversational access
between server and client
not supported in DCE RPC

57
RPC Performance

RPC costs
marshaling unmarshaling
RPC runtime and network protocol
physical wire transfer
In the remote case, these are about equal (but
people are working to do better)
Typical commercial numbers are 10-15K machine
instructions
Can do much better in the local case by avoiding
a full context switch

58
IBMs LU6.2 Peer-to-Peer

De-facto standard
APPC CPI-C APIs
Programs establish conversations (i.e. session)
via Allocate
Close the conversation with Deallocate
Then send and receive messages over the
conversation using Send_Data and Receive_Data
Uses the chained transaction model. Announce
transaction done using Syncpoint or Backout.
One pipe model - data (send/receive) control
(2-phase commit) messages flow on the same
session.

59
Conversations Two-way Alternate

A conversation is half-duplex.
Reflects the call-return style of most TP
communications
One participant is in send mode and the other is
in receive mode.
The sender must explicitly turn over send control
to the receiver, in a Send_Data call.
The receiver cant start sending until it
receives from the sender (using Receive_data) a
send-mode signal (a.k.a. polarity indicator)

60
Conversation Trees

When a program issues an Allocate(program-name),
the called instance of program-name becomes a
child of the caller
Thus Allocate calls from programs cause a
conversation tree to develop
E.g. A calls Allocate(B) B calls Allocate(C),
Allocate(D), and Allocate(A)

A
B
C
D
A
61
Synchronization Levels

There are 3 levels of synchronization in LU6.2
Level 2 - programs in the conversation tree
execute in a transaction. Each program explicitly
says when to commit by issuing Syncpoint.
Level 1 - No transactions. Each program can
acknowledge receipt of a message by issuing a
Confirm signal, which is meant to indicate that
the program has processed the message(s).
Level 0 - No transactions. No confirm. Just send
and receive message over a conversation.
Many non-IBM implementations are level 0.

62
Syncpoint Rules (I)

A program issues Syncpoint to announce its done
with its part of the transaction
Causes Syncpoint message to propagate to its
neighbors in the conversation tree.
A program can issue Syncpoint if either
all of its conversations are in send mode, and it
has not received a Syncpoint request over any
other conversation, or
all but one of its conversations are in send
mode, and it received a Syncpoint over the
receive-mode conversation
Syncpoint blocks the caller until the whole
transaction is committed or aborted (return code
tells which).

63
Syncpoint Rules (II)

Next statement is part of a new transaction
(chained model)
all programs in the conversation are part of the
same new transaction (chaining is in the
protocol, not just the API)
Eliminates some but not all protocol errors.
E.g.,
A and C are in send mode to B, and no Syncpoints
yet

A and C issue Syncpoint, which collides at B
B is stuck and will never satisfy the rules
LU6.2 is commit-from-anywhere. I.e. any program
in the conversation tree can be the first to call
Syncpoint. It neednt be the root of the
conversation tree.

64
Stateful Programs (I)

Connection-oriented communications model
A conversation names some shared state between
the communicating programs
direction of communications
direction of the link
transaction id
state of the transaction
Since programs hold conversations across message
exchanges, they may rely on each others retained
state from previous message exchanges.

65
Stateful Programs (II)

E.g., P1 has a connection to P2. P1 scans a file
owned by P2. P2 maintains a cursor (retained
state), indicating P1s position in the file.
Since connections arent recoverable across
system failures, programs must be able to
reconstruct retained state after they recover.
I.e. after each transaction commits or aborts
When a session is lost, programs must be able to
release retained state (needed anyway to abort
automatically when a program fails)

66
Message Passing Flexibility (I)

Request-reply protocols (RPC) require programs to
properly nest their request and reply messages.
Example - Request-reply matching

A
B
C
Call
T I M E
Call
Return
Return
67
Message Passing Flexibility (II)

But peer-to-peer allows arbitrary message flows
between the two parties to a conversation
Example - peer-to-peer message passing

A
B
C
Send
Rcv

To communicate with
an application that uses
peer-to-peer, you must
know the message
flows (protocol) that
the application expects

T I M E
68
Termination Model (I)

In RPC, a program normally announces termination
by returning to its caller
It must not return until all of its outbound
calls have returned
In peer-to-peer, a program announces termination
by invoking Syncpoint.
This also tells the programs transaction to
start committing, but each program decides
independently when to commit (by issuing
Syncpoint)
Termination errors are the price of more message
passing flexibility ...

69
Termination Model (II)

Certain programming errors are possible in
peer-to-peer
P1 invokes Syncpoint, but P2 is waiting for a
message from P1. P1 and P2 are deadlocked.
P2 gives up waiting for P1s message, so P2
invokes Syncpoint. P2 must be ready for P1s
message after Syncpoint returns.

70
Connection Models

To cope with stateful servers, both models need a
way to manage shared state.
In peer-to-peer, the state is implicitly attached
to conversation (session) context
In RPC, it is either exchanged in parameters or a
session is created above the communications layer
using a binding handle.
In both models, we need to clean up retained
state after a failure and need to reconstruct
shared state at appropriate times.

71
Multi-Transaction Requests

Some requests cannot execute as one transaction
because
it executes too long (causing lock contention)
resources dont support a compatible 2-phase
commit protocol.
Transaction may run too long because
It requires display I/O with user
People or machines are unavailable (hotel
reservation system, manager who approves the
request)
It requires long-running real-world actions (eg
get 2 estimates before settling an insurance
claim)
Subsystems transactions must be ACID (placing an
order, scheduling a shipment, reporting
commission)

72
Workflow

A multi-transaction request is called a workflow
Specialized workflow products are being offered.
IBM Flowmark, Action, JetForm, Wang/Kodak, ...
They have special features, such as
Flow-graph language for describing processes
consisting of steps, with preconditions for
moving between steps
representation of organizational structure and
roles (manual step can be performed by a person
in a role, with complex role resolution
procedure)
tracing of steps, locating in-flight workflows
ad hoc workflow, integrated with e-mail (case
mgmt)

73
Managing Workflow with Queues

Each workflow step is a request
Send the request to the queue of the server that
can process the request
Server outputs request(s) for the next step(s) of
the workflow

74
Workflows Can Violate Atomicity Isolation

Since a workflow runs as many transactions,
it may not be serializable relative to other
workflows
it may not be all-or-nothing
Consider a money transfer run as 2 Txs, T1 T2
Conflicting money transfers could run between T1
T2
A failure after T1 might prevent T2 from running
These problems require application-specific logic
E.g. T2 must send ack to T1s node. If T1s node
times out waiting for the ack, it takes action,
possibly compensating for T1

75
Automated Compensation

Saga workflow specification, where for each step
we identify a compensation.
If a workflow stops making progress, run
compensations for all committed steps, in reverse
order (like Tx abort).
Need to ensure that each compensations input is
available (e.g. log it) and that it definitely
can run (enforce constraints until workflow
completes).
Concept is still at the research stage.

76
Pseudo-conversations

A conversational transaction interacts with its
user during its execution.
Since this is long-running, it should run as
multiple requests
Since there are exactly two participants, just
pass the request back and forth
request carries all workflow context
request is recoverable, e.g. send/receive is
logged or request is stored in shared disk area
This is a simpler mechanism than queues