Title: Synchronization in Distributed Systems
1Synchronization in Distributed Systems
- EECS 750
- Spring 1999
- Course Notes Set 3
- Chapter 3
- Distributed Operating Systems
- Andrew Tanenbaum
2Synchronization in Distributed Systems
- A computation must be composed of components
separated logically, if not physically, or it
cannot be considered to be distributed - Just as this implied communication as a necessary
component, so too is some form of synchronization - Distributed components of the computation
cooperate and exchange information - This implies implicit and explicit constraints on
how the components execute relative to one
another - Such constraints are ensured and enforced by
various forms of synchronization
3Synchronization
- Interacting components whose execution is
constrained are obviously communicating - Communication support is a necessary but not
sufficient set of support for distributed systems - Some forms of communication are synchronous
implying both properties - As with communication
- Different situations require different semantics
- Weakest adequate semantics are usually the best
choice - As with communication
- Synchronization in distributed systems is like
that in uni-processor systems, only more so
4Synchronization
- Uni-processor systems present all of the basic
synchronization scenarios and problems - Critical Sections
- Mutual Exclusion
- Counting Semaphores - resource allocation
- Atomic Transactions
- BUT a uni-processor is pseudo-parallel
- Canonical problems are often simplified or are
special cases of the general problem - Multi-processors, multi-computers, and networks
of workstations all have different implications
5Synchronization
- Implicit assumptions change in moving from one
architecture to another - Uni-processor Multi-processor
- True parallelism changes the probability of
various scenarios by removing pseudo-parallel
constraints - Changes the methods by which critical sections
must be or are best protected - Still preserves the most basic assumption atomic
operations on shared memory - Multiple caches and NUMA hierarchy can make this
complicated
6Synchronization
- Single box Multiple distributed boxes
- Violates the assumption that all components can
have atomic access to shared memory - Requires new methods of supporting
synchronization - All synchronization methods ultimately decide
- Which set of computation operations must be
controlled and which need not - In what order to execute computation operations
- Sets of events that can be done at the same time
or that can be done in any order are concurrent - Sets of events that must be done one at a time
are sequential
7Synchronization
- Many synchronization methods in distributed
systems thus depend on - How the system can tell the time at which events
occur - How the system can tell the order in which events
occur - Under the principle of weakening semantics for
better performance, there are many forms of
event ordering - We will consider
- Mutual Exclusion
- Election
- Atomic Transactions
- Deadlock
8Clock Synchronization
- Coordination in DS often requires a knowledge of
when things happen which implies a clock of some
kind - We will see that not all situations requires
clocks with semantics of the same strength - Distributed systems are often more complicated
than non-distributed equivalents because they
require distributed rather than centralized
algorithms - The properties of distributed algorithms, as
always, determine the set of system services
required - Distribution is often more complex and difficult
than one first expects - Centralized architectures are not necessarily bad
9Clock SynchronizationDistributed Algorithm
Properties
- Distributed algorithms have properties with
important implications - Information scattered among many components
- Computation components (processes) make decisions
based on locally available information - Single points of failure should be avoided
- No common clock or other precise global time
source - Yet, some form of distributed sense of time is
required - How precise depends on what has to be
synchronized - Coarse grain is easy
- Most DS require fine enough grain to be hard
10Clock Synchronization Distributed Algorithm
Properties
- First three properties argue against
centralization for resource allocation and other
types of management - Limits scalability
- Single point of failure
- Requires a new approach to algorithm design
- Fourth property points out that time is different
in centralized and distributed systems - Temporal values have meaning in a DS only within
a given granularity determined by clock
synchronization - Unattended clocks can drift by hours or days
- ITTC uses GPS and Network Time Protocol (NTP) to
synchronize within fractions of a second
11Clock Synchronization Distributed Algorithm
Properties
- Consider the problem of a distributed file system
and compilation environment with files and
compilers on multiple distributed machines - Make tracks relations among source and output
files to determine what needs to be recompiled at
any given time - Make uses the creation time stamp of the files to
determine if a source file is younger than an
output file - Does not depend on the validity of the times of
each file - Does depend on the times imposing a correct
order on the set of creation events for each file - Single incorrect clock in uni-processor works
just fine - Multiple clocks must be synchronized closely
enough
12Logical Clocks
- No computer has an absolute clock and no computer
keeps absolute time - Computers keep logical time for a number of
reasons and in a number of different senses - Logical time is a representation of absolute time
in the computer subject to a number of different
constraints - How does a computer obtain a sense of time?
- Often a periodic interrupt updating software
clock - Imposes constraints on temporal resolution and
overhead - Raising interrupt frequency raises the resolution
but also the overhead
13Logical Clocks
- Where does the periodic interrupt come from?
- Timer hardware with an oscillating crystal
- Interrupt programmed for every N crystal
oscillations - Crystals differ from one another quite a bit
- Clock drift is the difference in rate between two
clocks - Clock drift over time results in a difference in
value between clocks called skew - Lamport observed
- Clock synchronization within reasonable limits is
possible - Useable synchronization need not be absolute
14Logical Clocks
- Degree of synchronization required depends on the
time scale of the operations being synchronized
and the semantics of the synchronization - Lamport based his approach on several
observations - Components which do not interact place no
constraint on the synchronization of their clocks - Interacting components often care only about the
order in which events occur, not their times - Even when a global time is required, it can often
be a logical time differing arbitrarily from
real-time - When real-time does matter, the system must be
designed to tolerate the real clock
synchronization tolerance
15Logical Clocks
- Algorithms which depend on temporal ordering but
which do not depend on absolute time use logical
clocks - Absolute time is given by physical clocks
- Lamports algorithm synchronizes logical clocks
16Lamports Logical ClockSynchronization
- Lamports approach to logical clocks is used in
many situations in distributed systems where
ordering is important but global time is not
required - Example of weakening semantics to simplify and/or
increase efficiency - Begin with an important relation happens-before
- A happens-before B ( ) when all the
processes involved in a distributed decision
agree that event A occurred first, and that then
B occurred - Note that this does not mean that A actually
happened before B according to a hypothetical
absolute global clock
17Lamports Logical ClockSynchronization
- A system can know the happens-before applies
when - 1) Events A and B are observed by the same
process, or by different processes with the same
global clock, and A happens before B, then - 2) Event A denotes sending a message, and event
B denotes receiving the same message, then
since a message cannot be received
before it is sent - 3) Happens-before is transitive so
- If two elements X and Y do not interact through
messages then they are concurrent since neither
can be determined
nor does it matter
18Lamports Logical ClockSynchronization
- We have, thus, distinguished between concurrent
events whose global ordering can be ignored, and
events to which a global logical time must be
assigned - This global logical time is denoted as the
logical clock value of an event - Consider the previous two situations
- Events on the same system
- send and receive events on different systems
19Lamports Logical ClockSynchronization
- On the same system, then
trivially since the two events on the same system
can easily use the same clock - Note that the temporal granularity of the system
clock being used must be sufficient to
distinguish A and B - Otherwise
- When the events occur on different systems, we
must assign C(A) and C(B) in such a way that the
necessary relation holds without ever decreasing
a time value - Thus logical clock values of an event may be
changed but always by moving them forward - Logical clocks in a distributed system always run
as fast or faster than the physical clocks with
which they interact
20Lamports Logical ClockSynchronization
- Consider how logical times are assigned in a
specific scenario - Figure 3-2 page 123 of Tanenbaum
- In Part A of the figure the three machines have
clocks running at different speeds, and the event
times are not consistent with the happens-before
relation - Note that message C arrives at local time 56 even
though it was sent at local time 60 - This is a contradiction because
- Clearly, a message must be received after it is
sent, so by setting the receiving clock to 61
21Lamports Logical ClockSynchronization
- Adjusting the receiving clock to 61 or greater
ensures that happens-before applies and events
can be assigned a rational logical order - Figure 3-2(b) shows this adjustment to the clock
at the receiver of C - Every message transfer takes at least 1 time tick
- Any clock, logical or physical, has finite
resolution - Two events occurring close enough together happen
at the same time - All clock values are thus limited to creating
partial rather than total orders on a set - Some distributed algorithms require a total order
22Lamports Logical ClockSynchronization
- Additional refinement Tie Breaker
- If a total order is required and for two events A
and B - then we use some unique property of the processes
associated with the events to choose the winner - Process ID (PID) is often used for this purpose
- Establishes a total order on a set of events
- Recall that ties can happen only between events
happening on the same system since we already
asserted that every message transfer takes at
least one tick of the logical clock
23Lamports Logical ClockSynchronization
- Following these rules means that the logical
clock at each node in a distributed system is now
sufficient to reason about synchronization
problems - Logical clock provides a way for each system to
decide about the order in which events occur from
each systems point of view - Consider the connection to in-order message
delivery in ensuring logically consistent
decision making among distributed components of a
computation - HOWEVER the logical clock values at each
distributed component may have little or no
relation to real time or to each other
24Physical Clocks
- All clocks are logical clocks in the sense that
each - Has finite resolution
- Approximates real time
- Two important questions must always be considered
for a particular system - How do we synchronize the computers logical
clock with real time - How do we synchronize computer clocks with one
another - Computers and distributed software running on
them may have several clocks, but one is the
local notion of real time
25Physical ClocksReal Time
- Sun Time
- Humans, including astronomers, want to have time
keeping stay synchronized with the sun - Harder than it seems
- Consider transition to Gregorian calendar
- Significantly shorter year was decreed to adjust
drift in previous scheme - Is the year 2000 a leap year? (Hint4,-100,400)
- Atomic Time
- 50 Cesium 133 clocks around the world
- Average number of ticks since 1/1/58
26Physical Clocks Real Time
- Atomic time is the official universal time
- Requires leap seconds every few years to stay
synchronized with earths rotation - Astronomers care because it makes a difference
where they point their instruments - They work at a much finer time scale than you
might think - So do computers and distributed computations
- GPS (Global Positioning System) satellites now
make this easy and cheap to get - You can also call NIST on the telephone in Ft.
Collins
27Physical Clocks Clock Synchronization
- Consider the difference between accuracy and
synchronization of two clocks - Their accuracy is how closely they agree with
real time - Their synchronization is how closely they agree
with each other - Synchronization of clocks in a network supporting
distributed decision making is often more
important than their accuracy - Synchronization of clocks affects how easily
distributed components can decide on an ordering
of events - Synchronization of clocks within a few
milliseconds of each other is desirable, but
seconds or minutes of drift from real time could
be OK
28Physical Clocks Clock Synchronization
- Degree of agreement among interacting machines is
thus a crucial factor - Consider make horror scenarios to see this point
- Same principle applies to banks
- Network performance measurement experiments often
depend on time stamps taken on different
machines - Experiments are often restructured to minimize or
avoid this - Physical clocks run at different speeds
- Manufacturers specify maximum drift rate (rho -
?) - Manufacturers lie (sorry - provide factually
unreliable information in a completely sincere
manner)
29Physical Clocks Clock Synchronization
- Maximum resolution desired for global time
keeping determines the maximum difference
which can be tolerated between synchronized
clocks - The time keeping of a clock, its tick rate should
satisfy - The worst possible divergence d between two
clocks is thus - So the maximum time ?t between clock
synchronization operations that can ensure d is
30Physical Clocks Clock Synchronization
- Christians Algorithm
- Periodically poll the machine with access to the
reference time source - Estimate round-trip delay with a time stamp
- Estimate interrupt processing time
- figure 3-6, page 129 Tanenbaum
- Take a series of measurements to estimate the
time it takes for a timestamp to make it from the
reference machine to the synchronization target - This allows the synchronization to converge
within d with a certain degree of confidence - Probabilistic algorithm and guarantee
31Physical Clocks Clock Synchronization
- Wide availability of hardware and software to
keep clocks synchronized within a few
milliseconds across the Internet is a recent
development - Network Time Protocol (NTP) discussed in papers
by David Mill(s) - GPS receiver in the local network synchronizes
other machines - What if all have GPS receivers
- Increasing deployment of distributed system
algorithms depending on synchronized clocks - Supply and demand constantly in flux
32Physical Clocks At-Most-Once Semantics
- Traditional approach
- Each message has unique message ID
- Server maintains list of IDs
- Can lose message numbers on server crash
- How long does server keep IDs?
- With globally synchronized clocks
- Sender assigns a timestamp to message
- Server keeps most recent timestamp for each
connection - reject any message with lower timestamp (is a
duplicate) - removing old timestamps
- G CurrentTime - MaxLifeTime - MaxClockSkew
- timestamps older than G are removed
33Physical Clocks At-Most-Once Semantics
- After a server crash
- CurrentTime is recomputed
- using global synchronization of time
- All messages older than G are rejected
- All messages before crash are rejected as
duplicate - some new messages may be wrongfully rejected
- but at-most-once semantics is guaranteed
34Physical Clocks Cache Coherence
- File caching in a distributed file system
- Many readers, single writer
- Writer must ask readers to invalidate their
copies - TS on the readers copies helps by making copies
expire - Readers lease their copies of a file block
- Constrains the period during which a
non-responding reader may delay a potential
writer - Does NFS server not responding sound familiar
- Note tradeoff of overhead and latency
- Lower lease time increases message load and
decreases delay of ignoring a non-responding
reader
35Mutual Exclusion
- Distributed components still need to coordinate
their actions, including but not limited to
access to shared data - Mutual exclusion to some limited set of
operations and data is thus required - Consider several approaches and compare and
contrast their advantages and disadvantages - Centralized Algorithm
- The single central process is essentially a
monitor - Central server becomes a semaphore server
- Three messages per use request, grant, release
- Centralized performance constraint and point of
failure
36Mutual ExclusionDistributed Algorithm Factors
- Functional Requirements
- 1) Freedom from deadlock
- 2) Freedom from starvation
- 3) Fairness
- 4) Fault tolerance
- Performance Evaluation
- Number of messages
- Latency
- Semaphore system Throughput
- Synchronization is always overhead and must be
accounted for as a cost
37Mutual Exclusion Distributed Algorithm Factors
- Performance should be evaluated under a variety
of loads - Cover a reasonable range of operating conditions
- We care about several types of performance
- Best case
- Worst case
- Average case
- Different aspects of performance are important
for different reason and in different contexts
38Mutual ExclusionLamports Algorithm
- Every site keeps a request queue sorted by
logical time stamp - Uses Lamports logical clocks to impose a total
global order on events associated with
synchronization - Algorithm assumes ordered message delivery
between every pair of communicating sites - Messages sent from site Sj in a particular order
arrive at Sj in the same order - Note Since messages arriving at a given site
come from many sources the delivery order of all
messages can easily differ from site to site
39Lamports Algorithm Request Resource r
- Thus, each site has a request queue containing
resource use requests and replies - Note that the requests and replies for any given
pair of sites must be in the same order in queues
at both sites - Because of message order delivery assumption
40Lamports Algorithm Entering CS for Resource r
- Site Si enters the CS protecting the resource
when - This ensures that no message from any site with a
smaller timestamp could ever arrive - This ensures that no other site will enter the CS
- Recall that requests to all potential users of
the resource and replies from then go into
request queues of all processes including the
sender of the message
41Lamports Algorithm Releasing the CS
- The site holding the resource is releasing it,
call that site - Note that the request for resource r had to be at
the head of the request_queue at the site holding
the resource or it would never have entered the
CS - Note that the request may or may not have been at
the head of the request_queue at the receiving
site
42Lamport ME Example
Pj
Pi
Pj enters critical section
queue(j10)
15
release(i5)
queue(j10)
14
Pi in critical section
reply(12)
reply(12)
13
13
12
12
queue(j10, i5)
11
11
request (i5)
queue(i5)
queue(j10)
request (j10)
43Lamports Algorithm Correctness
- We show that Lamports algorithm ensures mutual
exclusion through a proof by contradiction - Assume two sites and are executing in
the critical section concurrently - For this to happen, L1 and L2 must hold at both
sites concurrently which implies that at some
time t both sites and had their
own requests at the top of their respective
request_queues - Without Loss of generality (WLOG) assume that
- Due to L1 and FIFO property of communication it
is clear that at time t must have had the
request from in
44Lamports Algorithm Correctness
- This implies that at site the local request
is at the head of the local
even though the request from had a lower
timestamp - This is a contradiction
- Lamports algorithms thus ensures mutual
exclusion since assuming otherwise produces a
contradiction - Key idea is that L1 ensures that must place
the request from ahead of its own because it
definitely arrived and has a lower logical
timestamp
45Lamports AlgorithmComments
- Performance 3(N-1) messages per CS invocation
since each requires (N-1) REQUEST, REPLY, and
RELEASE messages - Observation Some REPLY messages are not required
- If sends a request to and then receives a
REQUEST from with a timestamp smaller than its
own REQUEST - need not send a reply to because it
already has enough information to make a decision - This reduces the messages to between 2(N-1) and
3(N-1) - As a distributed algorithm there is no single
point of failure but there is increased overhead
46Ricart and Agrawala
- Refine Lamports mutual exclusion by merging the
REPLY and RELEASE messages - Assumption total ordering of all events in the
system implying the use of Lamports logical
clocks with tie breaking - Request CS (P) operation
- 1) Site requesting the CS creates a
message and sends it to all
processes using the CS including itself - Messages are assumed to be reliably delivered in
order - Group communication support can play an obvious
role
47Ricart and AgrawalaReceive a CS Request
- If the receiver is not currently in the CS and
does not have pending request for it in its
request_queue - Send REPLY
- If the receiver is already in the CS
- Queue the request, sending no reply
- If the receiver desires the CS but has not
entered - Compare the TS of its request to that just
received - REPLY if received is newer
- Queue the request if pending request is newer
48Ricart and Agrawala
- Enter a CS
- A process enters the CS when it receives a REPLY
from every member of the group that can use the
CS - Leave a CS
- When the process leaves the CS it sends a REPLY
to the senders of all pending messages on its
queue
49Ricart and Agrawala Example 1
I
J
K
k in CS
OK(i)
i in CS
OK(k)
OK(j)
OK(j)
request(k12)
request(i8)
50Ricart and Agrawala Example 2
I
J
K
OK(j)
k in CS
j in CS
OK(i)
OK(i)
i in CS
OK(k)
OK(k)
OK(j)
q(k9)
q(j8, k9)
q(j8)
request(i7)
request(j8)
request(k9)
51Ricart and AgrawalaProof by Contradiction
- Assume sites and are executing in the CS
concurrently - Assume that
- Site clearly received the request from
after its own - Other wise
- However, can be executing concurrently with
only if returns a REPLY message in response
to the request from before exits its CS - This is impossible because
- The assumption leads to a contradiction and thus
the R-A algorithm ensures mutual exclusion - Performance 2(N-1) messages, (N-1) REQUEST and
(N-1) REPLY
52Ricart and AgrawalaObservations
- The algorithm works because the global logical
clock ensures a global total ordering on events - This ensures, in turn, that the decision about
who enters the CS is unambiguous - Single point of failure is now N points of
failure - A crashed group member cannot be distinguished
from a busy CS - Distributed and optimized version is N times
more vulnerable than the centralized version! - Explicit message denying entry helps reliability
and converts this into busy wait
53Ricart and AgrawalaObservations
- Either group communication support is used, or
each user of the CS must keep track of all other
potential users correctly - Powerful motivation for standard group
communication primitives - Argument against a centralized server said that a
single process involved in each CS decision was
bad - Now we have N processes involved in each decision
- Improvements get a majority - Makaewas
algorithm - Bottom Line a distributed algorithm is possible
- Shows theoretical and practical challenges of
designing distributed algorithms that are useful
54Token Passing Mutex
- General structure
- One token per CS ? token denotes permission to
enter - Only process with token allowed in CS
- Token passed from process to process ? logical
ring - Mutex
- Pass token to process i 1 mod N
- Received token gives permission to enter CS
- hold token while in CS
- Must pass token after exiting CS
- Fairness ensured each process waits at most N-1
entries to get CS
55Token Passing Mutex
- Correctness is obvious
- No starvation since passing is in strict order
- Difficulties with token passing mutex
- Idle case of no process entering CS pays overhead
of constantly passing the token - Lost tokens diagnosis and creating a new token
- Duplicate tokens ensure generation of only one
token - Crashes require a receipt to detect dead
destinations - Receipts double the message overhead
- Design challenge holding time for unneeded token
- Too short ? high overhead, too long ? high CS
latency
56Mutex Comparison
- Centralized
- Simplest and most efficient
- Centralized coordinator crashes create the need
to detect crash and choose a new coordinator - M/use 3 Entry Latency 2
- Distributed
- 3(N-1) messages per CS use (Lamport)
- 2(N-1) messages per CS use (Ricart Agrawala)
- If any process crashes with a non-empty queue,
algorithm wont work - M/use 2(N-1) Entry Latency 2(N-1)
57Mutex Comparison
- Token Ring
- Ensures fairness
- Overhead is subtle ? no longer linked to CS use
- M/use 1 ? ? Entry Latency 0 ? N-1
- This algorithm pays overhead when idle
- Need methods for re-generating a lost token
- Design Principle building fault handling into
algorithms for distributed systems is hard - Crash recovery is subtle and introduces overhead
in normal operation - Performance Metrics M/use and Entry Latency
58Election Algorithms
- Centralized approaches often necessary
- Best choice in mutex, for example
- Need method of electing a new coordinator when it
fails - General assumptions
- Give processes unique system/global numbers (e.g.
PID) - Elect process using a total ordering on the set
- All processes know process number of members
- All processes agree on new coordinator
- All do not know if it is up or down ? election
algorithm is responsible for determining this - Design challenge network delay vs. crashed peer
59Bully Algorithm
- Suppose the coordinator doesnt respond to P1
request - P1 holds an election by sending an election
message to all processes with higher numbers - If P1 receives no responses, P1 is the new
coordinator - If any higher numbered process responds, P1 ends
its election - Process receives an election request
- Reply to the sender tells it that it has lost the
election - Holds an election of its own
- Eventually all but highest surviving process give
up - Process recovering from a crash takes over if
highest
60Bully Algorithm
- Example Processes 0-7, 4 detects that 7 has
crashed - 4 holds election and loses
- 5 holds election and loses
- 6 holds election and wins
- Message overhead variable
- Who starts an election matters
- Solid lines say Am I leader?
- Dotted lines say you lose
- Hollow lines say I won
- 6 becomes the coordinator
- When 7 recovers it is a bully and sends I win
to all
61Ring Algorithm
- Processes have a total order known by all
- Each process knows its successor ? forming a ring
- Ring mod N
- So the successor of Pi is P(i1) mod N
- No token involved
- Any process Pi noticing that the coordinator is
not responding - Sends an election message to its successor P(i1)
mod N - If successor is down, send to next member ?
timeout - Receiving process adds its number to the message
and passes it along
62Ring Algorithm
- When election message gets back to election
initiator - Change message to coordinator
- Circulate to all members
- Coordinator is highest process in the total order
- All processes know the order and thus all will
agree no matter how the election started - Strength
- Only one coordinator chosen
- Weakness
- Scalability latency increases with N because the
algorithm is sequential
63Ring Algorithm
- What if more than one process detects a crashed
coordinator? - More than one election will be produced message
storm - All messages will contain the same information
member process numbers and order of members - Same coordinator is chosen (highest number)
- Refinement might include filtering duplicate
messages - Some duplicates will happen
- Consider two elections chasing each other
- Eliminate one initiated by lower numbered process
- Duplicated until lower reaches source of the
higher
64Atomic Transactions
- All synchronization methods so far have been low
level - Essentially equivalent to semaphores
- Good for building more powerful higher level
tools - Assume stable storage
- Contents survive all non-physical disasters
- Specifically used by system to store data across
crashes - Transaction
- Performs a single logical function
- All-or-none computationeither all operations are
executed or none - Must do so in the face of system failures ?
stable storage
65Atomic Transactions
- Transaction Model
- Start transaction
- Series of read and write operations
- Either a commit or abort operation
- Commit all transaction operations executed
successfully no transaction operations are
allowed to hold - Roll Back restore system to the original state
before transaction started - Transaction is in limbo before a commit
- Has neither occurred nor not occurred
- Depends on who is asking
66Transactions Properties ACID
- Atomic
- Actions occur indivisibly completely or not at
all - Appear to happen instantly, from the POV of any
interacting process because they are all blocked - No intermediate states are visible
- Consistent
- System invariants hold, but are specific to
application - Conservation of money semantics in banking
applications - Inside transaction this is violated, but from
outside, the transaction is indivisible and
invariants are, well, invariant
67Transactions Properties ACID
- Isolated
- Concurrent transactions do not interfere with
each other - Serializable results from every set of
transactions looks as if they are done in some
sequential transaction execution - Transaction system must ensure that only legal or
semantically consistent interleavings of
transaction components occur - Durable
- Once a transaction commits, results are permanent
- Relevant to ask permanent with respect to what
- Generally data structures or stable storage
contents
68Transaction Primitives
- Begin-transaction
- End-transaction
- Abort-transaction
- Returns to state before the begin-transaction
- Often referred to as roll-back
- Commit-transaction
- Changes made in transaction become visible to the
outside world - Transaction operations
- Read (receive)
- Write (send)
69Transaction Example
- Suppose we have three transactions T1, T2, and T3
- Two data elements, A and B
- Scheduled by a round-robin scheduler artificial
but instructive for this example - One operation per time slice
- Consider what interleavings of component
operations are consistent with a serial execution
order of transaction set - Obvious choice is to not interleave components of
different transactions ? constrains concurrency
70Transaction Example
- T1 ? T2 ? T3
- But T1 reads A after T3 writes
- This implies that T3 ? T1 creating a
contradiction - Atomicity is violated
- Abort T1
T3
22
Aw
Br
71Transaction Example
- T2 ? T3 ? T1
- T2 writes A after T3s write
- Requiring T3 ? T2
- Abort T2
- Note since we interleaved operations all members
of the set must be ready to commit before any can
commit
72Transaction Example
- T3 ? T1 ? T2
- This works because each reaches the commit stage
without encountering a contradiction
T
Ts
event1
event2
event3
event4
event5
event6
event7
Br
T3
20
Aw
T1
21
Ar
Aw
Ar
Bw
Aw
T2
22
73Nested Transactions
- Transaction divided into sub-transactions
- Structured as a hierarchy
- Internal nodes are masters for its children
- Advantages
- Better performance aborted sub-transactions do
not abort masters - Increased concurrency only need to lock
sub-transactions
A
C
H
G
F
B
I
J
D
E
74Nested Transactions
- Suppose a parent transaction starts several child
transactions - One or more children transactions commit
- Only after committing are the childs results
visible to parent - Atomicity is preserved at child level
- But the results are horrible so the parent aborts
- But child already committed
- Parent abort must roll back all child
transactions - Even if they have committed
- Commit of subordinate transactions thus not
final, and thus not real with respect to the
containing system
75Implementing Transactions
- Conceptually, a transaction is given a private
workspace - Containing all resources it is allowed to access
- Before commit all operations done to private
workspace - Commit changes in the private workspace are
reflected into the actual workspace (file system,
etc.) - If the shadowed workspaces of more than one
transaction intersect ? contain common member
data items - And one of them has a write operation on a common
member - Then there is a conflict
- And one of the transactions must be aborted
76Implementing Transactions
- First level optimization copy on write
- Private workspace points to the common workspace
- Copy items into the private space only when
written - Virtual memory systems do this when processes
fork - Copied items are shadowed
- Commit copies shadowed items into global
workspace - Second level optimization shadow blocks
- Make units of shadowing as small as possible
- Disk blocks within a file that are written
instead of the whole file - Specific variables or groups of variables in a
data space
77Implementing Transactions
- Private workspaces are a form of caching
- Design issues
- Size of shadowed objects
- Probability of an intersection of private
workspaces - Constraint on concurrency of transactions
- Overhead of managing information and detecting
intersections - Analogy to data cache line size and snooping
cache consistency problems
78Implementing Transactions Writeahead Log
- Global copies are changed in the course of a
transaction - Log of changes maintained in stable storage
- Log entries include of write operation records
- Transaction name
- Data item name
- Old value
- New value
- Save log entry before performing write operations
- Transaction Ti is represented by a series of
write operation records terminated by the commit
operation and record
79Implementing Transactions Writeahead Log
- Transaction log consists of
- lt Ti startgt
- series of write records (Ti, x, old value, new
value) - lt Ti commitgt or lt Ti abortgt
- Recovery procedures
- undo(Ti) restores a values written by Ti to old
values - redo(Ti) sets all values written by Ti to new
values - If Ti aborts
- Execute undo(Ti)
80Implementing Transactions Writeahead Log
- If there is a system failure the system can use
redo(Ti) to make sure all updates are in place - Compare writeahead log values to actual value
- Also use the log to proceed with the transaction
- If an abort is necessary, use undo(Ti)
- Note that the commit operation must be done
atomically - Difficult when different machines and processes
are involved - Multiple logs are still a problem to consider
81Implementing Transactions Two-Phase Commit
- The commit to the transaction must be atomic
- Specific roles permit this
- Figure 3-20, page 153 Tanenbaum
- Coordinator is selected (transaction initiator)
- Phase 1
- Coordinator writes prepare in log
- Sends prepare message to all processes involved
in the commit (subordinates) - Subordinates write ready (or abort) into log
- Subordinates reply to coordinator
- Coordinator collects replies from all
subordinates
82Implementing Transactions Two-phase Commit
- If any subordinate aborts or does not respond ?
abort - If all respond, commit message will make
transaction results permanent in all subordinates - Stable storage is the key to the very end
- Crashes can be handled by tracing the log to
recover - Phase 2
- Coordinator logs commit and sends commit message
- Subordinates write commit into their log
- Subordinates execute the commit
- Subordinates send finished message to coordinator
- System can remove all transaction log entries, if
desired
83Concurrency Control
- Transactions need to run simultaneously
- All modern data base systems need to serve
concurrent users -especially in parallelized
distributed system - Transactions can conflict
- One may write to items others want to read or
write - Most transactions do not conflict
- Maximizing performance requires us to constrain
only conflicting transactions - Concurrency control methods
- Locking
- Optimistic concurrency control
- Timestamps
84Locking
- Locks
- Semaphore of sorts creating mutual exclusion
regions within the total data of a DB - Simplistic scheme is too restrictive
- Distinguish read and write locks
- Many readers, single writer canonical problem
- Read locks
- Allow N read locks on a resource
- Write locks
- No other lock is permitted
85Locking
- Locking granularity
- File level is too coarse
- Finer granularity ? less concurrency constraint
- Finer granularity ? greater overhead managing
locks and increased probability of deadlock - Two-Phase locking
- Fine-grained locking can lead to inconsistency
and deadlock - Dividing lock requests into two phases helps
simplify - If transaction avoids updating until all locks
are acquired, this simplifies failure - Release all locks and try again
86Locking
- Growing phase
- Transaction obtains locks, may not release any
- Shrinking phase
- Once a lock is released, no locks can be obtained
for rest of the transaction - Disadvantage of two-phase locking
- Concurrency is reduced
- Resource ordering (prevention) or detection and
resolution are necessary to handle deadlocks - Strict TPL releases no locks until abort/commit
- Increases concurrency constraint but avoids
cascade aborts
87Two-Phase Locking
- Scenario 1
- Also safe from deadlock
- P1 P2
- lock R1 lock R1
- ... lock R2
- lock R2 ...
- ... unlock R2
- unlock R2 unlock R1
- unlock R1
88Two-Phase Locking
- Scenario 2
- Susceptible to deadlock
- P1 P2
- lock R1 lock R2
- ... lock R1
- lock R2 ...
- ... unlock R1
- unlock R1 unlock R2
- unlock R2
89Optimistic Concurrency Control
- Based on the observation that transactions rarely
conflict - Expected value argument
- Cumulative overhead of avoiding conflicts is more
expensive than detecting and resolving conflicts - Let a transaction make all changes
- Without checking for conflicts
- Deadlock free
- At commit time
- Check for conflicts with files that have changed
since the transaction began - if found ? abort all but one conflicting
transaction and redo
90Optimistic Concurrency Control
- Optimistic changes made to private workspace
- Distributed transactions need some form of global
clock - Basis for comparing time for file changes
- Make canonical problem
- Parallelism is maximized
- No waiting on locks
- Inefficient when an abort is needed
- Not a good strategy in systems with many
potential conflicts ? bets on conflict
probability - ? Load ? ? Conflicts? ? Failures ? ? Load
- Positive feedback scenario
91Timestamp Ordering
- Each transaction Ti assigned a unique timestamp
TS(Ti) - If Ti enters system before Tj,
- TS(Ti) lt TS(Tj)
- Imposes a total ordering on transactions
- Each data item, Q, gets two timestamps
- W-timestamp(Q) largest write timestamp
- R-timestamp(Q) largest read timestamp
- General concept
- Process transactions in a serial order
- Can use the same file, but must do it in order
- Therefore atomicity is preserved
92Timestamp Ordering
- For a read
- if (TS(Ti ) lt W-timestamp(Q))
- reject read
- roll back and re-start Ti
- else / TS(Ti ) ? W-timestamp(Q) /
- execute read
- R-timestamp max(R-timestamp, TS(Ti ))
-
- Timestamp ordering is deadlock-free
- Total ordering of file accesses ? no cycles can
result
93Timestamp Ordering Example
- Three transactions T1, T2, and T3
- two data elements, A and B
- scheduled in a round-robing scheduler
- one operation per time slice
- use read and write timestamps
94Timestamp Ordering Example
- Three transactions T1, T2, and T3
95Deadlocks
- Definition Each process in a set is waiting for
a resource to be released by another process in
set - The set is some subset of all processes
- Deadlock only involves the processes in the set
- Remember the necessary conditions for DL
- Remember that methods for handling DL are based
on preventing or detecting and fixing one or more
necessary conditions
96Deadlocks Necessary Conditions
- Mutual exclusion
- Process has exclusive use of resource allocated
to it - Hold and Wait
- Process can hold one resource while waiting for
another - No Preemption
- Resources are released only by explicit action by
controlling process - Requests cannot be withdrawn (i.e. request
results in eventual allocation or deadlock) - Circular Wait
- Every process in the DL set is waiting for
another process in the set, forming a cycle in
the SR graph
97Deadlock Handling Strategies
- No strategy
- Prevention
- Make it structurally impossible to have a
deadlock - Avoidance
- Allocate resources so deadlock cant occur
- Detection
- Let deadlock occur, detect it, recover from it
98No Strategy The Ostrich Algorithm
- Assumes deadlock rarely occurs
- Becomes more probable with more processes
- Catastrophic consequences when it does occur
- May need to re-boot all or some machines in
system - Fairly common and works well when
- DL is rare and
- Other sources of instability are more common
- How reboots of Window or MacOS are prompted by
DL?
99Deadlock Prevention
- Ordered resource allocation most common example
- Consider link with two-phase-locking grow and
shrink - Works but requires global view of all resources
- A total order on resources must exist for the
system - Process code must allocate resources in order
- Under utilizes resources when period of use of a
resource conflict with the total resource order - Consider process Pi and Pk using resources R1
and R2 - Pi uses R1 90 of its execution time and R2 10
- Pk uses R2 90 of its execution time and R1 10
- One holds one resource far too long
100Deadlock Avoidance
- General method Refuse allocations that may lead
to deadlock - Method for keeping track of states
- Need to know resources required by a process
- Bankers algorithm
- Must know maximum number allocated to Pi
- Keep track of resources available
- For each request, make sure maximum need will not
exceed total available - Under utilizes resources
- Never used
- Advance knowledge not available and CPU-intensive
101Deadlock Detection and Resolution
- Attractive for two main reasons
- Prevention and avoidance are hard, have
significant overhead, and require information
difficult or impossible to obtain - Deadlock is comparatively rare in most systems so
a form of the argument for optimistic concurrency
control applies detect and fix comparatively
rare situations - Availability of transactions helps
- DL resolution requires us to kill some
participant(s) - Transactions are designed to be rolled back and
restarted
102Centralized Deadlock Detection
- General method Construct a resource graph and
analyze it - Analyze through resource reductions
- If cycle exists after analysis, deadlock has
occurred - Processes in cycle are deadlocked
- Local graphs on each machine
- Pi requests R1
- R1s machine places request in local graph
- If cycle exists in local graph, perform
reductions to detect deadlock - Need to calculate union of all local graphs
- Deadlock cycle may transcend machine boundaries
103Graph Reduction
- Cycles dont always mean deadlock!
P1
P2
P3
P1
P2
Deadlock
P3
No Deadlock
P2
P3
104Waits-For Graphs (WFGs)
- Based on Resource Allocation Graph (SR)
- An edge from Pi to Pj
- means Pi is waiting for Pj to release a resource
- Replaces two edges in SR graph
- Pi ?R
- R ? Pj
- Deadlocked when a cycle is found
105Centralized Deadlock Detection
- All hosts communicate resource state to
coordinator - Construct global resource graph on coordinator
- Coordinator must be reliable and fast
- When to construct the graph is an important
choice - Report every