Title: Distributed Systems Principles and Paradigms
1Distributed Systems Principles and Paradigms
Chapter 05Synchronization
2Communication Synchronization
- Why do processes communicate in DS?
- To exchange messages
- To synchronize processes
- Why do processes synchronize in DS?
- To coordinate access of shared resources
- To order events
3Time, Clocks and Clock Synchronization
- Time
- Why is time important in DS?
- E.g. UNIX make utility (see Fig. 5-1)
- Clocks (Timer)
- Physical clocks
- Logical clocks (introduced by Leslie Lamport)
- Vector clocks (introduced by Collin Fidge)
- Clock Synchronization
- How do we synchronize clocks with real-world
time? - How do we synchronize clocks with each other?
05 1
Distributed Algorithms/5.1 Clock Synchronization
4Physical Clocks (1/3)
- Problem Clock Skew clocks gradually get out of
synch and give different values - Solution Universal Coordinated Time (UTC)
- Formerly called GMT (Greenwich Mean Time)
- Based on the number of transitions per second of
the cesium 133 atom (very accurate). - At present, the real time is taken as the average
of some 50 cesium-clocks around the world
International Atomic Time - Introduces a leap second from time to time to
compensate that days are getting longer. - UTC is broadcasted through short wave radio (with
the accuracy of /- 1 msec) and satellite
(Geostationary Environment Operational Satellite,
GEOS, with the accuracy of /- 0.5 msec). - Question Does this solve all our problems? Dont
we now have some global timing mechanism?
05 2
Distributed Algorithms/5.1 Clock Synchronization
5Physical Clocks (2/3)
- Problem Suppose we have a distributed system
with a UTC-receiver somewhere in it, we still
have to distribute its time to each machine. - Basic principle
- Every machine has a timer that generates an
interrupt H (typically 60) times per second. - There is a clock in machine p that ticks on each
timer interrupt. Denote the value of that clock
by Cp (t) , where t is UTC time. - Ideally, we have that for each machine p, Cp
(t) t, or, in other words, dC/ dt 1 - Theoretically, a timer with H60 should generate
216,000 ticks per hour - In practice, the relative error of modern timer
chips is 10-5 (or between 215,998 and 216,002
ticks per hour)
05 3
Distributed Algorithms/5.1 Clock Synchronization
6Physical Clocks (3/3)
Where r is the max. drift rate
Goal Never let two clocks in any system differ
by more than d time units gt synchronize at least
every d/2r seconds.
05 4
Distributed Algorithms/5.1 Clock Synchronization
7Clock Synchronization Principles
- Principle I Every machine asks a time server
for the accurate time at least once every d/2r
seconds (see Fig. 5-5). - But you need an accurate measure of round trip
delay, including interrupt handling and
processing incoming messages. - Principle II Let the time server scan all
machines periodically, calculate an average, and
inform each machine how it should adjust its time
relative to its present time. - Ok, youll probably get every machine in sync.
Note you dont even need to propagate UTC time
(why not?)
05 5
Distributed Algorithms/5.1 Clock Synchronization
8Clock Synchronization Algorithms
- The Berkeley Algorithm
- The time server polls periodically every machine
for its time - The received times are averaged and each machine
is notified of the amount of the time it should
adjust - Centralized algorithm, See Figure 5-6
- Decentralized Algorithm
- Every machine broadcasts its time periodically
for fixed length resynchronization interval - Averages the values from all other machines (or
averages without the highest and lowest values) - Network Time Protocol (NTP)
- the most popular one used by the machines on the
Internet - uses an algorithm that is a combination of
centralized/distributed
05 6
Distributed Algorithms/5.2 Logical Clocks
9Network Time Protocol (NTP)
- a protocol for synchronizing the clocks of
computers over packet-switched, variable-latency
data networks (i.e., Internet) - NTP uses UDP port 123 as its transport layer. It
is designed particularly to resist the effects of
variable latency - NTPv4 can usually maintain time to within 10
milliseconds (1/100 s) over the public Internet,
and can achieve accuracies of 200 microseconds
(1/5000 s) or better in local area networks under
ideal conditions - visit the following URL to understand NTP in
more detail - http//en.wikipedia.org/wiki/Network_Time_Protoco
l
10The Happened-Before Relationship
- Problem We first need to introduce a notion of
ordering before we can order anything. - The happened-before relation on the set of events
in a distributed system is the smallest relation
satisfying - If a and b are two events in the same process,
and a comes before b, then a ? b. (a happened
before b) - If a is the sending of a message, and b is the
receipt of that message, then a ? b. - If a ? b and b ? c, then a ? c. (transitive
relation) - Note if two events, x and y, happen in different
processes that do not exchange messages, then
they are said to be concurrent. - Note this introduces a partial ordering of
events in a system with concurrently operating
processes.
05 6
Distributed Algorithms/5.2 Logical Clocks
11Logical Clocks (1/2)
Problem How do we maintain a global view on the
systems behavior that is consistent with the
happened-before relation? Solution attach a
timestamp C(e) to each event e, satisfying the
following properties P1 If a and b are two
events in the same process, and a ?b, then we
demand that C (a) lt C (b) P2 If a corresponds to
sending a message m, and b to the receipt of that
message, then also C (a) lt C (b) Problem How do
we attach a timestamp to an event when theres no
global clock? ? maintain a consistent set of
logical clocks, one per process.
05 7
Distributed Algorithms/5.2 Logical Clocks
12Logical Clocks (2/2)
Each process Pi maintains a local counter Ci and
adjusts this counter according to the following
rules (1) For any two successive events that
take place within Pi, Ci is incremented by 1. (2)
Each time a message m is sent by process Pi, the
message receives a timestamp Tm Ci. (3)
Whenever a message m is received by a process Pj,
Pj adjusts its local counter Cj Property P1 is
satisfied by (1) Property P2 by (2) and
(3). This is called the Lamports Algorithm
05 8
Distributed Algorithms/5.2 Logical Clocks
13Logical Clocks Example
Fig 5-7. (a) Three processes, each with its own
clock. The clocks run at different rates. (b)
Lamports algorithm corrects the clocks
05 9
Distributed Algorithms/5.2 Logical Clocks
14- Assign the Lamports logical clock values for all
the events in the above timing diagram. Assume
that each processs local clock is set to 0
initially.
15- From the above timing diagram, what can you say
about the following events? - between a and b a ? b
- between b and f b ? f
- between e and k concurrent
- between c and h concurrent
- between k and h k ? h
16Total Ordering with Logical Clocks
Problem it can still occur that two events
happen at the same time. Avoid this by attaching
a process number to an event Pi timestamps event
e with Ci (e) i Then Ci (a) i happened before
Cj (b) j if and only if 1 Ci (a) lt Cj (a)
or 2 Ci (a) Cj (b) and i lt j
05 10 Distributed Algorithms/5.2
Logical Clocks
17Example Totally-Ordered Multicast (1/2)
- Problem We sometimes need to guarantee that
concurrent updates on a replicated database are
seen in the same order everywhere - Process P1 adds 100 to an account (initial
value 1000) - Process P2 increments account by 1
- There are two replicas
Outcome in absence of proper synchronization,
replica 1 will end up with 1111, while replica
2 ends up with 1110.
05 11 Distributed Algorithms/5.2
Logical Clocks
18Example Totally-Ordered Multicast (2/2)
- Process Pi sends timestamped message msgi to all
others. The message itself is put in a local
queue queuei. - Any incoming message at Pj is queued in queuej,
according to its timestamp. - Pj passes a message msgi to its application if
- (1) msgi is at the head of queuej
- (2) for each process Pk, there is a message
msgk in queuej with a larger
timestamp. - Note We are assuming that communication is
reliable and FIFO ordered.
05 12 Distributed Algorithms/5.2 Logical
Clocks
19- Fidges Logical Clocks
- with Lamports clocks, one cannot directly
compare the timestamps of two events to determine
their precedence relationship - - if C(a) lt C(b) then a ? b
- - if C(a) lt C(b), it could be a ? b or a ? b
- - e.g., events e and b in the previous example
Figure - C(e) 1 and C(b) 2
- thus C(e) lt C(b) but e ? b
- the main problem is that a simple integer clock
can not order both events within a process and
events in different processes - Collin Fidge developed an algorithm that
overcomes this problem - Fidges clock is represented as a vector c1 , c
2 , , cn with an integer clock value for each
process (ci contains the clock value of process i)
/
/
/
/
20- Fidges Algorithm
- The Fidges logical clock is maintained as
follows - 1 Initially all clock values are set to the
smallest value. - 2 The local clock value is incremented at least
once before each primitive event in a process. - 3 The current value of the entire logical clock
vector is delivered to the receiver for every
outgoing message. - 4 Values in the timestamp vectors are never
decremented. - 5 Upon receiving a message, the receiver sets
the value of each entry in its local timestamp
vector to the maximum of the two corresponding
values in the local vector and in the remote
vector received. - The element corresponding to the sender is a
special case it is set to one greater than the
value received, but only if the local value is
not greater than that received.
21- Get r_vector from the received msg sent by
process q - if l_vector q ? r_vectorq then
- l_vectorq r_vectorq 1
- for i 1 to n do
- l_vectori max(l_vectori, r_vectori)
- Timestamps attached to the events are compared as
follows - ep ? fq iff Tep p lt Tfq p
- (where ep represents an event e occurring in
process p, Tep represents the timestamp vector of
the event ep , and the ith element of Tep is
denoted by Tep i.) - This means event ep happened before event fq if
and only if process q received a direct or
indirect message from p and that message was sent
after ep had occurred. If ep and fq are in the
same process (i,e., p q), the local elements of
their timestamps represent their occurrences in
the process.
22- Assign the Lamports and Fidges logical clock
values for all the events in the above timing
diagram. Assume that each processs logical clock
is set to 0 initially.
23P1
P2
P3
24- The above diagram shows both Lamport timestamps
(an integer value ) and Fidge timestamps (a
vector of integer values ) for each event. - Lamport clocks
- 2 lt 5 since b ? h,
- 3 lt 4 but c ? g.
- Fidge clocks
- f ? h since 2 lt 4 is true,
- b ? h since 2 lt 3 is true,
- h ? a since 4 lt 0 is false,
- c ? h since (3 lt 3) is false and (4 lt 0) is false.
25P1
P2
P4
P3
a
e
j
m
b
k
f
c
n
g
d
h
o
l
i
- Assign the Lamports and Fidges logical clock
values for all the events in the above timing
diagram. Assume that each processs logical clock
is set to 0 initially.
26- From the above timing diagram, what can you say
about the following events? - between b and n
- between b and o
- between m and g
- between c and h
- between c and l
- between j and g
- between k and i
- between j and h
27- READING Reference
- Colin Fidge, Logical Time in Distributed
Computing Systems, IEEE Computer, Vol. 24, No.
8, pp. 28-33, August 1991.
28Global State (1/3)
Basic Idea Sometimes you want to collect the
current state of a distributed computation,
called a distributed snapshot. It consists of all
local states and messages in transit. Important
A distributed snapshot should reflect a
consistent state
05 15 Distributed Algorithms/5.3
Global State
29Global State (2/3)
- Note any process P can initiate taking a
distributed snapshot - P starts by recording its own local state
- P subsequently sends a marker along each of its
outgoing channels - When Q receives a marker through channel C, its
action depends on whether it had already recorded
its local state - Not yet recorded it records its local
state, and sends the marker along each of
its outgoing channels - Already recorded the marker on C indicates
that the channels state should be recorded
all messages received before this marker and
the time Q recorded its own state. - Q is finished when it has received a marker
along each of its incoming channels
05 16 Distributed Algorithms/5.3 Global
State
30Global State (3/3)
(a) Organization of a process and channels for a
distributed snapshot (b) Process Q receives a
marker for the first time and records its local
state (c) Q records all incoming message (d) Q
receives a marker for its incoming channel and
finishes recording the state of the incoming
channel
05 17 Distributed Algorithms/5.3 Global
State
31Election Algorithms
Principle Many distributed algorithms require
that some process acts as a coordinator. The
question is how to select this special process
dynamically. Note In many systems the
coordinator is chosen by hand (e.g., file
servers, DNS servers). This leads to centralized
solutions gt single point of failure. Question
If a coordinator is chosen dynamically, to what
extent can we speak about a centralized or
distributed solution? Question Is a fully
distributed solution, i.e., one without a
coordinator, always more robust than any
centralized/coordinated solution?
05 18 Distributed Algorithms/5.4 Election
Algorithms
32Election by Bullying (1/2)
- Principle Each process has an associated
priority (weight). The process with the highest
priority should always be elected as the
coordinator. - Issue How do we find the heaviest process?
- Any process can just start an election by
sending an election message to all other
processes (assuming you dont know the weights of
the others). - If a process Pheavy receives an election message
from a lighter process Plight, it sends a
take-over message to Plight. Plight is out of the
race. - If a process doesnt get a take-over message
back, it wins, and sends a victory message to all
other processes.
05 19 Distributed Algorithms/5.4 Election
Algorithms
33Election by Bullying (2/2)
Question Were assuming something very important
here what?
Assumption Each process knows the process number
of other processes
05 20 Distributed Algorithms/5.4 Election
Algorithms
34Election in a Ring
- Principle Process priority is obtained by
organizing processes into a (logical) ring.
Process with the highest priority should be
elected as coordinator. - Any process can start an election by sending an
election message to its successor. If a successor
is down, the message is passed on to the next
successor. - If a message is passed on, the sender adds
itself to the list. When it gets back to the
initiator, everyone had a chance to make its
presence known. - The initiator sends a coordinator message around
the ring containing a list of all living
processes. The one with the highest priority is
elected as coordinator. See Figure 5-12.
Question Does it matter if two processes
initiate an election?
Question What happens if a process crashes
during the election?
05 21 Distributed Algorithms/5.4 Election
Algorithms
35Mutual Exclusion
- Problem A number of processes in a distributed
system want exclusive access to some resource. - Basic solutions
- Via a centralized server.
- Completely distributed, with no topology
imposed. - Completely distributed, making use of a
(logical) ring. - Centralized Really simple
05 22 Distributed Algorithms/5.5 Mutual
Exclusion
36Mutual Exclusion Ricart Agrawala
- Principle The same as Lamport except that
acknowledgments arent sent. Instead, replies
(i.e., grants) are sent only when - The receiving process has no interest in the
shared resource or - The receiving process is waiting for the
resource, but has lower priority (known through
comparison of timestamps). - In all other cases, reply is deferred (see the
algorithm on pg. 267)
05 23 Distributed Algorithms/5.5 Mutual
Exclusion
37Mutual Exclusion Token Ring Algorithm
Essence Organize processes in a logical ring,
and let a token be passed between them. The one
that holds the token is allowed to enter the
critical region (if it wants to)
05 24 Distributed Algorithms/5.5 Mutual
Exclusion
38Distributed Transactions
- The transaction model
- Classification of transactions
- Concurrency control
39The Transaction Model (1)
- Updating a master tape is fault tolerant.
Question What happens if this computer operation
fails?
- Both tapes are rewound and the job is restarted
- from the beginning without any harm being done
40The Transaction Model (2)
Primitive Description
BEGIN_TRANSACTION Make the start of a transaction
END_TRANSACTION Terminate the transaction and try to commit
ABORT_TRANSACTION Kill the transaction and restore the old values
READ Read data from a file, a table, or otherwise
WRITE Write data to a file, a table, or otherwise
- Figure 5-18 Example primitives for transactions.
41The Transaction Model (3)
BEGIN_TRANSACTION reserve BOS -gt JFK reserve JFK -gt ICN reserve SEL -gt KPOEND_TRANSACTION (a) BEGIN_TRANSACTION reserve BOS -gt JFK reserve JFK -gt ICN reserve SEL -gt KPO full gtABORT_TRANSACTION (b)
- Transaction to reserve three flights commits
- Transaction aborts when third flight is
unavailable
42ACID Properties of Transactions
- Atomic
- To the outside world, the transaction happens
indivisibly - Consistent
- The transaction does not violate system
invariants - Isolated
- Concurrent transactions do not interfere with
each other - Durable
- Once a transaction commits, the changes are
permanent
43Nested Transactions
- Constructed from a number of subtransactions
- The top-level transaction may create children
that run in parallel with one another to gain
performance or simplify programming - Each of these children is called a
subtransaction and it may also have one or more
subtransactions - When any transaction or subtransaction starts,
it is conceptually given a private copy of all
data in the entire system for it to manipulate as
it wishes - If it aborts, its private space is destroyed
- If it commits, its private space replaces the
parents space - If the top-level transaction aborts, all the
changes made in the subtransactions must be wiped
out
44Distributed Transactions
- - Transactions involving subtransactions that
operate on data that are distributed across
multiple machines - - Separate distributed algorithms are needed to
handle the locking of data and committing the
entire transaction
45Implementing Transactions
- Private Workspace
- Gives a private workspace (i.e., all the data it
has access to) to a process when it begins a
transaction - Writeahead Log
- Files are actually modified in place but before
any block is changed, a record is written to a
log telling - which transaction is making the change
- which file and block is being changed
- what the old and new values are
- Only after the log has been written successfully,
the change is made to the file - Question Why is a log needed?
- ? for rollback if necessary
46Private Workspace
- The file index and disk blocks for a three-block
file - The situation after a transaction has modified
block 0 and appended block 3 - After committing
47Writeahead Log
x 0 y 0 BEGIN_TRANSACTION x x 1 y y 2 x y y END_TRANSACTION (a) Log x 0 / 1 (b) Log x 0 / 1 y 0 / 2 (c) Log x 0 / 1 y 0 / 2 x 1 / 4 (d)
- (a) a transaction
- (b) (d) The log before each statement is
executed
48Concurrency Control (1)
- The goal of concurrency control is to allow
multiple transactions to be executed
simultaneously - Final result should be the same as if all
transactions had run sequentially
- Fig. 5-23 General organization of managers for
handling transactions
49Concurrency Control (2)
- General organization of managers for handling
distributed transactions.
50Serializability (1)
BEGIN_TRANSACTION x 0 x x 1END_TRANSACTION (a) BEGIN_TRANSACTION x 0 x x 2END_TRANSACTION (b) BEGIN_TRANSACTION x 0 x x 3END_TRANSACTION (c)
(a) (c) Three transactions T1, T2, and T3
Schedule 1 x 0 x x 1 x 0 x x 2 x 0 x x 3 Legal
Schedule 2 x 0 x 0 x x 1 x x 2 x 0 x x 3 Legal
Schedule 3 x 0 x 0 x x 1 x 0 x x 2 x x 3 Illegal
(d)
- (d) Possible schedules
- Question Why is Schedule 3 illegal?
51Serializability (2)
- Two operations conflict is they operate on the
same data and if at least one of them is a write
operation - read-write conflict exactly one of the
operations is a write - write-write conflict involves more than one
write operations - Concurrency control algorithms can generally be
classified by looking at the way read and write
operations are synchronized - Using locking
- Explicitly ordering operations using timestamps
52Two-Phase Locking (1)
- In two-phase locking (2PL), the scheduler first
acquires all the locks it needs during the
growing (1st) phase, and then releases them
during the shrinking (2nd) phase - See the rules on pg. 284
- Fig. 5-26 Two-phase locking
53Two-Phase Locking (2)
- In strict two-phase locking, the shrinking phase
does not take place until the transaction has
finished running and has either committee or
aborted.
- Fig. 5-27 Strict two-phase locking
54