Distributed Systems Principles and Paradigms

About This Presentation

Title:

Distributed Systems Principles and Paradigms

Description:

Pi timestamps event e with Ci (e) i. Then: Ci (a) i happened before Cj (b) j if and only if: ... message at Pj is queued in queuej, according to its timestamp. ... – PowerPoint PPT presentation

Number of Views:300

Avg rating:3.0/5.0

Slides: 55

Provided by: orin

Category:

more less

Transcript and Presenter's Notes

Title: Distributed Systems Principles and Paradigms

1
Distributed Systems Principles and Paradigms
Chapter 05Synchronization
2
Communication Synchronization

Why do processes communicate in DS?
To exchange messages
To synchronize processes
Why do processes synchronize in DS?
To coordinate access of shared resources
To order events

3
Time, Clocks and Clock Synchronization

Time
Why is time important in DS?
E.g. UNIX make utility (see Fig. 5-1)
Clocks (Timer)
Physical clocks
Logical clocks (introduced by Leslie Lamport)
Vector clocks (introduced by Collin Fidge)
Clock Synchronization
How do we synchronize clocks with real-world
time?
How do we synchronize clocks with each other?

05 1
Distributed Algorithms/5.1 Clock Synchronization
4
Physical Clocks (1/3)

Problem Clock Skew clocks gradually get out of
synch and give different values
Solution Universal Coordinated Time (UTC)
Formerly called GMT (Greenwich Mean Time)
Based on the number of transitions per second of
the cesium 133 atom (very accurate).
At present, the real time is taken as the average
of some 50 cesium-clocks around the world
International Atomic Time
Introduces a leap second from time to time to
compensate that days are getting longer.
UTC is broadcasted through short wave radio (with
the accuracy of /- 1 msec) and satellite
(Geostationary Environment Operational Satellite,
GEOS, with the accuracy of /- 0.5 msec).
Question Does this solve all our problems? Dont
we now have some global timing mechanism?

05 2
Distributed Algorithms/5.1 Clock Synchronization
5
Physical Clocks (2/3)

Problem Suppose we have a distributed system
with a UTC-receiver somewhere in it, we still
have to distribute its time to each machine.
Basic principle
Every machine has a timer that generates an
interrupt H (typically 60) times per second.
There is a clock in machine p that ticks on each
timer interrupt. Denote the value of that clock
by Cp (t) , where t is UTC time.
Ideally, we have that for each machine p, Cp
(t) t, or, in other words, dC/ dt 1
Theoretically, a timer with H60 should generate
216,000 ticks per hour
In practice, the relative error of modern timer
chips is 10-5 (or between 215,998 and 216,002
ticks per hour)

05 3
Distributed Algorithms/5.1 Clock Synchronization
6
Physical Clocks (3/3)
Where r is the max. drift rate
Goal Never let two clocks in any system differ
by more than d time units gt synchronize at least
every d/2r seconds.
05 4
Distributed Algorithms/5.1 Clock Synchronization
7
Clock Synchronization Principles

Principle I Every machine asks a time server
for the accurate time at least once every d/2r
seconds (see Fig. 5-5).
But you need an accurate measure of round trip
delay, including interrupt handling and
processing incoming messages.
Principle II Let the time server scan all
machines periodically, calculate an average, and
inform each machine how it should adjust its time
relative to its present time.
Ok, youll probably get every machine in sync.
Note you dont even need to propagate UTC time
(why not?)

05 5
Distributed Algorithms/5.1 Clock Synchronization
8
Clock Synchronization Algorithms

The Berkeley Algorithm
The time server polls periodically every machine
for its time
The received times are averaged and each machine
is notified of the amount of the time it should
adjust
Centralized algorithm, See Figure 5-6
Decentralized Algorithm
Every machine broadcasts its time periodically
for fixed length resynchronization interval
Averages the values from all other machines (or
averages without the highest and lowest values)
Network Time Protocol (NTP)
the most popular one used by the machines on the
Internet
uses an algorithm that is a combination of
centralized/distributed

05 6
Distributed Algorithms/5.2 Logical Clocks
9
Network Time Protocol (NTP)

a protocol for synchronizing the clocks of
computers over packet-switched, variable-latency
data networks (i.e., Internet)
NTP uses UDP port 123 as its transport layer. It
is designed particularly to resist the effects of
variable latency
NTPv4 can usually maintain time to within 10
milliseconds (1/100 s) over the public Internet,
and can achieve accuracies of 200 microseconds
(1/5000 s) or better in local area networks under
ideal conditions
visit the following URL to understand NTP in
more detail
http//en.wikipedia.org/wiki/Network_Time_Protoco
l

10
The Happened-Before Relationship

Problem We first need to introduce a notion of
ordering before we can order anything.
The happened-before relation on the set of events
in a distributed system is the smallest relation
satisfying
If a and b are two events in the same process,
and a comes before b, then a ? b. (a happened
before b)
If a is the sending of a message, and b is the
receipt of that message, then a ? b.
If a ? b and b ? c, then a ? c. (transitive
relation)
Note if two events, x and y, happen in different
processes that do not exchange messages, then
they are said to be concurrent.
Note this introduces a partial ordering of
events in a system with concurrently operating
processes.

05 6
Distributed Algorithms/5.2 Logical Clocks
11
Logical Clocks (1/2)
Problem How do we maintain a global view on the
systems behavior that is consistent with the
happened-before relation? Solution attach a
timestamp C(e) to each event e, satisfying the
following properties P1 If a and b are two
events in the same process, and a ?b, then we
demand that C (a) lt C (b) P2 If a corresponds to
sending a message m, and b to the receipt of that
message, then also C (a) lt C (b) Problem How do
we attach a timestamp to an event when theres no
global clock? ? maintain a consistent set of
logical clocks, one per process.
05 7
Distributed Algorithms/5.2 Logical Clocks
12
Logical Clocks (2/2)
Each process Pi maintains a local counter Ci and
adjusts this counter according to the following
rules (1) For any two successive events that
take place within Pi, Ci is incremented by 1. (2)
Each time a message m is sent by process Pi, the
message receives a timestamp Tm Ci. (3)
Whenever a message m is received by a process Pj,
Pj adjusts its local counter Cj Property P1 is
satisfied by (1) Property P2 by (2) and
(3). This is called the Lamports Algorithm
05 8
Distributed Algorithms/5.2 Logical Clocks
13
Logical Clocks Example
Fig 5-7. (a) Three processes, each with its own
clock. The clocks run at different rates. (b)
Lamports algorithm corrects the clocks
05 9
Distributed Algorithms/5.2 Logical Clocks
14

Assign the Lamports logical clock values for all
the events in the above timing diagram. Assume
that each processs local clock is set to 0
initially.

From the above timing diagram, what can you say
about the following events?
between a and b a ? b
between b and f b ? f
between e and k concurrent
between c and h concurrent
between k and h k ? h

16
Total Ordering with Logical Clocks
Problem it can still occur that two events
happen at the same time. Avoid this by attaching
a process number to an event Pi timestamps event
e with Ci (e) i Then Ci (a) i happened before
Cj (b) j if and only if 1 Ci (a) lt Cj (a)
or 2 Ci (a) Cj (b) and i lt j
05 10 Distributed Algorithms/5.2
Logical Clocks
17
Example Totally-Ordered Multicast (1/2)

Problem We sometimes need to guarantee that
concurrent updates on a replicated database are
seen in the same order everywhere
Process P1 adds 100 to an account (initial
value 1000)
Process P2 increments account by 1
There are two replicas

Outcome in absence of proper synchronization,
replica 1 will end up with 1111, while replica
2 ends up with 1110.
05 11 Distributed Algorithms/5.2
Logical Clocks
18
Example Totally-Ordered Multicast (2/2)

Process Pi sends timestamped message msgi to all
others. The message itself is put in a local
queue queuei.
Any incoming message at Pj is queued in queuej,
according to its timestamp.
Pj passes a message msgi to its application if
(1) msgi is at the head of queuej
(2) for each process Pk, there is a message
msgk in queuej with a larger
timestamp.
Note We are assuming that communication is
reliable and FIFO ordered.

05 12 Distributed Algorithms/5.2 Logical
Clocks
19

Fidges Logical Clocks
with Lamports clocks, one cannot directly
compare the timestamps of two events to determine
their precedence relationship
- if C(a) lt C(b) then a ? b
- if C(a) lt C(b), it could be a ? b or a ? b
- e.g., events e and b in the previous example
Figure
C(e) 1 and C(b) 2
thus C(e) lt C(b) but e ? b
the main problem is that a simple integer clock
can not order both events within a process and
events in different processes
Collin Fidge developed an algorithm that
overcomes this problem
Fidges clock is represented as a vector c1 , c
2 , , cn with an integer clock value for each
process (ci contains the clock value of process i)

/
/
/
/
20

Fidges Algorithm
The Fidges logical clock is maintained as
follows
1 Initially all clock values are set to the
smallest value.
2 The local clock value is incremented at least
once before each primitive event in a process.
3 The current value of the entire logical clock
vector is delivered to the receiver for every
outgoing message.
4 Values in the timestamp vectors are never
decremented.
5 Upon receiving a message, the receiver sets
the value of each entry in its local timestamp
vector to the maximum of the two corresponding
values in the local vector and in the remote
vector received.
The element corresponding to the sender is a
special case it is set to one greater than the
value received, but only if the local value is
not greater than that received.

Get r_vector from the received msg sent by
process q
if l_vector q ? r_vectorq then
l_vectorq r_vectorq 1
for i 1 to n do
l_vectori max(l_vectori, r_vectori)
Timestamps attached to the events are compared as
follows
ep ? fq iff Tep p lt Tfq p
(where ep represents an event e occurring in
process p, Tep represents the timestamp vector of
the event ep , and the ith element of Tep is
denoted by Tep i.)
This means event ep happened before event fq if
and only if process q received a direct or
indirect message from p and that message was sent
after ep had occurred. If ep and fq are in the
same process (i,e., p q), the local elements of
their timestamps represent their occurrences in
the process.

Assign the Lamports and Fidges logical clock
values for all the events in the above timing
diagram. Assume that each processs logical clock
is set to 0 initially.

23
P1
P2
P3
24

The above diagram shows both Lamport timestamps
(an integer value ) and Fidge timestamps (a
vector of integer values ) for each event.
Lamport clocks
2 lt 5 since b ? h,
3 lt 4 but c ? g.
Fidge clocks
f ? h since 2 lt 4 is true,
b ? h since 2 lt 3 is true,
h ? a since 4 lt 0 is false,
c ? h since (3 lt 3) is false and (4 lt 0) is false.

25
P1
P2
P4
P3
a
e
j
m
b
k
f
c
n
g
d
h
o
l
i

Assign the Lamports and Fidges logical clock
values for all the events in the above timing
diagram. Assume that each processs logical clock
is set to 0 initially.

From the above timing diagram, what can you say
about the following events?
between b and n
between b and o
between m and g
between c and h
between c and l
between j and g
between k and i
between j and h

READING Reference
Colin Fidge, Logical Time in Distributed
Computing Systems, IEEE Computer, Vol. 24, No.
8, pp. 28-33, August 1991.

28
Global State (1/3)
Basic Idea Sometimes you want to collect the
current state of a distributed computation,
called a distributed snapshot. It consists of all
local states and messages in transit. Important
A distributed snapshot should reflect a
consistent state
05 15 Distributed Algorithms/5.3
Global State
29
Global State (2/3)

Note any process P can initiate taking a
distributed snapshot
P starts by recording its own local state
P subsequently sends a marker along each of its
outgoing channels
When Q receives a marker through channel C, its
action depends on whether it had already recorded
its local state
Not yet recorded it records its local
state, and sends the marker along each of
its outgoing channels
Already recorded the marker on C indicates
that the channels state should be recorded
all messages received before this marker and
the time Q recorded its own state.
Q is finished when it has received a marker
along each of its incoming channels

05 16 Distributed Algorithms/5.3 Global
State
30
Global State (3/3)
(a) Organization of a process and channels for a
distributed snapshot (b) Process Q receives a
marker for the first time and records its local
state (c) Q records all incoming message (d) Q
receives a marker for its incoming channel and
finishes recording the state of the incoming
channel
05 17 Distributed Algorithms/5.3 Global
State
31
Election Algorithms
Principle Many distributed algorithms require
that some process acts as a coordinator. The
question is how to select this special process
dynamically. Note In many systems the
coordinator is chosen by hand (e.g., file
servers, DNS servers). This leads to centralized
solutions gt single point of failure. Question
If a coordinator is chosen dynamically, to what
extent can we speak about a centralized or
distributed solution? Question Is a fully
distributed solution, i.e., one without a
coordinator, always more robust than any
centralized/coordinated solution?
05 18 Distributed Algorithms/5.4 Election
Algorithms
32
Election by Bullying (1/2)

Principle Each process has an associated
priority (weight). The process with the highest
priority should always be elected as the
coordinator.
Issue How do we find the heaviest process?
Any process can just start an election by
sending an election message to all other
processes (assuming you dont know the weights of
the others).
If a process Pheavy receives an election message
from a lighter process Plight, it sends a
take-over message to Plight. Plight is out of the
race.
If a process doesnt get a take-over message
back, it wins, and sends a victory message to all
other processes.

05 19 Distributed Algorithms/5.4 Election
Algorithms
33
Election by Bullying (2/2)
Question Were assuming something very important
here what?
Assumption Each process knows the process number
of other processes
05 20 Distributed Algorithms/5.4 Election
Algorithms
34
Election in a Ring

Principle Process priority is obtained by
organizing processes into a (logical) ring.
Process with the highest priority should be
elected as coordinator.
Any process can start an election by sending an
election message to its successor. If a successor
is down, the message is passed on to the next
successor.
If a message is passed on, the sender adds
itself to the list. When it gets back to the
initiator, everyone had a chance to make its
presence known.
The initiator sends a coordinator message around
the ring containing a list of all living
processes. The one with the highest priority is
elected as coordinator. See Figure 5-12.

Question Does it matter if two processes
initiate an election?
Question What happens if a process crashes
during the election?
05 21 Distributed Algorithms/5.4 Election
Algorithms
35
Mutual Exclusion

Problem A number of processes in a distributed
system want exclusive access to some resource.
Basic solutions
Via a centralized server.
Completely distributed, with no topology
imposed.
Completely distributed, making use of a
(logical) ring.
Centralized Really simple

05 22 Distributed Algorithms/5.5 Mutual
Exclusion
36
Mutual Exclusion Ricart Agrawala

Principle The same as Lamport except that
acknowledgments arent sent. Instead, replies
(i.e., grants) are sent only when
The receiving process has no interest in the
shared resource or
The receiving process is waiting for the
resource, but has lower priority (known through
comparison of timestamps).
In all other cases, reply is deferred (see the
algorithm on pg. 267)

05 23 Distributed Algorithms/5.5 Mutual
Exclusion
37
Mutual Exclusion Token Ring Algorithm
Essence Organize processes in a logical ring,
and let a token be passed between them. The one
that holds the token is allowed to enter the
critical region (if it wants to)
05 24 Distributed Algorithms/5.5 Mutual
Exclusion
38
Distributed Transactions

The transaction model
Classification of transactions
Concurrency control

39
The Transaction Model (1)

Updating a master tape is fault tolerant.

Question What happens if this computer operation
fails?

Both tapes are rewound and the job is restarted
from the beginning without any harm being done

40
The Transaction Model (2)
Primitive Description
BEGIN_TRANSACTION Make the start of a transaction
END_TRANSACTION Terminate the transaction and try to commit
ABORT_TRANSACTION Kill the transaction and restore the old values
READ Read data from a file, a table, or otherwise
WRITE Write data to a file, a table, or otherwise

Figure 5-18 Example primitives for transactions.

41
The Transaction Model (3)
BEGIN_TRANSACTION reserve BOS -gt JFK reserve JFK -gt ICN reserve SEL -gt KPOEND_TRANSACTION (a) BEGIN_TRANSACTION reserve BOS -gt JFK reserve JFK -gt ICN reserve SEL -gt KPO full gtABORT_TRANSACTION (b)

Transaction to reserve three flights commits
Transaction aborts when third flight is
unavailable

42
ACID Properties of Transactions

Atomic
To the outside world, the transaction happens
indivisibly
Consistent
The transaction does not violate system
invariants
Isolated
Concurrent transactions do not interfere with
each other
Durable
Once a transaction commits, the changes are
permanent

43
Nested Transactions

Constructed from a number of subtransactions
The top-level transaction may create children
that run in parallel with one another to gain
performance or simplify programming
Each of these children is called a
subtransaction and it may also have one or more
subtransactions
When any transaction or subtransaction starts,
it is conceptually given a private copy of all
data in the entire system for it to manipulate as
it wishes
If it aborts, its private space is destroyed
If it commits, its private space replaces the
parents space
If the top-level transaction aborts, all the
changes made in the subtransactions must be wiped
out

44
Distributed Transactions

- Transactions involving subtransactions that
operate on data that are distributed across
multiple machines
- Separate distributed algorithms are needed to
handle the locking of data and committing the
entire transaction

45
Implementing Transactions

Private Workspace
Gives a private workspace (i.e., all the data it
has access to) to a process when it begins a
transaction
Writeahead Log
Files are actually modified in place but before
any block is changed, a record is written to a
log telling
which transaction is making the change
which file and block is being changed
what the old and new values are
Only after the log has been written successfully,
the change is made to the file
Question Why is a log needed?
? for rollback if necessary

46
Private Workspace

The file index and disk blocks for a three-block
file
The situation after a transaction has modified
block 0 and appended block 3
After committing

47
Writeahead Log
x 0 y 0 BEGIN_TRANSACTION x x 1 y y 2 x y y END_TRANSACTION (a) Log x 0 / 1 (b) Log x 0 / 1 y 0 / 2 (c) Log x 0 / 1 y 0 / 2 x 1 / 4 (d)

(a) a transaction
(b) (d) The log before each statement is
executed

48
Concurrency Control (1)

The goal of concurrency control is to allow
multiple transactions to be executed
simultaneously
Final result should be the same as if all
transactions had run sequentially

Fig. 5-23 General organization of managers for
handling transactions

49
Concurrency Control (2)

General organization of managers for handling
distributed transactions.

50
Serializability (1)
BEGIN_TRANSACTION x 0 x x 1END_TRANSACTION (a) BEGIN_TRANSACTION x 0 x x 2END_TRANSACTION (b) BEGIN_TRANSACTION x 0 x x 3END_TRANSACTION (c)
(a) (c) Three transactions T1, T2, and T3
Schedule 1 x 0 x x 1 x 0 x x 2 x 0 x x 3 Legal
Schedule 2 x 0 x 0 x x 1 x x 2 x 0 x x 3 Legal
Schedule 3 x 0 x 0 x x 1 x 0 x x 2 x x 3 Illegal
(d)

(d) Possible schedules
Question Why is Schedule 3 illegal?

51
Serializability (2)

Two operations conflict is they operate on the
same data and if at least one of them is a write
operation
read-write conflict exactly one of the
operations is a write
write-write conflict involves more than one
write operations
Concurrency control algorithms can generally be
classified by looking at the way read and write
operations are synchronized
Using locking
Explicitly ordering operations using timestamps

52
Two-Phase Locking (1)

In two-phase locking (2PL), the scheduler first
acquires all the locks it needs during the
growing (1st) phase, and then releases them
during the shrinking (2nd) phase
See the rules on pg. 284