Title: DISTRIBUTED COMPUTING
1DISTRIBUTED COMPUTING
2ROAD MAP OVERVIEW
- Why are distributed systems interesting?
- Why are they hard?
3GOALS OF DISTRIBUTED SYSTEMS
- Take advantage of cost/performance difference
between microprocessors and shared memory
multiprocessors - Build systems
- 1. with a single system image
- 2. with higher performance
- 3. with higher reliability
- 4. for less money than uniprocessor systems
- In wide-area distributed systems, information and
work are physically distributed, implying that
computing needs should be distributed. Besides
improving response time, this contributes to
political goals such as local control over data.
4WHY SO HARD?
- A distributed system is one in which each process
has imperfect knowledge of the global state. - Reasons Asynchrony and failures
- We discuss problems that these two features raise
and algorithms to address these problems. - Then we discuss implementation issues for real
distributed systems.
5ANATOMY OF A DISTRIBUTED SYSTEM
- A set of asynchronous computing devices connected
by a network. Normally, no global clock. - Communication is either through messages or
shared memory. Shared memory is usually harder to
implement.
6ANATOMY OF A DISTRIBUTED SYSTEM (cont.)
- EACH PROCESSOR HAS ITS OWN CLOCK
- ARBITRARY NETWORK
BROADCAST MEDIUM Special protocols will be
possible for the broadcast medium.
7COURSE GOALS
- 1. To help you understand which system
assumptions are important. - 2. To present some interesting and useful
distributed algorithms and methods of analysis
then have you apply them under challenging
conditions. - 3. To explore the sources for distributed
intelligence.
8BASIC COMMUNICATION PRIMITIVE MESSAGE PASSING
- Paradigm
- Send message to destination
- Receive message from origin
- Nice property can make distribution transparent,
since it does not matter whether destination is
at a local computer or at a remote one (except
for failures). - Clean framework Paradigms for Process
Interaction in Distributed Programs, G. R.
Andrews, ACM Computing Surveys 231 (March 1991)
pp. 49-90.
9 BLOCKING (SYNCHRONOUS) VS. NON-BLOCKING
(ASYNCHRONOUS) COMMUNICATION
- For sender Should the sender wait for the
receiver to receive a message or not? - For receiver When arriving at a reception point
and there is no message waiting, should the
receiver wait or proceed? Blocking receive is
normal (i.e., receiver waits).
10(No Transcript)
11REMOTE PROCEDURE CALL
- Client calls the server using a call server (in
parameters out parameters). The call can appear
anywhere that a normal procedure call can. - Server returns the result to the client.
- Client blocks while waiting for response from
server.
12RENDEZVOUS FACILITY
- One process sends a message to another process
and blocks at least until that process accepts
the message. - The receiving process blocks when it is waiting
to accept a request. - Thus, the name Only when both processes are
ready for the data transfer, do they proceed. - We will see examples of rendezvous interactions
in CSP and Ada.
13Beyond send-receive Conversations
- Needed when a continuous connection is more
efficient and/or only some data at a time. - Bob and Alice Bob initiates, Alice responds,
then Bob, then Alice, - But what if Bob wants Alice to send messages as
they arrive without Bobs doing more than an ack? - Sendonly or receiveonly mode.
- Others?
14SEPARATION OF CONCERNS
- Separation of concerns is the software
engineering principle that each component should
have a single small job to do so it can do it
well. - In distributed systems, there are at least three
concerns having to do with remote services what
to request, where to do it, how to ask for it.
15IDEAL SEPARATION
- What to request application programmer must
figure this out, e.g. access customer database. - Where to do it application programmer should not
need to know where, because this adds complexity
if location changes, application break. - How to ask for it want a uniform interface.
16WHERE TO DO IT ORGANIZATION OF CLIENTS AND
SERVERS
- A service is a piece of work to do. Will be done
by a server. - A client who wants a service sends a message to a
service broker for that service. The server gets
work from the broker and commonly responds
directly to the client. A server is a process. - More basic approach Each server has a port from
which it can receive requests. - Difference In client-broker-server model, many
servers can offer the same service. In direct
client-server approach, client must request a
service from a particular server.
17ALTERNATIVE NAME SERVER
- A service is a piece of work to do. Will be done
by a server. Name Server knows where services are
done - Example Client requests address of server from
the Name Server and then communicates directly
with that server.. - Difference Client-server communication is
direct, so may be more efficient.
18HOW TO ASK FOR ITOBJECT-BASED
- Encapsulation of data behind functional
interface. - Inheritance is optional but interface is the
contract. - So need a technique for both synchronous and
asynchronous procedure calls.
19REFERENCE EXAMPLECORBA OBJECT REQUEST BROKER
- Send operation to ORB with its parameters.
- ORB routes operation to proper site for
execution. - Arranges for response to be sent to you directly
or indirectly. - Operations can be events so can allow
interrupts from servers to clients.
20SUCCESSORS TO CORBA Microsoft Products
- COM allow objects to call one another in a
centralized setting classes objects of those
classes. Can create objects and then invoke them. - DCOM COM Object Request Broker.
- ActiveX DCOM for the Web.
21SUCCESSORS TO CORBA Java RMI
- Remote Method invocation (RMI) Define a service
interface in Java. - Register the server in RMI repository, i.e., an
object request broker. - Client may access Server through repository.
- Notion of distributed garbage collection
22SUCCESSORS TO CORBA Enterprise Java Beans
- Beans are again objects but can be customized at
runtime. - Support distributed transaction notion (later) as
well as backups. - So transaction notion for persistent storage is
another concern it is nice to separate.
23REDUCING BUREAUCRACYautomatic registration
- SUN also developed an abstraction known as JINI.
- New device finds a lookup service (like an ORB),
uploads its interface, and then everyone can
access. - No need to register.
- Requires a trusted environment.
24COOPERATING DISTRIBUTED SYSTEMS LINDA
- Linda supports a shared data structure called a
tuple space. - Linda tuples, like database system records,
consists of strings and integers. We will see
that in the matrix example below.
25LINDA OPERATIONS
- The operations are out (add a tuple to the
space) in (read and remove a tuple from the
space) and read (read but dont remove a tuple
from the tuple space). - A pattern-matching mechanism is used so that
tuples can be extracted selectively by specifying
values or data types of some fields. - in (dennis, ?x, ?y, .)
- gets tuple whose first field contains dennis,
assigns values in second and third fields of the
tuple to x and y, respectively.
26EXAMPLE MATRIX MULTIPLICATION
- There are two matrices A and B. We store As rows
and Bs columns as tuples. - (A, 1, As first row), (A, 2, As second
row) . - (B, 1, Bs first column), (B, 2, Bs second
column) . - (Next, 15)
- There is a global counter called Next in the
range 1 .. number of rows of A x number of
columns of B. - A process performs an in on Next, records the
value, and performs an out on Next1, provided
Next is still in its range. - Convert Next into the row number I and column
number j such that Next i x total number of
columns j.
27ACTUAL MULTIPLICATION
- First find i and j.
- in (Next, ?temp)
- out (Next, temp 1)
- convert (temp, i, j)
- Given i and j, a process just reads the values
and outputs the result. - read (A, i, ?row_values)
- read (B, j, ?col_values)
- out (result, i, j, Dotproduct(row, col)).
28LINDA IMPLEMENTATION OF SHARED TUPLE SPACE
- The implementers assert that the work represented
by the tuples is large enough so that there is no
need for shared memory hardware. - The question is how to implement out, in, and
read (as well as inp and readp).
29BROADCAST IMPLEMENTATION 1
- Implement out by broadcasting the argument of out
to all sites. (Use a negative acknowledgement
protocol for the broadcast.) - To implement read, perform the read from the
local memory. - To implement in, perform a local read and then
attempt to delete the tuple from all other sites. - If several sites perform an in, only one site
should succeed. - One approach is to have the site originating the
tuple decide which site deletes. - Summary good for reads and outs, not so good for
ins.
30BROADCAST IMPLEMENTATION 2
- Implement out by writing locally.
- Implement in and read by a global query. (This
may have to be repeated if the data is not
present.) - Summary better for out. Worse for read. Same for
in.
31COMMUNICATION REVIEW
- Basic distributed communication when no shared
memory send/receive. - Location transparency broker or name server or
tuple space. - Synchrony and asynchrony are both useful (e.g.
real-time vs. informational sensors). - Other mechanisms are possible
32COMMUNICATION BY SHARED MEMORY beyond locks
- Framework Herlihy, Maurice. Impossibility and
Universality Results for - Wait-Free Synchronization, ACM SIGACT-SIGOPS
Symposium - on Principles of Distributed Computed (PODC),
1988. - In a system that uses mutual exclusion, it is
possible that one process may stop while holding
a critical resources and hang the entire system. - It is of interest to find wait-free primitives,
in which no process ever waits for another one. - The primitive operations include test-and-set,
fetch-and-add, and fetch-and-cons. - Herlihy shows that certain operations are
strictly more powerfully wait-free than others.
33CAN MAKE ANYTHING WAIT-FREE (at a time price)
- Dont maintain the data structure at all.
Instead, just keep a history of the operations. - enq(x)
- put enq(x) on end of history list
(fetch-and-cons) - end enq(x)
- deq
- put deq on end of history list (fetch-and-cons)
- replay the array and figure out what to
return - end deq
- Not extremely practical the deq takes O(number
of deqs number of enqs) time. - Suggestion is to have certain operations
reconstruct the state in an efficient manner.
34GENERAL METHOD COMPARE-AND-SWAP
- Compare-and-swap takes two values v and v. If
the registers current value is v, it is replaced
by v, otherwise it is left unchanged. The
registers old value is returned. - temp compare-and-swap (register, 0, i)
- if register 0 then register i
- else register is unchanged
- Use this primitive to perform atomic updates to a
data structure. - In the following figure, what should the
compare-and-swap do?
35PERSISTENT DATA STRUCTURES AND WAIT-FREEDOM
- One node added, one node removed. To establish
change, change the current pointer. - Old tree would still be available.
- Important point If process doing change should
abort, then no other process is affected.
36LAMPORT Times, Clocks paper
- What is the proper notion of time for Distributed
Systems? - Time Is a Partial Order
- The Arrow Relation
- Logical Clocks
- Ordering All Events using a tie-breaking Clock
- Achieving Mutual Exclusion Using This Clock
- Correctness
- Criticisms
- Need for Physical Clocks
- Conditions for Physical Clocks
- Assumptions About Clocks and Messages
- How Do We Achieve Physical Clock Goal?
37ROAD MAP TIME ACCORDING TO LAMPORT
38TIME
- Assuming there are no failures, the most
important difference between distributed systems
and centralized ones is that distributed systems
have no natural notion of global time. - Lamport was the first who built a theory around
accepting this fact. - That theory has proven to be surprisingly useful,
since the partial order that Lamport proposed is
enough for many applications.
39WHAT LAMPORT DOES
- Paper (reference on next slide) describes a
message-based criterion for obtaining a time
partial order. - 2. It converts this time partial order to a
total order. - 3. It uses the total order to solve the mutual
exclusion problem. - 4. It describes a stronger notion of physical
time and gives an algorithm that sometimes
achieves it (depending on quality of local clocks
and message delivery).
40NOTIONS OF TIME IN DISTRIBUTED SYSTEMS
Lamport, L. Times, Clocks, and the Ordering of
Events in a Distributed System, Communications
of the ACM, vol. 21, no. 7 (July 1978).
- Distributed system consists of a collection of
distinct processes, which are spatially
separated. (Each process has a unique
identifier.) - Communicate by exchanging messages.
- Messages arrive in the order they are sent.
(Could be achieved by hand-shaking protocol.) - Consequence Time is partial order in distributed
systems. Some events may not be ordered.
41THE ARROW (partial order) RELATION
- We say A happens before B or A ? B, if
- 1. A and B are in the same process and A
happens before B in that processgt (Assume
processes are sequential.) - 2. A is the sending of a message at one process
and B is the receiving of that message at
another process, then A ? B. - 3. There is a C such that A ? C and C ? B.
- In the jargon, ? is an irreflexive partial
ordering.
42LOGICAL CLOCKS
- Clocks are a way of assigning a number to an
event. Each process has its own clock. - For now, clocks will have nothing to do with real
time, so they can be implemented by counters with
no actual timing mechanism. - Clock condition For any events A and B, if A ?
B, then C(A) lt C(B).
43IMPLEMENTING LOGICAL CLOCKS
- Each process increments its local clock between
any two successive events. - Each process puts its local time on each message
that it sends. - Each process changes its clock C to C when it
receives message m having timestamp T. Require
that Cgt max(C, T).
44IMPLEMENTATION OF LOGICAL CLOCKS
- Receiver clock jumps to 14 because of timestamp
on message received.
Receiver clock is unaffected by the timestamp
associated with sent message, because receivers
clock is already 18, so greater than message
timestamp.
45ORDERING ALL EVENTS
- We want to define a total order ?.
- Suppose two events occur in the same process,
then they are ordered by the first condition. - Suppose A and B occur in different processes, i
and j. Use process ids to break ties. - LC(A) Ai, A concatenated with i.
- LC(B) Bj.
- The total ordering ? is called Lamport clock.
46ACHIEVING MUTUAL EXCLUSION USING THIS CLOCK
- Goals
- Only one process can hold the resource at a time.
- Requests must be granted in the order in which
they are made. - Assumption Messages arrive in the order they are
sent. (Remember, this can be achieved by
handshaking.)
47ALGORITHM FOR MUTUAL EXCLUSION
- To request the resource, Pi sends the message
request resource to all other processes along
with Pis local Lamport timestamp T. It also puts
that message on its own request queue. - When a process receives such a request, it
acknowledges the message. (Unless it has already
sent a message to Pi timestamped later than T.) - Releasing the resource is analogous to
requesting, but doesnt require an
acknowledgement.
48USING THE RESOURCE
- Process Pi starts using the resource when
- its own request on its local request queue has
the earliest Lamport timestamp T (consistent with
?) and - it has received a message (either an
acknowledgement or some other message) from every
other process with a timestamp larger than T.
49CORRECTNESS
- Theorem Mutual exclusion and first-requested,
first-served are achieved. - Proof
- Suppose Pi and Pj are both using the resource at
the same time and have timestamps Ti and Tj. - Suppose Ti lt Tj. Then Pj must have received is
request, since it has received at least one
message with a timestamp greater than Tj from Pi
and since messages arrive in the order they are
sent. But then Pj would not execute its request.
Contradiction. - First-requested, first-served. If Pi requests the
resource before Pj (in the ? sense), then Ti lt
Tj, so Pi will win.
50CRITICISMS
- Many messages. If only one process is using the
resource, it still must send messages to many
other processes. - If one process stops, then all processes hang (no
wait freedom could we achieve?)
51Is there a Wait-Free Variant?
- Modify resource locally and then send to
everyone. If nobody objects, then new resource
value is good. - Difficulty how to make it so that a single
atomic wait-free operation can install the update
to the resource?
52NEED FOR PHYSICAL CLOCKS
- Time as a partial order is the most frequent
assumption in distributed systems, however it is
sometimes important to have a physical notion of
time. - Example Going outside the system. Person X
starts a program A, then calls Y on the
telephone, who then starts program B. We would
like A ? B. - But that may not be true for Lamport clocks,
because they are sensitive only to inter computer
messages. Physical clocks try to account for
event ordering that are external to the system.
53CONDITIONS FOR PHYSICAL CLOCKS
- Suppose u is the smallest time through internal
or external means that one process can be
informed of an event occurring at another
process. That is, u is the smallest transmission
time. - (Distance/speed of light?)
- Suppose we have a global time t (all processes
are in same frame of reference) that is unknown
to any process. - Goal for physical clocks Ci(t u) gt Cj(t) for
any i, j. - This ensures that if A happens before B, then
the clock time for B will be after the clock time
for A.
54ASSUMPTIONS ABOUT CLOCKS AND MESSAGES
- Clock drift. In one unit of global time, Ci will
advance between 1-k and 1k time units. (k ltlt 1) - A message can be sent in some minimum time v with
a possible additional delay of at most e.
55HOW DO WE ACHIEVE PHYSICAL CLOCK GOAL?
- Cant always do so, e.g., cant synchronize
quartz watches using the U.S. post office. - Basic algorithm Periodically (to be determined),
each process sends out timestamped messages. - Upon receiving a message from Pi timestamped Ti,
process Pj sets its own timestamp to max(Ti v,
Tj).
56WHAT ALGORITHM ACCOMPLISHES
- Simplifying to the essence of the idea, suppose
there are two processes i and j and i sends a
message that arrives at global time t. - After possibly resetting its timestamp, process j
ensures that - Cj(t) Ci(t) v (ev)x(1k)
- That is, since i sent its message at local time
Ti, is clock may have advanced (ev)x(1k) time
units to Ti(ev)x(1k) time. At the least Cj(t)
Tiv. - How good can synchronization be, given e, v, k?
57ROAD MAP SOME FUNDAMENTAL PROTOCOLS
58PROTOCOLS
- Asynchrony and distributed centers of processing
give rise to various problems - Find a spanning tree n a network.
- When does a collection of processes terminate?
- Find a consistent state of a distributed system,
i.e., some analogue to a photographic snapshot. - Establish a synchronization point. This will
allow us to implement parallel algorithms that
work in rounds on a distributed asynchronous
system. - Find the shortest path from a given node s to
every other node in the network.
59MODEL
- Links are bidirectional. A message traverses a
single link. - All nodes have distinct ids. Each node knows
immediate neighbors - Messages incur arbitrary but finite delay.
- FIFO discipline on links, i.e., messages are
received in the order they are sent.
60PRELIMINARY ESTABLISH A SPANNING TREE
- Some node establishes itself as the leader (e.g.
node x establishes a spanning tree for its
broadcasts, so x is root). - That node sends out a request for children to
its neighborsin the graph. - When a node n receives request for children
from node m
if m is the first node that sent n this message,
then n responds ACK and youre my parent sends
request for children to its other neighbors else
n responds ACK
Each node except the root has one parent, and
every node is in the tree. A leaf is a node
that only received ACKs from neighbors to
which it sent requests.
61TERMINATING THE SPANNING TREE
- A node that determines that it is a leaf sends up
an Im all done message to its parent. - Each non-root parent sends an Im all done
message to its parent once it has received such
message from all its children. - When the root receives Im all done from its
children, then it is done.
62BROADCAST WITH FEEDBACK
- A given node s would like to pass message X to
all other nodes in the network and be informed
that all nodes have received the message. - Algorithm
- Construct the spanning tree and then send X
along the tree. - Have each node send an acknowledgement to its
parent after it has received an acknowledgement
from its children.
63PROCESS TERMINATION
- Def Detect the completion of a collection of
non-interacting tasks, each of which is performed
on a distinct processor. - When a leaf finishes its computation, it sends an
I am terminated message to its parent. - When an internal node completes it has received
an I am terminated message from all of its
children, it sends such a message to its parent. - When the root completes and has received an I am
terminated message from all of its children, all
task have been completed.
64DISTRIBUTED SNAPSHOTS
- Intuitively, a snapshot is a freezing of a
distributed computation at the same time. - Given a snapshot, it is easy to detect stable
conditions such as deadlock. - (A deadlock condition doesnt go away. If a
deadlock held in the past and nothing has been
done about it, then it still holds. That makes it
stable.)
65FORMAL NOTION OF SNAPSHOT
- Assume that each processor has a local clock,
which is incremented after the receipt of and
processing of each incoming message, e.g., a
Lamport clock. (Processing may include
transmitting other messages.) - A collection of local times tkkeN, where N
denotes the set of nodes, constitutes a snapshot,
if each message received by node j from node i
prior to tj has been sent by i prior to ti. - A message sent by i before ti but not received by
j before tj is said to be in transit. - The correctness criterion is that no message be
received before the snapshot, which was sent
after the snapshot. (Such a thing could never
happen if the snapshot time at every site were a
single global time.)
66DISTRIBUTED SNAPSHOTS
A MESSAGE FROM i TO J
i
j
i
j
SITUATION I
Time
ti
OK
tj
i
j
ti
SITUATION II
Time
BAD
tj
i
j
SITUATION III
tj
Time
ti
IN TRANSIT -- OK
ti is SNAPSHOT TIME FOR PROCESS i tj is SNAPSHOT
TIME FOR PROCESS j
67ALGORITHM
- Node i enters its snapshot time either
spontaneously or upon receipt of a flagged
message, whichever comes first. In either case,
it sends out a flagged token and advances the
clock to what becomes its snapshot time ti. - Messages sent later are after the snapshot.
- This algorithm allows each node to determine when
all messages in transit have been received. - That is, when a node receives a flagged token
from all its neighbors, then it has received all
messages in transit.
68SNAPSHOT PROTOCOL IS CORRECT
- Remember that we must prevent a node i from
receiving a message before its snapshot time, ti,
that was sent by a node j after its snapshot
time, tj. - But any message sent after the snapshot will
follow the flagged token at the receiving site
because of the FIFO discipline on links. So, bad
case cannot happen.
69SYNCHRONIZER
- It is often much easier to design a distributed
protocol when the underlying system is
synchronous. - In synchronous systems, computation proceeds in
rounds. Messages are sent at the beginning of
the round and arrive before the end of the round.
The beginning of each round is determined by a
global clock. - A synchronizer enables a protocol designed for a
synchronous system to run an asynchronous one.
70PROTOCOL FOR SYNCHRONIZER
- Round manager broadcasts round n begin Each
node transmits the messages of the round. - Each node then sends is flagged token and records
that time as the snapshot time. (Snapshot tokens
are numbered to distinguish the different
rounds.) - Each node receives messages along each link until
it receives the flagged token. - Nodes perform non-interfering termination back to
manager after they have received a token from all
neighbors.
71MINIMUM-HOP PATHS
- Task is to obtain the paths with the smallest
number of links from a given node s to each other
node in the network. - Suppose the network is synchronous. In the first
round s sends to its neighbors. In the second
round, the neighbors send to their neighbors. And
so it continues. - When node I receives id s, for the first time, it
designates the link on which s has arrived as the
first link on the shortest path to s. - Use the synchronization protocol to simulate
rounds.
NETWORK
ONE POSSIBLE RESULT
72MINIMUM-HOP PATHS
- The round by round approach takes a long time.
Can you think of an asynchronous approach that
takes less time? - How do you know when youre done?
NETWORK
ONE POSSIBLE RESULT
73CLUSTERING
- K-means clustering
- Choose k centroids from the set of points at
random. - Then assign each point to the cluster of the
nearest centroid. - Then recompute the centroid of each cluster and
start over. - Why does this converge?
- A Lyapunov function is a function of the state of
an algorithm that decreases whenever the state
changes and that is bounded from below. - With sequential k-means the sum of the distances
always decreases. - Can't get lower than zero.
74WHY DOES DISTANCE DECREASE?
- Well, when you readjust the mean, it decreases
for that set. - When you reassign, every distance gets smaller
still. - So every step readjusts the total distance.
- How do we do this for a distributed asynchronous
system? - What if you have rounds?
- What if you dont?
75SETI-at-Home Style Projects?
- SETI stands for search for extra-terrestrial
intelligence. It consists of testing radio signal
receptions for some regularity. - Ideal distributed system project master sends
out work. Servers do work. - Servers may crash. What to do?
- Servers may be dishonest. What to do?
76BROADCAST PROTOCOLS
- Often it is important to send a message to a
group of processes in an all-or-nothing manner.
That is, either all non-failing processes should
receive the message or none should. - This is called atomic broadcast
- Assumptions
- fail-stop processes
- messages are received from one process to another
in the order they are sent
77ATOMIC (UNORDERED) BROADCAST PROTOCOL
- Application Update all copies of a replicated
data item - Initiator Send message m to all destination
processes - Destination process When receiving m for the
first time, send it to all other destinations
78Fault-Tolerant Broadcasts
- Reference A Modular Approach to Fault-Tolerant
Broadcasts and Related Problems Vassos
Hadzilacos and Sam Toueg. - Describes reliable broadcast, FIFO broadcast,
causal broadcast and ordered broadcast.
79Stronger Broadcasts
- FIFO broadcast Reliable broadcast that
guarantees that messages broadcast by the same
sender are received in the order they were
broadcast. - A bit more precise If a process broadcasts a
message m before it broadcasts a message m, then
no correct process accepts m unless it has
previously accepted m. (Might buffer a message
before accepting it.)
80Problems with FIFO
- Network news application, where users distribute
their articles with FIFO Broadcast. User A
broadcasts an article. - User B, at a different site, accepts that article
and broadcasts a response that can only be
understood by a user who has already seen the
original article. - User C accepts Bs response before accepting the
original article from A and so misinterprets the
response.
81Causal Broadcast
- Causal broadcast If the broadcast of m causally
precedes the broadcast of m (in the sense of
Lamport ordering), then m must be accepted
everywhere before m - Does this solve the previous problem?
82Problems with Causal Broadcast
- Consider a replicated database with two copies of
a bank account x residing in different sites.
Initially, x has a value of 100. A user deposits
20, triggering a broadcast of add 20 to x to
the two copies of x. - At the same time, at a different site, the bank
initiates a broadcast of add 10 percent interest
to x. Not causally related, so Causal Broadcast
allows the two copies of x to accept these update
messages in different orders.
83THE NEED FOR ORDERED BROADCAST
- In the causal protocol, it is possible for two
updaters at different sites to send their
messages in a different order to the various
processes, so the sequence wont be consistent.
84Total Order (Atomic Broadcast)
- If correct processes p and q both accept messages
m and m, the p accepts m before m if and only
if q accepts m before m
85DALE SKEENS ORDERED BROADCAST PROTOCOL
- Idea is to assign each broadcast a global logical
timestamp and deliver messages in the order of
timestamps. - As before, initiator send message m to all
receiving processes (maybe not all) - Receiver process marks m as undelivered (keeps m
in buffer) and sends a proposed timestamp that is
larger than any timestamp that the site has
already proposed or received - Timestamps are made unique by attaching the
sites identifier as low-order bits.Time
advances at each process based on lamport clocks.
86SKEENS ORDERED BROADCAST PROTOCOL (cont.)
Initiator Send message m
Receivers
Proposed timestamp (e.g., 17), based on local
Lamport time
Take max. (e.g., 29)
Final timestamp
Forget proposed timestamp for m. Wait until final
timestamp for m is minimum of proposed or final
timestamps. Accept m. Forget timestamp for m.
87CORRECTNESS
- Theorem m and m will be accepted in same order
at all common sites. - Proof steps
- Every two final timestamps will be different.
- If TS(m)ltTS(m), then any proposed timestamp for
mltTS(m) TS(m) is final timestamp for m.
88QUESTIONS TO ENSURE UNDERSTANDING
- Find an example showing that changing the Skeen
protocol in any one of the following ways would
yield an incorrect protocol. - The timestamps at different sites could be the
same. - The initiator chooses the minimum (instead of the
maximum) proposed timestamp as the final
timestamp. - Sites accept messages as soon as they become
deliverable.
89ORDER-PRESERVING BROADCAST PROTOCOLS ON BROADCAST
NET
Framework Chang, Jo-Mei. Simplifying
Distributed Database Systems Design by Using a
Broadcast Network, ACM SIGMOD, June 1984.
- Proposes a virtual distributed system that
implements ordered atomic broadcast and failure
detection. - Shows that this makes designing the rest of
system easier. - Shows that implementing these two primitives
isnt so hard. - Paradigm find an appropriate intermediate level
of abstraction that can be implemented and that
facilitates the higher functions. - Build Facilities that use Broadcast Network.
- Implement Atomic Broadcast Network.
90RATIONALE
- Use property of current networks, which are
naturally broadcast, although not so reliable. - Common tasks of distributed systems Send same
information to many sites participating in a
transaction (update all copies) reach agreement
(e.g. transaction commitment).
91DESCRIPTION OF ABSTRACT MACHINE
- Services and assurances it provides
- Atomic broadcast failure atomicity. If a message
is received by an application program at one
site, it will be received at all operational
sites. - System-wide clock and all messages are
timestamped in sequence. This is the effective
message order. - Assumptions Failures are fail-stop, not
malicious. So, for example token site will not
lie about messages or sequence numbers. - Network failures require extra memory.
92CHANG SCHEME
- Tools Token-passing scheme positive
acknowledgments negative acknowledgements.
93BEAUTY OF NEGATIVE ACKNOWLEDGMENT
- How does a site discover that it hasnt received
a message? - Non-token site knows that it has missed a message
if there is a gap in the counter values that it
has received. In that case, it requests that
information from the token site (negative ack). - Overhead one positive acknowledgment per
broadcast message vs. one acknowledgment per site
per message in naïve implementation.
94TOKEN TRANSFER
- Token transfer is a standard message. The target
site must acknowledge. To become a token site,
the target site must guarantee that it has
received all messages since the last time it was
a token site. - Detect failure at a non-token site, when it fails
to accept token responsibility.
95REVISIT ASSUMPTIONS
- Sites do not lie about their state (i.e., no
malicious sites could use authentication). - Sites tell you when they fail (e.g. through
redundant circuitry) or by not responding. - If there is a network partition, then no negative
ack would occur, so must keep message m around
until everyone has acquired the token after m was
sent.
96ROAD MAP COMMIT PROTOCOLS
97THE NEED
- Scenario Transaction manager (representing user)
communicates with several database servers. - Main problem is to make the commit atomic (i.e.,
either all sites commit the transaction or none
do).
98NAÏVE (INCORRECT) ALGORITHM
- RESULT INCONSISTENT STATE
99TWO-PHASE COMMIT PHASE 1
- Transaction manager asks all servers whether they
can commit. - Upon receipt, each able server saves all updates
to stable storage and responds yes. - If server cannot say yes (e.g., because of a
concurrency control problem), then it says no. In
that case, it can immediately forget the
transaction. Transaction manager will abort the
transaction at all sites.
100TWO-PHASE COMMIT PHASE 2
- If all servers say yes, then transaction manager
writes a commit record to stable storage and
tells them all to commit, but if some say no or
dont respond, transaction manager tells them all
to abort. - Upon receipt, the server writes the commit record
and then sends an acknowledgement. The
transaction manager is done when it receives all
acknowledgements. - If a database server fails during first step, all
abort. - If a database server fails during second step, it
can consult the transaction manager to see
whether it should commit.
101ALL OF TWO-PHASE COMMIT
States of server
102QUESTIONS AND ANSWERS
- Q What happens if the transaction manager fails?
- A A database server who said yes to the first
phase but has received neither a commit nor abort
instruction must wait until the transaction
manager recovers. It is said to be blocked. - Q How does a recovering transaction manager know
whether it committed a given transaction before
failing? - A The transaction manager must write a commit T
record to stable storage after it receives yess
from all data base servers on behalf of T and
before it sends any commit messages to them. - Q Is there any way to avoid having a data base
server block when the transaction manager fails? - A A database server may consult other database
servers who have participated in the transaction,
if it knows who they are.
103OPTIMIZATION FOR READ-ONLY TRANSACTIONS
- Read-only transactions.
- Suppose a given server has done only reads (no
updates) for a transaction. - Instead of responding to the transaction manager
that it can commit, it responds READ-only - The transaction manager can thereby avoid sending
that server a commit message
104THREE-PHASE COMMIT
- A non-blocking protocol, assuming that
- A process fail-stops and does not recover during
the protocol - The network delivers messages from A to B in the
order they were sent - Live processes respond within the timeout period
- Non-blocking surviving servers can decide what
to do.
105PROTOCOL
106STATES OF SERVER ASSUMING FIRST TM DOES NOT FAIL
107INVARIANTS while first TM active
- No server can be in the willing state while any
other server (live or failed) is in the committed
state - No server can be in the aborted state while any
other server (live or failed) is in the
ready-to-commit state
108CONTRAST WITH TWO-PHASE COMMIT
109RECOVERY IN THREE-PHASE COMMITafter first TM
fails or slows down too much
- What the newly elected TM does
110ROAD MAP KNOWLEDGE LOGIC AND CONSENSUS
111EXAMPLE COORDINATED ATTACK
- Forget about computers. Think about a pair of
allied generals A and B. - They have previously agreed to attack
simultaneously or not at all. Now, they can only
communicate via carrier pigeon (or some other
unreliable medium). - Suppose general A sends the message to B
- Attack at Dawn
- Now, general A wont attack alone. A doesnt know
whether B has received the message. B understand
As predicament, so B sends an acknowledgment. - Agreed
112WILL IT EVER END?
113IT NEVER ENDS
- Theorem Assume that communication is unreliable.
Any protocol that guarantees that if one of the
generals attacks, then the other does so at the
same time, is a protocol in which necessarily
neither general attacks. - Have you ever had this problem when making an
appointment by electronic mail?
114BACK TO COMPUTERS
- While ostensibly about military matters, the Two
Generals problem and the Byzantine Agreement
problem should remind you of the commit problem. - In all three problems, there are two
possibilities commit (attack) and abort (dont
attack). - In all three problems, all sites (generals) must
agree. - In all three problems, always aborting (not
attacking) is not an interesting solution. - The theorem shows that no non-blocking commit
protocol is possible when the network can drop
messages. - Corollary If the decision must be made within a
fixed time period, then unbounded network delays
prevent the sites from ever committing.
115BASIC MODEL FOR KNOWLEDGE LOGIC
- Each processor is in some local state. That is,
it knows some things. - The global state is just the set of all local
states. - Two global states are indistinguishable to a
processor if the processor has the same local
state in both global states.
116SOME USEFUL NOTATION FOR SUCH PROBLEMS
- Ki agent i knows.
- CG common knowledge among group G
- A statement x is common knowledge if
- Every agent knows x. ? i Ki x.
- Every agent knows that every other agent knows x.
- ? i ? j Ki Kj x.
- Every agent knows that every other agent knows
that every other agent knows x
and so on.
117EXAMPLES
- In coordinated attack problem, when A sends his
message. - KA A says attack at dawn
- When B receives that, then
- KBKA A says attack at dawn
- However, it is false that KAKBKA A says attack
at dawn - This is remedied when A receives the first
acknowledge, at which point - KAKBKA A says attack at dawn
- However, it is false that
- KBKAKBKA A says attack at dawn
- More knowledge but never common knowledge.
118EXAMPLE RELIABLE AND BOUNDED TIME COMMUNICATION
- If A knows that B will receive any message that A
sends within one minute of As sending it, then
if A sends - Attack at dawn
- A knows that within two minutes
- CA,B A says attack at dawn
119CONCLUSIONS
- Common knowledge is unattainable in systems with
unreliable communication (or with unbounded
delay) - Common knowledge is attainable in systems with
reliable communication in bounded time
120ROAD MAP KNOWLEDGE LOGIC AND TRANSMISSION
121APPLYING KNOWLEDGE TO SEQUENCE TRANSMISSION
PROTOCOLS
- Problem The two processes are the sender and the
receiver. - Sender S has an input tape with an infinite
sequence of data elements (0,1, blank). S tries
to transmit these to receiver R. R writes these
onto the output tape. - Correctness Output tape should contain a prefix
of input tape even in the face of errors (safety
condition). - Given a sufficiently long correct transmission,
output tape should make progress (liveness
condition).
122MODEL
- Messages are kept in order
- Sender and receiver are synchronous. This implies
that sending a blank conveys information. - Three possible type of errors
- Deletion errors either a 0 or a 1 is sent, but a
blank is received. - Mutation errors a 0 (resp. 1) is sent, but a 1
(resp. 0) is received. Blanks are transmitted
correctly. - Insertion errors a blank is sent, but a 0 or 1
is received. - Question Can we handle all three error types?
123POSSIBLE ERROR TYPES
- If all error types are present, then a sent
sequence can be transformed to any other sequence
of the same length. So receiver R can gain no
information about messages that sender S actually
transmitted. - For any two of the three, the problem is
solvable. - To show this, we will extend the transmission
alphabet to consist of blank, 0, 1, ack, ack2,
ack3. - Eventually, we will encode these into 0s, 1s and
blanks.
124ERROR TYPE DELETION ALONE
- So a 1,0, or any acknowledgement can become a
blank. - Suppose the input for S is 0,0,1
- For any symbol y, we went to achieve that the
sender knows that the receiver has received
(knows) symbol y. - Denote this Ks Kr (y).
- Imagine the following protocol If S doesnt
receive an acknowledgement, then it resends the
symbol it just sent. If S receives an
acknowledgement, S sends the next symbol on its
tape. - Scenario S sends y, R sends ack, S sends next
symbol y. - Is there a problem?
125GOAL OF PROTOCOL
- Yes, there is a problem. Look at this from Rs
point of view. It may be that y y. - R doesnt know whether S is resending y (because
it didnt receive Rs acknowledgement) or S is
sending a new symbol. - So, R needs more knowledge. Specifically, R must
know that S received its acknowledgement. S must
know that R knows this. - We need Ks Kr Ks Kr y. To get this, S sends ack2
to R. Then R sends ack3 to S.
126EXERCISE
- Suppose that the symbol after y is y and y ? y.
- Then can S send y as soon as it receives ack to
y? (Assume R has a way of knowing that it
received y and y correctly.) - S sends y
- R sends ack
- S sends y
- R sends ack
127ENCODING PROTOCOL IN 0s and 1s
128WHAT DO WE WANT FROM AN ENCODING?
- Unique decidability. If e(x) is received
uncorrupted, then recipient knows that it is
uncorrupted and is an encoding of x. - Corruption detectability. If e(x) is corrupted,
the recipient knows that it is. - Thus, receiver knows when it receives good data
and when it receives a garbled message.
129ENCODING FOR DELETIONS AND MUTATIONS
- Recall that mutation means that a 0 can become a
1 or vice versa. - Encoding (b is blank)
The same extended alphabet protocol will
work. Any insertion will result in two non-blank
characters. A mutation can only change a 1 to a
0.
130Self-Stabilizing Systems
- A distributed system is self-stabilizing if, when
started from an arbitrary initial configuration,
it is guaranteed to reach a legitimate
configuration as execution progresses, and once a
legitimate configuration is achieved, all
subsequent configurations remain legitimate.
131Self-Stabilizing Systems(using invariants)
- There is an invariant I which implies a safety
condition S. - When failures occur, S is maintained though I may
not be. - However when the failures go away, I returns.
- http//theory.lcs.mit.edu/classes/6.895/fall02/pap
ers/Arora/masking.pdf
132Self-Stabilizing Systemscomponents
- A corrector returns a program from state S to I
e.g. error correction codes, exception handlers,
database recovery. - A detector sees whether there is a problem e.g.
acceptance tests, watchdog programs, parity ...
133Self-Stabilizing Systemsexample
- Error model messages may be dropped.
- Message sending protocol called the alternating
bit protocol, which we explain in stages. - Sender sends a message, Receiver acknowledges if
message is received and uncorrupted (can use
checksum). - Sender sends next message.
134Alternating Bit Protocol continued
- If Sender receives no ack, then it resends.
- But What if receiver has received the message
but the ack got lost. - In that case, the receiver thinks of this as a
new message.
135Alternating Bit Protocol-- weve arrived.
- Solution 1 Send a sequence number with the
message so receiver knows whether a message is
new or old. - But This number increases as the log of the
number of messages. - Better Send the parity of the sequence number.
This is the alternating bit protocol. - Invariant Output equals what was sent perhaps
without the last message.
136Why is this Self-Stabilizing?
- Safety output is a prefix of what was sent even
in the face of failures (provided checksums are
sufficient to detect corruption). - Invariant (Output equals what was sent perhaps
without the last message) is a strong liveness
guarantee.
137ROAD MAP TECHNIQUES FOR REAL-TIME SYSTEMS
138SYSTMES THAT CANNOT OR SHOULD NOT WAIT
- Time-sharing operating environments concern for
throughput. - Want to satisfy as many users as possible.
- Soft real-time systems (e.g., telemarketing)
concern for statistics of response time. - Want a few disgruntled customers.
- Firm real-time systems (e.g., obtain ticker
information on Wall Street) concern to meet as
many deadlines as possible. - If you miss, you lose the deal.
- Hard real-time systems (e.g., airplane
controllers) requirements to meet all deadlines. - If you miss, then airplane may crash.
139DISTINCTIVE CHARACTERISTICS OR REAL-TIME SYSTEMS
- Predictability is essential For hard, real-time
systems the time to run a routine must be known
in advance. - Implication Much programming is done in
assembly language. Changes are done by patching
machine code. - Fairness is considered harmful We do not want
an ambulance to wait for a taxi. - Implication messages must be prioritized, FIFO
queues are bad. - Preemptibility is essential An emergency
condition must be able to override a low-priority
task immediately. - Implication Task switching must be fast, so
processes must reside in memory. - Scheduling is of major concern The time budget
of an application is as important as its monetary
budget. Meeting time constraints is more than
just a matter of fast hardware. - Implication We must look at the approaches to
scheduling.
140SCHEDULING APPROACHES
- Cyclic executive Divide processor time into
endlessly repeating cycles where each cycle is
some fixed length, say 1 second. During a cycle
some periodic tasks may occur several times,
others only once. Gaps allow sporadic tasks to
enter. - Rate-monotonic Give tasks priority based on the
frequency with which they are requested. - Earliest-deadline first Give the highest
priority to the task with the earliest deadline.
141CYCLIC EXECUTIVE STRATEGY
- A cycle design containing sub-intervals of
different lengths. During a sub-interval, either
a periodic task runs or a gap is permitted for
sporadic tasks to run. - Note the task T1 runs three times during each
cycle. In general, different periodic tasks may
have to run with different frequencies.
142RATE MONOTONIC ASSIGNMENT
Three tasks
- Rate monotonic would say that T1 should get
highest priority (because its period is smallest
and rate is highest), then T2, then T3. - Assume that all tasks are perfectly preemptable.
- As the following figure shows, all tasks meet
their deadlines. What happens if T3 is given
highest priority because it is the most
important?
143EXAMPLE OF RATE MONOTONIC SCHEDULING
0 50 100 150 200
250 300 350 400 450
500
- Use of rate monotonic scheduler (higher rate gets
higher priority) ensures that all tasks complete
by their deadlines. - Notice that T3 completes earlier in its cycle the
second time, indicating that the
most-difficult-to-meet situation is the very
initial one.
144CASE STUDY
- A group is designing a command and control
system. - Interrupts arrive at different rates, however the
maximum rate of each interrupt is predictable. - Computation time of task associated with each
interrupt is predictable. - First implementation uses Ada and a special
purpose operating system. The operating system
handled interrupts in a round-robin fashion. - That is, first the OS checked for interrupts for
task 1, then ask 2, and so on. - System did not meet its deadlines, yet was
grossly underutilized (about 50).
145FIRST DECISION
- Management decided that the problem was Ada.
- Do you think they were right? (Assume that they
could have shortened each task by 10 and that
the tasks and times are of the three task system
given previously.
146CASE STUDY SOLUTIONS
- Switching from Ada probably would not have
helped. - Consider using round-robin for the three task
system given before. If task T3 is allowed to run
to completion, then it will prevent task T1 from
running for 100 time units (or 90 with the time
improvement). That is not fast enough. - Change scheduler to give priority to task with
smallest period, but tasks remain
non-preemptable. - Helps, but not enough since the T3-T1 conflict
would still prevent T1 from completing. - Change tasks so longer tasks are preemptable.
- This would solve the problem in combination with
rate monotonic priority assignment. (Show this.) - Motto Look first at the scheduler.
147PRIORITY INVERSION AND PRIORITY IHERITANCE
148SPECIAL CONSIDERATIONS FOR DISTRIBUTED SYSTEMS
- Since communication is unpredictable, most
distributed non-shared memory real-time systems
do no dynamic task allocation. Tasks are
pre-allocated to specific processors. - Example oil refineries where each chemical
process is controlled by a separate computer. - Message exchange is limited to communicating data
(e.g., in sensor applications) or status (e.g.
time out messages). Messages must be prioritized
and some messages should be datagrams. - Example Command and control system has messages
that take priority over all other messages, e.g.,
hostilities have begun. - Special processor architectures are possible that
implement a global clock (hence require real-time
clock synchronization) and guaranteed message
deliveries. - Example application airplane control with a
token-passing network.
149OPEN PROBLEMS
- Major open problem is to combine real-time
algorithms with other needs, e.g., high
performance network protocols and distributed
database technology. - What is the place of carrier-sense detection
circuits in realtime system? - If exponential back-off is used, then no
guarantee is possible. (See text.) - However, a tree-based conflict protocol, e.g.,
based on a sites identifier, can guarantee
message transmission. - How should deadlocks be handled in a real-time
transaction system? - Aborting an arbitrary transaction is
unacceptable. - Aborting a low priority transaction may be
acceptable.
150COMPONENTS OF SECURITY
- Authentication Proving that you are who you say
you are. - Access Rights Giving you the information for
which you have clearance. - Integrity Protecting information from
unauthorized exposure. - Prevention of Subversion Guard against Replay
attacks, Trojan Horse attacks, Covert Channel
analysis attacks
151AUTHENTICATION AND ZERO KNOWLEDGE PROOFS
- The parable of the Amazing Sand Counter
- Person S makes the following claim
- You fill a bucket with sand. I can tell, just by
looking at it, how many grains of sand there are.
However, I wont tell you. - You may test me, if you like, but I wont answer
any question that will teach you anything about
the number of grains in the bucket. - The test may include your asking me to leave the
room. - What do you do?
152SAND MAGIC
- The Amazing Sand Counter claims to know how many
grains of sand there are in a bucket just by
looking at it. - How can you put him to the test?
153AUTHENTICATING THE AMAZING SAND COUNTER
- Answer
- Tester tells S to leave the room.
- Tester T removes a few grains from bucket and
counts them, then keeps in Ts pocket. - T asks S to return and say how many grains have
been removed. - T repeats until convinced or until T shows that S
lies. - Are there any problems left? Can the tester use
the Amazing Sand Counters knowledge to
masquerade as the Amazing Sand Counter?
154MIGHT A COUNTERFEIT AMAZING SAND COUNTER SUCCEED
- Can the tester use the Amazing Sand Counters
knowledge to masquera