Title: Interprocess Communication and Coordination II
1Inter-process Communication and Coordination (II)
2Outline
- Message Passing Communication
- Request/Reply Communication (Remote Procedure
Call) - Transaction Communication
- Name and Directory Services
- Distributed Mutual Exclusion
- Leader Election
3Transaction Communication
4Introduction
- Transaction communication
- Service-oriented request/reply communication
multicast - Transaction fundamental unit of interaction
between client and server processes in a database
system - Database transaction a sequence of synchronous
request/reply operations that satisfy the ACID
properties - Transaction communication a set of asynchronous
request/reply communications that have the ACID
properties but are without the sequential
constraint of the operations in a database
transaction - Multicast of the same message to replicated
servers
5The Transaction Model
- Transaction to reserve three flights commits
- Transaction aborts when third flight is
unavailable
6The ACID Properties
- Achieve concurrency transparency
- Allow sharing of objects with interference
- The execution of a transaction appears to take
place in a critical section however, operations
from different transactions may be interleaved in
some safe way to achieve more concurrency - ACID tries to achieve consistency
W/O Interleave
With Interleave
7The ACID Properties (Cont.)
- Atomicity consistency of replicated or
partitioned objects - Either all of the operations in a transaction are
performed or none of them are, in spite of
failures - Consistency (Serializability)
- The execution of interleaved transactions is
equivalent to a serial execution of the
transactions in some order - Isolation (does not see something that has never
occurred) - Partial results of an incomplete transaction are
not visible to others before the transaction is
successfully committed - Durability (see something that has actually
occurred) - The system guarantees that the results of a
committed transaction will be made permanent even
if a failure occurs after the commitment
8Two-Phase Commit Atomic Transaction Protocol
- Coordinator the processor that initiates the
transaction - Participants all the remaining processors
- Unanimous voting scheme ? atomicity
- Voting is initiated by the coordinator. All
participants must come to an agreement about
whether to commit or abort the transaction and
must wait for the announcement of the decision - Before a participant can vote to commit, it must
be prepared to perform the commit - A transaction is committed only if all
participants agree and are ready to commit
9Two-Phase Commit Atomic Transaction Protocol
(Cont.)
- Each participant and coordinator maintains a
private work space for keeping track of updated
data objects - Each update contains the old and new value of a
data object - Updates will not be made permanent until the
transaction is finally committed ? isolation - To cope with failures ? flush the updates to a
stable storage - Durability and failure recovery
- Two synchronization points pre-commit and commit
- Write and flush update logs, and then pre-commit
or commit
10Two-Phase Commit Atomic Transaction Protocol
(Cont.)
11Two-Phase Commit Atomic Transaction Protocol
(Cont.)
Failures and recovery actions for the 2PC protocol
12Two-Phase Commit Atomic Transaction Protocol
(Cont.)
Finite State Machine for Coordinator
Finite State Machine for Participant
13Two-Phase Commit Atomic Transaction Protocol
(Cont.)
- The protocol can easily fail when a process
crashes for other processes may be indefinitely
waiting for a message from that process - Timeout can be used for detecting the failure of
other processes - Coordinator
- Blocked in state WAIT ?GLOBAL_ABORT
- Participant
- Blocked in state INIT ? VOTE_ABORT
- Blocked in state READY ? consult the state of
other processes
14Two-Phase Commit Atomic Transaction Protocol
(Cont.)
- Actions taken by a participant P when residing in
state READY and having contacted another
participant Q
If all processes are in READY, block until the
server recover
15Two-Phase Commit Atomic Transaction Protocol
(Cont.)
16Two-Phase Commit Atomic Transaction Protocol
(Cont.)
17Two-Phase Commit Atomic Transaction Protocol
(Cont.)
18Two-Phase Commit Atomic Transaction Protocol
(Cont.)
- 2PC is a blocking commit protocol
- A participant may need to block until the
coordinator recovers - To avoid blocking
- Use a multicast primitive by which a receiver
immediately multicasts a received message to all
other processes - Allow participants to reach a final decision,
even if the coordinator has not yet recovered - Three-phase commit protocol (3PC)
- Section 12.1.1
19Name and Directory Services
20Name and Address Resolution (I)
- Name and directory services look-up operations
- Given the name or some attribute of an object
entity, more attribute information is obtained - Object entity services and objects (users,
computers, files) - Name services how a named object can be
addressed and subsequently located by using its
address - Directory services
- Special name service (directory service of a file
system) - All kinds of attribute look-ups on different
object types, not just limited to address
information - Sometimes, name and directory services are used
interchangeably
21Name and Address Resolution (II)
- Two stages of resolution process
- Name resolution names ? logical addresses
- Name application-oriented denotations of objects
(who it is) - Address object representation carrying structure
information relevant to OS (where it can be
found) - Ex. map a server name to its port addresses
- Address resolution logical addresses ? network
routes - Address contains intermediate object
identification information between names and
route. - Route are the lowest level of location
information - Ex. Map a server port to its Ethernet port
22Object Attributes and Name Structure (I)
- An object entity is characterized by its
attribute - User affiliation attribute File version
number, creation date - Attributes involved in name resolution process
name and address - Name space collections of names, recognized by a
name service, with their corresponding attributes
and addresses - A name space containing different object class ?
type attribute - Name structure
- Flat name how to achieve unique name
- Names of concatenated attributes (Hierarchical
structure name) - Names of collections of attributes
(Structure-free name) - Attribute partition for an object name
- Physical ltuser.host.networkgt (explicit location
information) - Organizational ltuser.department.organizationgt
(location transparent) - Functional professionltprofessorgt,
specialtyltcomputer science)
23Object Attributes and Name Structure (II)
24Name Space and Information Base
- DIB and DIT
- Directory Information Base (DIB) conceptual data
model for storing and representing object
information - DIT Directory information Tree
- Object entity in DIB ? a node in DIT
- Distinguished attributes attributes for naming
purpose - Path from object node to root
- Suitable for both hierarchical and
attribute-based resolution - Naming domain a subname space for which there is
a single administrative authority for name
management - Naming context a partial subtree of the DIT
- The basic units for distribution the DIB to DSA
- DSA Directory Service Agents (servers for name
service)
25Distribution of a DIT
DSA
Naming Context
26Name Resolution Process (I)
- Name resolution process
- Initiated by a Directory User Agent (DUA)
- The resolution request is sent from one DSA to
another until the object is found in the DIT and
returned to the DUA - Interaction mode among DSAs
- Recursive chaining normal mode for structured
name resolution - Transitive chaining
- Fewer messages each message carries source DUA
address - Violate RPC and client/server programming
paradigm - Referral DSA suggests another suitable DSA to
DUA - Can be used for name and attribute-based
resolutions - Multicast request is sent to multiple DSAs
concurrently - Suitable when structural information is not
available
27Name Resolution Interaction Modes
28Name Resolution Process (II)
- Techniques for enhancing the name resolution
performance - Caching recently used names and their addresses
are kept in cache - Replication naming contexts can be duplicated
in different DSAs - Inconsistency issues
- Object entries in a directory may be an alias or
a group - Pointers to other object names and are leaf-nodes
in the DIT
29The DNS Name Space
- DNS Internet Domain Name Service
- Loop up host addresses and mail servers
- DNS name space hierarchical rooted tree
- Rootltnl, vu, cs, flitsgt ? flits.cs.vu.nl.
- Domain (Naming context) a subtree domain name
a path name to its root node - Node content a collection of resource records
- Zone administrative unit (Naming Domain)
- Each zone is implemented by a primary name server
and maybe other secondary name server - Zone update is done locally in the DNS definition
files in the primary name server secondary name
servers request the primary server to transfer
its content
30Type of Resource Records in DNS
31An Simple Example for DNS
32Part of the description for the vu.nl domain
which contains the cs.vu.nl domain
33OSI X.500
- DNS given a (hierarchically structured) name,
resolve it to a node in the naming graph and
returns the content of that node in the form of a
resource record - Similar to telephone book for looking up phone
numbers - X.500 directory service ? a client can look for
an entity based on a description of properties
instead of a full name - Similar to yellow page
- A simplified version of X.500 Light-weight
Directory Access Protocol (LDAP)
34The X.500 Name Space
- X.500 consists of a number of records (directory
entries) - Each record is made up of a collection of
(attribute, value) pairs - Attribute type
- Single-valued and multiple-valued attributes
- Directory Information Base (DIB) the collection
of all directory entries in an X.500 directory
service - Unique name for each record is formed by a
sequence of its naming attributes (Relative
Distinguished Name, RDN) - /CNL/OVrije Universiteit/OUMath. Comp. Sc.
- Directory Information Tree (DIT) listing RDNs in
sequence leads to a hierarchy of the collection
of directory entries
35An Simple Example of a X.500 Directory Entry
36Part of the DIT
- Each node represents a directory entry and may
also act as a directory in the traditional sense
(may have children) - Use read to read a single record given its path
name in the DIT - Use list to get the names of all the children of
a node in the DIT
37Two directory entries having Host_Name as RDN
38X.500 Implementation
- Similar to DNS, but
- X.500 supports more lookup operations
- Answer search((CNL)(OVrije
Universiteit)(OU)(CNMain Server)) - Butexpensivemay access many DSA
- Directory Service Agents (DSA) ? Name Server
- Naming Domain ? Zone
- Each part of a partitioned DIT ? Naming Context ?
Domain
39Other Issues in Name Service
- Name Services for Directories, Files
- Locating Mobile Entities
- Removing Un-referenced Entities
40Mutual Exclusion
- Ensure that concurrent processes make a
serialized access to shared resources or data
41Classification
- Contention-based mutual exclusion
- Centralized mutual exclusion
- Distributed mutual exclusion
- Timestamp Priority Schemes
- Voting scheme (variant of timestamp priority
schemes) - Token-based mutual exclusion
- Ring structure
- Tree structure
- Broadcast structure
42A Centralized Algorithm (I)
- Process 1 asks the coordinator for permission to
enter a critical region. Permission is granted - Process 2 then asks permission to enter the same
critical region. The coordinator does not reply. - When process 1 exits the critical region, it
tells the coordinator, when then replies to 2
43A Centralized Algorithm (II)
- Characteristics
- Guarantee mutual exclusion
- Fair requests are granted in the order in which
they are received - No starvation
- Can be used for more general resource allocation
- Drawbacks
- Single point of failure
- How to distinguish a dead coordinator from
permission denied - Performance bottleneck
44TimeStamp Prioritized Schemes
3(N-1) for the completion of a CS
- Lamports logical clock totally order requests
for entering CS - A process requests the CS by broadcasting a
timestamped REQUEST message to all other
processes (including itself) - Each process maintains a queue of pending REQUEST
messages arranged in the ascending timestamp
order - Once receiving a REQUEST message, a process
inserts the message in its queue and sends a
REPLY message to the requesting process - A process is allowed to enter its CS only if it
has gathered all the REPLY messages and its
request message is at the top of the request
queue.
45TimeStamp Prioritized Schemes (Cont.)
- When exiting the CS, a process broadcasts a
RELEASE message to every process - Once receiving a RELEASE message, a process
removes the completed request from its request
queue. - At that moment, if the process only request is
at the top of the request queue, it enters its CS
provided that all REPLY messages have been
received. - When receiving REQUEST, RELEASE, and REPLY
messages, a process adjusts its logical clock
accordingly
46TimeStamp Prioritized Schemes Improved
- When a process receives a REQUEST message
- Not in the critical region AND not want to enter
? OK (REPLY) - Already in the critical region ? No REPLY and
queue the request - After it exits the CS, send REPLY message
- Want to enter and not yet don so ? compare the
timestamps of the incoming message with the one
that it sent to everyone - The lowest one win
- if incoming message is lower ?OK (REPLY)
- Otherwise ? No REPLY and queue the request
- 2(N-1) for the completion of a CS
- REPLY and RELEASE are combined
47TimeStamp Prioritized Schemes Improved (Cont.)
- Two processes want to enter the same critical
region at the same moment. - Process 0 has the lowest timestamp, so it wins.
- When process 0 is done, it sends an OK also, so 2
can now enter the critical region.
48TimeStamp Prioritized Schemes (Cont.)
- Characteristics
- Guarantee mutual exclusion
- No deadlock or starvation
- Drawbacks (Weird)
- N points of failure
- How to distinguish a dead participant from a
refuse - N points of bottleneck
- Improvement
- A simple majority of other processes ? VOTING
- Avoid N points of failure
49Voting Schemes
- As soon as a candidate has a majority of the
votes ? WIN - When a process receives a REQUEST message, it
sends a REPLY (i.e. vote) only if the process has
not voted for any other candidate (requesting
process) - Once a process has voted, it is not allowed to
send any more REPLY messages until its vote has
been returned (e.g. by a RELEASE message) - A candidate obtains permission to enter the CS
when it has received a majority of the votes - Problem deadlock (Say two processes both win
half votes) - Solved by change vote when a process receives a
more attractive candidate (judged by timestamp or
other criteria) - INQUIRE ? RELINQUISH
50Voting Schemes (Cont.)
- Require O(N) messages per CS entry
- some messages for deadlock avoidance
- Reduce the message overhead by reducing the
number of votes required to enter the CS - Each process i has a request set (quorum) Si, and
a process needs the vote from every member of its
request set to enter the CS - To ensure CS, Si ? Sj ? ?
- It is possible that each quorum to be of
size - See Chapters 6 and 10 for further discussion
51A Toke Ring Algorithm
Not necessarily in process ID order
- An unordered group of processes on a network.
- A logical ring constructed in software.
52A Toke Ring Algorithm (Cont.)
- Idea
- Processes are connected in a logical ring
structure - A token circulates in the ring
- A process possessing the token is allowed to
enter CS - Finished the CS ? passes token to the successor
node - Advantages
- Simple, deadlock-free, fair
- Token can carry state information such as
priority - Disadvantages
- The token circulates in the ring even if no
process wants to enter CS - Result in unnecessary network traffic
- A process must wait until the token arrive for
entering CS
53Comparison
54Tree structure
- Require a process to explicitly request the
token, and to only move the token if it knows of
a pending request (not like ring) - Indefinite postponement and deadlock ? but for
tree structure, both OK - Impose a hierarchical structure on the processors
? need to maintain it - Root the process owning the token
- How to navigate a request to the token
- Unique path between a process and the token
holder - How to navigate the token to the next processor
to enter CS - A FIFO queue for pending requests is maintained
by each node - The head of the FIFO queues of all nodes forms
the global FIFO queue - Algorithm
- Algorithm List 10.9, 10.10, 10.11, and 10.12
- An example
55Tree Structure (Cont.)
- Each process has a FIFO request queue and a
pointer to its immediate predecessor - When a process receives a request, appends to the
queue - If queue is empty and the process does not have
the token, request the token from its predecessor - Otherwise, the token will arrive soon ? no
further action is taken - If a process has token, but is not using it, and
has a nonempty queue ? remove the first entry
from queue and sends token to that process - Occur when a request arrives, when the token
arrives, or when the process releases the token - Also change its pointer to the process to which
it sent the token - If the queue is not empty, the process will need
to re-obtain the token, so it sends a request to
the new token holder - If the process is the first entry in the FIFO
list, enters the CS
56Tree Structure (Cont.)
T0 ? P4 requests T2 ? P3 requests
3
4 3
3
4
2
3
2
3
2
3
1
1
7
2
4
4
4
4 3
3
5
6
57Broadcast Structure
- Proposed by Suzuki/Kasami
- Use group communication without being aware of
topology - Token
- Token vector T numbers of completions of a CS
by processes - Q pending request queue
- Each P local sequence number a sequence vector
S - Local sequence number no. of requests for CS
(attached to REQUEST) - S Store the highest sequence number of every
process heard by P
58Broadcast Structure (Cont.)
- Pi requests CS by broadcasting REQUEST (with
local seq) - Pj updates its sequence vector when receiving a
REQUEST message from Pi ? Sji max (Sji,
seq) - If Pj holds an idle token (empty Q), Pj sends the
token to Pi if Sji Ti 1 ? Pi enter CS
when it receives the token - Upon completion of CS,
- Pi update Ti equal to Sii
- Appends all processes with SikTik 1(where
k ? i) to Q - Pi removes the top entry from the request queue
and sends it the token - If Q is empty the token stays with Pi
59Broadcast Structure (Cont.)
?????Process Request???
t0, P1 holds the tokenand P1 wants to enter
CS. Its OK for P1 to enter CS
Token T Q
?????Process????CS
t1, P2 wants to enter CS
t2, P4 wants to enter CS
60Broadcast Structure (Cont.)
24
4
t3, P1 leaves CS.Update T and Q.Send Token to
P2
Token T Q
t4, P3 wants to enter CS
4
61Broadcast Structure (Cont.)
t5, P2 leaves CS
4 3
62Broadcast Structure (Cont.)
3
4
Sequence Vector Si
Token Vector T
Token Queue Q
- No central controller and the management of the
shared token is distributed - The contention for mutual exclusion is centrally
serialized by the FIFO token queue - No deadlock or starvation
63Leader Election
64Overview
- Elect a process as coordinator, initiator
- Especially when the existing coordinator (leader)
fails - Usually detect by time-out
- Leader election criteria
- Extrema finding based on a global priority
- Preference-based vote for a leader for a
personal preference (locality, reliability) - Leader election VS. mutual exclusion
- 3rd Paragraph of Page 139
- Leader election algorithms depend on the
topological structure assumption of the process
group
65Bully Algorithm
- Assumption
- Complete topology processes can reach each other
in one message - Each process has a unique number (process id)
- Election locate the process with the highest id
and designate it as the new coordinator - Every process knows the process id of every other
process - But do not know which ones are currently up or
down - Reliable network and only the processes may fail
- Detect failure of a process by time-out
- A failed process can rejoin the group by forcing
an election upon its recovery
66Bully Algorithm (Cont.)
- Any process (say P) notices that the coordinator
is no longer responding to requests, it initiates
an election - P sends an ELECTION message to all processes with
higher ID - If no one responds, P wins the election and
becomes coordinator - If one of the higher process answers, it takes
over. Ps job is done. - A process can get an ELECTION message from
processes with lower ID - Send OK back to the sender to indicate he is
alive and will take over - Hold an election, unless it is already holding
one - All processes give up but one, and the one is the
new coordinator - It sends messages to all processes to tell them
he is the new coordinator - If a process that was previously down comes back
up, it holds an election
67Bully Algorithm (Cont.)
- Process 4 holds an election
- Process 5 and 6 respond, telling 4 to stop
- Now 5 and 6 each hold an election
68Bully Algorithm (Cont.)
69Ring Algorithm
- Assumption
- processes are physically or logically ordered
- Each process knows the current members of the
ring and the order - Election cycle any process notices that the
coordinator is no longer responding to requests,
it initiates an election - A ELECTION message containing its own ID is sent
to the successor - If successor dies, skip over until a running
process is located - Any process receiving the ELECTION message add
its ID into the message and resend it to its
successor - Finally, the ELECTION message go back to the
initiator - Coordinator announcement cycle
- A COORDINATOR message is circulated once again
- Tell who is the coordinator (the one with highest
ID) and who the members of the new ring are
70Ring Algorithm (Cont.)
5,6,0,1
5,6,0,1,2
5,6,0,1,2,3
5,6,0,1,2,3,4
The topological order may not thesame as the
process ID order
71Ring Algorithm (Cont.)
- Improvement Figure 4.22 (pp. 141)
- When a process sends a msg, simply forward the
larger of its id and the received value to the
successor - A process that is already involved in the
election process does not need to forward a msg
unless msg contains a value higher than its id - Time and message complexity
- Only one initiator O(N) (time and msg)
- N simultaneous election initiators
- No optimization O(N2) msgs
- Optimization O(N) or O(N2) msgs ? whether the
ring is arranged in ascending or descending order
of the nodes id
72Ring Algorithm (Cont.)
- Further Improvement O(N log N)
- Idea Disable some election initiated by
lower-priority nodes as much as possible,
irrespective of the topological order of nodes - Compare a nodes id with those of its left and
right neighbors - Initiator node remains active if its ID is
higher than both neighbors - Otherwise, become passive and only relay messages
- Effectively eliminate at least half of the nodes
in each round of message exchanges reduce O(N)
to O(log N) - Require bi-direction ring
- For unidirectional ring ? buffering two
consecutive messages before a node is determined
to be in an active or passive mode
73Tree Topologies
- Dynamically build minimum-weight spanning tree
(MST) for a network of N nodes - If all edges in a connected graph have unique
weights ? unique MST - Leader election and building MST can be reduced
to each other - Gallager, Humbelt, and Spira Approach GHS83
- Searching and combining
- merge fragments
- Fragment minimum-weight subtrees of the final
MST - Bottom-up from a single node
- Each fragment finds its minimum-weight outgoing
edge of the fragment and uses it to join with a
node in a different fragment - The new fragment is still minimum-weight
74Tree Topologies (Cont.)
- Tree topology and leader election
- The last node that merges and yields to the final
MST can be the leader - Elect a leader after an MST has been constructed
- An initiator broadcasts a Campaign-For-Leader
(CFL) message, which carries a logical timestamp,
to all nodes along the MST - When message reaches a leaf, it replies a
Voting(V) to its parent - A parent will send the voting message to its
parent after all its childrens voting have been
collected - Once a node reply finishes its reply, it is done
? wait for the announcement of the new leader and
accepts no CFL - For multiple initiators ? the lowest timestamp
wins
75Tree Topologies (Cont.)
- Build a spanning tree by message flooding
- Robust in a failure-prone network
- Idea
- Every node repeats a received message (which has
not been seen yet) to all neighboring nodes,
except the sender - Eventually every node will be reached and a
spanning tree is formed - Steps
- Initiators flood the system with CFL messages
- As messages are flooding, a spanning forest with
each tree rooted at an initiator is built up. - Reply messages are sent by backtracking the path
from leaf to root - For multiple initiators the lowest timestamp wins