Title: Coordination
1Chapter 8
2Topics
- Election algorithms
- Mutual exclusion
- Deadlock
- Transaction
3Election Algorithms
- This is the way nodes in a DS electing a new
coordinator when the old one failed or was cut
out of the network - In the following algorithms, each processor
(node) has a unique ID. Communications are
reliable (messages are not dropped or corrupted).
4Requirements
- Safety each process Pi has coordinator null or
coordinator P, where P is the live process - Liveness each process Pi eventually has
coordinator ? null or it has failed.
5The Bully Algorithm
- (Garcia-Molina) Node with highest ID bullies his
way into leadership. - When a process notices that the coordinator
fails, it holds an election - 1. P sends an ELECTION (E-message) to all
processes with higher numbers - 2. If no one responds, P wins the election and
becomes coordinator. - 3. If one of the higher-ups answers, say Q, it
takes over. Ps job is done.
6An Example
- Process 4 holds an election
- Process 5 and 6 respond, telling 4 to stop
- Now 5 and 6 each hold an election
7An Example (Cont.)
- Process 6 tells 5 to stop
- Process 6 wins and tells everyone
8The Cost
- In a network of N nodes, assume the coordinator
with ID N fails - If the process with ID (N-1) starts an election,
the cost is O(N) messages - If the lowest numbered node starts an election,
the cost is O(N2)
9A Ring Election Algorithm
- Nodes are physically or logically organized in a
ring. - Nodes know their successors.
- Node states are Normal, Election, Leader.
- Any node that notices that the leader is not
functioning, changes his state to Election,
starts an election message containing his ID and
sends it to his clockwise neighbor.
10An Example
11A Ring Election Algorithm (2)
- When a node receives an election message
- Add its ID to the message, send it to the
successor - If the message contains its own ID, it sends a
CORDINATOR message, which contains the list
member with the highest number as the
coordinator. This message circulates once.
12An Example
13An Example (Cont.)
14An Example (Cont.)
15Complexity
- In the best case, only one node starts an
election message, so the number of messages is
2N. - In the worst case, N nodes start an election
message resulting in O(N2). - Improvements
- Drop election messages arriving in less than time
?, where ? is the time a message takes to
traverse the ring. - Does it work?
16LCR Ring Election
- Each node sends a message with its ID around the
ring. When a process receives an incoming
message, it compares the ID with its own. If the
incoming ID is greater than its own, it passes it
to the next node if it is less than its own, it
discards it if it is equal to its own, it
declares itself leader.
3
Elect 0
Elect 3
Elect 5
0
5
17Complexity
2
- If messages are passed clockwiseonly one
survives after the first round. - If messages are passed counter-clockwise...
- Best case O(N), worst case O(N2).
Elect 2
Elect 1
1
3
Elect 3
Elect 0
0
18HS (Hirschberg Sinclair) Ring Election (1)
- Motivation O(N2) is a lot of messages. Improve
it to O(N log N). - Assumptions the ring size can be unknown. The
communications must be bidirectional. All nodes
start more or less at the same time. Each node
operates in phases and sends out tokens. The
tokens carry hop-counts and direction flags in
addition to the ID of the sender.
ID3,2 hops Counter-clckws
ID3 2 hops clockwise
3
19HS Ring Election (2)
- Phases are numbered 0, 1, 2, 3, ?log2N?. In
each phase, k, node j sends out tokens uj
containing its ID in both directions. - The tokens travel 2k hops then return to their
origin j. - Travel only the distance of 2k
- If both tokens make it back, process j continues
with the next phase (increments k). If both
tokens do not make it back, process j simply
waits to be told who the results of the election.
Outbound
x
3
x
Inbound
20HS Ring Election (3)
- All processes always relay inbound tokens.
- If a process i receives a token uj going in the
outbound direction, it compares the tokens ID
with its own. - If it has a larger ID, it simply discards the
token. - If it has a smaller ID, it relays the token as
requested. - If it is equal to the token ID, it has received
its own token in the outbound direction, so the
token has gone clear around the ring and the
process declares itself leader.
ID3, 2 hops clockwise
4
21Complexity
- Communications Complexity In the first phase,
every process sends out 2 tokens and they go one
hop and return. This is a total of 4N messages
for the tokens to go out and return. - In phase k, where kgt0, a node sends out tokens if
it was not overruled in the previous phase, that
is by a process within a distance of 2k-1 in
either direction. This implies that within group
of 2k-11consecutive nodes, at most one goes on
to send out tokens in phase k. - This limits the message complexity to O(N log N).
22Mutual Exclusion in DS
- Mutual exclusion is needed for restricting access
to a shared resource. - We use semaphores, monitors and similar
constructs to enforce mutual exclusion on a
centralized system. - We need the same capabilities on DS.
- As in the one processor case, we are interested
in safety (mutual exclusion), progress, and
bounded waiting (fairness).
23Solutions
- Centralized lock manager
- Token-passing lock manager
- Distributed lock manager
- Ricard/Agrawala Algorithm
- Voting
- Quorum
24A Centralized Algorithm
a) Process 1 asks the coordinator for permission
to enter a critical region. Permission is
granted b) Process 2 then asks permission to
enter the same critical region. The coordinator
does not reply. c) When process 1 exits the
critical region, it tells the coordinator, when
then replies to 2
25Problems with Centralized Locking?
Other issues?
26The Token Ring Algorithm
- Assumption Processes are ordered in a ring.
- Communications are reliable and can be limited to
one direction. - Size of ring can be unknown and each process is
only required to know his immediate neighbor. - A single token circulates around the ring (in one
direction only).
3
0
token
5
27Algorithm Details
- When a process has the token, he can enter the CR
at most once. Then he must pass the token on. - Only the process with the token can enter the CR,
thus Mutual Exclusion is ensured. - Bounded waiting since the token circulates.
- Liveness as long as the process with the token
doesnt fail, progress in ensures. Global
snapshots can be used if a lost token is
suspected.
3
0
token
5
28Problems with Token-Algorithm
- 1. How to distinguish if token is lost or if it
is used very long? - 2. What happens if token-holder crashes for some
time? - 3. How to maintain a logical ring if a
participant drops out (voluntarily or by failure)
of the system? - 4. How to identify and add new participants?
- 5. Token is perpetually passed over the ring even
when none of the participants wants to enter its
CS ? unnecessary overhead consuming bandwidth - 6. Ring imposes an average delay of N/2 hops
limiting scalability
29Distributed Algorithm Ricart and Agrawala
Timestamp Algorithm
- Assumption there is a total ordering of all
events in the system (Lamports timestamps will
provide this). - Communications are reliable.
- Each process must maintain a queue for each
critical region or resource if there is more than
one resource to be shared.
resource
0
1
2
30Ricart and Agrawala (2)
- When a process wants to enter the Critical Region
or obtain a resource, it sends a message with its
ID and a Lamport timestamp (t, pid) to all other
processes. - It can proceed to enter the CR when it gets an
OK message from all other processes. - When it is done with the CR, it sends an OK
message to every process on its wait queue and
removes them from the queue.
31Ricart and Agrawala (3)
- When a process, P1, receives a request for the
resource from process, P2 - If P1 is not in the CR and does not want the CR,
it sends back an OK message. - If P1 is currently in the CR, it does not reply,
but queues P2s request. - If P1 wants to enter the CR but has not yet
received all the permissions, it compares the
timestamp in P2s message with the one in the
message that P1 sent out to request the CR. The
lowest timestamp wins. - If TS(P1) lt TS(P2), then P2s message is put on
the queue. - If TS(P1) gt TS(P2), then P1 sends P2 an OK
message.
32Ricart and Agrawala (4)
- Two processes want to enter the same critical
region at the same moment. - Process 0 has the lowest timestamp, so it wins.
- When process 0 is done, it sends an OK also, so 2
can now enter the critical region.
33Analysis
- No tokens anymore
- Cooperative voting to determine sequence of CSs
- Does not rely on an interconnection media
offering ordered messages - Serialization based on logical time stamps (
total ordering) - If a participant wants to enter its CS it asks
all others for permission and does not proceed
until all others have agreed - If a participant gets a permission request and is
not interested in its CS, it returns permission
immediately to the requester. - Message complexity 2(N-1).
- Algorithm ensures
- mutual exclusion (no 2 have the lowest timestamp)
- progress (someone has the lowest timestamp)
- bounded waiting
34Voting for Mutual Exclusion
- Potential problems You must be sure you have
more votes than any other process to enter the
CR if P1 has 4 and P2 has 3 and P3 has 2, P1 has
the most votes, but how does he know without
communicating (costly) with other contenders?
Just having 4 votes is not enough what if P1 has
4 and P2 has 5 ? - Potential solution require a simple majority to
win. But 4 is not a majority of 9, so in this
example, no one can go. Worse processes are
deadlocked. - Must be a way to resolve this kind of deadlock.
35Timestamp Resolution
- When a process makes a request, it attaches a
Lamport timestamp. Voters will prefer candidates
with the smaller timestamp. - If voter V has voted for P1 and then receives a
request for vote from P2 with an earlier
timestamp, V will try to retrieve its vote. V
retrieves his vote by sending an INQUIRE message
to P1. If P1 has not yet received all the needed
votes, he must relinquish Vs vote, in which
case, V now gives his vote to P2. This avoids
deadlock. - When the P1 is finished with the CR, he sends
release messages to all his voters, so they can
give their votes to new candidates.
36Anti-quorum Resolution
- An anti-quorum is any set of nodes that has a
non-empty intersection with all quorums. - A voter votes YES to one process and NO to other
processes seeking the same resource. - When process gets a quorum of YES votes proceeds
to the CR. When he gets an anti-quorum of NO
votes, he knows he will not get enough YES votes,
so he withdraws his candidacy and releases his
votes. - After waiting a specified time, he tries again to
gain enough votes.
37Quorums
- Do we need to get a majority of votes or is there
some smaller set of votes that will do?
Different nodes could have different voting
districts as long as any two districts have a
non-empty intersection. - Quorums have the property that any 2 have a
non-empty intersection. - Simple majorities are quorums. Any 2 sets whose
sizes are simple majorities must have at least
one element in common.
38Quorums (2)
- Grid quorum arrange nodes in logical grid
(square). A quorum is all of a row and all of a
column. Quorum size is 2sqrt(n) 1. - Finite Projective Plane (Maekawa) if N7, form
coteries of 3
39Comparison
40Transaction Property
- Atomicity. Either all operations of the
transaction are properly reflected in the
database or none are. - Consistency. Execution of a transaction in
isolation preserves the consistency of the
database. - Isolation. Although multiple transactions may
execute concurrently, each transaction must be
unaware of other concurrently executing
transactions. Intermediate transaction results
must be hidden from other concurrently executed
transactions. - Durability. After a transaction completes
successfully, the changes it has made to the
database persist, even if there are system
failures.
41Example Funds Transfer
- Transaction to transfer 50 from account A to
account B - 1. read(A)
- 2. A A 50
- 3. write(A)
- 4. read(B)
- 5. B B 50
- 6. write(B)
- Consistency requirement the sum of A and B is
unchanged by the execution of the transaction. - Atomicity requirement if the transaction fails
after step 3 and before step 6, the system
ensures that its updates are not reflected in the
database.
42Example Funds Transfer continued
- Durability requirement once the user has been
notified that the transaction has completed
(i.e., the transfer of the 50 has taken place),
the updates to the DB must persist despite
failures. - Isolation requirement if between steps 3 and 6,
another transaction is allowed to access the
partially updated database, it will see an
inconsistent database (the sum A B will be less
than it should be).Can be ensured by running
transactions serially.
43The Transaction Model
44Transaction Types
- Flat transactions
- No partial results available
- A nested transaction is a transaction that is
logically decomposed into a hierarchy of
sub-transactions. - Allow partial results to be committed
- A distributed transaction is a logically flat
indivisible transaction that operates on
distributed data.
45Distributed Transactions Illustration
46Private Workspace
- The file index and disk blocks for a three-block
file - The situation after a transaction has modified
block 0 and appended block 3 - After committing
Q the cost of copying data?
47More Efficient Implementation
- Two common methods of implementation are
write-ahead logs and before/after images. - With write-ahead logs, the transactions act on
the permanent workspace, but before they can make
a change, a log record is written to stable
storage with the transaction and data item ID and
the old and new values. - This log can then be used if the transaction
aborts and the changes need to be rolled back.
48Write-ahead Log
- a) A transaction
- b) d) The log before each statement is executed
49Before- and After- Images
- A before- and after-image is kept for each data
item. - When a data item is changed, the old value is
written to the before-image and the new value is
the after-image. - Other transactions are not allowed to see the
new value until the current transaction commits. - The after-image is made permanent and durable
once the transaction which wrote it commits. - If the transaction aborts, the before-image is
restored.
50DBMS Organization
- General organization of managers for handling
transactions.
51DBMS Organization
52Levels of Consistency (SQL92)
- Serializable default
- Repeatable read only committed records to be
read, repeated reads of same record must return
same value. However, a transaction may not be
serializable. - Read committed only committed records can be
read, but successive reads of record may return
different (but committed) values. - Read uncommitted even uncommitted records may
be read (browse).
53Serializability
54Two-Phase Locking (2PL)
55Strict 2PL
56Pessimistic Timestamp Ordering
- Target enforce serializability
- Every transaction gets a (Lamport, totally
ordered) timestamp. - Every data item has a read ts and a write ts and
a commit bit c. - The commit bit c is true if and only if the most
recent transaction to write to that item has
committed. - The scheduler maintains the item timestamps and
checks to make sure the reads and writes are
correct.
57Read Too Late
- T1 tries to read X, but ts(T1) lt write-ts(X)
meaning X has been written to by a later
transaction. - T1 should not be allowed to read X because it was
written by a transaction that occurs later in the
serialization order (transactions are serialized
by start time). - Solution T1 is aborted.
58Write Too Late
- T1 tries to write X, but the read-ts indicates
that some other transaction should have read the
value about to be written. - Solution T1 is aborted.
59Dirty Reads
- T1 reads X that was last written by T2. The
timestamps are properly ordered, but the commit
bit cfalse so if T2 later aborts then T1 must
abort. - Solution We can avoid cascading aborts by
delaying T1s read until T2 has committed (though
not necessary to ensure serializability).
60Thomas Write Rule
- T2 has written to X before T1. When T1 tries to
write, the appropriate action is to do nothing.
No other transaction T3 that should have read
T1s value of X got T2s value instead, because
it would have been aborted because of a too late
read. Future reads of X want T2s value or a
later value, not T1s value. - Solution T1s write can be skipped.
61TS Ordering Rules
- When scheduler receives a read request from
transaction T, - if ts(T)gt write-ts(X) and c(X) is true, grant
request and set read-ts(X) to
MAXts(T),read-ts(X) - if ts(T)gt write-ts(X) and c(X) is false, delay
T until c(X) becomes true or txn aborts. - If ts(T)lt write-ts(X), abort T and restart with
new timestamp.
62TS Ordering Rules, continued
- When scheduler receives a write request from
transaction T, - if ts(T)gt read-ts(X) and ts(T)gt write-ts(X),
grant request, set write-ts(X) to ts(T) and
c(X)false - if ts(T)gt read-ts(X) and ts(T)lt write-ts(X),
dont do the operation but allow T to continue as
if done (Thomas write rule). - If ts(T)lt read-ts(X), abort T and restart with
new timestamp.
63Optimistic Timestamp Ordering
- In any optimistic concurrency control, each
transaction does its writes to a private
workspace until completion of a validation phase. - In the validate phase, the scheduler validates
the transaction by comparing its read set and
write set with those of other transactions. - After validation, the write set values are
written to the database and the transaction
commits - Validation is frequently done with the help of
timestamps.
64Two-Phase Commit (2PC)
- When several database take part in a single
transaction a protocol called Two-Phase Commit is
used - Each database is assumed to have its own local
resource manager - A single system component called the Coordinator
controls the whole process.
65Steps
- Phase 1
- Coordinator sends a VOTE_REQUEST message
- Clients return VOTE_COMMIT or VOTE_ABORT
- Phase 2
- Coordinator collects all votes and sends
GLOBAL_COMMIT or GLOBAL_ABORT - Each client commits or aborts.
- Important factor time-out
662PC (2)
- The finite state machine for the coordinator in
2PC. - The finite state machine for a participant.
- Client fail?
- Coordinate fail?