Coordination - PowerPoint PPT Presentation

About This Presentation
Title:

Coordination

Description:

When a process notices that the coordinator fails, it holds an election: 1. P sends an ELECTION (E-message) to all processes with higher numbers ... – PowerPoint PPT presentation

Number of Views:62
Avg rating:3.0/5.0
Slides: 67
Provided by: baobao3
Category:

less

Transcript and Presenter's Notes

Title: Coordination


1
Chapter 8
  • Coordination

2
Topics
  • Election algorithms
  • Mutual exclusion
  • Deadlock
  • Transaction

3
Election Algorithms
  • This is the way nodes in a DS electing a new
    coordinator when the old one failed or was cut
    out of the network
  • In the following algorithms, each processor
    (node) has a unique ID. Communications are
    reliable (messages are not dropped or corrupted).

4
Requirements
  • Safety each process Pi has coordinator null or
    coordinator P, where P is the live process
  • Liveness each process Pi eventually has
    coordinator ? null or it has failed.

5
The Bully Algorithm
  • (Garcia-Molina) Node with highest ID bullies his
    way into leadership.
  • When a process notices that the coordinator
    fails, it holds an election
  • 1. P sends an ELECTION (E-message) to all
    processes with higher numbers
  • 2. If no one responds, P wins the election and
    becomes coordinator.
  • 3. If one of the higher-ups answers, say Q, it
    takes over. Ps job is done.

6
An Example
  • Process 4 holds an election
  • Process 5 and 6 respond, telling 4 to stop
  • Now 5 and 6 each hold an election

7
An Example (Cont.)
  • Process 6 tells 5 to stop
  • Process 6 wins and tells everyone

8
The Cost
  • In a network of N nodes, assume the coordinator
    with ID N fails
  • If the process with ID (N-1) starts an election,
    the cost is O(N) messages
  • If the lowest numbered node starts an election,
    the cost is O(N2)

9
A Ring Election Algorithm
  • Nodes are physically or logically organized in a
    ring.
  • Nodes know their successors.
  • Node states are Normal, Election, Leader.
  • Any node that notices that the leader is not
    functioning, changes his state to Election,
    starts an election message containing his ID and
    sends it to his clockwise neighbor.

10
An Example
11
A Ring Election Algorithm (2)
  • When a node receives an election message
  • Add its ID to the message, send it to the
    successor
  • If the message contains its own ID, it sends a
    CORDINATOR message, which contains the list
    member with the highest number as the
    coordinator. This message circulates once.

12
An Example
13
An Example (Cont.)
14
An Example (Cont.)
15
Complexity
  • In the best case, only one node starts an
    election message, so the number of messages is
    2N.
  • In the worst case, N nodes start an election
    message resulting in O(N2).
  • Improvements
  • Drop election messages arriving in less than time
    ?, where ? is the time a message takes to
    traverse the ring.
  • Does it work?

16
LCR Ring Election
  • Each node sends a message with its ID around the
    ring. When a process receives an incoming
    message, it compares the ID with its own. If the
    incoming ID is greater than its own, it passes it
    to the next node if it is less than its own, it
    discards it if it is equal to its own, it
    declares itself leader.

3
Elect 0
Elect 3
Elect 5
0
5
17
Complexity
2
  • If messages are passed clockwiseonly one
    survives after the first round.
  • If messages are passed counter-clockwise...
  • Best case O(N), worst case O(N2).

Elect 2
Elect 1
1
3
Elect 3
Elect 0
0
18
HS (Hirschberg Sinclair) Ring Election (1)
  • Motivation O(N2) is a lot of messages. Improve
    it to O(N log N).
  • Assumptions the ring size can be unknown. The
    communications must be bidirectional. All nodes
    start more or less at the same time. Each node
    operates in phases and sends out tokens. The
    tokens carry hop-counts and direction flags in
    addition to the ID of the sender.

ID3,2 hops Counter-clckws
ID3 2 hops clockwise
3
19
HS Ring Election (2)
  • Phases are numbered 0, 1, 2, 3, ?log2N?. In
    each phase, k, node j sends out tokens uj
    containing its ID in both directions.
  • The tokens travel 2k hops then return to their
    origin j.
  • Travel only the distance of 2k
  • If both tokens make it back, process j continues
    with the next phase (increments k). If both
    tokens do not make it back, process j simply
    waits to be told who the results of the election.

Outbound
x
3
x
Inbound
20
HS Ring Election (3)
  • All processes always relay inbound tokens.
  • If a process i receives a token uj going in the
    outbound direction, it compares the tokens ID
    with its own.
  • If it has a larger ID, it simply discards the
    token.
  • If it has a smaller ID, it relays the token as
    requested.
  • If it is equal to the token ID, it has received
    its own token in the outbound direction, so the
    token has gone clear around the ring and the
    process declares itself leader.

ID3, 2 hops clockwise
4
21
Complexity
  • Communications Complexity In the first phase,
    every process sends out 2 tokens and they go one
    hop and return. This is a total of 4N messages
    for the tokens to go out and return.
  • In phase k, where kgt0, a node sends out tokens if
    it was not overruled in the previous phase, that
    is by a process within a distance of 2k-1 in
    either direction. This implies that within group
    of 2k-11consecutive nodes, at most one goes on
    to send out tokens in phase k.
  • This limits the message complexity to O(N log N).

22
Mutual Exclusion in DS
  • Mutual exclusion is needed for restricting access
    to a shared resource.
  • We use semaphores, monitors and similar
    constructs to enforce mutual exclusion on a
    centralized system.
  • We need the same capabilities on DS.
  • As in the one processor case, we are interested
    in safety (mutual exclusion), progress, and
    bounded waiting (fairness).

23
Solutions
  • Centralized lock manager
  • Token-passing lock manager
  • Distributed lock manager
  • Ricard/Agrawala Algorithm
  • Voting
  • Quorum

24
A Centralized Algorithm
a) Process 1 asks the coordinator for permission
to enter a critical region. Permission is
granted b) Process 2 then asks permission to
enter the same critical region. The coordinator
does not reply. c) When process 1 exits the
critical region, it tells the coordinator, when
then replies to 2
25
Problems with Centralized Locking?
Other issues?
26
The Token Ring Algorithm
  • Assumption Processes are ordered in a ring.
  • Communications are reliable and can be limited to
    one direction.
  • Size of ring can be unknown and each process is
    only required to know his immediate neighbor.
  • A single token circulates around the ring (in one
    direction only).

3
0
token
5
27
Algorithm Details
  • When a process has the token, he can enter the CR
    at most once. Then he must pass the token on.
  • Only the process with the token can enter the CR,
    thus Mutual Exclusion is ensured.
  • Bounded waiting since the token circulates.
  • Liveness as long as the process with the token
    doesnt fail, progress in ensures. Global
    snapshots can be used if a lost token is
    suspected.

3
0
token
5
28
Problems with Token-Algorithm
  • 1. How to distinguish if token is lost or if it
    is used very long?
  • 2. What happens if token-holder crashes for some
    time?
  • 3. How to maintain a logical ring if a
    participant drops out (voluntarily or by failure)
    of the system?
  • 4. How to identify and add new participants?
  • 5. Token is perpetually passed over the ring even
    when none of the participants wants to enter its
    CS ? unnecessary overhead consuming bandwidth
  • 6. Ring imposes an average delay of N/2 hops
    limiting scalability

29
Distributed Algorithm Ricart and Agrawala
Timestamp Algorithm
  • Assumption there is a total ordering of all
    events in the system (Lamports timestamps will
    provide this).
  • Communications are reliable.
  • Each process must maintain a queue for each
    critical region or resource if there is more than
    one resource to be shared.

resource
0
1
2
30
Ricart and Agrawala (2)
  • When a process wants to enter the Critical Region
    or obtain a resource, it sends a message with its
    ID and a Lamport timestamp (t, pid) to all other
    processes.
  • It can proceed to enter the CR when it gets an
    OK message from all other processes.
  • When it is done with the CR, it sends an OK
    message to every process on its wait queue and
    removes them from the queue.

31
Ricart and Agrawala (3)
  • When a process, P1, receives a request for the
    resource from process, P2
  • If P1 is not in the CR and does not want the CR,
    it sends back an OK message.
  • If P1 is currently in the CR, it does not reply,
    but queues P2s request.
  • If P1 wants to enter the CR but has not yet
    received all the permissions, it compares the
    timestamp in P2s message with the one in the
    message that P1 sent out to request the CR. The
    lowest timestamp wins.
  • If TS(P1) lt TS(P2), then P2s message is put on
    the queue.
  • If TS(P1) gt TS(P2), then P1 sends P2 an OK
    message.

32
Ricart and Agrawala (4)
  • Two processes want to enter the same critical
    region at the same moment.
  • Process 0 has the lowest timestamp, so it wins.
  • When process 0 is done, it sends an OK also, so 2
    can now enter the critical region.

33
Analysis
  • No tokens anymore
  • Cooperative voting to determine sequence of CSs
  • Does not rely on an interconnection media
    offering ordered messages
  • Serialization based on logical time stamps (
    total ordering)
  • If a participant wants to enter its CS it asks
    all others for permission and does not proceed
    until all others have agreed
  • If a participant gets a permission request and is
    not interested in its CS, it returns permission
    immediately to the requester.
  • Message complexity 2(N-1).
  • Algorithm ensures
  • mutual exclusion (no 2 have the lowest timestamp)
  • progress (someone has the lowest timestamp)
  • bounded waiting

34
Voting for Mutual Exclusion
  • Potential problems You must be sure you have
    more votes than any other process to enter the
    CR if P1 has 4 and P2 has 3 and P3 has 2, P1 has
    the most votes, but how does he know without
    communicating (costly) with other contenders?
    Just having 4 votes is not enough what if P1 has
    4 and P2 has 5 ?
  • Potential solution require a simple majority to
    win. But 4 is not a majority of 9, so in this
    example, no one can go. Worse processes are
    deadlocked.
  • Must be a way to resolve this kind of deadlock.

35
Timestamp Resolution
  • When a process makes a request, it attaches a
    Lamport timestamp. Voters will prefer candidates
    with the smaller timestamp.
  • If voter V has voted for P1 and then receives a
    request for vote from P2 with an earlier
    timestamp, V will try to retrieve its vote. V
    retrieves his vote by sending an INQUIRE message
    to P1. If P1 has not yet received all the needed
    votes, he must relinquish Vs vote, in which
    case, V now gives his vote to P2. This avoids
    deadlock.
  • When the P1 is finished with the CR, he sends
    release messages to all his voters, so they can
    give their votes to new candidates.

36
Anti-quorum Resolution
  • An anti-quorum is any set of nodes that has a
    non-empty intersection with all quorums.
  • A voter votes YES to one process and NO to other
    processes seeking the same resource.
  • When process gets a quorum of YES votes proceeds
    to the CR. When he gets an anti-quorum of NO
    votes, he knows he will not get enough YES votes,
    so he withdraws his candidacy and releases his
    votes.
  • After waiting a specified time, he tries again to
    gain enough votes.

37
Quorums
  • Do we need to get a majority of votes or is there
    some smaller set of votes that will do?
    Different nodes could have different voting
    districts as long as any two districts have a
    non-empty intersection.
  • Quorums have the property that any 2 have a
    non-empty intersection.
  • Simple majorities are quorums. Any 2 sets whose
    sizes are simple majorities must have at least
    one element in common.

38
Quorums (2)
  • Grid quorum arrange nodes in logical grid
    (square). A quorum is all of a row and all of a
    column. Quorum size is 2sqrt(n) 1.
  • Finite Projective Plane (Maekawa) if N7, form
    coteries of 3

39
Comparison
40
Transaction Property
  • Atomicity. Either all operations of the
    transaction are properly reflected in the
    database or none are.
  • Consistency. Execution of a transaction in
    isolation preserves the consistency of the
    database.
  • Isolation. Although multiple transactions may
    execute concurrently, each transaction must be
    unaware of other concurrently executing
    transactions. Intermediate transaction results
    must be hidden from other concurrently executed
    transactions.
  • Durability. After a transaction completes
    successfully, the changes it has made to the
    database persist, even if there are system
    failures.

41
Example Funds Transfer
  • Transaction to transfer 50 from account A to
    account B
  • 1. read(A)
  • 2. A A 50
  • 3. write(A)
  • 4. read(B)
  • 5. B B 50
  • 6. write(B)
  • Consistency requirement the sum of A and B is
    unchanged by the execution of the transaction.
  • Atomicity requirement if the transaction fails
    after step 3 and before step 6, the system
    ensures that its updates are not reflected in the
    database.

42
Example Funds Transfer continued
  • Durability requirement once the user has been
    notified that the transaction has completed
    (i.e., the transfer of the 50 has taken place),
    the updates to the DB must persist despite
    failures.
  • Isolation requirement if between steps 3 and 6,
    another transaction is allowed to access the
    partially updated database, it will see an
    inconsistent database (the sum A B will be less
    than it should be).Can be ensured by running
    transactions serially.

43
The Transaction Model
44
Transaction Types
  • Flat transactions
  • No partial results available
  • A nested transaction is a transaction that is
    logically decomposed into a hierarchy of
    sub-transactions.
  • Allow partial results to be committed
  • A distributed transaction is a logically flat
    indivisible transaction that operates on
    distributed data.

45
Distributed Transactions Illustration
46
Private Workspace
  • The file index and disk blocks for a three-block
    file
  • The situation after a transaction has modified
    block 0 and appended block 3
  • After committing

Q the cost of copying data?
47
More Efficient Implementation
  • Two common methods of implementation are
    write-ahead logs and before/after images.
  • With write-ahead logs, the transactions act on
    the permanent workspace, but before they can make
    a change, a log record is written to stable
    storage with the transaction and data item ID and
    the old and new values.
  • This log can then be used if the transaction
    aborts and the changes need to be rolled back.

48
Write-ahead Log
  • a) A transaction
  • b) d) The log before each statement is executed

49
Before- and After- Images
  • A before- and after-image is kept for each data
    item.
  • When a data item is changed, the old value is
    written to the before-image and the new value is
    the after-image.
  • Other transactions are not allowed to see the
    new value until the current transaction commits.
  • The after-image is made permanent and durable
    once the transaction which wrote it commits.
  • If the transaction aborts, the before-image is
    restored.

50
DBMS Organization
  • General organization of managers for handling
    transactions.

51
DBMS Organization
52
Levels of Consistency (SQL92)
  • Serializable default
  • Repeatable read only committed records to be
    read, repeated reads of same record must return
    same value. However, a transaction may not be
    serializable.
  • Read committed only committed records can be
    read, but successive reads of record may return
    different (but committed) values.
  • Read uncommitted even uncommitted records may
    be read (browse).

53
Serializability
54
Two-Phase Locking (2PL)
55
Strict 2PL
56
Pessimistic Timestamp Ordering
  • Target enforce serializability
  • Every transaction gets a (Lamport, totally
    ordered) timestamp.
  • Every data item has a read ts and a write ts and
    a commit bit c.
  • The commit bit c is true if and only if the most
    recent transaction to write to that item has
    committed.
  • The scheduler maintains the item timestamps and
    checks to make sure the reads and writes are
    correct.

57
Read Too Late
  • T1 tries to read X, but ts(T1) lt write-ts(X)
    meaning X has been written to by a later
    transaction.
  • T1 should not be allowed to read X because it was
    written by a transaction that occurs later in the
    serialization order (transactions are serialized
    by start time).
  • Solution T1 is aborted.

58
Write Too Late
  • T1 tries to write X, but the read-ts indicates
    that some other transaction should have read the
    value about to be written.
  • Solution T1 is aborted.

59
Dirty Reads
  • T1 reads X that was last written by T2. The
    timestamps are properly ordered, but the commit
    bit cfalse so if T2 later aborts then T1 must
    abort.
  • Solution We can avoid cascading aborts by
    delaying T1s read until T2 has committed (though
    not necessary to ensure serializability).

60
Thomas Write Rule
  • T2 has written to X before T1. When T1 tries to
    write, the appropriate action is to do nothing.
    No other transaction T3 that should have read
    T1s value of X got T2s value instead, because
    it would have been aborted because of a too late
    read. Future reads of X want T2s value or a
    later value, not T1s value.
  • Solution T1s write can be skipped.

61
TS Ordering Rules
  • When scheduler receives a read request from
    transaction T,
  • if ts(T)gt write-ts(X) and c(X) is true, grant
    request and set read-ts(X) to
    MAXts(T),read-ts(X)
  • if ts(T)gt write-ts(X) and c(X) is false, delay
    T until c(X) becomes true or txn aborts.
  • If ts(T)lt write-ts(X), abort T and restart with
    new timestamp.

62
TS Ordering Rules, continued
  • When scheduler receives a write request from
    transaction T,
  • if ts(T)gt read-ts(X) and ts(T)gt write-ts(X),
    grant request, set write-ts(X) to ts(T) and
    c(X)false
  • if ts(T)gt read-ts(X) and ts(T)lt write-ts(X),
    dont do the operation but allow T to continue as
    if done (Thomas write rule).
  • If ts(T)lt read-ts(X), abort T and restart with
    new timestamp.

63
Optimistic Timestamp Ordering
  • In any optimistic concurrency control, each
    transaction does its writes to a private
    workspace until completion of a validation phase.
  • In the validate phase, the scheduler validates
    the transaction by comparing its read set and
    write set with those of other transactions.
  • After validation, the write set values are
    written to the database and the transaction
    commits
  • Validation is frequently done with the help of
    timestamps.

64
Two-Phase Commit (2PC)
  • When several database take part in a single
    transaction a protocol called Two-Phase Commit is
    used
  • Each database is assumed to have its own local
    resource manager
  • A single system component called the Coordinator
    controls the whole process.

65
Steps
  • Phase 1
  • Coordinator sends a VOTE_REQUEST message
  • Clients return VOTE_COMMIT or VOTE_ABORT
  • Phase 2
  • Coordinator collects all votes and sends
    GLOBAL_COMMIT or GLOBAL_ABORT
  • Each client commits or aborts.
  • Important factor time-out

66
2PC (2)
  • The finite state machine for the coordinator in
    2PC.
  • The finite state machine for a participant.
  • Client fail?
  • Coordinate fail?
Write a Comment
User Comments (0)
About PowerShow.com