CS60002 Distributed Systems - PowerPoint PPT Presentation

About This Presentation
Title:

CS60002 Distributed Systems

Description:

will cover about half the course, supplemented by copies of papers ... If a b and b a, then a and b are concurrent ( a || b) Logical Clock ... – PowerPoint PPT presentation

Number of Views:673
Avg rating:3.0/5.0
Slides: 119
Provided by: arobind
Category:

less

Transcript and Presenter's Notes

Title: CS60002 Distributed Systems


1
CS60002 Distributed Systems
2
  • Text Book
  • Advanced Concepts in Operating Systems by
    Mukesh Singhal and Niranjan G. Shivaratri
  • will cover about half the course, supplemented by
    copies of papers
  • Xerox, notes, copies of papers etc. will cover
    the rest.

3
What is a distributed system?
  • A very broad definition
  • A set of autonomous processes communicating among
    themselves to perform a task
  • Autonomous able to act independently
  • Communication shared memory or message passing
  • Concurrent system a better term probably

4
  • A more restricted definition
  • A network of autonomous computers that
    communicate by message passing to perform some
    task
  • A practical distributed system will probably
    have both
  • Computers that communicate by messages
  • Processes/threads on a computer that communicate
    by messages or shared memory

5
Advantages
  • Resource Sharing
  • Higher Performance
  • Fault Tolerance
  • Scalability

6
Why is it hard to design them?
  • The usual problem of concurrent systems
  • Arbitrary interleaving of actions makes the
    system hard to verify
  • Plus
  • No globally shared memory (therefore hard to
    collect global state)
  • No global clock
  • Unpredictable communication delays

7
Models for Distributed Algorithms
  • Topology completely connected, ring, tree etc.
  • Communication shared memory/message passing
    (reliable? Delay? FIFO/Causal? Broadcast/multicast
    ?)
  • Synchronous/asynchronous
  • Failure models (fail stop, crash, omission,
    Byzantine)
  • An algorithm need to specify the model on which
    it is supposed to work

8
Complexity Measures
  • Message complexity no. of messages
  • Communication complexity/Bit Complexity no. of
    bits
  • Time complexity For synchronous systems, no. of
    rounds. For asynchronous systems, different
    definitions are there.

9
Some Fundamental Problems
  • Ordering events in the absence of a global clock
  • Capturing the global state
  • Mutual exclusion
  • Leader election
  • Clock synchronization
  • Termination detection
  • Constructing spanning trees
  • Agreement protocols

10
  • Ordering of Events and
  • Logical Clocks

11
Ordering of Events
  • Lamports Happened Before relationship
  • For two events a and b, a ? b if
  • a and b are events in the same process and a
    occurred before b
  • a is a send event of a message m and b is the
    corresponding receive event at the destination
    process
  • a ? c and c ? b for some event c

12
  • a ? b implies a is a potential cause of b
  • Causal ordering potential dependencies
  • Happened Before relationship causally orders
    events
  • If a ? b, then a causally affects b
  • If a ? b and b ? a, then a and b are concurrent (
    a b)

13
Logical Clock
  • Each process i keeps a clock Ci.
  • Each event a in i is timestamped C(a), the value
    of Ci when a occured
  • Ci is incremented by 1 for each event in i
  • In addition, if a is a send of message m from
    process i to j, then on receive of m,
  • Cj max(Cj, C(a)1)

14
  • Points to note
  • if a ? b, then C(a) lt C(b)
  • ? is an irreflexive partial order
  • Total ordering possible by arbitrarily ordering
    concurrent events by process numbers

15
Limitation of Lamports Clock
  • a ? b implies C(a) lt C(b)
  • BUT
  • C(a) lt C(b) doesnt imply a ? b !!
  • So not a true clock !!

16
Solution Vector Clocks
  • Ci is a vector of size n (no. of processes)
  • C(a) is similarly a vector of size n
  • Update rules
  • Cii for every event at process i
  • if a is send of message m from i to j with vector
    timestamp tm, on receive of m
  • Cjk max(Cjk, tmk) for all k

17
  • For events a and b with vector timestamps ta and
    tb,
  • ta tb iff for all i, tai tbi
  • ta ? tb iff for some i, tai ? tbi
  • ta tb iff for all i, tai tbi
  • ta lt tb iff (ta tb and ta ? tb)
  • ta tb iff (ta lt tb and tb lt ta)

18
  • a ? b iff ta lt tb
  • Events a and b are causally related iff ta lt tb
    or tb lt ta, else they are concurrent
  • Note that this is still not a total order

19
Causal ordering of messages application of
vector clocks
  • If send(m1)? send(m2), then every recipient of
    both message m1 and m2 must deliver m1 before
    m2.
  • deliver when the message is actually given to
    the application for processing

20
Birman-Schiper-Stephenson Protocol
  • To broadcast m from process i, increment Ci(i),
    and timestamp m with VTm Cii
  • When j ? i receives m, j delays delivery of m
    until
  • Cji VTmi 1 and
  • Cjk VTmk for all k ? i
  • Delayed messaged are queued in j sorted by vector
    time. Concurrent messages are sorted by receive
    time.
  • When m is delivered at j, Cj is updated according
    to vector clock rule.

21
Problem of Vector Clock
  • message size increases since each message needs
    to be tagged with the vector
  • size can be reduced in some cases by only sending
    values that have changed

22
  • Capturing Global State

23
Global State Collection
  • Applications
  • Checking stable properties, checkpoint
    recovery
  • Issues
  • Need to capture both node and channel states
  • system cannot be stopped
  • no global clock

24
  • Some notations
  • LSi local state of process i
  • send(mij) send event of message mij from
    process i to process j
  • rec(mij) similar, receive instead of send
  • time(x) time at which state x was recorded
  • time (send(m)) time at which send(m) occured

25
  • send(mij) ? LSi iff
  • time(send(mij)) lt time(LSi)
  • rec(mij) ? LSj iff
  • time(rec(mij)) lt time(LSj)
  • transit(LSi,LSj) mij send(mij) ? LSi and
    rec(mij) ? LSj
  • inconsistent(LSi, LSj) mij send(mij) ? LSi
    and rec(mij) ? LSj

26
  • Global state collection of local states
  • GS LS1, LS2,, LSn
  • GS is consistent iff
  • for all i, j, 1 i, j n,
  • inconsistent(LSi, LSj) ?
  • GS is transitless iff
  • for all i, j, 1 i, j n,
  • transit(LSi, LSj) ?
  • GS is strongly consistent if it is consistent and
    transitless.

27
Chandy-Lamports Algorithm
  • Uses special marker messages.
  • One process acts as initiator, starts the state
    collection by following the marker sending rule
    below.
  • Marker sending rule for process P
  • P records its state then for each outgoing
    channel C from P on which a marker has not been
    sent already, P sends a marker along C before any
    further message is sent on C

28
  • When Q receives a marker along a channel C
  • If Q has not recorded its state then Q records
    the state of C as empty Q then follows the
    marker sending rule
  • If Q has already recorded its state, it records
    the state of C as the sequence of messages
    received along C after Qs state was recorded and
    before Q received the marker along C

29
  • Points to Note
  • Markers sent on a channel distinguish messages
    sent on the channel before the sender recorded
    its states and the messages sent after the sender
    recorded its state
  • The state collected may not be any state that
    actually happened in reality, rather a state that
    could have happened
  • Requires FIFO channels
  • Network should be strongly connected (works
    obviously for connected, undirected also)
  • Message complexity O(E), where E no. of links

30
Lai and Youngs Algorithm
  • Similar to Chandy-Lamports, but does not require
    FIFO
  • Boolean value X at each node, False indicates
    state is not recorded yet, True indicates
    recorded
  • Value of X piggybacked with every application
    message
  • Value of X distinguishes pre-snapshot and
    post-snapshot messages, similar to the Marker

31
  • Mutual Exclusion

32
Mutual Exclusion
  • very well-understood in shared memory systems
  • Requirements
  • at most one process in critical section (safety)
  • if more than one requesting process, someone
    enters (liveness)
  • a requesting process enters within a finite time
    (no starvation)
  • requests are granted in order (fairness)

33
Classification of Distributed Mutual Exclusion
Algorithms
  • Non-token based/Permission based
  • Permission from all processes e.g. Lamport,
    Ricart-Agarwala, Raicourol-Carvalho etc.
  • Permission from a subset ex. Maekawa
  • Token based
  • ex. Suzuki-Kasami

34
Some Complexity Measures
  • No. of messages/critical section entry
  • Synchronization delay
  • Response time
  • Throughput

35
Lamports Algorithm
  • Every node i has a request queue qi, keeps
    requests sorted by logical timestamps (total
    ordering enforced by including process id in the
    timestamps)
  • To request critical section
  • send timestamped REQUEST (tsi, i) to all other
    nodes
  • put (tsi, i) in its own queue
  • On receiving a request (tsi, i)
  • send timestamped REPLY to the requesting node i
  • put request (tsi, i) in the queue

36
  • To enter critical section
  • i enters critical section if (tsi, i) is at the
    top if its own queue, and i has received a
    message (any message) with timestamp larger than
    (tsi, i) from ALL other nodes.
  • To release critical section
  • i removes it request from its own queue and sends
    a timestamped RELEASE message to all other nodes
  • On receiving a RELEASE message from i, is
    request is removed from the local request queue

37
  • Some points to note
  • Purpose of REPLY messages from node i to j is to
    ensure that j knows of all requests of i prior to
    sending the REPLY (and therefore, possibly any
    request of i with timestamp lower than js
    request)
  • Requires FIFO channels.
  • 3(n 1 ) messages per critical section
    invocation
  • Synchronization delay max. message transmission
    time
  • requests are granted in order of increasing
    timestamps

38
Ricart-Agarwala Algorithm
  • Improvement over Lamports
  • Main Idea
  • node j need not send a REPLY to node i if j has a
    request with timestamp lower than the request of
    i (since i cannot enter before j anyway in this
    case)
  • Does not require FIFO
  • 2(n 1) messages per critical section invocation
  • Synchronization delay max. message transmission
    time
  • requests granted in order of increasing
    timestamps

39
  • To request critical section
  • send timestamped REQUEST message (tsi, i)
  • On receiving request (tsi, i) at j
  • send REPLY to i if j is neither requesting nor
    executing critical section or if j is requesting
    and is request timestamp is smaller than js
    request timestamp. Otherwise, defer the request.
  • To enter critical section
  • i enters critical section on receiving REPLY from
    all nodes
  • To release critical section
  • send REPLY to all deferred requests

40
Roucairol-Carvalho Algorithm
  • Improvement over Ricart-Agarwala
  • Main idea
  • once i has received a REPLY from j, it does not
    need to send a REQUEST to j again unless it sends
    a REPLY to j (in response to a REQUEST from j)
  • no. of messages required varies between 0 and 2(n
    1) depending on request pattern
  • worst case message complexity still the same

41
Maekawas Algorithm
  • Permission obtained from only a subset of other
    processes, called the Request Set (or Quorum)
  • Separate Request Set Ri for each process i
  • Requirements
  • for all i, j Ri n Rj ? F
  • for all i i ? Ri
  • for all i Ri K, for some K
  • any node i is contained in exactly D Request
    Sets, for some D
  • K D sqrt(N) for Maekawas

42
  • A simple version
  • To request critical section
  • i sends REQUEST message to all process in Ri
  • On receiving a REQUEST message
  • send a REPLY message if no REPLY message has been
    sent since the last RELEASE message is received.
    Update status to indicate that a REPLY has been
    sent. Otherwise, queue up the REQUEST
  • To enter critical section
  • i enters critical section after receiving REPLY
    from all nodes in Ri

43
  • To release critical section
  • send RELEASE message to all nodes in Ri
  • On receiving a RELEASE message, send REPLY to
    next node in queue and delete the node from the
    queue. If queue is empty, update status to
    indicate no REPLY message has been sent.

44
  • Message Complexity 3sqrt(N)
  • Synchronization delay
  • 2 (max message transmission time)
  • Major problem DEADLOCK possible
  • Need three more types of messages (FAILED,
    INQUIRE, YIELD) to handle deadlock. Message
    complexity can be 5sqrt(N)
  • Building the request sets?

45
Token based Algorithms
  • Single token circulates, enter CS when token is
    present
  • No FIFO required
  • Mutual exclusion obvious
  • Algorithms differ in how to find and get the
    token
  • Uses sequence numbers rather than timestamps to
    differentiate between old and current requests

46
Suzuki Kasami Algorithm
  • Broadcast a request for the token
  • Process with the token sends it to the requestor
    if it does not need it
  • Issues
  • Current vs. outdated requests
  • determining sites with pending requests
  • deciding which site to give the token to

47
  • The token
  • Queue (FIFO) Q of requesting processes
  • LN1..n sequence number of request that j
    executed most recently
  • The request message
  • REQUEST(i, k) request message from node i for
    its kth critical section execution
  • Other data structures
  • RNi1..n for each node i, where RNij is the
    largest sequence number received so far by i in a
    REQUEST message from j.

48
  • To request critical section
  • If i does not have token, increment RNii and
    send REQUEST(i, RNii) to all nodes
  • if i has token already, enter critical section if
    the token is idle (no pending requests), else
    follow rule to release critical section
  • On receiving REQUEST(i, sn) fat j
  • set RNji max(RNji, sn)
  • if j has the token and the token is idle, send it
    to i if RNji LNi 1. If token is not idle,
    follow rule to release critical section

49
  • To enter critical section
  • enter CS if token is present
  • To release critical section
  • set LNi RNii
  • For every node j which is not in Q (in token),
    add node j to Q if RNi j LN j 1
  • If Q is non empty after the above, delete first
    node from Q and send the token to that node

50
  • Points to note
  • No. of messages 0 if node holds the token
    already, n otherwise
  • Synchronization delay 0 (node has the token) or
    max. message delay (token is elsewhere)
  • No starvation

51
Raymonds Algorithm
  • Forms a directed tree (logical) with the
    token-holder as root
  • Each node has variable Holder that points to
    its parent on the path to the root. Roots Holder
    variable points to itself
  • Each node i has a FIFO request queue Qi

52
  • To request critical section
  • Send REQUEST to parent on the tree, provided i
    does not hold the token currently and Qi is
    empty. Then place request in Qi
  • When a non-root node j receives a request from i
  • place request in Qj
  • send REQUEST to parent if no previous REQUEST
    sent

53
  • When the root receives a REQUEST
  • send the token to the requesting node
  • set Holder variable to point to that node
  • When a node receives the token
  • delete first entry from the queue
  • send token to that node
  • set Holder variable to point to that node
  • if queue is non-empty, send a REQUEST message to
    the parent (node pointed at by Holder variable)

54
  • To execute critical section
  • enter if token is received and own entry is at
    the top of the queue delete the entry from the
    queue
  • To release critical section
  • if queue is non-empty, delete first entry from
    the queue, send token to that node and make
    Holder variable point to that node
  • If queue is still non-empty, send a REQUEST
    message to the parent (node pointed at by Holder
    variable)

55
  • Points to note
  • Avg. message complexity O(log n)
  • Sync. delay (T log n)/2, where T max. message
    delay

56
Leader Election
57
Leader Election in Rings
  • Models
  • Synchronous or Asynchronous
  • Anonymous (no unique id) or Non-anonymous (unique
    ids)
  • Uniform (no knowledge of n, the number of
    processes) or non-uniform (knows n)
  • Known Impossibility Result
  • There is no Synchronous, non-uniform leader
    election protocol for anonymous rings
  • Implications ??

58
Election in Asynchronous Rings
  • Lelann-Chang-Roberts Algorithm
  • send own id to node on left
  • if an id received from right, forward id to left
    node only if received id greater than own id,
    else ignore
  • if own id received, declares itself leader
  • works on unidirectional rings
  • message complexity ?(n2)

59
  • Hirschberg-Sinclair Algorithm
  • operates in phases, requires bidirectional ring
  • In kth phase, send own id to 2k processes on
    both sides of yourself (directly send only to
    next processes with id and k in it)
  • if id received, forward if received id greater
    than own id, else ignore
  • last process in the chain sends a reply to
    originator if its id less than received id
  • replies are always forwarded
  • A process goes to (k1)th phase only if it
    receives a reply from both sides in kth phase
  • process receiving its own id declare itself
    leader

60
  • Message Complexity O(nlgn)
  • Lots of other algorithms exist for rings
  • Lower Bound Result
  • Any comparison-based leader election algorithm in
    a ring requires ?(nlgn) messages
  • What if not comparison-based?

61
Leader Election in Arbitrary Networks
  • FloodMax
  • synchronous, round-based
  • at each round, each process sends the max. id
    seen so far (not necessarily its own) to all its
    neighbors
  • after diameter no. of rounds, if max. id seen
    own id, declares itself leader
  • Complexity O(d.m), where d diameter of the
    network, m no. of edges
  • does not extend to asynchronous model trivially
  • Variations of building different types of
    spanning trees with no pre-specified roots.
    Chosen root at the end is the leader (Ex., the
    DFS spanning tree algorithm we covered earlier)

62
  • Clock Synchronization

63
Clock Synchronization
  • Multiple machines with physical clocks. How can
    we keep them more or less synchronized?
  • Internal vs. External synchronization
  • Perfect synchronization not possible because of
    communication delays
  • Even synchronization within a bound can not be
    guaranteed with certainty because of
    unpredictability of communication delays.
  • But still useful !! Ex. Kerberos, GPS

64
How clocks work
  • Computer clocks are crystals that oscillate at a
    certain frequency
  • Every H oscillations, the timer chip interrupts
    once (clock tick). No. of interrupts per second
    is typically 18.2, 50, 60, 100 can be higher,
    settable in some cases
  • The interrupt handler increments a counter that
    keeps track of no. of ticks from a reference in
    the past (epoch)
  • Knowing no. of ticks per second, we can calculate
    year, month, day, time of day etc.

65
Clock Drift
  • Unfortunately, period of crystal oscillation
    varies slightly
  • If it oscillates faster, more ticks per real
    second, so clock runs faster similar for slower
    clocks
  • For machine p, when correct reference time is t,
    let machine clock show time as C Cp(t)
  • Ideally, Cp(t) t for all p, t
  • In practice,
  • 1 ? dC/dt 1 ?
  • ? max. clock drift rate, usually around 10-5
    for cheap oscillators
  • Drift gt Skew between clocks (difference in clock
    values of two machines)

66
Resynchronization
  • Periodic resynchronization needed to offset skew
  • If two clocks are drifting in opposite
    directions, max. skew after time t is 2 ? t
  • If application requires that clock skew lt d, then
    resynchronization period
  • r lt d /(2 ?)
  • Usually ? and d are known

67
Cristians Algorithm
  • One m/c acts as the time server
  • Each m/c sends a message periodically (within
    resync. period r) asking for current time
  • Time server replies with its time
  • Sender sets its clock to the reply
  • Problems
  • message delay
  • time server time is less than senders current
    time

68
  • Handling message delay try to estimate the time
    the message with the timer servers time took to
    each the sender
  • measure round trip time and halve it
  • make multiple measurements of round trip time,
    discard too high values, take average of rest
  • make multiple measurements and take minimum
  • use knowledge of processing time at server if
    known
  • Handling fast clocks
  • do not set clock backwards slow it down over a
    period of time to bring in tune with servers
    clock

69
Berkeley Algorithm
  • Centralized as in Cristians, but the time server
    is active
  • time server asks for time of other m/cs at
    periodic intervals
  • time server averages the times and sends the new
    time to m/cs
  • M/cs sets their time (advances immediately or
    slows down slowly) to the new time
  • Estimation of transmission delay as before

70
External Synchronization
  • Clocks must be synchronized with real time
  • Cristians algorithm can be used if the time
    server is synchronized with real time somehow
  • Berkeley algorithm cannot be used
  • But what is real time anyway?

71
Measurement of time
  • Astronomical
  • traditionally used
  • based on earths rotation around its axis and
    around the sun
  • solar day interval between two consecutive
    transits of the sun
  • solar second 1/86,400 of a solar day
  • period of earths rotation varies, so solar
    second is not stable
  • mean solar second average length of large no of
    solar days, then divide by 86,400

72
  • Atomic
  • based on the transitions of Cesium 133 atom
  • 1 sec. time for 9,192,631,770 transitions
  • about 50 labs maintain Cesium clock
  • International Atomic Time (TAI) mean no. of
    ticks of the clocks since Jan 1, 1958
  • highly stable
  • But slightly off-sync with mean solar day (since
    solar day is getting longer)
  • A leap second inserted approx. occasionally to
    bring it in sync. (so far 32, all positive)
  • Resulting clock is called UTC Universal
    Coordinated Time

73
  • UTC time is broadcast from different sources
    around the world, ex.
  • National Institute of Standards Technology
    (NIST) runs radio stations, most famous being
    WWV, anyone with a proper receiver can tune in
  • United States Naval Observatory (USNO) supplies
    time to all defense sources, among others
  • National Physical Laboratory in UK
  • GPS satellites
  • Many others

74
NTP Network Time Protocol
  • Protocol for time sync. in the internet
  • Hierarchical architecture
  • primary time servers (stratum 1) synchronize to
    national time standards via radio, satelite etc.
  • secondary servers and clients (stratum 2, 3,)
    synchronize to primary servers in a hierrachical
    manner (stratum 2 servers sync. with stratum 1,
    startum 3 with stratum 2 etc.).

75
  • Reliability ensured by redundant servers
  • Communication by multicast (usually within LAN
    servers), symmetric (usually within multiple
    geographically close servers), or client server
    (to higher stratum servers)
  • Complex algorithms to combine and filter times
  • Sync. possible to within tens of milliseconds for
    most machines
  • But, just a best-effort service, no guarantees
  • RFC 1305 and www.eecis.udel.edu/ntp/ for more
    details

76
  • Termination Detection

77
Termination Detection
  • Model
  • processes can be active or idle
  • only active processes send messages
  • idle process can become active on receiving an
    computation message
  • active process can become idle at any time
  • termination all processes are idle and no
    computation message are in transit
  • Can use global snapshot to detect termination
    also

78
Huangs Algorithm
  • One controlling agent, has weight 1 initially
  • All other processes are idle initially and has
    weight 0
  • Computation starts when controlling agent sends a
    computation message to a process
  • An idle process becomes active on receiving a
    computation message
  • B(DW) computation message with weight DW. Can
    be sent only by the controlling agent or an
    active process
  • C(DW) control message with weight DW, sent by
    active processes to controlling agent when they
    are about to become idle

79
  • Let current weight at process W
  • Send of B(DW)
  • Find W1, W2 such that W1 gt 0, W2 gt 0, W1 W2 W
  • Set W W1 and send B(W2)
  • Receive of B(DW)
  • W DW
  • if idle, become active
  • Send of C(DW)
  • send C(W) to controlling agent
  • Become idle
  • Receive of C(DW)
  • W DW
  • if W 1, declare termination

80
  • Building Spanning Trees

81
Building Spanning Trees
  • Applications
  • Broadcast
  • Convergecast
  • Leader election
  • Two variations
  • from a given root r
  • root is not given a-priori

82
  • Flooding Algorithm
  • starts from a given root r
  • r initiates by sending message M to all
    neighbours, sets its own parent to nil
  • For all other nodes, on receiving M from i for
    the first time, set parent to i and send M to all
    neighbors except i. Ignore any M received after
    that
  • Tree built is an arbitrary spanning tree
  • Message complexity
  • 2m (n -1) where m no of edges
  • Time complexity ??

83
Constructing a DFS tree with given root
  • plain parallelization of the sequential algorithm
    by introducing synchronization
  • each node i has a set unexplored, initially
    contains all neighbors of i
  • A node i (initiated by the root) considers nodes
    in unexplored one by one, sending a neighbor j a
    message M and then waiting for a response (parent
    or reject) before considering the next node in
    unexplored
  • if j has already received M from some other node,
    j sends a reject to i

84
  • else, j sets i as its parent, and considers nodes
    in its unexplored set one by one
  • j will send a parent message to i only when it
    has considered all nodes in its unexplored set
  • i then considers the next node in its unexplored
    set
  • Algorithm terminates when root has received
    parent or reject message from all its neighbours
  • Worst case no. of messages 4m
  • Time complexity O(m)

85
What if no root given?
  • Main idea
  • Nodes have unique ids
  • A node starts building the DFS tree with itself
    as root (a single node fragment) spontaneously as
    in the previous case
  • Fragments of the spanning tree gets built in
    parallel, all nodes in each fragment is
    identified by the id of its root
  • M carries the fragment id of the sender

86
  • when M sent from node in lower id fragment to
    node in higher id fragment, lower id fragment is
    stalled by higher id fragment by not giving a
    response
  • When M sent from higher to lower id fragment,
    node in lower id fragment switches parent to node
    in higher id tree, resets unexplored, and starts
    DFS again
  • Eventually, the highest id node becomes the root
    (leader election!!)
  • Message complexity O(mn) !!
  • Time complexity O(m)

87
What about MSTs??
  • Gallager-Humblet-Spira Algorithm
  • much more complex! but similar to Kruskals
  • no root given, edge weights assumed to be
    distinct
  • MST built up in fragments (subtree of MST)
  • initially each node in its own fragment
  • fragments merge, finally just one fragment
  • outgoing edge edge that goes between two
    fragments
  • known result min. wt. outgoing edge of a
    fragment always in MST

88
  • Issues
  • How does a node find its min. wt. outgoing edge?
  • How does a fragment finds its min. wt. outgoing
    edge?
  • When does two fragments merge?
  • How does two fragments merge?

89
Some definitions
  • Each node has three states
  • Sleeping initial state
  • Find currently finding the fragments min. wt.
    outgoing edge
  • Found found the min. wt. outgoing edge
  • Each fragment has a level
  • initially, each node is in a fragment of level 0

90
Merging rule for fragments
  • Suppose F is a fragment with id X, level L, and
    min. wt. outgoing edge e. Let fragment at other
    end of e be F1, with id X1 and level L1. Then
  • if L lt L1, F merges into F1, new fragment has id
    X1, level L1
  • if LL1, and e is also the min. wt. outgoing edge
    for F1, then F and F1 merges new fragment has id
    X2 weight of e, and level L 1 e is called
    the core edge
  • otherwise, F waits until one of the above becomes
    true

91
How to find min. wt. outgoing edge of a fragment
  • nodes on core edge broadcasts initiate message to
    all fragment nodes along fragment edges contains
    level and id
  • on receiving initiate, a node find its min. wt.
    outgoing edge (in Find state) how?
  • nodes send Report message with min. wt. edge up
    towards the core edge along fragment edges (and
    enters Found state)
  • leafs send their min. wt. outgoing edge,
    intermediate nodes send the min. of their min.
    wt. outgoing edge and min. edge sent by children
    in fragment path info to best edge kept
  • when Report reaches the nodes on the core edge,
    min. wt. outgoing edge of the fragment is known.

92
What then???
  • nodes on core edges send Change_core message to
    node i with min. wt. outgoing edge
  • node i then sends a Connect message to node j at
    other end with its level
  • If js fragment level is greater than is,
    initiate message sent from j to i. This updates
    level and id of all nodes in js old fragment if
    j has not sent a Report message yet, nodes in is
    old fragment starts finding its min. wt. outgoing
    edge, else not.
  • if js fragment level is less, no response sent
    and i just waits till js fragment id becomes
    higher
  • if fragment ids match and j sends Connect to i
    also, merge into a level L1 fragment with new
    core edge and id, and send initiate message

93
  • some more details skipped, read paper
  • Algo. terminates when no outgoing edge found for
    a fragment
  • Worst case message complexity O(n log n m)

94
  • Fault Tolerance
  • and
  • Recovery

95
Fault Tolerance Recovery
  • Classification of faults
  • based on component that failed
  • program/process
  • processor/machine
  • link
  • storage
  • clock
  • based on behavior of faulty component
  • Crash just halts
  • Failstop crash with additional conditions
  • Omission fails to perform some steps
  • Byzantine behaves arbitrarily
  • Timing violates timing constraints

96
  • Types of tolerance
  • Masking system always behaves as per
    specifications even in presence of faults
  • Non-masking system may violate specifications
    in presence of faults. Should at least behave in
    a well-defined manner
  • Fault tolerant system should specify
  • Class of faults tolerated
  • what tolerance is given from each class

97
  • Some building blocks (assumptions/primitives that
    help in designing fault tolerant systems)
  • Agreement (multiple processes agree on some
    value)
  • Clock synchronization
  • Stable storage (data accessible after crash)
  • Reliable communication (point-to-point,
    broadcast, multicast)
  • Atomic actions

98
Agreement Problems
  • Model
  • total n processes, at most m of which can be
    faulty
  • reliable communication medium
  • fully connected
  • receiver always knows the identity of the sender
    of a message
  • byzantine faults
  • synchronous system. In each round, a process
    receives messages, performs computation, and
    sends messages.

99
Different problem variations
  • Byzantine agreement (or Byzantine Generals
    problem)
  • one process x broadcasts a value v
  • all nonfaulty processes must agree on a common
    value (Agreement condition).
  • The agreed upon value must be v if x is nonfaulty
    (Validity condition)
  • Consensus
  • Each process broadcasts its initial value
  • satisfy agreement condition
  • If initial value of all nonfaulty processes is v,
    then the agreed upon value must be v

100
  • Interactive Consistency
  • Each process i broadcasts its own value vi
  • all nonfaulty processes agree on a common vector
    (v1, v2,,vn)
  • If the ith process is nonfaulty, then the ith
    value in the vector agreed upon by nonfaulty
    processes must be vi
  • Solution to Byzantine agreement problem implies
    solution to other two

101
Byzantine Agreement Problem
  • no solution possible if
  • asynchronous system, or
  • n lt (3m 1)
  • needs at least (m1) rounds of message exchange
    (lower bound result)
  • Oral messages messages can be forged/changed
    in any manner, but the receiver always knows the
    sender

102
Lamport-Shostak-Pease Algorithm
  • Recursively defined
  • OM(m), m gt 0
  • Source x broadcasts value to all processes
  • Let vi value received by process i from source
    (0 if no value received). Process i acts as a new
    source and initiates OM(m -1), sending vi to
    remaining (n - 2) processes
  • For each i, j, i ? j, let vj value received by
    process i from process j in step 2 using O(m-1).
    Process i uses the value majority(v1, v2, , vn
    -1)

103
  • OM(0)
  • Source x broadcasts value to all processes
  • Each process uses the value if no value
    received, 0 is used
  • Time complexity m1 rounds
  • Message Complexity O(nm)
  • You can reduce message complexity to polynomial
    by increasing time

104
Atomic Actions and Commit Protocols
  • An action may have multiple subactions executed
    by different processes at different nodes of a
    distributed system
  • Atomic action either all subactions are done or
    none are done (all-or-nothing property/ global
    atomicity property) as far as system state is
    concerned
  • Commit protocols protocols for enforcing global
    atomicity property

105
Two-Phase Commit
  • Assumes the presence of write-ahead log at each
    process to recover from local crashes
  • One process acts as coordinator
  • Phase 1
  • coordinator sends COMMIT_REQUEST to all processes
  • waits for replies from all processes
  • on receiving a COMMIT_REQUEST, a process, if the
    local transaction is successful, writes Undo/redo
    logs in stable storage, and sends an AGREED
    message to the coordinator. Otherwise, sends an
    ABORT

106
  • Phase 2
  • If all processes reply AGREED, coordinator writes
    COMMIT record into the log, then sends COMMIT to
    all processes. If at least one process has
    replied ABORT, coordinator sends ABORT to all.
    Coordinator then waits for ACK from all
    processes. If ACK is not received within timeout
    period, resend. If all ACKs are received,
    coordinator writes COMPLETE to log
  • On receiving a COMMIT, a process releases all
    resources/locks, and sends an ACK to coordinator
  • On receiving an ABORT, a process undoes the
    transaction using Undo log, releases all
    resources/locks, and sends an ACK

107
  • Ensures global atomicity either all processes
    commit or all of them aborts
  • Resilient to crash failures (see text for
    different scenarios of failure)
  • Blocking protocol crash of coordinator can
    block all processes
  • Non-blocking protocols possible ex., Three-Phase
    Commit protocol we will not discuss in this class

108
Checkpointing Rollback Recovery
  • Error recovery
  • Forward error recovery assess damage due to
    faults exactly and repair the erroneous part of
    the system state
  • less overhead but hard to assess effect of faults
    exactly in general
  • Backward error recovery on a fault, restore
    system state to a previous error-free state and
    restart from there
  • costlier, but more general, application-independen
    t technique

109
  • Checkpoint and Rollback Recovery a form of
    backward error recovery
  • Checkpoint
  • local checkpoint local state of a process saved
    in stable storage for possible rollback on a
    fault
  • global checkpoint collection of local
    checkpoints, one from each process
  • Consistent and Strongly Consistent Global
    Checkpoint similar to consistent and strongly
    consistent global state respectively (Also called
    recovery line)

110
  • Orphan message a message whose receive is
    recorded in some local checkpoint of a global
    checkpoint but send is not recorded in any local
    checkpoint in that global checkpoint ( Note A
    consistent global checkpoint cannot have an
    orphan message)
  • Lost message a message whose send is recorded
    but receive is not in a global checkpoint
  • Is lost messages a problem??
  • not if unreliable channels assumed (since
    messages can be lost anyway)
  • if reliable channels assumed, need to handle this
    properly! Cannot lose messages !
  • We will assume unreliable channels for simplicity

111
Performance measures for a checkpointing and
recovery algorithm
  • during fault-free operation
  • checkpointing time
  • space for storing checkpoints and messages (if
    needed)
  • in case of a fault
  • recovery time (time to establish recovery line)
  • extent of rollback (how far in the past did we
    roll back? how much computation is lost?)
  • is output commit problem handled? (if an output
    was sent out before the fault, say cash dispensed
    at a teller m/c, it should not be resent after
    restarting after the fault)

112
Some parameters that affect performance
  • Checkpoint interval (time between two successive
    checkpoints)
  • Number of processes
  • Communication pattern of the application
  • Fault frequency
  • Nature of stable storage

113
Classification of Checkpoint Recovery Algorithms
  • Asynchronous/Uncoordinated
  • every process takes local checkpoint
    independently
  • to recover from a fault in one process, all
    processes coordinate to find a consistent global
    checkpoint from their local checkpoints
  • very low fault-free overhead, recovery overhead
    is high
  • Domino effect possible (no consistent global
    checkpoint exist, so all processes have to
    restart from scratch)
  • higher space requirements, as all local
    checkpoints need to be kept
  • Good for systems where fault is rare and
    inter-process communication is not too high (less
    chance of domino effect)

114
  • Synchronous/Coordinated
  • all processes coordinate to take a consistent
    global checkpoint
  • during recovery, every process just rolls back to
    its last local checkpoint independently
  • low recovery overhead, but high checkpointing
    overhead
  • no domino effect possible
  • low space requirement, since only last checkpoint
    needs to be stored at each process

115
  • Communication Induced
  • Synchronize checkpointing with communication,
    since message send/receive is the fundamental
    cause of inconsistency in global checkpoint
  • Ex. take local checkpoint right after every
    send! Last local checkpoint at each process is
    always consistent. But too costly
  • Many variations are there, more efficient than
    the above, we will not discuss them in this class.

116
  • Message logging
  • Take coordinated or uncoordinated checkpoint, and
    then log (in stable storage) all messages
    received since the last checkpoint
  • On recovery, only the recovering process goes
    back to its last checkpoint, and then replays
    messages from the log appropriately until it
    reaches the state right before the fault
  • Only class that can handle output commit problem!
  • Details too complex to discuss in this class

117
Some Checkpointing Algorithms
  • Asynchronous/Uncoordinated
  • See Juang-Venkatesans algorithm in text, quite
    well-explained
  • Synchronous/Coordinated
  • Chandy-Lamports global state collection
    algorithm can be modified to handle recovery from
    faults
  • See Koo-Touegs algorithm in text, quite
    well-explained

118
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com