Title: Distributed Operating Systems CS551
1Distributed Operating SystemsCS551
- Colorado State University
- at Lockheed-Martin
- Lecture 6 -- Spring 2001
2CS551 Lecture 6
- Topics
- Distributed Process Management (Chapter 7)
- Distributed Scheduling Algorithm Choices
- Scheduling Algorithm Approaches
- Coordinator Elections
- Orphan Processes
- Distributed File Systems (Chapter 8)
- Distributed Name Service
- Distributed File Service
- Distributed Directory Service
3Distributed Deadlock Prevention
- Assign each process a global timestamp when it
starts - No two processes should have same timestamp
- Basic idea When one process is about to block
waiting for a resource that another process is
using, a check is made to see which has a larger
timestamp (i.e. is younger). Tanenbaum, DOS
(1995)
4Distributed Deadlock Prevention
- Somehow put timestamps on each process,
representing creation time of process - Suppose a process needs a resource already owned
by another process - Determine relative ages of both processes
- Decide if waiting process should Preempt, Wait,
Die, or Wound owning process - Two different algorithms
5Distributed Deadlock Prevention
- Allow wait only if waiting process is older
- Since timestamps increase in any chain of waiting
processes, cycles are impossible - Or allow wait only if waiting process is younger
- Here timestamps decrease in any chain of waiting
process, so cycles are again impossible - Wiser to give older processes priority
6Example wait-die algorithm
Wants resource
Holds resource
54
79
Waits
Holds resource
Wants resource
79
54
Dies
7Example wound-wait algorithm
Wants resource
Holds resource
54
79
Preempts
Holds resource
Wants resource
79
54
Waits
8Algorithm Comparison
- Wait-die kills young process
- When young process restarts and requests resource
again, it is killed once more - Less efficient of these two algorithms
- Wound-wait preempts young process
- When young process re-requests resource, it has
to wait for older process to finish - Better of the two algorithms
9Figure 7.7 The Bully Algorithm. (Galli, p. 169)
10Process Management in a Distributed Environment
- Processes in a Uniprocessor
- Processes in a Multiprocessor
- Processes in a Distributed System
- Why need to schedule
- Scheduling priorities
- How to schedule
- Scheduling algorithms
11Distributed Scheduling
- Basically resource management
- Want to distribute processing load among the
processing elements in order to maximize
performance - Consider having several homogeneous processing
elements on a LAN with equal average workloads - Workload may still not be evenly distributed
- Some PEs may have idle cycles
12Efficiency Metrics
- Communication cost
- Low if very little or no communication required
- Low if all communicating processes
- on same PE
- not distant (small number of hops)
- Execution cost
- Relative speed of PE
- Relative location of needed resources
- Type of
- operating system
- machine code
- architecture
13Efficiency Metrics, continued
- Resource Utilization
- May be based upon
- Current PE loads
- Load status state
- Resource queue lengths
- Memory usage
- Other resource availability
14Level of Scheduling
- When to run process locally or to send it to an
idle PE? - Local Scheduling
- Allocate process to local PE
- Review Galli, Chapter 2, for more information
- Global Scheduling
- Choose which PE executes which process
- Also called process allocation
- Precedes local scheduling decision
15Figure 7.1 Scheduling Decision Chart.
(Galli,p.152)
16Distribution Goals
- Load Balancing
- Tries to maintain an equal load throughout system
- Load Sharing
- Simpler
- Tries to prevent any PE from becoming too busy
17Load Balancing / Load Sharing
- Load Balancing
- Try to equalize loads at PEs
- Requires more information
- More overhead
- Load Sharing
- Avoid having an idle PE if there is work to do
- Anticipating Transfers
- Avoid PE idle wait while a task is coming
- Get a new task just before PE becomes idle
18Figure 7.2 Load Distribution Goals.
(Galli,p.153)
19Processor Allocation Algorithms
- Assume virtually identical PEs
- Assume PEs fully interconnected
- Assume processes may spawn children
- Two strategies
- Non-migratory
- static binding
- non-preemptive
- Migratory
- dynamic binding
- preemptive
20Processor Allocation Strategies
- Non-migratory (static binding, non-preemptive)
- Transfer before process starts execution
- Once assigned to a machine, process stays there
- Migratory (dynamic binding, preemptive)
- Processes may move after execution begins
- Better load balancing
- Expensive must collect and move entire state
- More complex algorithms
21Efficiency Goals
- Optimal
- Completion time
- Resource Utilization
- System Throughput
- Any combination thereof
- Suboptimal
- Suboptimal Approximate
- Suboptimal Heuristic
22Optimal Scheduling Algorithms
- Requires state of all competing processes
- Scheduler must have access to all related
information - Optimization is a hard problem
- Usually NP-Hard for multiple processors
- Thus, consider
- Suboptimal Approximate solutions
- Suboptimal Heuristic solutions
23SubOptimal Approximate Solutions
- Similar to Optimal Scheduling algorithms
- Try to find good solutions, not perfect solutions
- Searches are limited
- Include intelligent shortcuts
24SubOptimal Heuristic Solutions
- Heuristics
- Employ rules-of-thumb
- Employ intuition
- May not be provable
- Generally considered to work in an acceptable
manner - Examples
- If PE has heavy load, dont give it more to do
- Locality of reference for related processes, data
25Figure 7.1 Scheduling Decision Chart.
(Galli,p.152)
26Types of Load Distribution Algs
- Static
- Decisions are hard-wired in
- Dynamic
- Use static information to make decisions
- Overhead of keeping track of information
- Adaptive
- A type of dynamic algorithm
- May work differently at different loads
27Load Distribution Algorithm Issues
- Transfer Policy
- Selection Policy
- Location Policy
- Information Policy
- Stability
- Sender-initiated versus Receiver-Initiated
- Symmetrically-Initiated
- Adaptive Algorithms
28Load Dist. Algs. Issues, cont.
- Transfer Policy
- When it is appropriate to move a task?
- If load at sending PE gt threshold
- If load at receiving PE lt threshold
- Location Policy
- Find a receiver PE
- Methods
- Broadcast messages
- Polling random, neighbors, recent candidates
29Load Dist. Algs. Issues, cont.
- Selection Policy
- Which task should migrate?
- Simple
- Select new tasks
- Non-Preemptive
- Criteria
- Cost of transfer
- should be covered by reduction in response time
- Size of task
- Number of dependent system calls (use local PE)
30Load Dist. Algs. Issues, cont.
- Information Policy
- What information should be collected?
- When? From whom? By whom?
- Demand-driven
- Get info when PE becomes sender or receiver
- Sender-initiated senders look for receivers
- Receiver-initiated receivers look for senders
- Symmetrically-initiated either of above
- Periodic at fixed time intervals, not adaptive
- State-change-driven
- Send info about node state (rather than solicit)
31Load Dist. Algs. Issues, cont.
- Stability
- Queuing Theoretic
- Stable Sum(arrival load overhead) lt capacity
- Effective Using the algorithm gives better
performance than not doing load distribution - An effective algorithm cannot be unstable
- A stable algorithm can be ineffective (overhead)
- Algorithmic Stability
- E.g. Performing overhead operations, but making
no forward progress - E.g. moving a task from PE to PE, only to learn
that it increases the PE workload enough that it
needs to be transferred again
32Load Dist Algs Issues, concluded
- Stability
- Queuing Theoretic
- Stable Sum(arrival load overhead) lt capacity
- Effective Using the algorithm gives better
performance than not doing load distribution - An effective algorithm cannot be unstable
- A stable algorithm can be ineffective (overhead)
- Algorithmic Stability
- E.g. Performing overhead operations, but making
no forward progress - E.g. moving a task from PE to PE, only to learn
that it increases the PE workload enough that it
needs to be transferred again
33Load Dist Algs Sender-Initiated
- Sender PE thinks it is overloaded
- Transfer Policy
- Threshold (T) based on PE CPU queue length (QL)
- Sender QL gt T
- Receiver QL lt T
- Selection Policy
- Non-preemptive
- Allows only new tasks
- Long-lived tasks makes this policy worthwhile
34Load Dist Algs Sender-Initiated
- Location (3 different policies)
- Random
- Select a receiver at random
- Useless or wasted if destination is loaded
- Want to avoid transferring the same task from PE
to PE to PE - Include limit on number of transfers
- Threshold
- Start polling PEs at random
- If receiver found, send task to it
- Limit search to Poll-limit
- If limit hit, keep task on current PE
35LDAs Sender-Initiated
- Location (3 different policies, cont.)
- Shortest
- Poll a random set of PEs
- Choose PE with shortest queue length
- Only a little better than Threshold Location
Policy - Not worth the additional work
36LDAs Sender-Initiated
- Information Policy
- Demand-driven
- After identifying a sender
- Stability
- At high load, PE might not find a receiver
- Polling will be wasted
- Polling increases the load on the system
- Could lead to instability
37LDAs Receiver-Initiated
- Receiver is trying to find work
- Transfer Policy
- If local QL lt T, try to find a sender
- Selection Policy
- Non-preemptive
- But there may not be any
- Worth the effort
38LDAs Receiver-Initiated
- Location Policy
- Select PE at random
- If taking a task does not move that PEs load
below threshold, take it - If no luck after trying the Poll Limit times,
- Wait until another task completed
- Wait another time period
- Information Policy
- Demand-driven
39LDAs Receiver-Initiated
- Stability
- Tends to be stable
- At high load, a sender should be found
- Problem
- Transfers tend to be preemptive
- Tasks on sender node have already started
40LDAs Symmetrically-Initiated
- Both senders and receivers can search for tasks
to transfer - Has both advantages and disadvantages of two
previous methods - Above average algorithm
- Try to keep load at each PE at acceptable level
- Aiming for exact average can cause thrashing
41LDAs Symmetrically-Initiated
- Transfer Policy
- Each PE
- Estimates the average load
- Sets both an upper and a lower threshold
- Equal distance from any estimate
- If load gt upper, PE acts as a sender
- If load lt lower, PE acts as a receiver
42LDAs Symmetrically-Initiated
- Location Policy
- Sender-initiated
- Sender broadcasts a TooHigh message, sets timeout
- Receiver sends Accept message, clears timeout,
increases Load value, sets timeout - If sender still wants to send when Accept message
comes, sends task - If sender gets TooLow message before Accept,
sends task - If sender has TooHigh timeout with no Accept
- Average estimate is too low
- Broadcasts ChangeAvg message to all PEs
43LDAs Symmetrically-Initiated
- Location Policy
- Receiver-initiated
- Receiver sends TooLow message, sets timeout
- Rest is converse of sender-initiated algorithm
- Selection Policy
- Use a reasonable policy
- Non-preemptive, if possible
- Low cost
44LDAs Symmetrically-Initiated
- Information Policy
- Demand-driven
- Determined at each PE
- Low overhead
45LDAs Adaptive
- Stable Symmetrically-Initiated
- Previous instability was due to too much polling
by the sender - Each PE keeps lists of the other Pes sorted into
three categories - Sender overloaded
- Receiver overloaded
- Okay
- Each PE has all other Pes receiver list at start
46LDAs Adaptive
- Transfer Policy
- Based on PE CPU queue length
- Low threshold (LT) and high threshold (HT)
- Selection Policy
- Sender-initiated only sends new tasks
- Receiver-initiated takes any task
- Trying for low cost
- Information Policy
- Demand-driven maintains lists
47LDAs Adaptive
- Location Policy
- Receiver-initiated
- Order of polling
- Senders list head to tail (new info first)
- OK list tail to head (out-of-date first)
- Receiver list (tail to head)
- When PE becomes receiver, QL lt LT
- Starts polling
- If it finds a sender, transfer happens
- Else use replies to update lists
- Continues until
- It finds a sender
- It is no longer a receiver
- It hits the Poll Limit
48LDAs Adaptive
- Notes
- At high loads, activity is sender-initiated, but
there sender will soon have an empty receiver
list ? no polling - So it will go to receiver-initiated
- At low loads, receiver-initiated ? failure
- But overhead doesnt matter at low load
- And lists get updated
- So sender-initiated should work quickly
49Load Scheduling Algorithms (Galli)
- Usage Points
- Charged for using remote PEs, resources
- Graph Theory
- Minimum cutset of assignment graph
- Maximum flow of graph
- Probes
- Messages to locate available, appropriate PEs
- Scheduling Queues
- Stochastic Learning
50Figure 7.3 Usage Points. (Galli,p.158)
51Figure 7.4 Economic Usage Points. (Galli,
p.159)
52Figure 7.5 Two-Processor Min-Cut Example.
(Galli, p.161)
53Figure 7.6 A Station with Run Queues and Hints.
(Galli, p.164)
54CPU Queue Length as Metric
- PE queue length correlates well with response
time - Easy to measure
- Caution
- When accepting new migrating process, increment
queue length right away - Perhaps time-out needed in case process never
arrives - PE queue length does not correlate well with PE
utilization - Daemon to monitor PE utilization overhead
55Election Algorithms
- Bully algorithm (Garcia-Molina, 1982)
- A Ring election algorithm
56Bully Algorithm
- Each processor has a unique number
- One processor notices that the leader/server is
missing - Sends messages to all other processes
- Requests to be appointed leader
- Includes his processor number
- Processors with higher (lower) processor numbers
can bully the first processor
57Figure 7.7 The Bully Algorithm. (Galli, p. 169)
58Bully Algorithm, continued
- Initial processor need only send messages about
election to higher/lower numbered processors - Any processors that respond effectively tell the
first processor that they overrule him and that
he is out of the running - These processors then start sending election
messages to the other top processors
59Bully Example
1
4
3
3, 4 respond
0
2
1
5
4
3
2 calls election
0
2
5
60Bully Example, continued
1
4
3
4 calls election
0
2
1
5
4
3
3 calls election
0
2
5
61Bully Example, concluded
1
4
3
4 is the new leader
0
2
1
5
4
3
4 responds to 3
0
2
5
62A Ring Election Algorithm
- No token
- Each processor knows successor
- When a processor notices leader is down, sends
election message to successor - If successor is down, sends to next processor
- Each sender adds own number to message
63Ring Election Algorithm, cont.
- First processor eventually receives back the
election message containing his number - Election message is changed to coordinator
message and resent around ring - The highest processor number in message becomes
the new leader - When first processor receives the coordinator
message, it is deleted
64Ring Election Example
3,4,5,6,0,1
2
3,4,5,6,0,1,2
1
3
3,4,5,6,0
3
0
4
3,4,5,6
3,4
7
5
3,4,5
6
65Orphan Processes
- A child process that is still active after its
parent process has terminated prematurely - Can happen with remote procedure calls
- Wastes resources
- Can corrupt shared data
- Can create more processes
- Three solutions follow
66Orphan Cleanup
- A process must clean up after itself after a
crash - Requires each parent keep list of children
- Parent thus has access to family tree
- Must be kept in nonvolatile storage
- On restart, each family tree member told of
parent processs death and halts execution - Disadvantage parent overhead
67Figure 7.8 Orphan Cleanup Family Trees.
(Galli, p.170)
68Child Process Allowance
- All child processes receive a finite time
allowance - If no time left, child must request more time
from parent - If parent has terminated prematurely, childs
request goes unanswered - With no time allowance, child process dies
- Requires more communication
- Slows execution of child processes
69Figure 7.9 Child Process Allowance. (Galli,
p.172)
70Process Version Numbers
- Each process must keep track of a version number
for its parent - After a system crash, the entire distributed
system is assigned a new version number - Child forced to terminate if version number is
out-of-date - Child may try to find parent
- Terminates if unsuccessful
- Requires a lot of communication
71Figure 7.10 Process Version Numbers. (Galli,
p.174)