Title: Evaluating Task Assignment Policies for Distributed Supercomputing Servers
1Evaluating Task Assignment Policies for
Distributed Supercomputing Servers
Bianca Schroeder, Mor Harchol-Balter Computer
Science Dept Carnegie Mellon University www.cs.cm
u.edu/bianca,harchol
2The Distributed Server Model
Task Assignment Policy rule for assigning jobs
to hosts
- Jobs are processed First-Come-First-Serve
- Jobs are run to completion
- Users provide upper bounds on runtime.
Motivation Xolas, Pleiades, NASA Ames and PSC
distributed server
3Commonly used TAPs
- 1. Random
- 2. Round-Robin
- 3. Shortest-Queue
- Send job to host with fewest number jobs.
- 4. Least-Work-Left
- Send job to host with
- least total work left.
- Runtime-Based-E
- Separate jobs by runtimes equal expected
load. -
4What is a good TAP?
- We want to minimize
- 1. mean response time.
- mean slowdown.
- 3. variance in slowdown.
-
-
Additionally, desire fairness.
5Which TAP is best according to literature?
- Round-Robin
- Random
- Shortest-Queue
- 4. Least-Work-Left
- 5.
-
Optimal for exponentially- distributed
runtimes. Wolff 1989
Runtime-Based-E
Better for heavy-tailed runtime
distributions. Harchol-Balter 1998
6Simulation Setup
-
- Runtimes are taken from PSCs
- Cray J90 and C90 traces.
- Arrival times are
-
- The system has 2 or more
- hosts.
- A. Poisson i.i.d.
- B. taken from traces.
7Simulation Results for Slowdown
Random
LWL
Slowdown
Runtime-Based
System Load
8Simulation Results for Variance of Slowdown
Random
Variance
LWL
Runtime-Based
1
System Load
9WHY does Runtime-Based work so well?
Recall, P-K formula for M/G/1 queue
FCFS
Second moment of Runtime Distribution
Mean Waiting Time
Runtime-Based reduces variance of runtime
distribution at the hosts. No other policy does
this!
10Simulation Results for Slowdown
Random
LWL
Slowdown
Runtime-Based
System Load
11Is balancing load optimal?
All policies we have seen so far balance load.
12New Load Unbalancing
Runtime-Based-U
13Simulation results for Runtime-Based-U Slowdown
Slowdown
Runtime-Based-E
Runtime-Based-U-fair
Runtime-Based-U-opt
System Load
14Simulation results for Runtime-Based-UVariance
in slowdown
Variance
Runtime-Based-E
Runtime-Based-U-fair
Runtime-Based-U-opt
System Load
15Why does Runtime-Based-U work so well?
- Like Runtime-Based-E, it reduces
- the variance in job sizes.
- It unbalances load.
16How unbalanced is the load under Runtime-Based-U?
Runtime-Based-E
Runtime-Based-U-fair
Fraction of total load going to host 1
Runtime-Based-U-opt
System Load
17Difficulties for runtime-based policies
- Knowing runtimes.
-
- Finding cutoffs.
- Simple calculation using
-
- Downey 1997
- Gibbons 1997
- Smith et al. 1998
- P-K formula
- Only 1/10 of trace data
18Conclusion
Differences between TAPs are huge! Not intuitive
pre-analysis which TAPs are good!
- Reducing variance at hosts
- is important.
- Load unbalancing may be
- better than load balancing.
- Penalizing long jobs may
- actually be fair.
19Simulation Results for Slowdown
Slowdown
System Load
20Simulation Results for Slowdown
Slowdown
System Load
21Simulation results for scaled interarrival times
22Simulation results for scaled interarrival times
23Simulation results for more than 2 hosts
Slowdown
Hosts
24The SITA-E algorithmSize Interval Task
Assignment with Equal Load
S
Host 1
M
Host 2
Outside Arrivals
L
Host 3
XL
Host 4
The cutoffs are chosen as to balance the
load at the hosts.
25How do you find the optimal or fair cutoff?
- Fix the search space for cutoffs.
- For each potential cutoff, use
-
-
- to determine the expected slowdown.
- 3. Pick the best cutoff for your metric.
a) the trace-data b) the P-K-formula.