Title: Designing Parallel Operating Systems using Modern Interconnects
1Designing Parallel Operating Systems using Modern
Interconnects
Toward Realistic Evaluation of Job Scheduling
Strategies
Eitan Frachtenberg With Dror Feitelson, Fabrizio
Petrini, and Juan Fernandez
Computer and Computational Sciences Division Los
Alamos National Laboratory
Ideas that change the world
2Outline
- The challenges of parallel job scheduling
Evaluation - Emulation rationale, strengths, and weaknesses
- Experimental results and analysis
- How do different algorithms react to increasing
load? - Can knowing the future help?
- What is the effect of multiprogramming?
- What applications is it good for?
3Parallel Job Scheduling
- The task assign compute resources to parallel
jobs - The computers (clusters and MPPs)
- Range from 100s of processors to 10,000 and more
- Typically homogenous and connected by a fast
interconnect - Jobs arrive dynamically, with different sizes and
runtimes, requiring online scheduling - Mostly fine-grained communication, lots of memory
- Mix of serial, parallel, short and long jobs
4Scheduling Taxonomy
- Rectangle packing
- Main dimensions space sharing and time sharing
- Additional queue dimension backfilling,
priorities
5Backfilling
- Backfilling is a technique to move jobs forward
in queue - Requires advanced knowledge of run time (or
reservation) - Reduces external fragmentation and improves
utilization, responsiveness and throughput - Has several variations
6Time Sharing
- Does not require reservation times, but can be
combined with backfilling - Higher reduction of external fragmentation,
possibly even internal fragmentation, resulting
in improved utilization, responsiveness and
throughput - But also challenging
- Memory pressure
- Context-switch overheads
- Process synchronization tradeoffs
- Tightly-coupled processes must be coscheduled
- Coordination can incur overhead and fragmentation
7none implicit
hybrid explicitLocal
DCS ICS/SB CC PB FCS BCS
GS
Time Sharing Spectrum
coordination
- No coordination local UNIX scheduling
- Explicit coordination
- Global clock (centralized)
- Global context-switches to known job
- Implicit coordination infer synchronization
information at server side, receiver side, or
both - Hybrid global coordination with local autonomy
8Without Timesharing
- Short processes wait for long periods in the
queue - External fragmentation creates many holes
9Time Sharing - GS
- Gang Scheduling multiprograms several jobs
- Reduces response time and fills holes
- Incurs more overhead and memory pressure (time
quantum)
10Time Sharing - SB
- Spin Block (ICS) is a sender side coordination
heuristic - Reduces overhead, increases scalability
- Performs poorly with fine-grained communication
11Time Sharing - FCS
- Combine global synchronization local
information - Rely on scalable primitives for global
coordination and information exchange - Measure communication characteristics, such as
granularity and wait times - Classify processes based on synchronization
requirements - Schedule processes based on class
- Preferential to short jobs
12FCS Classification
Fine
Coarse
Granularity
DC Locally scheduled
Long
Short
Block times
CS Always gang-scheduled
F Preferably gang-scheduled
13Evaluation Challenges
- Theoretical Analysis (queuing theory)
- Not applicable to time-sharing due to unknown
parameters, application structure, and feedbacks - Simulation
- Many assumptions, not all known/reported
- Hard to reproduce many studies provide
contradicting results, often showing theirs is
best - Rarely factors application characteristics
- Experiments with real sites and workloads
- Largely impractical and irreproducible
- Emulation
14Emulation Methodology
- Framework for studying scheduling algorithms
- Runs any MPI application in a cluster
- Implemented several scheduling algorithms
- Allows control over input parameters
- Provides detailed logs and analysis tools
- Testing in a repeatable dynamic environment
- Dynamic job arrivals, with varying time and space
requirements - Complex, longer and more realistic workloads
15Evaluation by Emulation
- Pros
- Real no hidden assumptions or overheads
- Configurable choice of parameters and workloads
- Repeatable same experiment, same results
- Portable allows the isolation of HW factors
- Cons
- Slow
- Requires more resources than analysis/simulation
- GIGO results are only as representative as
input
16Experimental Environment
- Implemented on top of STORM, a scalable resource
management system for clusters - Algorithms FCFS, GS, SB, FCS, using backfilling
- MPI synthetic (BSP) and LANL applications
- Different granularities and communication
patterns - Flexible workload model, 1000 jobs
- Time shrinking
- Three clusters, using QsNet
- Pentium III 32x2 1GB/node
- Itanium II 32x2 2GB/node
- Alpha EV6 64x4 8GB/node
17Experiments Overview
- Use synthetic applications for basic insights
- Effect of multiprogramming level
- Effect of backfilling
- Effect of time quantum
- Effect of load
- Use LANLs Sage/Sweep3D for application study
- Caveat emptor
- Only LANL applications
- Does not follow input workload closely
- Limited set of inputs
- Different architecture (Alpha)
18Effect of MPL
- Questions
- What is the effect of preemptive multiprogramming
compared to FCFS (batch) scheduling? - Higher MPL higher performance?
- Parameters
- GS with MPL values 1?6 (1batch)
- Input load 75
- Bounded slowdown, cutoff at 10s
19MPL Response Time
20MPL Bounded Slowdown
21Effect of Backfilling
- Adding backfilling (the future) to GS/Batch
helps?
22Backfilling Response time
- Backfilling helps short jobs, harms long jobs
23Effect of Time Quantum
- Shorter time quantum pros
- System more responsive
- Less external fragmentation
- Longer time quantum pros
- Less cache/memory pressure
- Less synchronization overhead
- Setup
- GS at 75 load
- Compare Pentium III to Itanium II
24Time Quantum Response Time
25Effect of Load
- Comparing FCFS, GS, SB and FCS with backfilling
- Varying offered load by increasing run times
- Load values 40 ? 90
- No measurements after saturation point
26Load Response Time
27Load - Bounded Slowdown
28Response Time Median
29Bounded Slowdown Median
30500 Shortest jobs CDF
31500 Longest jobs CDF
32Scientific Applications
- Sage and Sweep3D
- Hydrodynamics codes
- approx. 50-80 of LANL cycles
- Memory-constrained
- Mostly operating out of cache
- Relatively load-balanced
- Parameters
- MPL 2, 100ms time quantum
- 1000 jobs, modeled arrival times, random run
times - Realistic inputs, biased toward short runs
33Response Time
34Bounded Slowdown
35Conclusions - methodology
- A more realistic evaluation of job scheduling
- Repeatable experiments
- Allows isolation of factors
- Direct comparison of platforms on the
applications you care most about
36Conclusions - experiments
- Significant improvement over FCFS can be achieved
with multiprogramming, even MPL2 - Backfilling can also make a difference
- Batch programming discriminates against short
jobs - Multiprogramming for scientific apps pays off,
even with MPL 2 - FCS can outperform explicit/implicit coscheduling
- For more information eitanf_at_lanl.gov
37Time Quantum Slowdown