Designing Parallel Operating Systems using Modern Interconnects - PowerPoint PPT Presentation

1 / 37

About This Presentation

Title:

Designing Parallel Operating Systems using Modern Interconnects

Description:

Designing Parallel Operating Systems using Modern Interconnects ... Emulation: rationale, strengths, and weaknesses. Experimental results and analysis ... – PowerPoint PPT presentation

Number of Views:50

Avg rating:3.0/5.0

Slides: 38

Provided by: csHu

Category:

more less

Transcript and Presenter's Notes

Title: Designing Parallel Operating Systems using Modern Interconnects

1
Designing Parallel Operating Systems using Modern
Interconnects
Toward Realistic Evaluation of Job Scheduling
Strategies
Eitan Frachtenberg With Dror Feitelson, Fabrizio
Petrini, and Juan Fernandez
Computer and Computational Sciences Division Los
Alamos National Laboratory
Ideas that change the world
2
Outline

The challenges of parallel job scheduling
Evaluation
Emulation rationale, strengths, and weaknesses
Experimental results and analysis
How do different algorithms react to increasing
load?
Can knowing the future help?
What is the effect of multiprogramming?
What applications is it good for?

3
Parallel Job Scheduling

The task assign compute resources to parallel
jobs
The computers (clusters and MPPs)
Range from 100s of processors to 10,000 and more
Typically homogenous and connected by a fast
interconnect
Jobs arrive dynamically, with different sizes and
runtimes, requiring online scheduling
Mostly fine-grained communication, lots of memory
Mix of serial, parallel, short and long jobs

4
Scheduling Taxonomy

Rectangle packing
Main dimensions space sharing and time sharing
Additional queue dimension backfilling,
priorities

5
Backfilling

Backfilling is a technique to move jobs forward
in queue
Requires advanced knowledge of run time (or
reservation)
Reduces external fragmentation and improves
utilization, responsiveness and throughput
Has several variations

6
Time Sharing

Does not require reservation times, but can be
combined with backfilling
Higher reduction of external fragmentation,
possibly even internal fragmentation, resulting
in improved utilization, responsiveness and
throughput
But also challenging
Memory pressure
Context-switch overheads
Process synchronization tradeoffs
Tightly-coupled processes must be coscheduled
Coordination can incur overhead and fragmentation

7
none implicit
hybrid explicitLocal
DCS ICS/SB CC PB FCS BCS
GS
Time Sharing Spectrum
coordination

No coordination local UNIX scheduling
Explicit coordination
Global clock (centralized)
Global context-switches to known job
Implicit coordination infer synchronization
information at server side, receiver side, or
both
Hybrid global coordination with local autonomy

8
Without Timesharing

Short processes wait for long periods in the
queue
External fragmentation creates many holes

9
Time Sharing - GS

Gang Scheduling multiprograms several jobs
Reduces response time and fills holes
Incurs more overhead and memory pressure (time
quantum)

10
Time Sharing - SB

Spin Block (ICS) is a sender side coordination
heuristic
Reduces overhead, increases scalability
Performs poorly with fine-grained communication

11
Time Sharing - FCS

Combine global synchronization local
information
Rely on scalable primitives for global
coordination and information exchange
Measure communication characteristics, such as
granularity and wait times
Classify processes based on synchronization
requirements
Schedule processes based on class
Preferential to short jobs

12
FCS Classification
Fine
Coarse
Granularity
DC Locally scheduled
Long
Short
Block times
CS Always gang-scheduled
F Preferably gang-scheduled
13
Evaluation Challenges

Theoretical Analysis (queuing theory)
Not applicable to time-sharing due to unknown
parameters, application structure, and feedbacks
Simulation
Many assumptions, not all known/reported
Hard to reproduce many studies provide
contradicting results, often showing theirs is
best
Rarely factors application characteristics
Experiments with real sites and workloads
Largely impractical and irreproducible
Emulation

14
Emulation Methodology

Framework for studying scheduling algorithms
Runs any MPI application in a cluster
Implemented several scheduling algorithms
Allows control over input parameters
Provides detailed logs and analysis tools
Testing in a repeatable dynamic environment
Dynamic job arrivals, with varying time and space
requirements
Complex, longer and more realistic workloads

15
Evaluation by Emulation

Pros
Real no hidden assumptions or overheads
Configurable choice of parameters and workloads
Repeatable same experiment, same results
Portable allows the isolation of HW factors
Cons
Slow
Requires more resources than analysis/simulation
GIGO results are only as representative as
input

16
Experimental Environment

Implemented on top of STORM, a scalable resource
management system for clusters
Algorithms FCFS, GS, SB, FCS, using backfilling
MPI synthetic (BSP) and LANL applications
Different granularities and communication
patterns
Flexible workload model, 1000 jobs
Time shrinking
Three clusters, using QsNet
Pentium III 32x2 1GB/node
Itanium II 32x2 2GB/node
Alpha EV6 64x4 8GB/node

17
Experiments Overview

Use synthetic applications for basic insights
Effect of multiprogramming level
Effect of backfilling
Effect of time quantum
Effect of load
Use LANLs Sage/Sweep3D for application study
Caveat emptor
Only LANL applications
Does not follow input workload closely
Limited set of inputs
Different architecture (Alpha)

18
Effect of MPL

Questions
What is the effect of preemptive multiprogramming
compared to FCFS (batch) scheduling?
Higher MPL higher performance?
Parameters
GS with MPL values 1?6 (1batch)
Input load 75
Bounded slowdown, cutoff at 10s

19
MPL Response Time
20
MPL Bounded Slowdown
21
Effect of Backfilling

Adding backfilling (the future) to GS/Batch
helps?

22
Backfilling Response time

Backfilling helps short jobs, harms long jobs

23
Effect of Time Quantum

Shorter time quantum pros
System more responsive
Less external fragmentation
Longer time quantum pros
Less cache/memory pressure
Less synchronization overhead
Setup
GS at 75 load
Compare Pentium III to Itanium II

24
Time Quantum Response Time
25
Effect of Load

Comparing FCFS, GS, SB and FCS with backfilling
Varying offered load by increasing run times
Load values 40 ? 90
No measurements after saturation point

26
Load Response Time
27
Load - Bounded Slowdown
28
Response Time Median
29
Bounded Slowdown Median
30
500 Shortest jobs CDF
31
500 Longest jobs CDF
32
Scientific Applications

Sage and Sweep3D
Hydrodynamics codes
approx. 50-80 of LANL cycles
Memory-constrained
Mostly operating out of cache
Relatively load-balanced
Parameters
MPL 2, 100ms time quantum
1000 jobs, modeled arrival times, random run
times
Realistic inputs, biased toward short runs

33
Response Time
34
Bounded Slowdown
35
Conclusions - methodology

A more realistic evaluation of job scheduling
Repeatable experiments
Allows isolation of factors
Direct comparison of platforms on the
applications you care most about

36
Conclusions - experiments

Significant improvement over FCFS can be achieved
with multiprogramming, even MPL2
Backfilling can also make a difference
Batch programming discriminates against short
jobs
Multiprogramming for scientific apps pays off,
even with MPL 2
FCS can outperform explicit/implicit coscheduling
For more information eitanf_at_lanl.gov

37
Time Quantum Slowdown

Write a Comment

User Comments (0)