Designing Parallel Operating Systems using Modern Interconnects - PowerPoint PPT Presentation

1 / 78

About This Presentation

Title:

Designing Parallel Operating Systems using Modern Interconnects

Description:

Poll local event. Eitan Frachtenberg. MIT, 20-Sep-2004. 11. P. A. L ... FCFS, GS, FCS, SB (ICS), BCS. Emulation vs. simulation ... – PowerPoint PPT presentation

Number of Views:41

Avg rating:3.0/5.0

Slides: 79

Provided by: csHu

Category:

more less

Transcript and Presenter's Notes

Title: Designing Parallel Operating Systems using Modern Interconnects

1
Designing Parallel Operating Systems using Modern
Interconnects
Designing Parallel Operating Systems using Modern
Interconnects
Eitan Frachtenberg (eitanf_at_lanl.gov) With Fabrizi
o Petrini, Juan Fernandez, Dror Feitelson,
Jose-Carlos Sancho, Kei Davis
Computer and Computational Sciences Division Los
Alamos National Laboratory
Ideas that change the world
2
Cluster Supercomputers

Growing in prevalence and performance,
7 out of 10 top supercomputers
Running parallel applications
Advanced, high-end interconnects

3
Distributed vs. Parallel

Distributed and parallel applications (including
operating systems) may be distinguished by their
use of global and collective operations
Distributedlocal information, relatively small
number of point-to-point messages
Parallelglobal synchronization barriers,
reductions, exchanges

4
System Software Components
Resource Management
Job Scheduling
Communication Library
System Software
Fault Tolerance
Parallel I/O
5
Problems with System Software

Independent single-node OS (e.g. Linux) connected
by distributed dæmons
Redundant components
Performance hits
Scalability issues
Load balancing issues

6
OSs Collective Operations

Many OS tasks are inherently global or collective
operations
Job launching, data dissemination
Context switching
Job termination (normal and forced)
Load balancing

7
Resource Management
Resource Management
Parallel I/O
Parallel I/O
Fault Tolerance
Fault Tolerance
Local Operating System
Local Operating System
User-Level Communication
User-Level Communication
Job Scheduling
Job Scheduling
Node 1
Node 2
Job Scheduling
Fault Tolerance
Communication
Parallel I/O
Resource Mgmt
Global Parallel Operating System
8
The Vision

Modern interconnects are very powerful
collective operations
programmable NICs
on-board RAM
Use a small set of network mechanisms as parallel
OS infrastructure
Build upon this infrastructure to create unified
system software
System software Inherits scalability and
performance from network features

9
Example ASCI Q Barrier HotI03
10
Parallel OS Primitives

System software built atop three primitives
Xfer-And-Signal
Transfer block of data to a set of nodes
Optionally signal local/remote event upon
completion
Compare-And-Write
Compare global variable on a set of nodes
Optionally write global variable on the same set
of nodes
Test-Event
Poll local event

11
Core Primitives on QsNet

System software built atop three primitives
Xfer-And-Signal (QsNet)
Node S transfers block of data to nodes D1, D2,
D3 and D4
Events triggered at source and destinations

12
Core Primitives (cont.)

System software built atop three primitives
Compare-And-Write (QsNet)
Node S compares variable V on nodes D1, D2, D3
and D4

Is V ?, ? ?, gt to Value?

13
Core Primitives (cont.)

System software built atop three primitives
Compare-And-Write (QsNet)
Node S compares variable V on nodes D1, D2, D3
and D4
Partial results are combined in the switches

S
14
System Software Components
Resource Management
Job Scheduling
Communication Library
System Software
Fault Tolerance
Parallel I/O
15
Scalable Tool for Resource Management

Inherits scalability from network primitives
Data dissemination and coordination
Interactive job launching speeds
Context-switching at milliseconds level
Described in SC02

16
State of the Art in ResourceManagement

Resource managers (e.g. PBS, LSF, RMS,
LoadLeveler, Maui) are typically implemented
using
TCP/IPfavors portability over performance,
Poorly-scaling algorithms for the
distribution/collection of data and control
messages
Favoring development time over performance
Scalable performance not important for small
clusters but crucial for large ones.
There exists a need for fast and scalable
resource management.

17
Experimental Setup

64 nodes/256 processors ES40 Alphaserver cluster
2 independent network rails of Quadrics Elan3
Files are placed in ramdisk in order to avoid I/O
bottlenecks and expose the performance of the
resource management algorithms

18
Launch Times (Unloaded System)

The launch time is constant when we increase the
number of processors. STORM is highly scalable

19
Launch Times (Loaded System, 12 MB)

Worst case 1.5 seconds to launch a 12 MB file
on 256 processors

20
Measured and Estimated Launch Times

The model shows that in an ES40-based
Alphaserver a 12MB binary can be launched in
135ms on 16,384 nodes

21
Comparative Evaluation(Measured Modeled)
22
System Software Components
Resource Management
Job Scheduling
Communication Library
System Software
Fault Tolerance
Parallel I/O
23
Job Scheduling

Controls the allocation of space and time
resources to jobs
HPC apps have special requirements
Multiple processing and network resources
Synchronization ( lt 1ms granularity)
Potentially memory hogs with little locality
Has significant effect on throughput,
responsiveness, and utilization

24
First-Come-First-Serve (FCFS)
25
Gang Scheduling (GS)
26
Implicit CoScheduling
27
Hybrid Methods

Combine global synchronization local
information
Rely on scalable primitives for global
coordination and information exchange
First implementation of two novel algorithms
Flexible CoScheduling (FCS)
Buffered CoScheduling (BCS)

28
Flexible CoScheduling (FCS)

Measure communication characteristics, such as
granularity and wait times
Classify processes based on synchronization
requirements
Schedule processes based on class
Described in IPDPS03

29
FCS Classification
Fine
Coarse
Granularity
DC Locally scheduled
Long
Short
Block times
CS Always gang-scheduled
F Preferably gang-scheduled
30
Methodology

Synthetic, controllable MPI programs
Workload
Static all jobs start together
Dynamic different sizes, arrival and run times
Various schedulers implemented
FCFS, GS, FCS, SB (ICS), BCS
Emulation vs. simulation
Actual implementation takes into account all the
overhead and factors of a real system

31
Hardware Environment

Environment ported to three architectures and
clusters
Crescendo 32x2 Pentium III, 1GB
Accelerando 32x2 Itanium II, 2GB
Wolverine 64x4 Alpha ES40, 8GB

32
Synthetic Application

Bulk synchronous, 3ms basic granularity
Can control granularity, variability and
Communication pattern

33
Synthetic Scenarios
Balanced Complementing Imbalanced
Mixed
34
Turnaround Time
35
Dynamic Workloads JSSPP03

Static workloads are simple and offer insights,
but are not realistic
Most real-life workloads are more complex
Users submit jobs dynamically, of varying time
and space requirements

36
Dynamic Workload Methodology

Emulation using a workload model Lublin03
1000 jobs, approx. 12 days, shrunk to 2 hrs
Varying load by factoring arrival times
Using same synthetic application, with random
Arrival time, run time, and size, based on model
Granularity (fine, medium, coarse)
communication pattern (ring, barrier, none)
Recent study with scientific apps (yet
unpublished)

37
Load Response Time
38
Load Bounded Slowdown
39
Timeslice Response Time
40
System Software Components
Resource Management
Job Scheduling
Communication Library
System Software
Fault Tolerance
Parallel I/O
41
Buffered CoScheduling (BCS)

Buffer all communications
Exchange information about pending communication
every time slice
Schedule and execute communication
Implemented mostly on the NIC
Requires fine-grained heartbeats
Described in SC03

42
Design and Implementation

Global synchronization
Strobe sent at regular intervals (time slices)
Compare-And-Write Xfer-And-Signal (Master)
Test-Event (Slaves)
All system activities are tightly coupled
Global Scheduling
Exchange of communication requirements
Xfer-And-Signal Test-Event
Communication scheduling
Real transmission
Xfer-And-Signal Test-Event

43
Design and Implementation

Implementation in the NIC
Application processes interact with NIC threads
MPI primitive ? Descriptor posted to the NIC
Communications are buffered
Cooperative threads running in the NIC
Synchronize
Partial exchange of control information
Schedule communications
Perform real transmissions and reduce
computations
Comp/comm completely overlapped

44
Design and Implementation

Non-blocking primitives MPI_Isend/Irecv

45
Design and Implementation

Blocking primitives MPI_Send/Recv

46
Performance Evaluation

BCS MPI vs. Quadrics MPI
Experimental Setup
Benchmarks and Applications
NPB (IS,EP,MG,CG,LU) - Class C
SWEEP3D - 50x50x50
SAGE - timing.input
Scheduling parameters
500µs communication scheduling time slice (1
rail)
250µs communication scheduling time slice (2
rails)

47
Performance Evaluation

Benchmarks and Applications (C)

Application Slowdown IS (32PEs) 10.40 EP
(49PEs) 5.35 MG (32PEs) 4.37 CG
(32PEs) 10.83 LU (32PEs) 15.04 SWEEP3D (49PEs)
-2.23 SAGE (62PEs) -0.42
48
Performance Evaluation

SAGE - timing.input (IA32)

0.5 SPEEDUP
49
Blocking Communication

Blocking vs. Non-blocking SWEEP3D (IA32)
MPI_Send/Recv ? MPI_Isend/Irecv MPI_Waitall

50
System Software Components
Resource Management
Job Scheduling
Communication Library
System Software
Fault Tolerance
Parallel I/O
51
Fault Tolerance Today

Fault tolerance is commonly achieved, if at all,
by
Checkpointing
Segmentation of the machine
Removal of fault-prone components
Massive hardware redundancy is not considered
economically feasible

52
Our Approach to Fault Tolerance

Recent work shows that scalable, system-level
fault-tolerance is within reach with current
technology, with low overhead, can be achieved
through a global operating system
Two results provide the basis for this claim
Buffered CoScheduling that enforces frequent,
global recovery lines and global control
Feasibility of incremental checkpoint

53
Checkpointing and Recovery

Simplicity
Easy implementation
Cost-effective
No additional hardware support
Critical aspect Bandwidth requirements

54
Reducing Bandwidth

Incremental checkpointing
Only the memory modified from the previous
checkpoint is saved to stable storage

Incremental
Full
Process state
55
Enabling Automatic Checkpointing
Checkpoint data
User intervention
Application
Low
High
Run-time library
Operating system
automatic
Hardware
High
Low
56
The Bandwidth Challenge
Does the current technology provide enough
bandwidth?

Frequent
Automatic

57
Methodology

Quantifying the Bandwidth Requirements
Checkpoint intervals 1s to 20s
Comparing with the current bandwidth available

900 MB/s
Sustained network bandwidthQuadrics QsNet II
75 MB/s
Single sustained disk bandwidthUltra SCSI
controller
58
Memory Footprint
Increasing memory footprint
64 Itanium II processors
59
Bandwidth Requirements
78.8MB/s
12.1MB/s
Bandwidth (MB/s)
Timeslices (s)
Decreases with the timeslices
Sage-1000MB
60
Bandwidth Requirementsfor 1 second
Most demanding
Increases with memory footprint
Single SCSI disk performance
61
Increasing Memory Footprint Size
Increases sublinearly
Average Bandwidth (MB/s)
Timeslices (s)
62
Increasing Processor Count
Decreases slightly with processor count
Average Bandwidth (MB/s)
Weak-scaling
Timeslices (s)
63
Technological Trends
Increases at a faster pace
Performance of applications bounded by memory
improvements
Performance Improvement per year
64
Conclusions

As clusters grow, interconnection technology
advances
Better bandwidth and latency
On-board programmable processor, RAM
Hardware support for collective operations
Allows the development of common system
infrastructure that is a parallel program in
itself

65
Conclusions (cont.)

On top of infrastructure we built
Scalable resource management (STORM)
Novel job scheduling algorithms
Simplified system design and communication
library
Possible basis for transparent fault tolerance

66
Conclusions (cont.)

Experimental performance evaluation demonstrates
Scalable interactive job launching and
context-switching
Multiprogramming parallel jobs is feasible
Adaptive scheduling algorithms adjust to
different job requirements, improving response
times and slowdown in various workloads
Transparent, frequent checkpoint within current
reach

67
References

Eitans web page
http//www.cs.huji.ac.il/etcs/pubs/
Fabrizios web page
http//www.c3.lanl.gov/fabrizio/publications.htm
l
PAL team web page
http//www.c3.lanl.gov/par_arch/Publications.html

68
Resource Overlapping
69
Turnaround Time
70
Response Time
71
Timeslice Bounded Slowdown
72
FCFS vs. GS and MPL
73
FCFS vs. GS and MPL (2)
74
Backfilling

Backfilling is a technique to move jobs forward
in queue
Can be combined with time-sharing schedulers such
as GS when all timeslots are full

75
Backfilling

Backfilling is a technique to move jobs forward
in queue
Can be combined with time-sharing schedulers such
as GS when all timeslots are full

76
Effect of Backfilling
77
Characterization
Data initialization
Regular processing bursts
Sage-1000MB
78
Communication
Interleaved

Regular communicationbursts

Sage-1000MB

Write a Comment

User Comments (0)