Title: Designing Parallel Operating Systems using Modern Interconnects
1Designing Parallel Operating Systems using Modern
Interconnects
Designing Parallel Operating Systems using Modern
Interconnects
Eitan Frachtenberg (eitanf_at_lanl.gov) With Fabrizi
o Petrini, Juan Fernandez, Dror Feitelson,
Jose-Carlos Sancho, Kei Davis
Computer and Computational Sciences Division Los
Alamos National Laboratory
Ideas that change the world
2Cluster Supercomputers
- Growing in prevalence and performance,
7 out of 10 top supercomputers - Running parallel applications
- Advanced, high-end interconnects
3Distributed vs. Parallel
- Distributed and parallel applications (including
operating systems) may be distinguished by their
use of global and collective operations - Distributedlocal information, relatively small
number of point-to-point messages - Parallelglobal synchronization barriers,
reductions, exchanges
4System Software Components
Resource Management
Job Scheduling
Communication Library
System Software
Fault Tolerance
Parallel I/O
5Problems with System Software
- Independent single-node OS (e.g. Linux) connected
by distributed dæmons - Redundant components
- Performance hits
- Scalability issues
- Load balancing issues
6OSs Collective Operations
- Many OS tasks are inherently global or collective
operations - Job launching, data dissemination
- Context switching
- Job termination (normal and forced)
- Load balancing
7Resource Management
Resource Management
Parallel I/O
Parallel I/O
Fault Tolerance
Fault Tolerance
Local Operating System
Local Operating System
User-Level Communication
User-Level Communication
Job Scheduling
Job Scheduling
Node 1
Node 2
Job Scheduling
Fault Tolerance
Communication
Parallel I/O
Resource Mgmt
Global Parallel Operating System
8The Vision
- Modern interconnects are very powerful
- collective operations
- programmable NICs
- on-board RAM
- Use a small set of network mechanisms as parallel
OS infrastructure - Build upon this infrastructure to create unified
system software - System software Inherits scalability and
performance from network features
9Example ASCI Q Barrier HotI03
10Parallel OS Primitives
- System software built atop three primitives
- Xfer-And-Signal
- Transfer block of data to a set of nodes
- Optionally signal local/remote event upon
completion - Compare-And-Write
- Compare global variable on a set of nodes
- Optionally write global variable on the same set
of nodes - Test-Event
- Poll local event
11Core Primitives on QsNet
- System software built atop three primitives
- Xfer-And-Signal (QsNet)
- Node S transfers block of data to nodes D1, D2,
D3 and D4 - Events triggered at source and destinations
12Core Primitives (cont.)
- System software built atop three primitives
- Compare-And-Write (QsNet)
- Node S compares variable V on nodes D1, D2, D3
and D4
S
- Is V ?, ? ?, gt to Value?
13Core Primitives (cont.)
- System software built atop three primitives
- Compare-And-Write (QsNet)
- Node S compares variable V on nodes D1, D2, D3
and D4 - Partial results are combined in the switches
S
14System Software Components
Resource Management
Job Scheduling
Communication Library
System Software
Fault Tolerance
Parallel I/O
15Scalable Tool for Resource Management
- Inherits scalability from network primitives
- Data dissemination and coordination
- Interactive job launching speeds
- Context-switching at milliseconds level
- Described in SC02
16State of the Art in ResourceManagement
- Resource managers (e.g. PBS, LSF, RMS,
LoadLeveler, Maui) are typically implemented
using - TCP/IPfavors portability over performance,
- Poorly-scaling algorithms for the
distribution/collection of data and control
messages - Favoring development time over performance
- Scalable performance not important for small
clusters but crucial for large ones. - There exists a need for fast and scalable
resource management.
17Experimental Setup
- 64 nodes/256 processors ES40 Alphaserver cluster
- 2 independent network rails of Quadrics Elan3
- Files are placed in ramdisk in order to avoid I/O
bottlenecks and expose the performance of the
resource management algorithms
18Launch Times (Unloaded System)
- The launch time is constant when we increase the
number of processors. STORM is highly scalable
19Launch Times (Loaded System, 12 MB)
- Worst case 1.5 seconds to launch a 12 MB file
on 256 processors
20Measured and Estimated Launch Times
- The model shows that in an ES40-based
Alphaserver a 12MB binary can be launched in
135ms on 16,384 nodes
21Comparative Evaluation(Measured Modeled)
22System Software Components
Resource Management
Job Scheduling
Communication Library
System Software
Fault Tolerance
Parallel I/O
23Job Scheduling
- Controls the allocation of space and time
resources to jobs - HPC apps have special requirements
- Multiple processing and network resources
- Synchronization ( lt 1ms granularity)
- Potentially memory hogs with little locality
- Has significant effect on throughput,
responsiveness, and utilization
24First-Come-First-Serve (FCFS)
25Gang Scheduling (GS)
26Implicit CoScheduling
27Hybrid Methods
- Combine global synchronization local
information - Rely on scalable primitives for global
coordination and information exchange - First implementation of two novel algorithms
- Flexible CoScheduling (FCS)
- Buffered CoScheduling (BCS)
28Flexible CoScheduling (FCS)
- Measure communication characteristics, such as
granularity and wait times - Classify processes based on synchronization
requirements - Schedule processes based on class
- Described in IPDPS03
29FCS Classification
Fine
Coarse
Granularity
DC Locally scheduled
Long
Short
Block times
CS Always gang-scheduled
F Preferably gang-scheduled
30Methodology
- Synthetic, controllable MPI programs
- Workload
- Static all jobs start together
- Dynamic different sizes, arrival and run times
- Various schedulers implemented
- FCFS, GS, FCS, SB (ICS), BCS
- Emulation vs. simulation
- Actual implementation takes into account all the
overhead and factors of a real system
31Hardware Environment
- Environment ported to three architectures and
clusters - Crescendo 32x2 Pentium III, 1GB
- Accelerando 32x2 Itanium II, 2GB
- Wolverine 64x4 Alpha ES40, 8GB
32Synthetic Application
- Bulk synchronous, 3ms basic granularity
- Can control granularity, variability and
Communication pattern
33Synthetic Scenarios
Balanced Complementing Imbalanced
Mixed
34Turnaround Time
35Dynamic Workloads JSSPP03
- Static workloads are simple and offer insights,
but are not realistic - Most real-life workloads are more complex
- Users submit jobs dynamically, of varying time
and space requirements
36Dynamic Workload Methodology
- Emulation using a workload model Lublin03
- 1000 jobs, approx. 12 days, shrunk to 2 hrs
- Varying load by factoring arrival times
- Using same synthetic application, with random
- Arrival time, run time, and size, based on model
- Granularity (fine, medium, coarse)
- communication pattern (ring, barrier, none)
- Recent study with scientific apps (yet
unpublished)
37Load Response Time
38Load Bounded Slowdown
39Timeslice Response Time
40System Software Components
Resource Management
Job Scheduling
Communication Library
System Software
Fault Tolerance
Parallel I/O
41Buffered CoScheduling (BCS)
- Buffer all communications
- Exchange information about pending communication
every time slice - Schedule and execute communication
- Implemented mostly on the NIC
- Requires fine-grained heartbeats
- Described in SC03
42Design and Implementation
- Global synchronization
- Strobe sent at regular intervals (time slices)
- Compare-And-Write Xfer-And-Signal (Master)
- Test-Event (Slaves)
- All system activities are tightly coupled
- Global Scheduling
- Exchange of communication requirements
- Xfer-And-Signal Test-Event
- Communication scheduling
- Real transmission
- Xfer-And-Signal Test-Event
43Design and Implementation
- Implementation in the NIC
- Application processes interact with NIC threads
- MPI primitive ? Descriptor posted to the NIC
- Communications are buffered
- Cooperative threads running in the NIC
- Synchronize
- Partial exchange of control information
- Schedule communications
- Perform real transmissions and reduce
computations - Comp/comm completely overlapped
44Design and Implementation
- Non-blocking primitives MPI_Isend/Irecv
45Design and Implementation
- Blocking primitives MPI_Send/Recv
46Performance Evaluation
- BCS MPI vs. Quadrics MPI
- Experimental Setup
- Benchmarks and Applications
- NPB (IS,EP,MG,CG,LU) - Class C
- SWEEP3D - 50x50x50
- SAGE - timing.input
- Scheduling parameters
- 500µs communication scheduling time slice (1
rail) - 250µs communication scheduling time slice (2
rails)
47Performance Evaluation
- Benchmarks and Applications (C)
Application Slowdown IS (32PEs) 10.40 EP
(49PEs) 5.35 MG (32PEs) 4.37 CG
(32PEs) 10.83 LU (32PEs) 15.04 SWEEP3D (49PEs)
-2.23 SAGE (62PEs) -0.42
48Performance Evaluation
- SAGE - timing.input (IA32)
0.5 SPEEDUP
49Blocking Communication
- Blocking vs. Non-blocking SWEEP3D (IA32)
- MPI_Send/Recv ? MPI_Isend/Irecv MPI_Waitall
50System Software Components
Resource Management
Job Scheduling
Communication Library
System Software
Fault Tolerance
Parallel I/O
51Fault Tolerance Today
- Fault tolerance is commonly achieved, if at all,
by - Checkpointing
- Segmentation of the machine
- Removal of fault-prone components
- Massive hardware redundancy is not considered
economically feasible
52Our Approach to Fault Tolerance
- Recent work shows that scalable, system-level
fault-tolerance is within reach with current
technology, with low overhead, can be achieved
through a global operating system - Two results provide the basis for this claim
- Buffered CoScheduling that enforces frequent,
global recovery lines and global control - Feasibility of incremental checkpoint
53Checkpointing and Recovery
- Simplicity
- Easy implementation
- Cost-effective
- No additional hardware support
- Critical aspect Bandwidth requirements
54Reducing Bandwidth
- Incremental checkpointing
- Only the memory modified from the previous
checkpoint is saved to stable storage
Incremental
Full
Process state
55Enabling Automatic Checkpointing
Checkpoint data
User intervention
Application
Low
High
Run-time library
Operating system
automatic
Hardware
High
Low
56The Bandwidth Challenge
Does the current technology provide enough
bandwidth?
57Methodology
- Quantifying the Bandwidth Requirements
- Checkpoint intervals 1s to 20s
- Comparing with the current bandwidth available
900 MB/s
Sustained network bandwidthQuadrics QsNet II
75 MB/s
Single sustained disk bandwidthUltra SCSI
controller
58Memory Footprint
Increasing memory footprint
64 Itanium II processors
59Bandwidth Requirements
78.8MB/s
12.1MB/s
Bandwidth (MB/s)
Timeslices (s)
Decreases with the timeslices
Sage-1000MB
60Bandwidth Requirementsfor 1 second
Most demanding
Increases with memory footprint
Single SCSI disk performance
61Increasing Memory Footprint Size
Increases sublinearly
Average Bandwidth (MB/s)
Timeslices (s)
62Increasing Processor Count
Decreases slightly with processor count
Average Bandwidth (MB/s)
Weak-scaling
Timeslices (s)
63Technological Trends
Increases at a faster pace
Performance of applications bounded by memory
improvements
Performance Improvement per year
64Conclusions
- As clusters grow, interconnection technology
advances - Better bandwidth and latency
- On-board programmable processor, RAM
- Hardware support for collective operations
- Allows the development of common system
infrastructure that is a parallel program in
itself
65Conclusions (cont.)
- On top of infrastructure we built
- Scalable resource management (STORM)
- Novel job scheduling algorithms
- Simplified system design and communication
library - Possible basis for transparent fault tolerance
66Conclusions (cont.)
- Experimental performance evaluation demonstrates
- Scalable interactive job launching and
context-switching - Multiprogramming parallel jobs is feasible
- Adaptive scheduling algorithms adjust to
different job requirements, improving response
times and slowdown in various workloads - Transparent, frequent checkpoint within current
reach
67References
- Eitans web page
- http//www.cs.huji.ac.il/etcs/pubs/
- Fabrizios web page
- http//www.c3.lanl.gov/fabrizio/publications.htm
l - PAL team web page
- http//www.c3.lanl.gov/par_arch/Publications.html
68Resource Overlapping
69Turnaround Time
70Response Time
71Timeslice Bounded Slowdown
72FCFS vs. GS and MPL
73FCFS vs. GS and MPL (2)
74Backfilling
- Backfilling is a technique to move jobs forward
in queue - Can be combined with time-sharing schedulers such
as GS when all timeslots are full
75Backfilling
- Backfilling is a technique to move jobs forward
in queue - Can be combined with time-sharing schedulers such
as GS when all timeslots are full
76Effect of Backfilling
77Characterization
Data initialization
Regular processing bursts
Sage-1000MB
78Communication
Interleaved
Regular communicationbursts
Sage-1000MB