Designing Parallel Operating Systems using Modern Interconnects - PowerPoint PPT Presentation

1 / 78
About This Presentation
Title:

Designing Parallel Operating Systems using Modern Interconnects

Description:

Poll local event. Eitan Frachtenberg. MIT, 20-Sep-2004. 11. P. A. L ... FCFS, GS, FCS, SB (ICS), BCS. Emulation vs. simulation ... – PowerPoint PPT presentation

Number of Views:41
Avg rating:3.0/5.0
Slides: 79
Provided by: csHu
Category:

less

Transcript and Presenter's Notes

Title: Designing Parallel Operating Systems using Modern Interconnects


1
Designing Parallel Operating Systems using Modern
Interconnects
Designing Parallel Operating Systems using Modern
Interconnects
Eitan Frachtenberg (eitanf_at_lanl.gov) With Fabrizi
o Petrini, Juan Fernandez, Dror Feitelson,
Jose-Carlos Sancho, Kei Davis
Computer and Computational Sciences Division Los
Alamos National Laboratory
Ideas that change the world
2
Cluster Supercomputers
  • Growing in prevalence and performance,
    7 out of 10 top supercomputers
  • Running parallel applications
  • Advanced, high-end interconnects

3
Distributed vs. Parallel
  • Distributed and parallel applications (including
    operating systems) may be distinguished by their
    use of global and collective operations
  • Distributedlocal information, relatively small
    number of point-to-point messages
  • Parallelglobal synchronization barriers,
    reductions, exchanges

4
System Software Components
Resource Management
Job Scheduling
Communication Library
System Software
Fault Tolerance
Parallel I/O
5
Problems with System Software
  • Independent single-node OS (e.g. Linux) connected
    by distributed dæmons
  • Redundant components
  • Performance hits
  • Scalability issues
  • Load balancing issues

6
OSs Collective Operations
  • Many OS tasks are inherently global or collective
    operations
  • Job launching, data dissemination
  • Context switching
  • Job termination (normal and forced)
  • Load balancing

7
Resource Management
Resource Management
Parallel I/O
Parallel I/O
Fault Tolerance
Fault Tolerance
Local Operating System
Local Operating System
User-Level Communication
User-Level Communication
Job Scheduling
Job Scheduling
Node 1
Node 2
Job Scheduling
Fault Tolerance
Communication
Parallel I/O
Resource Mgmt
Global Parallel Operating System
8
The Vision
  • Modern interconnects are very powerful
  • collective operations
  • programmable NICs
  • on-board RAM
  • Use a small set of network mechanisms as parallel
    OS infrastructure
  • Build upon this infrastructure to create unified
    system software
  • System software Inherits scalability and
    performance from network features

9
Example ASCI Q Barrier HotI03
10
Parallel OS Primitives
  • System software built atop three primitives
  • Xfer-And-Signal
  • Transfer block of data to a set of nodes
  • Optionally signal local/remote event upon
    completion
  • Compare-And-Write
  • Compare global variable on a set of nodes
  • Optionally write global variable on the same set
    of nodes
  • Test-Event
  • Poll local event

11
Core Primitives on QsNet
  • System software built atop three primitives
  • Xfer-And-Signal (QsNet)
  • Node S transfers block of data to nodes D1, D2,
    D3 and D4
  • Events triggered at source and destinations

12
Core Primitives (cont.)
  • System software built atop three primitives
  • Compare-And-Write (QsNet)
  • Node S compares variable V on nodes D1, D2, D3
    and D4

S
  • Is V ?, ? ?, gt to Value?

13
Core Primitives (cont.)
  • System software built atop three primitives
  • Compare-And-Write (QsNet)
  • Node S compares variable V on nodes D1, D2, D3
    and D4
  • Partial results are combined in the switches

S
14
System Software Components
Resource Management
Job Scheduling
Communication Library
System Software
Fault Tolerance
Parallel I/O
15
Scalable Tool for Resource Management
  • Inherits scalability from network primitives
  • Data dissemination and coordination
  • Interactive job launching speeds
  • Context-switching at milliseconds level
  • Described in SC02

16
State of the Art in ResourceManagement
  • Resource managers (e.g. PBS, LSF, RMS,
    LoadLeveler, Maui) are typically implemented
    using
  • TCP/IPfavors portability over performance,
  • Poorly-scaling algorithms for the
    distribution/collection of data and control
    messages
  • Favoring development time over performance
  • Scalable performance not important for small
    clusters but crucial for large ones.
  • There exists a need for fast and scalable
    resource management.

17
Experimental Setup
  • 64 nodes/256 processors ES40 Alphaserver cluster
  • 2 independent network rails of Quadrics Elan3
  • Files are placed in ramdisk in order to avoid I/O
    bottlenecks and expose the performance of the
    resource management algorithms

18
Launch Times (Unloaded System)
  • The launch time is constant when we increase the
    number of processors. STORM is highly scalable

19
Launch Times (Loaded System, 12 MB)
  • Worst case 1.5 seconds to launch a 12 MB file
    on 256 processors

20
Measured and Estimated Launch Times
  • The model shows that in an ES40-based
    Alphaserver a 12MB binary can be launched in
    135ms on 16,384 nodes

21
Comparative Evaluation(Measured Modeled)
22
System Software Components
Resource Management
Job Scheduling
Communication Library
System Software
Fault Tolerance
Parallel I/O
23
Job Scheduling
  • Controls the allocation of space and time
    resources to jobs
  • HPC apps have special requirements
  • Multiple processing and network resources
  • Synchronization ( lt 1ms granularity)
  • Potentially memory hogs with little locality
  • Has significant effect on throughput,
    responsiveness, and utilization

24
First-Come-First-Serve (FCFS)
25
Gang Scheduling (GS)
26
Implicit CoScheduling
27
Hybrid Methods
  • Combine global synchronization local
    information
  • Rely on scalable primitives for global
    coordination and information exchange
  • First implementation of two novel algorithms
  • Flexible CoScheduling (FCS)
  • Buffered CoScheduling (BCS)

28
Flexible CoScheduling (FCS)
  • Measure communication characteristics, such as
    granularity and wait times
  • Classify processes based on synchronization
    requirements
  • Schedule processes based on class
  • Described in IPDPS03

29
FCS Classification
Fine
Coarse
Granularity
DC Locally scheduled
Long
Short
Block times
CS Always gang-scheduled
F Preferably gang-scheduled
30
Methodology
  • Synthetic, controllable MPI programs
  • Workload
  • Static all jobs start together
  • Dynamic different sizes, arrival and run times
  • Various schedulers implemented
  • FCFS, GS, FCS, SB (ICS), BCS
  • Emulation vs. simulation
  • Actual implementation takes into account all the
    overhead and factors of a real system

31
Hardware Environment
  • Environment ported to three architectures and
    clusters
  • Crescendo 32x2 Pentium III, 1GB
  • Accelerando 32x2 Itanium II, 2GB
  • Wolverine 64x4 Alpha ES40, 8GB

32
Synthetic Application
  • Bulk synchronous, 3ms basic granularity
  • Can control granularity, variability and
    Communication pattern

33
Synthetic Scenarios
Balanced Complementing Imbalanced
Mixed
34
Turnaround Time
35
Dynamic Workloads JSSPP03
  • Static workloads are simple and offer insights,
    but are not realistic
  • Most real-life workloads are more complex
  • Users submit jobs dynamically, of varying time
    and space requirements

36
Dynamic Workload Methodology
  • Emulation using a workload model Lublin03
  • 1000 jobs, approx. 12 days, shrunk to 2 hrs
  • Varying load by factoring arrival times
  • Using same synthetic application, with random
  • Arrival time, run time, and size, based on model
  • Granularity (fine, medium, coarse)
  • communication pattern (ring, barrier, none)
  • Recent study with scientific apps (yet
    unpublished)

37
Load Response Time
38
Load Bounded Slowdown
39
Timeslice Response Time
40
System Software Components
Resource Management
Job Scheduling
Communication Library
System Software
Fault Tolerance
Parallel I/O
41
Buffered CoScheduling (BCS)
  • Buffer all communications
  • Exchange information about pending communication
    every time slice
  • Schedule and execute communication
  • Implemented mostly on the NIC
  • Requires fine-grained heartbeats
  • Described in SC03

42
Design and Implementation
  • Global synchronization
  • Strobe sent at regular intervals (time slices)
  • Compare-And-Write Xfer-And-Signal (Master)
  • Test-Event (Slaves)
  • All system activities are tightly coupled
  • Global Scheduling
  • Exchange of communication requirements
  • Xfer-And-Signal Test-Event
  • Communication scheduling
  • Real transmission
  • Xfer-And-Signal Test-Event

43
Design and Implementation
  • Implementation in the NIC
  • Application processes interact with NIC threads
  • MPI primitive ? Descriptor posted to the NIC
  • Communications are buffered
  • Cooperative threads running in the NIC
  • Synchronize
  • Partial exchange of control information
  • Schedule communications
  • Perform real transmissions and reduce
    computations
  • Comp/comm completely overlapped

44
Design and Implementation
  • Non-blocking primitives MPI_Isend/Irecv

45
Design and Implementation
  • Blocking primitives MPI_Send/Recv

46
Performance Evaluation
  • BCS MPI vs. Quadrics MPI
  • Experimental Setup
  • Benchmarks and Applications
  • NPB (IS,EP,MG,CG,LU) - Class C
  • SWEEP3D - 50x50x50
  • SAGE - timing.input
  • Scheduling parameters
  • 500µs communication scheduling time slice (1
    rail)
  • 250µs communication scheduling time slice (2
    rails)

47
Performance Evaluation
  • Benchmarks and Applications (C)

Application Slowdown IS (32PEs) 10.40 EP
(49PEs) 5.35 MG (32PEs) 4.37 CG
(32PEs) 10.83 LU (32PEs) 15.04 SWEEP3D (49PEs)
-2.23 SAGE (62PEs) -0.42
48
Performance Evaluation
  • SAGE - timing.input (IA32)

0.5 SPEEDUP
49
Blocking Communication
  • Blocking vs. Non-blocking SWEEP3D (IA32)
  • MPI_Send/Recv ? MPI_Isend/Irecv MPI_Waitall

50
System Software Components
Resource Management
Job Scheduling
Communication Library
System Software
Fault Tolerance
Parallel I/O
51
Fault Tolerance Today
  • Fault tolerance is commonly achieved, if at all,
    by
  • Checkpointing
  • Segmentation of the machine
  • Removal of fault-prone components
  • Massive hardware redundancy is not considered
    economically feasible

52
Our Approach to Fault Tolerance
  • Recent work shows that scalable, system-level
    fault-tolerance is within reach with current
    technology, with low overhead, can be achieved
    through a global operating system
  • Two results provide the basis for this claim
  • Buffered CoScheduling that enforces frequent,
    global recovery lines and global control
  • Feasibility of incremental checkpoint

53
Checkpointing and Recovery
  • Simplicity
  • Easy implementation
  • Cost-effective
  • No additional hardware support
  • Critical aspect Bandwidth requirements

54
Reducing Bandwidth
  • Incremental checkpointing
  • Only the memory modified from the previous
    checkpoint is saved to stable storage

Incremental
Full
Process state
55
Enabling Automatic Checkpointing
Checkpoint data
User intervention
Application
Low
High
Run-time library
Operating system
automatic
Hardware
High
Low
56
The Bandwidth Challenge
Does the current technology provide enough
bandwidth?
  • Frequent
  • Automatic

57
Methodology
  • Quantifying the Bandwidth Requirements
  • Checkpoint intervals 1s to 20s
  • Comparing with the current bandwidth available

900 MB/s
Sustained network bandwidthQuadrics QsNet II
75 MB/s
Single sustained disk bandwidthUltra SCSI
controller
58
Memory Footprint
Increasing memory footprint
64 Itanium II processors
59
Bandwidth Requirements
78.8MB/s
12.1MB/s
Bandwidth (MB/s)
Timeslices (s)
Decreases with the timeslices
Sage-1000MB
60
Bandwidth Requirementsfor 1 second
Most demanding
Increases with memory footprint
Single SCSI disk performance
61
Increasing Memory Footprint Size
Increases sublinearly
Average Bandwidth (MB/s)
Timeslices (s)
62
Increasing Processor Count
Decreases slightly with processor count
Average Bandwidth (MB/s)
Weak-scaling
Timeslices (s)
63
Technological Trends
Increases at a faster pace
Performance of applications bounded by memory
improvements
Performance Improvement per year
64
Conclusions
  • As clusters grow, interconnection technology
    advances
  • Better bandwidth and latency
  • On-board programmable processor, RAM
  • Hardware support for collective operations
  • Allows the development of common system
    infrastructure that is a parallel program in
    itself

65
Conclusions (cont.)
  • On top of infrastructure we built
  • Scalable resource management (STORM)
  • Novel job scheduling algorithms
  • Simplified system design and communication
    library
  • Possible basis for transparent fault tolerance

66
Conclusions (cont.)
  • Experimental performance evaluation demonstrates
  • Scalable interactive job launching and
    context-switching
  • Multiprogramming parallel jobs is feasible
  • Adaptive scheduling algorithms adjust to
    different job requirements, improving response
    times and slowdown in various workloads
  • Transparent, frequent checkpoint within current
    reach

67
References
  • Eitans web page
  • http//www.cs.huji.ac.il/etcs/pubs/
  • Fabrizios web page
  • http//www.c3.lanl.gov/fabrizio/publications.htm
    l
  • PAL team web page
  • http//www.c3.lanl.gov/par_arch/Publications.html

68
Resource Overlapping
69
Turnaround Time
70
Response Time
71
Timeslice Bounded Slowdown
72
FCFS vs. GS and MPL
73
FCFS vs. GS and MPL (2)
74
Backfilling
  • Backfilling is a technique to move jobs forward
    in queue
  • Can be combined with time-sharing schedulers such
    as GS when all timeslots are full

75
Backfilling
  • Backfilling is a technique to move jobs forward
    in queue
  • Can be combined with time-sharing schedulers such
    as GS when all timeslots are full

76
Effect of Backfilling
77
Characterization
Data initialization
Regular processing bursts
Sage-1000MB
78
Communication
Interleaved


Regular communicationbursts


Sage-1000MB
Write a Comment
User Comments (0)
About PowerShow.com