Title: Scalable Process Management and Interfaces for Clusters
1Scalable Process Management and Interfaces for
Clusters
- Rusty Lusk
- representing also
- David Ashton, Anthony Chan, Bill Gropp, Debbie
Swider, Rob Ross, Rajeev Thakur - Mathematics and Computer Science Division
- Argonne National Laboratory
2Interfaces
- High-level programming is the identification of
components and interfaces - MPI is an interface
- The new abstract device interface inside the next
generation MPICH - An experimental process manager component - MPD
- An interface between process managers and
parallel libraries - BNR
3Projects Related to MPI at ANL
Sandia
Other Apps
IBM
PETSc
Jumpshot
SLOG
Perf Analysis
LANL
UoC Flash
HP
Etnus
Debugging
CurrentMPIIOImpls
LLNL
LBL
Put/GetPrgming
ASCI
ND C
MPI
MPI2
MVICH
CollectiveOps
Others
MPICH
ROMIO
SGI
MPICH-2
VIA
Myrinet
PVFS
BNR
ADI3
LargeClusters
NCSA
DataMngmt
Java IO
Infiniband
Myricom
IBM
MultiThreading
MPD
OpenMP
NT Cluster
Globus
MicroSoft
ScalableSystem Tools
SUT
IMPI
MPICH-G2
QoS
NIST
NGI
Topology
Topology Sens.Collective
4Projects at Argonne Related to MPI
Sandia
Other Apps
IBM
PETSc
Jumpshot
SLOG
Perf Analysis
LANL
UoC Flash
HP
Etnus
Debugging
CurrentMPIIOImpls
LLNL
LBL
Put/GetPrgming
ASCI
ND C
MPI
MPI2
MVICH
CollectiveOps
Others
MPICH
ROMIO
SGI
MPICH-2
VIA
Myrinet
PVFS
BNR
ADI3
LargeClusters
NCSA
DataMngmt
Java IO
Infiniband
Myricom
IBM
MultiThreading
MPD
OpenMP
NT Cluster
Globus
MicroSoft
ScalableSystem Tools
SUT
IMPI
MPICH-G2
QoS
NIST
NGI
Topology
Topology Sens.Collective
5MPI As An Interface
- It is multi-level some routines are
implementable in terms of the others - collective operations
- parallel I/O (esp. ROMIO implementation)
- It is extensible there are places where
unforeseeable parameters are managed - attributes on objects
- info objects
- (each uses keyvalue pairs)
- It is interceptable the MPI profiling interface
allows users to wrap interface calls
6MPI-2 Status Assessment
- All MPP vendors now have MPI-1. Free
implementations (MPICH, LAM) support
heterogeneous workstation networks. - MPI-2 implementations are being undertaken now by
all vendors. - Fujitsu, NEC have complete MPI-2 implementations
- MPI-2 is harder to implement than MPI-1 was.
- MPI-2 implementations appearing piecemeal, with
I/O first. - I/O available in most MPI implementations
- One-sided available in some (e.g., HP, SGI, NEC,
and Fujitsu, coming soon from IBM) - parts of dynamic in LAM
7The MPICH Implementation of MPI
- As a research project exploring tradeoffs
between performance and portability conduct
research in implementation issues. - As a software project providing a free MPI
implementation on most machines enabling
vendors and others to build complete MPI
implementation on their own communication
services. - MPICH 1.2.1 just released, with complete MPI-1,
parts of MPI-2 (I/O and C), port to
Windows2000. - Available at http//www.mcs.anl.gov/mpi/mpich
8Internal Interfaces in MPICHThe Abstract Device
Interface
- ADI-1 objectives
- speed of implementation
- performance of MPI point-to-point operations
layered over vendor libraries (NX, CMMD, EUI,
etc.) - portability
- ADI-2 objectives
- portability
- robustness
- support for our own research into implementation
issues - ease of outside implementations
- vendor
- research
9Experiences with ADI-1 -2
- Vendors
- could (and did) field complete MPI
implementations quickly - could (and did) incrementally move up the
interface levels, replacing lower-level code - could (and did) evolve upper-level code
- Researchers
- could (and did) experiment with new transport
layers (e.g. BIP) - Enabled by interface design
10Internal Interfaces in Next Generation MPICH
- Objectives for third-generation Abstract Device
Interface (ADI-3) - support for full MPI-2
- enable thread-safe implementation
- provide more tools for collaborators
- support in ADI for collective operations
- enable high performance on new networks
- Status
- interfaces being finalized, comments
- exploratory thread-safe, multiprotocol
implementation running - http//www.mcs.anl.gov/mpi/mpich.adi3
MPI_Reduce
MPI_Isend
ADI_Isend
ADI_Rhc
write
welcome
11Interfaces in the Parallel Computing Environment
System Admin
System Monitor
Scheduler
Queue manager
Internal Interfaces
Job Submission
Process Manager
Parallel Library
PVM, MPI
User
Application
File System
12What is a Process Manager?
- A process management system is the software
component that starts user processes (with
command line arguments and environment), ensures
that they terminate cleanly, and manages I/O - For simple jobs, this can be the shell
- For parallel jobs, more is needed
- Process management is different from scheduling
and queuing - We focus for now on the Unix environment
- Related projects MPICH, PVM, LAM, Harness, PBS,
LSF, DQS, Load Leveler, Condor - An experimental system MPD (the multipurpose
daemon)
13Goals for MPD
- Original Goal speed up MPICH startup
- evolved to
- Grandiose Goal build an entire cluster
computing environment - evolved to
- Architectural Goal design the components of a
cluster computing environment and their
interfaces - evolved to
- Realistic Goal make a start on understanding
architectural requirements, and speed up MPICH
startup
14Design Targets for MPD
- Simplicity - transparent enough to convince
system people to run it as root - Speed - startup of parallel jobs fast enough to
provide interactive feel (1000 processes in a
few seconds) - Robustness - no single point of failure,
auto-repair of at least some failures - Scalability - complexity or size of any one
component shouldnt depend on the number of
components - Service - provide parallel jobs with what they
need, e.g. mechanism for precommunication
15Parallel Jobs
- Individual process environments each process in
a parallel job should be able to have - its own executable file
- its own command-line arguments
- its own environment variables
- exit codes
- Collective identity of parallel job a job
should collectively - be able to be signalled (suspended, restarted,
killed, others) - produce stdout, stderr, and accept stdin scalably
- terminate, especially abnormally
16Deriving the Design from the Targets
- Simplicity and robustness gt multicomponent
system - daemon - persistent, may run for weeks or months
- one instance per host - manager - started by daemon to manage one process
(its client) of parallel job - clients - the application processes, e.g. MPI
processes or system processes - console processes - talk to user and to daemon on
local host - Speed gt daemons are in contact with one another
prior to job startup
17Deriving the Design from the Targets
- Scalability no master gt daemons connected in
ring, or ring of rings - Manager services gt managers also connected
speed gt managers inherit part of ring from
daemons - Separate managers for each client process support
individual process environments - Collective identity of job represented by console
processes, e.g. mpirun or mpiexec, which
represents job for stdin, stdout, stderr, and
signals
18Architecture of MPD
- Daemons, managers, clients, consoles
- Experimental process manager, job manager,
scheduler interface for parallel jobs
19Interesting Features
- Security
- Challenge-response system, using passwords in
protected files and encryption of random numbers - Speed not important since daemon startup is
separate from job startup - Fault Tolerance
- When a daemon dies, this is detected and the ring
is reknit gt minimal fault tolerance - New daemon can be inserted in ring
- Signals
- Signals can be delivered to clients by their
managers
20More Interesting Features
- Uses of signal delivery
- signals delivered to a job-starting console
process are propagated to the clients - so can suspend, resume, or kill an mpirun
- one client can signal another
- can be used in setting up connections dynamically
- a separate console process can signal currently
running jobs - can be used to implement a primitive gang
scheduler - Support for parallel libraries via BNR
21More Interesting Features
- Support for parallel libraries
- implements the BNR process manager interface,
used by MPICH. - groups, put, get, fence, spawn
- simple distributed database maintained in the
managers - solves pre-communication problem of startup
- makes MPD independent from MPICH while still
providing needed features
22Handling Standard I/O
- Managers capture stdout and stderr (separately)
from their clients - Managers forward stdout and stderr (separately)
up a pair of binary trees to the console,
optionally adding a rank identifier as line label - Consoles stdin is delivered to stdin of client 0
by default, but can be controlled to broadcast or
go to specific client
mpd ring
I/O tree
manager ring
client
23Client Wrapping
- Unix semantics for fork, exec, and process
environments allow interposition of other
processes that do not know about the client
library - For example,
- mpirun -np 16 myprog
- can be replaced by
- mpirun -np 16 nice -5 myprog
- or
- mpirun -np 16 pty myprog
24Putting It All Together
- The combination of
- client wrapping
- I/O management, especially redirection of stdin
- line labels on stdout
- ability to customize console
- can be surprisingly powerful.
25A Simple Parallel Debugger
- The program mpigdb is a slightly modified version
of the mpirun console program - Automatically wraps given client with gdb
- Intercepts (gdb) prompts and counts them, issues
own (mpigdb) prompt when enough have been
received - Sets line label option on stdout and stderr
- Sets broadcast behavior for stdin as default
- Uses z command to modify stdin target
- any specific rank, or broadcast to all
26Parallel Debugging with mpigdb
donner mpigdb -np 3 cpi (mpigdb) b 33 0
Breakpoint 1 at 0x8049eac file cpi.c, line
33. 1 Breakpoint 1 at 0x8049eac file cpi.c,
line 33. 2 Breakpoint 1 at 0x8049eac file
cpi.c, line 33. (mpigdb) r 2 Breakpoint 1, main
(argc1, argv0xbffffab4) at cpi.c33 1
Breakpoint 1, main (argc1, argv0xbffffac4) at
cpi.c33 0 Breakpoint 1, main (argc1,
argv0xbffffad4) at cpi.c33 (mpigdb) n 2 43
MPI_Bcast(n, 1, MPI_INT, 0,
MPI_COMM_WORLD) 0 39 if (n0)
n100 else n0 1 43 MPI_Bcast(n, 1,
MPI_INT, 0, MPI_COMM_WORLD) (mpigdb) z
0 (mpigdb) n 0 41 startwtime
MPI_Wtime() (mpigdb) n 0 43
MPI_Bcast(n, 1, MPI_INT, 0, MPI_COMM_WORLD) (mpi
gdb)
27Continuing...
(mpigdb) z (mpigdb) n ....
(mpigdb) n 0 52
x h ((double)i - 0.5) 1 52
x h ((double)i - 0.5) 2 52
x h ((double)i - 0.5) (mpigdb) p x 0
2 0.0050000000000000001 2 2
0.025000000000000001 1 2 0.014999999999999999
(mpigdb) c 0 pi is approximately
3.1416009869231249, 0 Error is
0.0000083333333318 0 Program exited normally. 1
Program exited normally. 2 Program exited
normally. (mpigdb) q donner
28Experiences
- Instantaneous startup of small MPICH jobs was
wonderful after years of conditioning for slow
startup - Not so critical for batch-scheduled jobs, but
allows writing parallel versions of short, quick
Unix tools (cp, rm, find, etc) as MPI jobs - Speed on big jobs (Chiba City with fast
Ethernet) - mpdringtest 1 on 211 hosts - .13 sec.
- mpirun -np 422 hostname - 3.5 sec.
- Running the daemons as root required little extra
- We really do use the debugger mpigdb
29Running the Daemons as Root
- Daemon is run as root
- Console runs as a setuid program, and becomes
root only briefly while connected to daemon. - Console transmits uid, gid, and group membership
of real user to daemon ring - Daemons fork managers, who assume users
attributes before execing the manager program - After job startup, console, managers, and clients
are all running as user. - Daemon free to accept more console commands from
same or other users - In experimental use now on our Chiba City cluster
30Status
- The MPD system is open source and is available as
a component of the MPICH distribution. - Configuring and installing MPICH with the
ch_p4mpd device automatically builds and installs
MPD. - MPD can be built and installed separately from
MPICH - Serving as platform for study of process manager
interfaces
31Motivation for a Process Manager Interface
- Any device or method needs to interact with
process manager (at least at startup, and
probably later as well) - A parallel library (such as an MPI
implementation) should be independent of any
particular process manager. - Process managers abound, but few are equipped to
manage the processes of parallel jobs. - Globus has a meta process manager that
interacts with local ones. - Condor can start PVM and MPICH jobs (special for
each) - PBS TM interface
- MPD is a prototype process manager, intended to
help us explore interface issues.
32One Interface in the ADI-3 Library the BNR
Process Manager Interface
- Goals
- simple enough to plausibly suggest for
implementation by other process managers - provide startup info and precommunication
- not specific to either parallel library or
process manager - Status
- complete definition still not frozen
- some parts already implemented by GARA (Globus P.
M.) and MPD
Providers
MPD
GARA
PBS
Condor?
Harness?
BNR Interface
Globus Device
Others
MPICH
Users
33The BNR Interface
- Data part allows users (e.g. parallel
libraries) to put keyvalue pairs into the
(possibly distributed) database. - Spawn part allows users to request that
processes be started, with hints. - Groups, put/get/fence model for synchronization,
communication preparation. (Fence provides
scalability) - mpirun (as user) can use precommunication for
command-line args, environment
34Plans
- Support TotalView startup gt real parallel
debugger - Performance tuning, alternate networks
- Verification and simulation subproject
- Full implementation of BNR interface, including
BNR_Spawn (needed for more powerful mpiexec and
MPI_Spawn, MPI_Spawn_multiple)
35Example of Use in Setting Up Communication
- Precommunication (communication needed before MPI
communication), in TCP method
Process 26
Process 345
BNR_Init( grp ) ...obtain own host and
port... listen( port ) BNR_Put( host_26, host
) BNR_Put( port_26, port ) BNR_Fence( grp )
...
BNR_Init( grp ) ... BNR_Fence( grp ) ...
... decide to connect to 0 ... BNR_Get( grp,
host_26, host ) BNR_Get( grp, port_26, port
) connect( host, port )
36Multiple Providers of the BNR Interface
- MPD
- BNR_Puts deposit data in managers
- BNR_Fence implemented by manager ring
- BNR_Get hunts around ring for data
- Globus
- BNR_Puts are local into process
- BNR_Fence is an all-to-all exchange
- BNR_Get is then also local
- MPICH-NT launcher
- uses global database in dedicated process
37Multiple Users of the Interface
- Put/Get/Fence already in use
- Current MPICH (1.2.1), ch_p4mpd device
- Globus device in MPICH
- Various methods in MPICH2000 RMQ device
- Finalizing spawn part for use in next-generation
MPICH
38Summary
- There is still much to do in creating a solid
parallel computing environment for applications. - Process managers are one important component of
the environment. MPD is an experimental process
manager, focusing on providing needed services to
parallel jobs (fast startup, stdio, etc.) - A widely used interface to process managers in
general would help the development of process
managers and parallel libraries alike. - Other components and interfaces are important
topics for research and experimentation.