Scalable Process Management and Interfaces for Clusters - PowerPoint PPT Presentation

1 / 31
About This Presentation
Title:

Scalable Process Management and Interfaces for Clusters

Description:

Scalable Process Management and Interfaces for Clusters Rusty Lusk representing also David Ashton, Anthony Chan, Bill Gropp, Debbie Swider, Rob Ross, Rajeev Thakur – PowerPoint PPT presentation

Number of Views:114
Avg rating:3.0/5.0
Slides: 32
Provided by: Ewing2
Category:

less

Transcript and Presenter's Notes

Title: Scalable Process Management and Interfaces for Clusters


1
Scalable Process Management and Interfaces for
Clusters
  • Rusty Lusk
  • representing also
  • David Ashton, Anthony Chan, Bill Gropp, Debbie
    Swider, Rob Ross, Rajeev Thakur
  • Mathematics and Computer Science Division
  • Argonne National Laboratory

2
Interfaces
  • High-level programming is the identification of
    components and interfaces
  • MPI is an interface
  • The new abstract device interface inside the next
    generation MPICH
  • An experimental process manager component - MPD
  • An interface between process managers and
    parallel libraries - BNR

3
Projects Related to MPI at ANL
Sandia
Other Apps
IBM
PETSc
Jumpshot
SLOG
Perf Analysis
LANL
UoC Flash
HP
Etnus
Debugging
CurrentMPIIOImpls
LLNL
LBL
Put/GetPrgming
ASCI
ND C
MPI
MPI2
MVICH
CollectiveOps
Others
MPICH
ROMIO
SGI
MPICH-2
VIA
Myrinet
PVFS
BNR
ADI3
LargeClusters
NCSA
DataMngmt
Java IO
Infiniband
Myricom
IBM
MultiThreading
MPD
OpenMP
NT Cluster
Globus
MicroSoft
ScalableSystem Tools
SUT
IMPI
MPICH-G2
QoS
NIST
NGI
Topology
Topology Sens.Collective
4
Projects at Argonne Related to MPI
Sandia
Other Apps
IBM
PETSc
Jumpshot
SLOG
Perf Analysis
LANL
UoC Flash
HP
Etnus
Debugging
CurrentMPIIOImpls
LLNL
LBL
Put/GetPrgming
ASCI
ND C
MPI
MPI2
MVICH
CollectiveOps
Others
MPICH
ROMIO
SGI
MPICH-2
VIA
Myrinet
PVFS
BNR
ADI3
LargeClusters
NCSA
DataMngmt
Java IO
Infiniband
Myricom
IBM
MultiThreading
MPD
OpenMP
NT Cluster
Globus
MicroSoft
ScalableSystem Tools
SUT
IMPI
MPICH-G2
QoS
NIST
NGI
Topology
Topology Sens.Collective
5
MPI As An Interface
  • It is multi-level some routines are
    implementable in terms of the others
  • collective operations
  • parallel I/O (esp. ROMIO implementation)
  • It is extensible there are places where
    unforeseeable parameters are managed
  • attributes on objects
  • info objects
  • (each uses keyvalue pairs)
  • It is interceptable the MPI profiling interface
    allows users to wrap interface calls

6
MPI-2 Status Assessment
  • All MPP vendors now have MPI-1. Free
    implementations (MPICH, LAM) support
    heterogeneous workstation networks.
  • MPI-2 implementations are being undertaken now by
    all vendors.
  • Fujitsu, NEC have complete MPI-2 implementations
  • MPI-2 is harder to implement than MPI-1 was.
  • MPI-2 implementations appearing piecemeal, with
    I/O first.
  • I/O available in most MPI implementations
  • One-sided available in some (e.g., HP, SGI, NEC,
    and Fujitsu, coming soon from IBM)
  • parts of dynamic in LAM

7
The MPICH Implementation of MPI
  • As a research project exploring tradeoffs
    between performance and portability conduct
    research in implementation issues.
  • As a software project providing a free MPI
    implementation on most machines enabling
    vendors and others to build complete MPI
    implementation on their own communication
    services.
  • MPICH 1.2.1 just released, with complete MPI-1,
    parts of MPI-2 (I/O and C), port to
    Windows2000.
  • Available at http//www.mcs.anl.gov/mpi/mpich

8
Internal Interfaces in MPICHThe Abstract Device
Interface
  • ADI-1 objectives
  • speed of implementation
  • performance of MPI point-to-point operations
    layered over vendor libraries (NX, CMMD, EUI,
    etc.)
  • portability
  • ADI-2 objectives
  • portability
  • robustness
  • support for our own research into implementation
    issues
  • ease of outside implementations
  • vendor
  • research

9
Experiences with ADI-1 -2
  • Vendors
  • could (and did) field complete MPI
    implementations quickly
  • could (and did) incrementally move up the
    interface levels, replacing lower-level code
  • could (and did) evolve upper-level code
  • Researchers
  • could (and did) experiment with new transport
    layers (e.g. BIP)
  • Enabled by interface design

10
Internal Interfaces in Next Generation MPICH
  • Objectives for third-generation Abstract Device
    Interface (ADI-3)
  • support for full MPI-2
  • enable thread-safe implementation
  • provide more tools for collaborators
  • support in ADI for collective operations
  • enable high performance on new networks
  • Status
  • interfaces being finalized, comments
  • exploratory thread-safe, multiprotocol
    implementation running
  • http//www.mcs.anl.gov/mpi/mpich.adi3

MPI_Reduce
MPI_Isend
ADI_Isend
ADI_Rhc
write
welcome
11
Interfaces in the Parallel Computing Environment
System Admin
System Monitor
Scheduler
Queue manager
Internal Interfaces
Job Submission
Process Manager
Parallel Library
PVM, MPI
User
Application
File System
12
What is a Process Manager?
  • A process management system is the software
    component that starts user processes (with
    command line arguments and environment), ensures
    that they terminate cleanly, and manages I/O
  • For simple jobs, this can be the shell
  • For parallel jobs, more is needed
  • Process management is different from scheduling
    and queuing
  • We focus for now on the Unix environment
  • Related projects MPICH, PVM, LAM, Harness, PBS,
    LSF, DQS, Load Leveler, Condor
  • An experimental system MPD (the multipurpose
    daemon)

13
Goals for MPD
  • Original Goal speed up MPICH startup
  • evolved to
  • Grandiose Goal build an entire cluster
    computing environment
  • evolved to
  • Architectural Goal design the components of a
    cluster computing environment and their
    interfaces
  • evolved to
  • Realistic Goal make a start on understanding
    architectural requirements, and speed up MPICH
    startup

14
Design Targets for MPD
  • Simplicity - transparent enough to convince
    system people to run it as root
  • Speed - startup of parallel jobs fast enough to
    provide interactive feel (1000 processes in a
    few seconds)
  • Robustness - no single point of failure,
    auto-repair of at least some failures
  • Scalability - complexity or size of any one
    component shouldnt depend on the number of
    components
  • Service - provide parallel jobs with what they
    need, e.g. mechanism for precommunication

15
Parallel Jobs
  • Individual process environments each process in
    a parallel job should be able to have
  • its own executable file
  • its own command-line arguments
  • its own environment variables
  • exit codes
  • Collective identity of parallel job a job
    should collectively
  • be able to be signalled (suspended, restarted,
    killed, others)
  • produce stdout, stderr, and accept stdin scalably
  • terminate, especially abnormally

16
Deriving the Design from the Targets
  • Simplicity and robustness gt multicomponent
    system
  • daemon - persistent, may run for weeks or months
    - one instance per host
  • manager - started by daemon to manage one process
    (its client) of parallel job
  • clients - the application processes, e.g. MPI
    processes or system processes
  • console processes - talk to user and to daemon on
    local host
  • Speed gt daemons are in contact with one another
    prior to job startup

17
Deriving the Design from the Targets
  • Scalability no master gt daemons connected in
    ring, or ring of rings
  • Manager services gt managers also connected
    speed gt managers inherit part of ring from
    daemons
  • Separate managers for each client process support
    individual process environments
  • Collective identity of job represented by console
    processes, e.g. mpirun or mpiexec, which
    represents job for stdin, stdout, stderr, and
    signals

18
Architecture of MPD
  • Daemons, managers, clients, consoles
  • Experimental process manager, job manager,
    scheduler interface for parallel jobs

19
Interesting Features
  • Security
  • Challenge-response system, using passwords in
    protected files and encryption of random numbers
  • Speed not important since daemon startup is
    separate from job startup
  • Fault Tolerance
  • When a daemon dies, this is detected and the ring
    is reknit gt minimal fault tolerance
  • New daemon can be inserted in ring
  • Signals
  • Signals can be delivered to clients by their
    managers

20
More Interesting Features
  • Uses of signal delivery
  • signals delivered to a job-starting console
    process are propagated to the clients
  • so can suspend, resume, or kill an mpirun
  • one client can signal another
  • can be used in setting up connections dynamically
  • a separate console process can signal currently
    running jobs
  • can be used to implement a primitive gang
    scheduler
  • Support for parallel libraries via BNR

21
More Interesting Features
  • Support for parallel libraries
  • implements the BNR process manager interface,
    used by MPICH.
  • groups, put, get, fence, spawn
  • simple distributed database maintained in the
    managers
  • solves pre-communication problem of startup
  • makes MPD independent from MPICH while still
    providing needed features

22
Handling Standard I/O
  • Managers capture stdout and stderr (separately)
    from their clients
  • Managers forward stdout and stderr (separately)
    up a pair of binary trees to the console,
    optionally adding a rank identifier as line label
  • Consoles stdin is delivered to stdin of client 0
    by default, but can be controlled to broadcast or
    go to specific client

mpd ring
I/O tree
manager ring
client
23
Client Wrapping
  • Unix semantics for fork, exec, and process
    environments allow interposition of other
    processes that do not know about the client
    library
  • For example,
  • mpirun -np 16 myprog
  • can be replaced by
  • mpirun -np 16 nice -5 myprog
  • or
  • mpirun -np 16 pty myprog

24
Putting It All Together
  • The combination of
  • client wrapping
  • I/O management, especially redirection of stdin
  • line labels on stdout
  • ability to customize console
  • can be surprisingly powerful.

25
A Simple Parallel Debugger
  • The program mpigdb is a slightly modified version
    of the mpirun console program
  • Automatically wraps given client with gdb
  • Intercepts (gdb) prompts and counts them, issues
    own (mpigdb) prompt when enough have been
    received
  • Sets line label option on stdout and stderr
  • Sets broadcast behavior for stdin as default
  • Uses z command to modify stdin target
  • any specific rank, or broadcast to all

26
Parallel Debugging with mpigdb
donner mpigdb -np 3 cpi (mpigdb) b 33 0
Breakpoint 1 at 0x8049eac file cpi.c, line
33. 1 Breakpoint 1 at 0x8049eac file cpi.c,
line 33. 2 Breakpoint 1 at 0x8049eac file
cpi.c, line 33. (mpigdb) r 2 Breakpoint 1, main
(argc1, argv0xbffffab4) at cpi.c33 1
Breakpoint 1, main (argc1, argv0xbffffac4) at
cpi.c33 0 Breakpoint 1, main (argc1,
argv0xbffffad4) at cpi.c33 (mpigdb) n 2 43
MPI_Bcast(n, 1, MPI_INT, 0,
MPI_COMM_WORLD) 0 39 if (n0)
n100 else n0 1 43 MPI_Bcast(n, 1,
MPI_INT, 0, MPI_COMM_WORLD) (mpigdb) z
0 (mpigdb) n 0 41 startwtime
MPI_Wtime() (mpigdb) n 0 43
MPI_Bcast(n, 1, MPI_INT, 0, MPI_COMM_WORLD) (mpi
gdb)
27
Continuing...
(mpigdb) z (mpigdb) n ....
(mpigdb) n 0 52
x h ((double)i - 0.5) 1 52
x h ((double)i - 0.5) 2 52
x h ((double)i - 0.5) (mpigdb) p x 0
2 0.0050000000000000001 2 2
0.025000000000000001 1 2 0.014999999999999999
(mpigdb) c 0 pi is approximately
3.1416009869231249, 0 Error is
0.0000083333333318 0 Program exited normally. 1
Program exited normally. 2 Program exited
normally. (mpigdb) q donner
28
Experiences
  • Instantaneous startup of small MPICH jobs was
    wonderful after years of conditioning for slow
    startup
  • Not so critical for batch-scheduled jobs, but
    allows writing parallel versions of short, quick
    Unix tools (cp, rm, find, etc) as MPI jobs
  • Speed on big jobs (Chiba City with fast
    Ethernet)
  • mpdringtest 1 on 211 hosts - .13 sec.
  • mpirun -np 422 hostname - 3.5 sec.
  • Running the daemons as root required little extra
  • We really do use the debugger mpigdb

29
Running the Daemons as Root
  • Daemon is run as root
  • Console runs as a setuid program, and becomes
    root only briefly while connected to daemon.
  • Console transmits uid, gid, and group membership
    of real user to daemon ring
  • Daemons fork managers, who assume users
    attributes before execing the manager program
  • After job startup, console, managers, and clients
    are all running as user.
  • Daemon free to accept more console commands from
    same or other users
  • In experimental use now on our Chiba City cluster

30
Status
  • The MPD system is open source and is available as
    a component of the MPICH distribution.
  • Configuring and installing MPICH with the
    ch_p4mpd device automatically builds and installs
    MPD.
  • MPD can be built and installed separately from
    MPICH
  • Serving as platform for study of process manager
    interfaces

31
Motivation for a Process Manager Interface
  • Any device or method needs to interact with
    process manager (at least at startup, and
    probably later as well)
  • A parallel library (such as an MPI
    implementation) should be independent of any
    particular process manager.
  • Process managers abound, but few are equipped to
    manage the processes of parallel jobs.
  • Globus has a meta process manager that
    interacts with local ones.
  • Condor can start PVM and MPICH jobs (special for
    each)
  • PBS TM interface
  • MPD is a prototype process manager, intended to
    help us explore interface issues.

32
One Interface in the ADI-3 Library the BNR
Process Manager Interface
  • Goals
  • simple enough to plausibly suggest for
    implementation by other process managers
  • provide startup info and precommunication
  • not specific to either parallel library or
    process manager
  • Status
  • complete definition still not frozen
  • some parts already implemented by GARA (Globus P.
    M.) and MPD

Providers
MPD
GARA
PBS
Condor?
Harness?
BNR Interface
Globus Device
Others
MPICH
Users
33
The BNR Interface
  • Data part allows users (e.g. parallel
    libraries) to put keyvalue pairs into the
    (possibly distributed) database.
  • Spawn part allows users to request that
    processes be started, with hints.
  • Groups, put/get/fence model for synchronization,
    communication preparation. (Fence provides
    scalability)
  • mpirun (as user) can use precommunication for
    command-line args, environment

34
Plans
  • Support TotalView startup gt real parallel
    debugger
  • Performance tuning, alternate networks
  • Verification and simulation subproject
  • Full implementation of BNR interface, including
    BNR_Spawn (needed for more powerful mpiexec and
    MPI_Spawn, MPI_Spawn_multiple)

35
Example of Use in Setting Up Communication
  • Precommunication (communication needed before MPI
    communication), in TCP method

Process 26
Process 345
BNR_Init( grp ) ...obtain own host and
port... listen( port ) BNR_Put( host_26, host
) BNR_Put( port_26, port ) BNR_Fence( grp )
...
BNR_Init( grp ) ... BNR_Fence( grp ) ...
... decide to connect to 0 ... BNR_Get( grp,
host_26, host ) BNR_Get( grp, port_26, port
) connect( host, port )
36
Multiple Providers of the BNR Interface
  • MPD
  • BNR_Puts deposit data in managers
  • BNR_Fence implemented by manager ring
  • BNR_Get hunts around ring for data
  • Globus
  • BNR_Puts are local into process
  • BNR_Fence is an all-to-all exchange
  • BNR_Get is then also local
  • MPICH-NT launcher
  • uses global database in dedicated process

37
Multiple Users of the Interface
  • Put/Get/Fence already in use
  • Current MPICH (1.2.1), ch_p4mpd device
  • Globus device in MPICH
  • Various methods in MPICH2000 RMQ device
  • Finalizing spawn part for use in next-generation
    MPICH

38
Summary
  • There is still much to do in creating a solid
    parallel computing environment for applications.
  • Process managers are one important component of
    the environment. MPD is an experimental process
    manager, focusing on providing needed services to
    parallel jobs (fast startup, stdio, etc.)
  • A widely used interface to process managers in
    general would help the development of process
    managers and parallel libraries alike.
  • Other components and interfaces are important
    topics for research and experimentation.
Write a Comment
User Comments (0)
About PowerShow.com