Title: Crash Course in Parallel Programming Using MPI
1Crash Course in Parallel Programming Using MPI
- Adam Jacobs
- HCS Research Lab
- 01/10/07
2Outline PCA Preparation
- Parallel Computing
- Distributed Memory Architectures
- Programming Models
- Flynns Taxonomy
- Parallel Decomposition
- Speedups
3Parallel Computing
- Motivated by high computational complexity and
memory requirements of large applications - Two Approaches
- Shared Memory
- Distributed Memory
- The majority of modern systems are clusters
(distributed memory architecture) - Many simple machines connected with a powerful
interconnect - ASCI Red, ASCI White,
- Also a hybrid approach can be used
- IBM Blue Gene
4Shared Memory Systems
- Memory resources are shared among processors
- Relatively easy to program for since there is a
single unified memory space - Scales poorly with system size due to the need
for cache coherency - Example
- Symmetric Multiprocessors (SMP)
- Each processor has equal access to RAM
- 4-way motherboards MUCH more expensive than 2-way
5Distributed Memory Systems
- Individual nodes consist of a CPU, RAM, and a
network interface - A hard disk is not necessary mass storage can be
supplied using NFS - Information is passed between nodes using the
network - No need for special cache coherency hardware
- More difficult to write programs for distributed
memory systems since the programmer must keep
track of memory usage
6Programming Models
- Multiprogramming
- Multiple programs running simultaneously
- Shared Address
- Global address space available to all processors
- Shared data is written to this global space
- Message Passing
- Data is sent directly to processors using
messages - Data Parallel
7Flynns Taxonomy
- SISD Single Instruction, Single Data
- Normal Instructions
- SIMD Single Instruction, Multiple Data
- Vector Operations, MMX, SSE, Altivec
- MISD Multiple Instructions, Single Data
- MIMD Multiple Instructions, Multiple Data
- SPMD Single Program, Multiple Data
8Parallel Decomposition
- Data Parallelism
- Parallelism within a dataset such that a portion
of the data can be computed independently from
the rest - Usually results in coarse-grained parallelism
(compute farms) - Allows for automatic load balancing strategies
- Functional Parallelism
- Parallelism between distinct functional blocks
such that each block can be performed
independently - Especially useful for pipeline structures
9Speedup
10Super-linear Speedup
- Linear speedup is the best that can be achieved
- Or is it?
- Super-linear speedup occurs when parallelizing an
algorithm results in a more efficient use of
hardware resources - 1MB task doesnt fit in a single processor
- 2 512 KB tasks do fit, results in lower effective
memory access times
11MPI Message Passing Interface
- Adam Jacobs
- HCS Research Lab
- 01/10/07
Slides created by Raj Subramaniyan
12Outline MPI Usage
- Introduction
- MPI Standard
- MPI Implementations
- MPICH Introduction
- MPI calls
- Present Emphasis
13Parallel Computing
- Motivated by high computational complexity and
memory requirements of large applications - Cooperation with other processes
- Cooperative and one-sided operations
- Processes interact with each other by exchanging
information - Models
- SIMD
- SPMD
- MIMD
14Cooperative Operations
- Cooperative all parties agree to transfer data
- Message-passing is an approach that makes the
exchange of data cooperative - Data must both be explicitly sent and received
- Any change in the receiver's memory is made with
the receiver's participation
15MPI Message Passing Interface
- MPI A message passing library specification
- A message passing model and not a specific
product - Designed for parallel computers, clusters and
heterogeneous networks - Standardization began in 1992 and the final draft
was made available in 1994 - Broad participation of vendors, library writers,
application specialists and scientists
Message Passing Interface Forum accessible at
http//www.mpi-forum.org/
16Features of MPI
- Point-to-point communication
- Collective operations
- Process groups
- Communication contexts
- Process topologies
- Bindings for Fortran 77 and C
- Environmental management and inquiry
- Profiling interface
17Features NOT included in MPI
- Explicit shared-memory operations
- Operations that require more operating system
support than the standard for example,
interrupt-driven receives, remote execution, or
active messages - Program construction tools
- Explicit support for threads
- Support for task management
- I/O functions
18MPI Implementations
- Listed below are MPI implementations available
for free - Appleseed (UCLA)
- CRI/EPCC (Edinburgh Parallel Computing Centre)
- LAM/MPI (Indiana University)
- MPI for UNICOS Systems (SGI)
- MPI-FM (University of Illinois) for Myrinet
- MPICH (ANL)
- MVAPICH (Infiniband)
- SGI Message Passing Toolkit
- OpenMPI
A detailed list of MPI implementations with
features can be found at http//www.lam-mpi.org/mp
i/implementations/
19MPICH
- MPICH A portable implementation of MPI developed
at the Argonne National Laboratory (ANL) and
Mississippi State University (MSU) - Very widely used
- Supports all the specs of MPI-1 standard
- Features part of MPI-2 standard are under
development (ANL alone)
http//www-unix.mcs.anl.gov/mpi/mpich/
20Writing MPI Programs
Part of all programs
- include "mpi.h" // Gives basic MPI types,
definitions - include ltstdio.hgt
- int main( argc, argv )
- int argc
- char argv
-
- MPI_Init( argc, argv ) // Starts MPI
-
- Actual code including normal C calls and MPI
calls -
- MPI_Finalize() // Ends MPI
- return 0
-
21Initialize and Finalize
- MPI_Init
- Initializes all necessary MPI variables
- Forms the MPI_COMM_WORLD communicator
- A communicator is a list of all the connections
between nodes - Opens necessary TCP connections
- MPI_Finalize
- Waits for all processes to reach the function
- Closes TCP connections
- Cleans up
22Rank and Size
- Environment details
- How many processes are there? (MPI_Comm_size)
- Who am I? (MPI_Comm_rank)
- MPI_Comm_size( MPI_COMM_WORLD, size )
- MPI_Comm_rank( MPI_COMM_WORLD, rank )
- The rank is a number between 0 and size-1
23Sample Hello World Program
- includes int main(int argc, char argv)
- int my_rank, p // process rank
and number of processes int source, dest
// rank of sender and receiving process int
tag 0 // tag for messages char
mesg100 // storage for message
MPI_Status status // stores status for
MPI_Recv statements MPI_Init(argc,
argv) MPI_Comm_rank(MPI_COMM_WORLD,
my_rank) MPI_Comm_size(MPI_COMM_WORLD,
p) if (my_rank!0) - sprintf(mesg,
"Greetings from d!", my_rank) // stores into
character array dest 0 // sets
destination for MPI_Send to process 0
MPI_Send(mesg, strlen(mesg)1, MPI_CHAR, dest,
tag, MPI_COMM_WORLD) - // sends string to process 0
else for(source 1 source lt p
source) MPI_Recv(message, 100,
MPI_CHAR, source, tag, MPI_COMM_WORLD, status) - // recv from each process
printf("s\n", message) // prints out
greeting to screen - MPI_Finalize() // shuts down MPI
24Compiling MPI Programs
- Two methods
- Compilation commands
- Using Makefile
- Compilation commands
- mpicc -o hello_world hello-world.c
- mpif77 -o hello_world hello-world.f
- Likewise mpiCC and mpif90 are available for C
and Fortran90 respectively - Makefile.in is a template Makefile
- mpireconfig translates Makefile.in to a Makefile
for a particular system
25Running MPI Programs
- To run hello_world on two machines
- mpirun -np 2 hello_world
- Must specify full path of executable
- To know the commands executed by mpirun
- mpirun t
- To get all the mpirun options
- mpirun -help
26MPI Communications
- Typical blocking send
- send (dest, type, address, length)
- dest integer representing the process to
receive the message - type data type being sent (often overloaded)
- (address, length) contiguous area in memory
being sent - MPI_Send/MPI_Recv provide point-to-point
communication - Typical global operation
- broadcast (type, address, length)
- Six basic MPI calls (init, finalize, comm, rank,
send, recv)
27MPI Basic Send/Recv
- int MPI_Send( void buf, int count, MPI_Datatype
datatype, int dest, int tag, MPI_Comm comm ) - buf initial address of send buffer dest rank
of destination (integer) - tag message tag (integer) comm communicator
(handle) - count number of elements in send buffer
(nonnegative integer) - datatype datatype of each send buffer element
(handle) - int MPI_Recv( void buf, int count, MPI_Datatype
datatype, int source, int tag, MPI_Comm comm,
MPI_Status status ) - status status object (Status) source rank of
source (integer) - status is mainly useful when messages are
received with MPI_ANY_TAG and/or MPI_ANY_SOURCE
28Information about a Message
- count argument in recv indicates maximum length
of a message - Actual length of message can be got using
MPI_Get_Count - MPI_Status status
- MPI_Recv( ..., status )
- ... status.MPI_TAG
- ... status.MPI_SOURCE
- MPI_Get_count( status, datatype, count )
29Example Matrix Multiplication Program
/ send matrix data to the worker tasks /
averow NRA/numworkers extra
NRAnumworkers offset 0 mtype
FROM_MASTER for (dest1 destltnumworkers
dest) rows (dest lt extra) ?
averow1 averow // If rows not divisible
absolutely by workers printf("sending d
rows to task d\n",rows,dest) // some workers
get an additional row MPI_Send(offset, 1,
MPI_INT, dest, mtype, MPI_COMM_WORLD) //
Starting row being sent MPI_Send(rows, 1,
MPI_INT, dest, mtype, MPI_COMM_WORLD) //
rows sent count rowsNCA // Gives
total elements being sent
MPI_Send(aoffset0, count, MPI_DOUBLE, dest,
mtype, MPI_COMM_WORLD) count
NCANCB // Equivalent to NRB NCB elements
in B MPI_Send(b, count, MPI_DOUBLE, dest,
mtype, MPI_COMM_WORLD) offset offset
rows // Increment offset for the next worker
MASTER SIDE
30Example Matrix Multiplication Program (contd.)
/ wait for results from all worker tasks /
mtype FROM_WORKER for (i1 iltnumworkers
i) // Get results from each worker
source i MPI_Recv(offset, 1,
MPI_INT, source, mtype, MPI_COMM_WORLD,
status) MPI_Recv(rows, 1, MPI_INT,
source, mtype, MPI_COMM_WORLD, status)
count rowsNCB // elements in the result
from the worker MPI_Recv(coffset0,
count, MPI_DOUBLE, source, mtype, MPI_COMM_WORLD,
status) / print results /
/ end of master section /
MASTER SIDE
31Example Matrix Multiplication Program (contd.)
if (taskid gt MASTER) // Implies a worker
node mtype FROM_MASTER source MASTER
printf ("Master d, mtyped\n", source,
mtype) // Receive the offset and number of
rows MPI_Recv(offset, 1, MPI_INT, source,
mtype, MPI_COMM_WORLD, status) printf
("offset d\n", offset) MPI_Recv(rows, 1,
MPI_INT, source, mtype, MPI_COMM_WORLD,
status) printf ("row d\n", rows)
count rowsNCA // elements to receive for
matrix A MPI_Recv(a, count, MPI_DOUBLE,
source, mtype, MPI_COMM_WORLD, status) printf
("a00 e\n", a00) count
NCANCB // elements to receive for matrix B
MPI_Recv(b, count, MPI_DOUBLE, source, mtype,
MPI_COMM_WORLD, status)
WORKER SIDE
32Example Matrix Multiplication Program (contd.)
for (k0 kltNCB k) for (i0 iltrows i)
cik 0.0 // Do the matrix
multiplication fro the rows you are assigned
to for (j0 jltNCA j)
cik cik aij bjk
mtype FROM_WORKER printf ("after computing
\n") MPI_Send(offset, 1, MPI_INT, MASTER,
mtype, MPI_COMM_WORLD) MPI_Send(rows, 1,
MPI_INT, MASTER, mtype, MPI_COMM_WORLD)
MPI_Send(c, rowsNCB, MPI_DOUBLE,
MASTER, mtype, MPI_COMM_WORLD) // Sending
the actual result printf ("after send \n")
/ end of worker /
WORKER SIDE
33Asynchronous Send/Receive
- MPI_Isend() and MPI_Irecv() are non-blocking
control returns to program after call is made - int MPI_Isend( void buf, int count, MPI_Datatype
datatype, int dest, int tag, MPI_Comm comm,
MPI_Request request ) - int MPI_Irecv( void buf, int count, MPI_Datatype
datatype, int source, int tag, MPI_Comm comm,
MPI_Request request ) - request communication request (handle) output
parameter
34Detecting Completions
- Non-blocking operations return (immediately)
request handles that can be waited on and
queried - MPI_Wait waits for an MPI send or receive to
complete - int MPI_Wait ( MPI_Request request, MPI_Status
status) - request matches request on Isend or Irecv
- status returns the status equivalent to status
for MPI_Recv when complete - blocks for send until message is buffered or sent
so message variable is free - blocks for receive until message is received and
ready
35Detecting Completions (contd.)
- MPI_Test tests for the completion of a send or
receive - int MPI_Test ( MPI_Request request, int flag,
MPI_Status status) - request, status as for MPI_Wait
- does not block
- flag indicates whether operation is complete or
not - enables code which can repeatedly check for
communication completion
36Multiple Completions
- Often desirable to wait on multiple requests
ex., A master/slave program - int MPI_Waitall( int count, MPI_Request
array_of_requests, MPI_Status
array_of_statuses ) - int MPI_Waitany( int count, MPI_Request
array_of_requests, int index, MPI_Status
status ) - int MPI_Waitsome( int incount, MPI_Request
array_of_requests, int outcount, int
array_of_indices, MPI_Status array_of_statuses
) - There are corresponding versions of test for each
of these
37Communication Modes
- Synchronous mode (MPI_Ssend) the send does not
complete until a matching receive has begun - Buffered mode (MPI_Bsend) the user supplies the
buffer to system - Ready mode (MPI_Rsend) user guarantees that
matching receive has been posted - Non-blocking versions are MPI_Issend, MPI_Irsend,
MPI_Ibsend
38Miscellaneous Point-to-Point Commands
- MPI_Sendrecv
- MPI_Sendrecv_replace
- MPI_cancel
- Used for buffered modes
- MPI_Buffer_attach
- MPI_Buffer_detach
39Collective Communication
- One to Many (Broadcast, Scatter)
- Many to One (Reduce, Gather)
- Many to Many (Allreduce, Allgather)
40Broadcast and Barrier
- Any type of message can be sent size of message
should be known to all - int MPI_Bcast ( void buffer, int count,
MPI_Datatype datatype, int root, MPI_Comm comm ) -
- buffer pointer to message buffer count number
of items sent - datatype type of item sent root sending
processor - comm communicator within which broadcast takes
place - Note count and type should be the same on all
processors - Barrier synchronization (broadcast without
message?) - int MPI_Barrier ( MPI_Comm comm )
41Reduce
- Reverse of broadcast all processors send to a
single processor - Several combining functions available
- MAX, MIN, SUM, PROD, LAND, BAND, LOR, BOR, LXOR,
BXOR, MAXLOC, MINLOC - int MPI_Reduce ( void sentbuf, void result, int
count, MPI_Datatype datatype, MPI_Op op, int
root, MPI_Comm comm )
42Scatter and Gather
- MPI_Scatter Source (array) on the sending
processor is spread to all processors - MPI_Gather Opposite of scatter array locations
at the receiver correspond to the rank of the
senders
43Many-to-many Communication
- MPI_Allreduce
- Syntax like reduce, except no root parameter
- All nodes get result
- MPI_Allgather
- Syntax like gather, except no root parameter
- All nodes get resulting array
44Evaluating Parallel Programs
- MPI provides tools to evaluate performance of
parallel programs - Timer
- Profiling Interface
- MPI_Wtime gives the wall clock time
- MPI_WTIME_IS_GLOBAL can be used to check the
synchronization of times for all the processes - PMPI_.... is an entry point for all routines can
be used for profiling - -mpilog option at compile time can be used to
generate logfiles
45Recent Developments
- MPI-2
- Dynamic process management
- One-sided communication
- Parallel file-IO
- Extended collective operations
- MPI for Grids ex., MPICH-G, MPICH-G2
- Fault-tolerant MPI ex., Starfish, Cocheck
46One-sided Operations
- One-sided one worker performs transfer of data
- Remote memory reads and writes
- Data can be accessed without waiting for other
processes
47File Handling
- Similar to general programming languages
- Sample function calls
- MPI_File_open
- MPI_File_read
- MPI_File_seek
- MPI_File_write
- MPI_File_set_size
- Non-blocking reads and writes are also possible
- MPI_File_Iread
- MPI_File_Iwrite
48C Datatypes
- MPI_CHAR char
- MPI_BYTE See standard like unsigned char
- MPI_SHORT short
- MPI_INT int
- MPI_LONG long
- MPI_FLOAT float
- MPI_DOUBLE double
- MPI_UNSIGNED_CHAR unsigned char
- MPI_UNSIGNED_SHORT unsigned short
- MPI_UNSIGNED unsigned int
- MPI_UNSIGNED_LONG unsigned long
- MPI_LONG_DOUBLE long double
49mpiP
- A lightweight profiling library for MPI
applications - In order to use in an application, simply add the
lmpiP flag to the compile script - Determines how much time a program spends in MPI
calls versus the rest of the application - Shows which MPI calls are used most frequently
50Jumpshot
- Graphical profiling tool for MPI
- Java-Based
- Useful for determining communication patterns in
an application - Color-coded bars represent time spent in an MPI
function - Arrows denote message passing
- Single line denotes actual processing time
51Summary
- The parallel computing community has cooperated
to develop a full-featured standard
message-passing library interface - Several implementations are available
- Many applications are being developed or ported
presently - MPI-2 process beginning
- Lots of MPI material available
- Very good facilities available at the HCS Lab for
MPI-based projects - Zeta Cluster will be available for class projects
52References
- 1 The Message Passing Interface (MPI)
Standard, http//www-unix.mcs.anl.gov/mpi/ - 2 LAM/MPI Parallel Computing,
http//www.lam-mpi.org - 3 W. Gropp, Tutorial on MPI The
Message-Passing Interface, http//www-unix.mcs.an
l.gov/mpi/tutorial/gropp/talk.html - 4 D. Culler and J. Singh, Parallel Computer
Architecture A Hardware/Software Approach
53Fault-Tolerant Embedded MPI
54Motivations
- MPI functionality required for HPC space
applications - De-facto standard/parallel programming model in
HPC - Fault-tolerant extensions for HPEC space systems
- MPI is inherently fault-intolerant, original
design choice - Existing HPC tools for MPI and fault-tolerant MPI
- Good basis for ideas, API standards, etc.
- Not readily amenable to HPEC platforms
- Focus on lightweight fault-tolerant MPI for HPEC
(FEMPI Fault-tolerant Embedded Message Passing
Interface) - Leverage prior work throughout HPC community
- Leverage prior work at UF on HPC with MPI
55Primary Source of Failures in MPI
- Nature of failures
- Individual processes of MPI job crash (Process
failure) - Communication failure between two MPI processes
(Network failure) - Behavior on failure
- When a receiver node fails, sender encounters a
timeout on a blocking send call, as no matching
receive is found and returns an error - Whole communicator context crashes and hence the
entire MPI job - NN open TCP connections in many MPI
implementations in such cases, the whole job
crashes immediately on failure of any node - Applies to collective communication calls as well
- Avoid failure/crash of entire application
- Health status of nodes provided by failure
detection service (via SR) - Check node status before communication with
another node to avoid establishing communication
with a dead process - If receiver dies after status check and before
communication, then timeout-based recovery will
be used
56FEMPI Software Architecture
- Low-level communication is provided through FEMPI
using Self-Reliants DMS - Heartbeating via SR and a process notification
extension to the SRP enables FEMPI fault
detection - Application and FEMPI checkpointing make use of
existing checkpointing libraries checkpoint
communication uses DMS - MPI Restore process on System Controller is
responsible for recovery decisions based on
application policies
57Fault Tolerance Actions
- Fault tolerance is provided through three stages
- Detection of a fault
- Notification
- Recovery
- Self-Reliant services used to provide detection
and notification capabilities - Heartbeats and other functionality are already
provided in API - Notification service built as an extension to FTM
of JMS - FEMPI will provide features to enable recovery of
an application - Employs reliable communications to reduce faults
due to communication failure - Low-level communications provided through
Self-Reliant services (DMS) instead of directly
over TCP/IP