Title: Message Passing Programming
1Message Passing Programming
- Carl Tropper
- Department of Computer Science
2Generalities
- Structure of message passing programs
- Asynchronous
- SPMD (single program, multiple data) model
- First look at building blocks
- Send and receive operations
- Blocking and unblocking versions
- MPI (the standard) specifics
3Send and receive operations
- send(void sendbuf, int nelems, int dest)
- receive(void recvbuf, int nelems, int source)
- Nelenselements to be sent/received
- P0 sends data to P1
- P0 P1
- a 100 receive(a, 1, 0)
- send(a, 1, 1) printf("d\n", a)
- a 0
- Good semantics- P1 to receives 100
- Bad semantics-P1 receives 0
- Could happen because dma and comm hardware could
return before 100 is actually sent
4Blocking message passing operations
- Handshake-
- Sender asks to send, receiver agrees to receive
- Sender sends, receiver receives
- Implemented without buffers
5Deadlocks in Blocking, non buffered send/receive
- P0 P1
- send(b, 1, 1) send(b, 1, 0)
- receive(a, 1, 1) receive(a, 1, 0)
- Both sends wait for both receives-DEADLOCK
- Can cure this deadlock by reversing the send and
receive ops (e.g. in P1) - Ugh
6Send/Receive Blocking Buffered
- Buffers used at sender and receiver
- Dedicated comm hardware at both ends
- If sender has no buffer but receiver does, still
can be made to work (rhs below)
7The impact of non-infinite buffer space
- P0 P1
- for (i 0 i lt 1000 i) for (i 0 i lt
1000 i) - produce_data(a) receive(a, 1, 0)
- send(a, 1, 1) consume_data(a)
-
- Consumer consumes slower then producer
produces..
8Deadlocks in Buffered Send/Receive
- P0 P1
- receive(a, 1, 1) receive(a, 1, 0)
- send(b, 1, 1) send(b, 1, 0)
- Receive operation still blocks,so deadlock can
happen - Moral of the story-still have to be careful to
avoid deadlocks!
9Non blocking optimizations
- Blocking is safe but wastes time
- Alternative-use non-blocking with check-status
operation - Process is free to perform any operation which
does not depend upon completion of send or
receive - Once transfer is complete, data can be used
10Non blocking optimization
11Possibilities
12MPI
- Vendors all had their own message passing
libraries - Enter MPI-the standard for C and Fortran
- Defines syntax, semantics of core set of library
routines (125 are defined)
13Core set of routines for MPI
- MPI_Init Initializes MPI.
- MPI_Finalize Terminates MPI.
- MPI_Comm_size Determines the number of
processes. - MPI_Comm_rank Determines the label of calling
process. - MPI_Send Sends a message.
- MPI_Recv Receives a message.
14Starting and Terminating MPI
- int MPI_Init(int argc, char argv)
- int MPI_Finalize()
- MPI_Init is called prior to other MPI routines-it
initializes the MPI environment - MPI_Finalize is called at the end-it does
clean-up - Return code for both is MPI_success
- Mpi_h contains mpi constants and data structures
15Communicators
- Communication domain- processes which communicate
with one another - Communicators are variables of type MPI_Comm.
They store information about communication
domains - MPI_COMM_WORLD - default communicator, all
processes in program
16Communicators
- int MPI_Comm_size(MPI_Comm comm, int size)
- int MPI_Comm_rank(MPI_Comm comm, int rank)
- MPI_Comm_size - number of processes in
communicator - Rank ids each process
17Hello world
- include ltmpi.hgt
- main(int argc, char argv)
-
- int npes, myrank
- MPI_Init(argc, argv)
- MPI_Comm_size(MPI_COMM_WORLD, npes)
- MPI_Comm_rank(MPI_COMM_WORLD, myrank)
- printf("From process d out of d, Hello
World!\n", - myrank, npes)
- MPI_Finalize()
-
- Prints hello world from each processor
18Sending/Receiving Messages
- int MPI_Send(void buf, int count, MPI_Datatype
datatype, int dest, int tag, MPI_Comm comm) - int MPI_Recv(void buf, int count, MPI_Datatype
datatype, int source, int tag, MPI_Comm
comm, MPI_Status status) - MPI_Send sends data in buf, countentries of
type MPI_Datatype - Length of message is specified as a number of
entries, not as a number of bytes, for
portability - Destrank of destination process, tagtype of
message - MPI_ANY_SOURCE any process can be source
- MPI_ANY_TAG same for tag
- Buf is where received message is stored
- Count,datatype specify length of buffer
19Datatypes
- MPI Datatype C Datatype
- MPI_CHAR signed char
- MPI_SHORT signed short int
- MPI_INT signed int
- MPI_LONG signed long int
- MPI_UNSIGNED_CHAR unsigned char
- MPI_UNSIGNED_SHORT unsigned short int
- MPI_UNSIGNED unsigned int
- MPI_UNSIGNED_LONG unsigned long int
- MPI_FLOAT float
- MPI_DOUBLE double
- MPI_LONG_DOUBLE long double
- MPI_BYTE
- MPI_PACKED
20Sending/Receiving
- Status variable used to get info on Recv
operation - C status stored in MPI_Status
- typedef struct MPI_Status
- int MPI_SOURCE
- int MPI_TAG
- int MPI_ERROR
- int MPI_Get_count(MPI_Status status,
MPI_Datatype datatype, int count) returns
entries in count variable
21Sending/Receiving
- MPI_Recv is a blocking receive op- it returns
after message is in buffer. - MPI_Send has 2 implementations
- Returns after MPI_Recv issued and message is sent
- Returns after MPI_Send copied message into
buffer-does not wait for MPI_Recv to be issued
22Avoiding Deadlocks
- Process 0 sends 2 messages to process 1,which
receives them in reverse order. - int a10, b10, myrank
- MPI_Status status
- ...
- MPI_Comm_rank(MPI_COMM_WORLD, myrank)
- if (myrank 0)
- MPI_Send(a, 10, MPI_INT, 1, 1,
MPI_COMM_WORLD) - MPI_Send(b, 10, MPI_INT, 1, 2,
MPI_COMM_WORLD) -
- else if (myrank 1)
- MPI_Recv(b, 10, MPI_INT, 0, 2,
MPI_COMM_WORLD) - MPI_Recv(a, 10, MPI_INT, 0, 1,
MPI_COMM_WORLD) -
- ...
- If MPI_Send is implemented by blocking until
receive is issued, then process 0 waits for a
receive for the tag 1 message, and process 1
waits for process 0 to issue MPI_Send. Deadlock - Solution- Programmer has to match order in
which sends and receives are issued-Ugh!
23Circular Deadlock
- Process i sends a message to process i 1 and
receives a message from process i - 1 - int a10, b10, npes, myrank
- MPI_Status status
- ...
- MPI_Comm_size(MPI_COMM_WORLD, npes)
- MPI_Comm_rank(MPI_COMM_WORLD, myrank)
- MPI_Send(a, 10, MPI_INT, (myrank1)npes, 1,
MPI_COMM_WORLD) - MPI_Recv(b, 10, MPI_INT, (myrank-1npes)npes, 1,
MPI_COMM_WORLD) - ...
- Deadlock if MPI_Send is blocking
- Works if it is implemented using buffering
- Deadlocks with two processes trying to send each
other messages.
24Break the circle
- Break circle into odd and even processes
- Odds first send and then receive
- Evens first receive and then send
- int a10, b10, npes, myrank
- MPI_Status status
- ...
- MPI_Comm_size(MPI_COMM_WORLD, npes)
- MPI_Comm_rank(MPI_COMM_WORLD, myrank)
- if (myrank2 1)
- MPI_Send(a, 10, MPI_INT, (myrank1)npes, 1,
MPI_COMM_WORLD) - MPI_Recv(b, 10, MPI_INT, (myrank-1npes)npes,
1, MPI_COMM_WORLD) -
- else
- MPI_Recv(b, 10, MPI_INT, (myrank-1npes)npes,
1, MPI_COMM_WORLD) - MPI_Send(a, 10, MPI_INT, (myrank1)npes, 1,
MPI_COMM_WORLD) -
- ...
25Break the circle, part II
- A simultaneous send/receive operation
- int MPI_Sendrecv(void sendbuf, int
sendcount, - MPI_Datatype senddatatype, int dest, intsendtag,
void recvbuf, int recvcount,MPI_Datatype
recvdatatype, int source, int recvtag,MPI_Comm
comm, MPI_Status status) - Problem-need to use disjoint buffers
- Solution-MPI_Sendrec_replace function-received
data replaces sent data in the same buffer - int MPI_Sendrecv_replace(void buf, int
count, - MPI_Datatype datatype, int dest, int sendtag,
- int source, int recvtag, MPI_Comm comm,
- MPI_Status status)
26Topologies and Embedding
MPI-sees processes arranged linearly while
parallel programs communicate naturally in
higher dimensions Need to map linear ordering to
these topologies Possible mappings are
27Solution
- MPI helps programmer to arrange processes in
topologies by supplying libraries - Mapping to processors is done by libraries
without programmer intervention
28Cartesian topologies
- Can specify arbitrary topologies, but most
topologies are grid-like (Cartesian) - MPI_Cart_create takes processes in comm_old and
builds a virtual process topology - int MPI_Cart_create(MPI_Comm comm_old, int ndims,
- int dims, int periods, int reorder,
MPI_Comm comm_cart) - New topology information is in comm_cart
- Processes belonging to comm_old need to call
comm_cart - Ndimsdimensions,dimssize of each dimension
- array periods specifies if there are wraparound
connections. PeriodItrue if a wrap in
dimension I - ReorderT allows processes to be reordered by MPI
-
29Process Naming
- Source, destination of processes are specified by
ranks in MPI - MPI_Cart_rank takes coordinates in array coords
and returns rank (maxdims is dimension of
coordinates address) - MPI_Cart_coord takes the rank of the process and
returns the its Cartesian coords in array coords - int MPI_Cart_coord(MPI_Comm comm_cart, int rank,
int maxdims, int coords) - int MPI_Cart_rank(MPI_Comm comm_cart, int
coords, int rank)
30Shifting
- Want to shift data along a dimension of the
topology? - int MPI_Cart_shift(MPI_Comm comm_cart, int dir,
int s_step, int rank_source, int rank_dest) - Dirdimension of shift (which dimension it lives
in) - S_stepsize of shift
31Overlapping communication with computation
- Blocking sends/receives do not permit overlap.
Need non-blocking functions - MPI_Isend starts send, but returns before it is
complete. - MPI_Irecv starts receive, but returns before data
is received - MPI_Test tests if non-blocking operation has
completed - MPI_Wait waits until non-blocking operation
finishes (dont say it)
32More non blocking
- int MPI_Isend(void buf, int count, MPI_Datatype
datatype, - int dest, int tag, MPI_Comm comm,
- MPI_Request request)
- int MPI_Irecv(void buf, int count, MPI_Datatype
datatype, - int source, int tag, MPI_Comm comm,
- MPI_Request request)
- Both allocate a request object and return a
pointer to an object. - The object is used as an argument by MPI_Test
and MPI_Wait to identify the op whose status - we want to query or
- we want to wait for
- int MPI_Test(MPI_Request request, int flag,
- MPI_Status status)
- FlagT if op is finished
- int MPI_Wait(MPI_Request request, MPI_Status
status) -
33Avoiding deadlocks
- Using non-blocking operations remove most
deadlocks. - Following code is not safe.
- int a10, b10, myrank
- MPI_Status status
- ...
- MPI_Comm_rank(MPI_COMM_WORLD, myrank)
- if (myrank 0)
- MPI_Send(a, 10, MPI_INT, 1, 1, MPI_COMM_WORLD)
- MPI_Send(b, 10, MPI_INT, 1, 2, MPI_COMM_WORLD)
-
- else if (myrank 1)
- MPI_Recv(b, 10, MPI_INT, 0, 2, status,
MPI_COMM_WORLD) MPI_Recv(a, 10, MPI_INT, 0, 1,
status, MPI_COMM_WORLD) -
-
- Replace either the send or the receive operations
with non-blocking counterparts fixes this
deadlock.
34Collective Ops-communication and computation
- Comm ops (MPI-broadcast, reduction,etc ops) are
implemented by MPI - All of the ops take a communicator argument which
defines the group of processes involved in the op - Ops dont act like barriers-can go past call
without waiting for other processes, but not a
great idea to do so..
35The collective
- Barrier synchronization operation
- int MPI_Barrier(MPI_Comm comm)
- Call returns after all processes have called the
function - The one-to-all broadcast operation is
- int MPI_Bcast(void buf, int count, MPI_Datatype
datatype,int source, MPI_Comm comm) - Source sends data in buf to all proceses in
group. - The all-to-one reduction operation is
- int MPI_Reduce(void sendbuf, void recvbuf, int
count, MPI_Datatype datatype, MPI_Op op, int
target, MPI_Comm comm) - Combines elements in sendbuf of each process
using op, returns combined values in recvbuf of
process with rank target - If count is more then one, then op is done on
each element -
36Pre-defined Reduction Types
- MPI_MAX Maximum C integers and floating point
- MPI_MIN Minimum C integers and floating point
- MPI_SUM Sum C integers and floating
point - MPI_PROD Product C integers and
floating point - MPI_LAND Logical AND C integers
- MPI_BAND Bit-wise AND C integers and byte
- MPI_LOR Logical OR C integers
- MPI_BOR Bit-wise OR C integers and byte
- MPI_LXOR Logical XOR C integers
- MPI_BXOR Bit-wise XOR C integers and byte
- MPI_MAXLOC max-min value-location Data-pairs
- MPI_MINLOC min-min value-location Data-pairs
37More Reduction
- The operation MPI_MAXLOC combines pairs of values
(vi, li) and returns the pair (v, l) such that v
is the maximum among all vi 's and l is the
corresponding li (if there are more than one, it
is the smallest among all these li 's). - MPI_MINLOC does the same, except for minimum
value of vi.
- Possible to define your own ops
38Reduction
- Need MPI datatypes for data pairs used with
MPI_Maxloc and MPI_Minloc - MPI_2INT corresponds to C datatype pair of ints
- MPI_Allreduce op returns result to all processes
- Int MPI_Allreduce(void sendbuf, void recvbuf,
int count, MPI_Datatype datatype, MPI_Op op,
MPI_Comm comm)
39Prefix Sum
- Prefix sum op is done via MPI_Scan-store partial
sum up to node i on node i - int MPI_Scan(void sendbuf, void recvbuf, int
count, MPI_Datatype datatype, MPI_Op op, - MPI_Comm comm)
- In the end, the receive buffer of process with
rank i stores reduction of send buffers of nodes
0 to i
40Gather Ops
- The gather operation is performed in MPI using
- int MPI_Gather(void sendbuf, int sendcount,
- MPI_Datatype senddatatype, void recvbuf,
- int recvcount, MPI_Datatype recvdatatype,
- int target, MPI_Comm comm)
- Each process sends the data in sendbuf to target
- Data is stored in recvbuf in rank order-data from
process I is stored at Isendcount of recvbuf - MPI also provides the MPI_Allgather function in
which the data are gathered at all the processes.
- int MPI_Allgather(void sendbuf, int sendcount,
MPI_Datatype senddatatype, void recvbuf, - int recvcount, MPI_Datatype recvdatatype,
- MPI_Comm comm)
- These ops assume that the size of all of the
array is the same-there are versions of the
instructions which allow different size arrays -
41Scatter Op
- MPI_Scatter
- int MPI_Scatter(void sendbuf, int sendcount,
- MPI_Datatype senddatatype, void recvbuf,
- int recvcount, MPI_Datatype recvdatatype,
- int source, MPI_Comm comm)
- Source process sends a different part of sendbuf
to each process. Received data is stored in
recvbuf - A version of MPI_Scatter allows different amounts
of data to be sent to different processes
42All to all Op
- The all-to-all personalized communication
operation is performed by - int MPI_Alltoall(void sendbuf, int sendcount,
MPI_Datatype senddatatype, void recvbuf, - int recvcount, MPI_Datatype recvdatatype,
MPI_Comm comm) - Each process sends a different part of sendbuf to
other processes (isendcount elements) - Received data stored in recvbuf array
- Vector variant exists, which allows different
amounts of data to be sent
43Groups and communicators
- Might want to split a group of processes into
subgroups - int MPI_Comm_split(MPI_Comm comm, int color, int
key, MPI_Comm newcomm) - Has to be called by all processes in a group
- Partitions processes in communicator comm into
disjoint subgoups - Color and key are input parameters
- Color defines the subgroups
- Key defines the rank within the subgroups
- New communicator is returned for each group in
newcomm parameter
44MPI_Comm_split
45Splitting Cartesian Topologies
- MPI_Cart_sub splits a cartesian topology into
smaller topologies - int MPI_Cart_sub(MPI_Comm comm_cart, int
keep_dims, MPI_Comm comm_subcart) - Array keep_dims is tells us how to break up the
topology - Original topology is stored in comm_cart,
comm_subcart stores new topologies
46Splitting Cartesian Topologies
- Array keep_dims tells us how. If keep_dimsIT,
then keep the ith dimension