Title: MPI
1MPI Message Passing Interface
- Source http//www.netlib.org/utk/papers/mpi-book/
mpi-book.html
2Message Passing Principles
- Explicit communication and synchronization
- Programming complexity is high
- But widely popular
- More control with the programmer
3MPI Introduction
- A standard for explicit message passing in MIMD
machines. - Need for a standard
- gtgt portability
- gtgt for hardware vendors
- gtgt for widespread use of concurrent computers
- Started in April 1992, MPI Forum in 1993, 1st MPI
standard in May 1994.
4MPI contains
- Point-Point (1.1)
- Collectives (1.1)
- Communication contexts (1.1)
- Process topologies (1.1)
- Profiling interface (1.1)
- I/O (2)
- Dynamic process groups (2)
- One-sided communications (2)
- Extended collectives (2)
- About 125 functions Mostly 6 are used
5MPI Implementations
- OpenMPI
- MPICH (Argonne National Lab)
- LAM-MPI (Ohio, Notre Dame, Bloomington)
- Cray, IBM, SGI
- MPI-FM (Illinois)
- MPI / Pro (MPI Software Tech.)
- Sca MPI (Scali AS)
- Plenty others
6- Communication Primitives
- - Communication scope
- - Point-point communications
- - Collective communications
7Point-Point communications send and recv
- MPI_SEND(buf, count, datatype, dest, tag, comm)
Rank of the destination
Communication context
Message
Message identifier
MPI_RECV(buf, count, datatype, source, tag, comm,
status)
MPI_GET_COUNT(status, datatype, count)
8A Simple Example
- comm MPI_COMM_WORLD
- rank MPI_Comm_rank(comm, rank)
- for(i0 iltn i) ai 0
- if(rank 0)
- MPI_Send(an/2, n/2, MPI_INT, 1, tag, comm)
-
- else
- MPI_Recv(b, n/2, MPI_INT, 0, tag, comm,
status) -
- / process array a /
- / do reverse communication /
9Communication Scope
- Explicit communications
- Each communication associated with communication
scope - Process defined by
- Group
- Rank within a group
- Message label by
- Message context
- Message tag
- A communication handle called Communicator
defines the scope
10Communicator
- Communicator represents the communication domain
- Helps in the creation of process groups
- Can be intra or inter (more later).
- Default communicator MPI_COMM_WORLD includes
all processes - Wild cards
- The receiver source and tag fields can be wild
carded MPI_ANY_SOURCE, MPI_ANY_TAG
11Buffering and Safety
- The previous send and receive are blocking.
Buffering mechanisms can come into play. - Safe buffering
Process 0
Process 1
MPI_Send MPI_Recv ..
MPI_Recv MPI_Send ..
OK
MPI_Recv MPI_Send ..
MPI_Recv MPI_Send ..
Leads to deadlock
MPI_Send MPI_Recv ..
MPI_Send MPI_Recv ..
May or may not succeed. Unsafe
12Non-blocking communications
- A post of a send or recv operation followed by
complete of the operation - MPI_ISEND(buf, count, datatype, dest, tag, comm,
request) - MPI_IRECV(buf, count, datatype, dest, tag, comm,
request) - MPI_WAIT(request, status)
- MPI_TEST(request, flag, status)
- MPI_REQUEST_FREE(request)
13Non-blocking
- A post-send returns before the message is copied
out of the send buffer - A post-recv returns before data is copied into
the recv buffer - Non-blocking calls consume space
- Efficiency depends on the implementation
14Other Non-blocking communications
- MPI_WAITANY(count, array_of_requests, index,
status) - MPI_TESTANY(count, array_of_requests, index,
flag, status) - MPI_WAITALL(count, array_of_requests,
array_of_statuses) - MPI_TESTALL(count, array_of_requests, flag,
array_of_statuses) - MPI_WAITSOME(incount, array_of_requests,
outcount, array_of_indices, array_of_statuses) - MPI_TESTSOME(incount, array_of_requests,
outcount, array_of_indices, array_of_statuses)
15Buffering and Safety
Process 0
Process 1
MPI_Send(1) MPI_Send(2) ..
MPI_Irecv(2) MPI_Irecv(1) ..
Safe
MPI_Isend MPI_Recv ..
MPI_Isend MPI_Recv
Safe
16Communication Modes
17- Collective Communications
18Example Matrix-vector Multiply
A
b
x
Communication All processes should gather all
elements of b.
19Collective Communications AllGather
data
processors
A1
A2
A3
A4
A0
A0
A0
A1
A2
A3
A4
A1
AllGather
A2
A1
A2
A3
A4
A0
A1
A2
A3
A4
A0
A3
A4
A0
A1
A2
A3
A4
MPI_ALLGATHER(sendbuf, sendcount, sendtype,
recvbuf, recvcount, recvtype, comm)
MPI_ALLGATHERV(sendbuf, sendcount, sendtype,
array_of_recvbuf, array_of_displ, recvcount,
recvtype, comm)
20Example Row-wise Matrix-Vector Multiply
- MPI_Comm_size(comm, size)
- MPI_Comm_rank(comm, rank)
- nlocal n/size
- MPI_Allgather(local_b,nlocal,MPI_DOUBLE, b,
nlocal, MPI_DOUBLE, comm) - for(i0 iltnlocal i)
- xi 0.0
- for(j0 jltn j)
- xi ainjbj
21Example Column-wise Matrix-vector Multiply
A
b
x
Dot-products corresponding to each element of x
will be parallelized
Steps
1. Each process computes its contribution to x
2. Contributions from all processes are added and
stored in appropriate process.
22Example Column-wise Matrix-Vector Multiply
- MPI_Comm_size(comm, size)
- MPI_Comm_rank(comm, rank)
- nlocal n/size
- / Compute partial dot-products /
- for(i0 iltn i)
- pxi 0.0
- for(j0 jltnlocal j)
- pxi ainlocaljbj
23Collective Communications Reduce, Allreduce
data
processors
A1
A2
A0
A0B0C0
A1B1C1
A2B2C2
B0
B1
B2
Reduce
C1
C2
C0
MPI_REDUCE(sendbuf, recvbuf, count, datatype, op,
root, comm)
A1
A2
A0
A0B0C0
A1B1C1
A2B2C2
B0
B1
B2
A0B0C0
A1B1C1
A2B2C2
AllReduce
C1
C2
C0
A0B0C0
A1B1C1
A2B2C2
MPI_ALLREDUCE(sendbuf, recvbuf, count, datatype,
op, comm)
24Collective Communications Scatter Gather
data
processors
A1
A2
A3
A4
A0
A0
A1
Scatter
A2
Gather
A3
A4
MPI_SCATTER(sendbuf, sendcount, sendtype,
recvbuf, recvcount, recvtype, root, comm)
MPI_SCATTERV(sendbuf, array_of_sendcounts,
array_of_displ, sendtype, recvbuf, recvcount,
recvtype, root, comm)
MPI_GATHER(sendbuf, sendcount, sendtype, recvbuf,
recvcount, recvtype, root, comm)
MPI_GATHERV(sendbuf, sendcount, sendtype,
recvbuf, array_of_recvcounts, array_of_displ,
recvtype, root, comm)
25Example Column-wise Matrix-Vector Multiply
- / Summing the dot-products /
- MPI_Reduce(px, fx, n, MPI_DOUBLE, MPI_SUM, 0,
comm) - / Now all values of x is stored in process 0.
Need to scatter them / - MPI_Scatter(fx, nlocal, MPI_DOUBLE, x, nlocal,
MPI_DOUBLE, 0, comm)
26Or
- for(i0 iltsize i)
- MPI_Reduce(pxinlocal, x, nlocal,
MPI_DOUBLE, MPI_SUM, i, comm) -
27Collective Communications
- Only blocking standard mode no tags
- Simple variant or vector variant
- Some collectives have roots
- Different types
- One-to-all
- All-to-one
- All-to-all
28Collective Communications - Barrier
MPI_BARRIER(comm)
A return from barrier in one process tells the
process that the other processes have entered the
barrier.
29Collective Communications - Broadcast
processors
A
A
A
A
A
A
MPI_BCAST(buffer, count, datatype, root, comm)
30Collective Communications AlltoAll
data
processors
A1
A2
A3
A4
A0
D0
A0
B0
E0
C0
B0
B1
B2
B3
B4
E1
C1
A1
D1
B1
AlltoAll
A2
C1
C2
C3
C4
C0
B2
C2
D2
E2
D1
D2
D3
D4
E3
D0
C3
D3
A3
B3
C4
A4
E0
E1
E2
E3
E4
B4
E4
D4
MPI_ALLTOALL(sendbuf, sendcount, sendtype,
recvbuf, recvcount, recvtype, comm)
MPI_ALLTOALLV(sendbuf, array_of_sendcounts,
array_of_displ, sendtype, array_of_recvbuf,
array_of_displ, recvcount, recvtype, comm)
31Collective Communications ReduceScatter, Scan
data
processors
A1
A2
A0
A0B0C0
B0
B1
B2
A1B1C1
ReduceScatter
A2B2C2
C1
C2
C0
MPI_REDUCESCATTER(sendbuf, recvbuf,
array_of_recvcounts, datatype, op, comm)
A1
A2
A0
A0
A1
A2
B0
B1
B2
A0B0
A1B1
A2B2
scan
C1
C2
C0
A0B0C0
A1B1C1
A2B2C2
MPI_SCAN(sendbuf, recvbuf, count, datatype, op,
comm)
32 33Communicators
- For logical division of processes
- For forming communication contexts and avoiding
message conflicts - Communicator specifies a communication domain
for communications - Can be
- Intra used for communicating within a single
group of processes - Inter used for communication between two
disjoint group of processes - Default communicators MPI_COMM_WORLD,
MPI_COMM_SELF
34Groups
- An ordered set of processes.
- New group derived from base groups.
- Group represented by a communicator
- Group associated with MPI_COMM_WORLD is the first
base group - New groups can be created with Unions,
intersections, difference of existing groups - Functions provided for obtaining sizes, ranks
35Communicator functions
- MPI_COMM_DUP(comm, newcomm)
- MPI_COMM_CREATE(comm, group, newcomm)
- MPI_GROUP_INCL(group, n, ranks,
newgroup) - MPI_COMM_GROUP(comm,
group) - MPI_COMM_SPLIT(comm, color, key, newcomm)
F,G,A,D, E,I,C, h
36Intercommunicators
- For multi-disciplinary applications, pipeline
applications, easy readability of program - Inter-communicator can be used for point-point
communication (send and recv) between processes
of disjoint groups - Does not support collectives in 1.1
- MPI_INTERCOMM_CREATE(local_comm, local_leader,
bridge_comm, remote_leader, tag, comm) - MPI_INTERCOMM_MERGE(intercomm, high,
newintracomm)
37Communicator and Groups example
Group 2
Group 1
Group 0
0(0) 1(3) 2(6) 3(9)
0(2) 1(5) 2(8) 3(11)
0(1) 1(4) 2(7) 3(10)
main() membership rank 3
MPI_Comm_Split(MPI_COMM_WORLD, membership, rank,
mycomm)
38Communicator and Groups example
- if(membership 0)
- MPI_Intercomm_Create(mycomm, 0,
MPI_COMM_WORLD, 1, 01, my1stcomm) -
- else if(membership 1)
- MPI_Intercomm_Create(mycomm, 0,
MPI_COMM_WORLD, 0, 01, my1stcomm) - MPI_Intercomm_Create(mycomm, 0,
MPI_COMM_WORLD, 2, 12, my2ndcomm) -
- else
- MPI_Intercomm_Create(mycomm, 0,
MPI_COMM_WORLD, 1, 12, my1stcomm) -
39MPI Process Topologies
40Motivation
- Logical process arrangement
- For convenient identification of processes
program readability - For assisting runtime system in mapping processes
onto hardware increase in performance - Default linear array, ranks from 0 n-1
- Virtual topology can give rise to trees, graphs,
meshes etc.
41Introduction
- Any process topology can be represented by
graphs. - MPI provides defaults for ring, mesh, torus and
other common structures
42Cartesian Topology
- Cartesian structures of arbitrary dimensions
- Can be periodic along any number of dimensions
- Popular cartesian structures linear array,
ring, rectangular mesh, cylinder, torus
(hypercubes)
43Cartesian Topology - constructors
- MPI_CART_CREATE(
- comm_old - old communicator,
- ndims number of dimensions,
- dims - number of processes along each
dimension, - periods periodicity of the dimensions,
- reorder whether ranks may be reordered,
- comm_cart new communicator representing
cartesian topology - )
- Collective communication call
44Cartesian Topology - Constructors
- MPI_DIMS_CREATE(
- nnodes(in) - number of nodes in a grid,
- ndims(in) number of dimensions,
- dims(inout) - number of processes along each
dimension - )
- Helps to create size of dimensions such that the
sizes are as close to each other as possible. - User can specify constraints by specifying ve
integers in certain entries of dims. - Only entries with 0 are modified.
(3,2)
(2,3, 1)
error
45Cartesian Topology Inquiry Translators
- MPI_CARTDIM_GET(comm, ndims)
- MPI_CART_GET(comm, maxdims, dims, periodic,
coords) - MPI_CART_RANK(comm, coords, rank) coordinates -gt
rank - MPI_CART_COORDS(comm, rank, maxdims, coords)
rank-gtcoordinates
46Cartesian topology - Shifting
- MPI_CART_SHIFT(
- comm,
- direction,
- displacement,
- source,
- dest
- )
- Useful for subsequent SendRecv
- MPI_SendRecv(, dest., source)
- Example
MPI_CART_SHIFT(comm, 1, 1, source, dest)
47Example Cannons Matrix-Matrix Multiplication
sqrt(P)
B0,0 B0,1 B0,2 B0,3 B1,0
B1,1 B1,2 B1,3 B2,0 B2,1
B2,2 B2,3 B3,0 B3,1 B3,2
B3,3
A0,0 A0,1 A0,2 A0,3 A1,0
A1,1 A1,2 A1,3 A2,0 A2,1
A2,2 A2,3 A3,0 A3,1 A3,2
A3,3
sqrt(P)
A0,0 A0,1 A0,2 A0,3 B0,0
B1,1 B2,2 B3,3 A1,1 A1,2
A1,3 A1,0 B1,0 B2,1 B3,2
B0,3 A2,2 A2,3 A2,0 A2,1 B2,0
B3,1 B0,2 B1,3 A3,3 A3,0
A3,1 A3,2 B3,0 B0,1 B1,2
B2,3
Initial Realignment
48Example Cannons Matrix-Matrix Multiplication
A0,3 A0,0 A0,1 A0,2 B3,0
B0,1 B1,2 B2,3 A1,0 A1,1
A1,2 A1,3 B0,0 B1,1 B2,2
B3,3 A2,1 A2,2 A2,3 A2,0 B1,0
B2,1 B3,2 B0,3 A3,2 A3,3
A3,0 A3,1 B2,0 B3,1 B0,2
B1,3
A0,2 A0,3 A0,0 A0,1 B2,0
B3,1 B0,2 B1,3 A1,3 A1,0
A1,1 A1,2 B3,0 B0,1 B1,2
B2,3 A2,0 A2,1 A2,2 A2,3 B0,0
B1,1 B2,2 B3,3 A3,1 A3,2
A3,3 A3,0 B1,0 B2,1 B3,2
B0,3
A0,1 A0,2 A0,3 A0,0 B1,0
B2,1 B3,2 B0,3 A1,2 A1,3
A1,0 A1,1 B2,0 B3,1 B0,2
B1,3 A2,3 A2,0 A2,1 A2,2 B3,0
B0,1 B1,2 B2,3 A3,0 A3,1
A3,2 A3,3 B0,0 B1,1 B2,2
B3,3
Third shift
Second shift
First shift
49Cannons Algotihm with MPI Topologies
- dims0 dims1 sqrt(P)
- periods0 periods1 1
- MPI_Cart_Create(comm,2,dims,periods,1,comm_2d)
- MPI_Comm_rank(comm_2d, my2drank)
- MPI_Cart_coords(comm_2d, my2drank, 2, mycoords)
- MPI_Cart_shift(comm_2d, 0, -1, rightrank,
leftrank) - MPI_Cart_shift(comm_2d, 1, -1, downrank,
uprank) - nlocal n/dims0
50Cannons Algotihm with MPI Topologies
- / Initial Matrix Alignment /
- MPI_Cart_shift(comm_2d, 0, -mycoords0,
shiftsource, shiftdest) - MPI_Sendrecv_replace(a, nlocalnlocal,
MPI_DOUBLE, shiftdest, 1, shiftsource, 1,
comm_2d, status) - MPI_Cart_shift(comm_2d, 1, -mycoords1,
shiftsource, shiftdest) - MPI_Sendrecv_replace(b, nlocalnlocal,
MPI_DOUBLE, shiftdest, 1, shiftsource, 1,
comm_2d, status)
51Cannons Algotihm with MPI Topologies
- / Main Computation Loop /
- for(i0 iltdims0 i)
- MatrixMultiply(nlocal,a,b,c) / ccab/
- / Shift matrix a left by one /
- MPI_Sensrecv_replace(a, nlocalnlocal,
MPI_DOUBLE, leftrank, 1, rightrank, 1, comm_2d,
status) - / Shift matrix b up by one /
- MPI_Sensrecv_replace(b, nlocalnlocal,
MPI_DOUBLE, uprank, 1, downrank, 1, comm_2d,
status) -
52Cannons Algotihm with MPI Topologies
- / Restore original distribution of a and b /
- MPI_Cart_shift(comm_2d, 0, mycoords0,
shiftsource, shiftdest) - MPI_Sendrecv_replace(a, nlocalnlocal,
MPI_DOUBLE, shiftdest, 1, shiftsource, 1,
comm_2d, status) - MPI_Cart_shift(comm_2d, 1, mycoords1,
shiftsource, shiftdest) - MPI_Sendrecv_replace(b, nlocalnlocal,
MPI_DOUBLE, shiftdest, 1, shiftsource, 1,
comm_2d, status)
53General Graph Topology
- MPI_GRAPH_CREATE(comm_old,
- nnodes, index, edges,
- reorder, comm_graph)
- Example
- nnodes 8,
- index 3, 4, 6, 7, 10, 11, 13, 14
- edges 1, 2, 4, 0, 0, 3, 2, 0, 5, 6, 4, 4, 7,
6
0
1
2
4
3
5
6
7
54General Graph Topology - Inquiry
- MPI_Graphdims_get(MPI_Comm comm, int nnodes, int
nedges) - MPI_Graph_get(MPI_Comm comm, int maxindex, int
maxedges, int index, int edges) - MPI_Graph_neighbors_count(MPI_Comm comm, int
rank, int nneighbors) - MPI_Graph_neighbors(MPI_Comm comm, int rank, int
maxneighbors, int neighbors) - MPI_TOPO_TEST(comm, status)
- status can be MPI_GRAPH, MPI_CART, MPI_UNDEFINED
55 56Communicators as caches
- Caches used for storing and retrieving attributes
- MPI_KEYVAL_CREATE(copy_fn, delete_fn, keyval,
extra_state) - typedef int MPI_Copy_function(MPI_Comm oldcomm,
-
int keyval, void extra_state, -
void attribute_val_in, -
void attribute_val_out, int flag) - typedef int MPI_Delete_function(MPI
_Comm comm, int keyval, -
void attribute_val, void
extra_state) - MPI_ATTR_PUT(comm, keyval, attribute_val)
- MPI_ATTR_GET(comm, keyval, attribute_val)
57Point-Point example
- main()
- int count
- sendbufcount
- if(rank ! 0)
- MPI_Recv(recvbuf, count, rank-1, tag, comm,
status) -
- else
- for(i0 iltcount i)
- recvbufi 0
-
-
- for(i0 iltcount i)
- recvbufi sendbufi
-
- if(rank ! size-1)
- MPI_Send(recvbuf, count, rank1, tag, comm)
58Collective CommunicationsFinding maximum
- main()
- MPI_Scatter(full_array, local_size, MPI_INT,
local_array, local_size, MPI_INT, 0, comm) - local_max max(local_array)
- MPI_Allreduce(local_max, global_max, MPI_INT,
MPI_MAX, comm) -
59Miscellanious attributes / functions
- MPI_WTIME_IS_GLOBAL for checking clock
synchronization - MPI_GET_PROCESSOR_NAME(name, resultlen)
- MPI_WTIME(), MPI_WTICK()
60Profiling Interface
- Primarily intended for profiling tool developers
- Also used for combining different MPI
implementations - MPI implementors need to provide equivalent
functions with PMPI_extension. - For e.g. PMPI_Bcast for MPI_Bcast
61Profiling Interface - Example
- pragma weak MPI_Send PMPI_Send
- int PMPI_Send(/ appropriate args /)
- / Useful content /
62Point-PointSome more functions / default
definitions
- MPI_PROC_NULL
- MPI_ANY_SOURCE
- MPI_IPROBE(source, tag, comm, flag, status)
- MPI_PROBE(source, tag, comm, status)
- MPI_CANCEL(request)
- MPI_TEST_CANCELLED(status, flag)
- Persistent Communication Requests
- MPI_SEND_INIT(buf, count, datatype, dest, tag,
comm, request) - MPI_RECV_INIT(buf, count, datatype, source, tag,
comm, request) - MPI_START(request)
- MPI_STARTALL(count, array_of_requests)