Title: MPI Workshop II
1MPI Workshop - II
- Introduction to Collective
- Communications
- HPC_at_UNM Research Staff
- Dr. Andrew C. Pineda, Dr. Paul M. Alsing
- Week 2 of 2
2Todays Topics
- Course Map
- Basic Collective Communications
- MPI_Barrier
- MPI_Scatterv, MPI_Gatherv, MPI_Reduce
- MPI Routines/Exercises
- Pi, Matrix-Matrix mult., Vector-Matrix mult.
- Other Collective Calls
- Cartesian Topology Example
- References
3Course Roadmap
4Example 1 - Pi Calculation
Uses the following MPI calls
MPI_BARRIER, MPI_BCAST, MPI_REDUCE
5Integration Domain Serial
x0 x1 x2 x3
xN
6Serial Pseudocode
- f(x) 1/(1x2) Example
- h 1/N, sum 0.0 N 10, h0.1
- do i 1, N x.05, .15, .25, .35, .45, .55,
- x h(i - 0.5) .65, .75,
.85, .95 - sum sum f(x)
- enddo
- pi h sum
7Integration Domain Parallel
8Parallel Pseudocode
- P(0) reads in N and Broadcasts N to each
processor - f(x) 1/(1x2) Example
- h 1/N, sum 0.0 N 10, h0.1
- do i rank1, N, nprocrs Procrs
P(0),P(1),P(2) - x h(i - 0.5) P(0) -gt .05,
.35, .65, .95 - sum sum f(x) P(1) -gt .15, .45,
.75 - enddo P(2) -gt .25, .55, .85
- mypi h sum
- Collect (Reduce) mypi from each processor
into a collective value of pi on the output
processor
9Lab exercise 1
- ssh -X ll (ssh -X user_at_ll.alliance.unm.edu
from outside) - cd mpi1/hello-world
- mpif77 -o fhello hello.f
- (or use included makefile)
- make fhello
- qsub -I q R11413 -l nodes2ppn2,walltime100
00 - mpirun -np 4 -nolocal -machinefile
PBS_NODEFILE fhello - mpirun -np 8 -nolocal -machinefile
PBS_NODEFILE fhello - exit (exit interactive batch session).
- If you want to see the hosts that the MPI
processes are mapped to compile and run
hello-name.(c/f). Youll see something
interesting if you run without the -nolocal flag
under ch_p4 interface. (ch_p4 is the default
environment for the training guest accounts.)
10Collective Communications - Broadcast
broadcast - copy of a piece of data to all
processes.
MPI_BCAST
11Collective Communications - Reduction
Reduction - collect data back to 1 process,
performing an associative operation on the data,
e.g. addition, product, maximum, etc.
- MPI_REDUCE
- MPI_SUM, MPI_PROD, MPI_MAX, MPI_MIN, MPI_IAND,
MPI_BAND,...
12Collective Communications - Synchronization
- Collective calls can (but are not required to)
return as soon as their participation in a
collective call is complete. - Return from a call does NOT indicate that other
processes have completed their part in the
communication. - Occasionally, it is necessary to force the
synchronization of processes. - MPI_BARRIER
13Collective Communications
Broadcast the coefficients to all processors.
Scatter the vectors among N processors as
zpart, xpart, and ypart. Calls can return
as soon as their participation is complete.
14Example
- Vecsum - Basic collective communications calls
- MPI_SCATTER - distribute an array evenly among
processors - MPI_GATHER - collect pieces of an array from
processors
15Vector Sum
16Vector Sum - contd
17Example 2 Matrix Multiplication (Easy) in C
Two versions depending on whether or not the
rows of C and A are evenly divisible by the
number of processes. Uses the following MPI
calls MPI_BCAST, MPI_BARRIER, MPI_SCATTERV,
MPI_GATHERV
18Serial Code in C/C
Note that all the arrays accessed in row major
order. Hence, it makes sense to distribute the
arrays by rows.
- for(i0 iltnrow_c i)
- for(j0jltncol_c j)
- cij0.0e0
- for(i0 iltnrow_c i)
- for(k0 kltncol_a k)
- for(j0jltncol_c j)
- cijaikbkj
19Matrix Multiplication in CParallel Example
20Collective Communications - Scatter/Gather
scatter/gather - distributes/collects pieces of
an array from 1 process to many
MPI_GATHER, MPI_SCATTER, MPI_GATHERV, MPI_SCATTERV
21Flavors of Scatter/Gather
- Equal-sized pieces of data distributed to each
processor - MPI_SCATTER, MPI_GATHER
- Unequal-sized pieces of data distributed
- MPI_SCATTERV, MPI_GATHERV
- Must specify arrays of sizes of data and their
displacements from the start of the data to be
distributed or collected. - Both of these arrays are of length equal to the
size of communications group
22Scatter/Scatterv Calling Syntax
- int MPI_Scatter(void sendbuf, int sendcount,
MPI_Datatype sendtype, void recvbuf, int
recvcount, MPI_Datatype recvtype, int root,
MPI_Comm comm) - int MPI_Scatterv(void sendbuf, int sendcounts,
- int offsets, MPI_Datatype sendtype, void
recvbuf, int recvcount, MPI_Datatype recvtype,
int root, MPI_Comm comm)
23Abbreviated Parallel Code (Equal size data
blocks)
- ierrMPI_Scatter(a,nrow_ancol_a/size,...)
- ierrMPI_Bcast(b,nrow_bncol_b,...)
- for(i0 iltnrow_c/size i)
- for(j0jltncol_c j)
- cpartij0.0e0
- for(i0 iltnrow_c/size i)
- for(k0 kltncol_a k)
- for(j0jltncol_c j)
- cpartijapartikbkj
- ierrMPI_Gather(cpart,(nrow_c/size)ncol_c, ...)
24Abbreviated Parallel Code (Unequal size data
blocks)
- ierrMPI_Scatterv(a,a_chunk_sizes,a_offsets,...)
- ierrMPI_Bcast(b,nrow_bncol_b, ...)
- for(i0 iltc_chunk_sizesrank/ncol_c i)
- for(j0jltncol_c j)
- cpartij0.0e0
- for(i0 iltc_chunk_sizesrank/ncol_c i)
- for(k0 kltncol_a k)
- for(j0jltncol_c j)
- cpartijapartikbkj
- ierrMPI_Gatherv(cpart, c_chunk_sizesrank,
- MPI_DOUBLE, ...)
- Look at C code to see how sizes and offsets are
done.
25Fortran version
- F77 - no dynamic memory allocation.
- F90 - allocatable arrays, arrays allocated in
contiguous memory. - Multi-dimensional arrays are stored in memory in
column major order. - Questions for the student.
- How should we distribute the data in this case?
What about loop ordering? - We never distributed B matrix. What if B is
large?
26Example 3 Vector Matrix Product in C
Illustrates MPI_Scatterv, MPI_Reduce, MPI_Bcast
27Main part of parallel code
- ierrMPI_Scatterv(a,a_chunk_sizes,a_offsets,MPI_DO
UBLE, apart,a_chunk_sizesran
k,MPI_DOUBLE, - root, MPI_COMM_WORLD)
- ierrMPI_Scatterv(btmp,b_chunk_sizes,b_offsets,MPI
_DOUBLE, - bparttmp,b_chunk_sizesrank,MPI_DOUBLE,
- root, MPI_COMM_WORLD)
- initialize cpart to zero
- for(k0 klta_chunk_sizesrank k)
- for(j0 jltncol_c j)
- cpartjapartkbpartkj
- ierrMPI_Reduce(cpart, c, ncol_c, MPI_DOUBLE,
MPI_SUM, root, - MPI_COMM_WORLD)
28Collective Communications - Allgather
MPI_ALLGATHER
29Collective Communications - Alltoall
30Transpose in Serial 2D-FFT
y
Perform 1D x-transforms on contiguous data (by
columns in Fortran)
x
x
Transpose 2D array, then perform y-transforms on
contiguous data in columns
y
31Transpose in Parallel 2D-FFT
y
Perform 1D x-transforms on contiguous data (by
columns in Fortran)
x
x
Transpose 2D array, then perform y-transforms on
contiguous data in columns
y
32Exercise Vector Matrix Product in C
Rewrite Example 3 to perform the vector matrix
product as shown.
33Poisson Equation on a 2D Gridperiodic boundary
conditions
34Serial Poisson Solver
- F90 Code
- N2 matrix for r and f
- Initialize r.
- Discretize the equation
- iterate until convergence
- output results
35Serial Poisson Solver Solution
36Serial Poisson Solver (cont)
- do j1, M
- do i 1, N !Fortran access down columns
first - phi(i,j) rho(i,j)
- .25 ( phi_old( modulo(i,N)1, j )
- phi_old( modulo(i-2,N)1, j )
- phi_old( i, modulo(j,N)1 )
- phi_old( i, modulo(j-2,N)1 )
- )
- enddo
- enddo
-
37Parallel Poisson Solver in MPIdomain decomp. 3 x
5 processor grid
0,1
0,0
0,2
0,4
0,3
1,3
1,1
1,2
1,4
1,0
2,2
2,4
2,0
2,1
2,3
38Parallel Poisson Solver in MPIProcessor Grid
e.g. 3 x 5, NM64
39Parallel Poisson Solver in MPIBoundary data
movement each iteration
0,1
0,0
0,2
0,4
0,3
1,3
1,1
1,2
1,4
1,0
2,2
2,4
2,0
2,1
2,3
40Ghost Cells Local Indices
M_local 13
1
0
13
14
0
1
ghost column
N_local 21
P(1,2)
21
22
ghost row
41Data Movemente.g. Shift Right, (East)
--gt ghost cells
--gt boundary data
42Communicators and Topologies
- A Communicator is a set of processors which can
talk to each other - The basic communicator is MPI_COMM_WORLD
- One can create new groups or subgroups of
processors from MPI_COMM_WORLD or other
communicators - MPI allows one to associate a Cartesian or Graph
topology with a communicator
43MPI Cartesian Topology Functions
- MPI_CART_CREATE( old_comm, nmbr_of_dims,
dim_sizes(), wrap_around(), reorder, cart_comm,
ierr) - old_comm MPI_COMM_WORLD
- nmbr_of_dims 2
- dim_sizes() (np_rows, np_cols) (3, 5)
- wrap_around ( .true., .true. )
- reorder .false. (generally set to .true.)
- allows system to reorder the procr s for better
performance - cart_comm grid_comm (name for new communicator)
44MPI Cartesian Topology Functions
- MPI_CART_RANK( comm, coords(), rank, ierr)
- comm grid_comm
- coords() ( coords(1), coords(2) ), e.g. (0,2)
for P(0,2) - rank processor rank inside grid_com
- returns the rank of the procr with coordinates
coords() - MPI_CART_COORDS( comm, rank, nmbr_of_dims,
coords(), ierr ) - nmbr_of_dims 2
- returns the coordinates of the procr in grid_comm
given its rank in grid_comm
45MPI Cartesian Topology Functions
- MPI_CART_SUB( grid_comm, free_coords(), sub_comm,
ierr) - grid_comm communicator with a topology
- free_coords() ( .false., .true. ) -gt (i fixed,
j varies), i.e. row communicator - ( .true., .false. ) -gt (i varies, j fixed),
i.e. column communicator - sub_comm the new sub communicator (say row_comm
or col_comm)
46MPI Cartesian Topology Functions
- MPI_CART_SHIFT( grid_comm, direction, displ,
rank_recv_from, rank_send_to, ierr) - grid_comm communicator with a topology
- direction 0 ? i varies, ? column shift N or
S - 1 ? j varies, ? row shift E or W
- disp how many procrs to shift over ( or -)
- e.g. N shift direction0, disp -1,
- S shift direction0, disp 1
- E shift direction1, disp 1
- W shift direction1, disp -1
-
47MPI Cartesian Topology Functions
- MPI_CART_SHIFT( grid_comm, direction, displ,
rank_recv_from, rank_send_to, ierr) - MPI_CART_SHIFT does not actually perform any data
transfer. It returns two ranks. - rank_recv_from the rank of the procr from which
the calling procr will receive the new data - rank_send_to the rank of the procr to which data
will be sent from the calling procr - Note MPI_CART_SHIFT does the modulo arithmetic
if the corresponding dimensions has
wrap_around() .true. -
48Parallel Poisson SolverNupward shift in columns
- !N or upward shift
- !P(i1,j) ?(recv from)? P(i,j) ?(send to)?
P(i-1,j) - direction 0 !i varies
- disp -1 !i ? i-1
- top_bottom_buffer phi_old_local(1,)
- call MPI_CART_SHIFT( grid_comm, direction, disp,
- rank_recv_from,
rank_send_to, ierr) - call MPI_SENDRECV( top_bottom_buffer, M_local1,
- MPI_DOUBLE_PRECISION,
rank_send_to, tag, - bottom_ghost_cells, M_local,
- MPI_DOUBLE_PRECISION,
rank_recv_from, tag, - grid_comm, status, ierr)
- phi_old_local(N_local1,) bottom_ghost_cells
49Parallel Poisson SolverMain computation
- do j1, M_local
- do i 1, N_local
- phi(i,j) rho(i,j)
- .25 ( phi_old( i1, j )
- phi_old( i-1, j )
- phi_old( i, j1 )
- phi_old( i, j-1 )
- )
- enddo
- enddo
- Note indices are all within range now due to
ghost cells
50Parallel Poisson Solver Global vs. Local
Indices
- i_offset0
- do i 1, coord(1)
- i_offset i_offset nmbr_local_rows(i)
- enddo
- j_offset0
- do j 1, coord(2)
- j_offset j_offset nmbr_local_cols(j)
- enddo
- do j j_offset1, j_offset M_local !global
indices - y (real(j)-.5)/MLy-Ly/2
- do i i_offset1, i_offset1 N_local
!global - x (real(i)-.5)/NLx
- makerho_local(i-i_offset,j-j_offset)
f(x,y) - enddo
- enddo
!store with local indices
51Parallel Poisson Solver in MPIprocessor grid
e.g. 3 x 5, NM64
M
64
1
13
14
27
39
52
40
53
26
1
1
0,0
0,2
0,1
0,4
0,3
22
13
1
N
23
1
1,0
1,1
1,3
1,2
1,4
43
21
44
2,0
2,1
2,2
2,3
2,4
64
52MPI Reduction Communication Functions
- Point-to-point communications in N,S,E,W shifts
- MPI_SENDRECV( sendbuf, sendcount,sendtype, dest,
sendtag, recvbuf, recvcount, recvtype,
source,recvtag, comm, status ierr) - Reduction operations in the computation
- MPI_ALLREDUCE( sendbuf, recvbuf, count, datatype,
operation, ierr) - operation MPI_SUM, MPI_MAX, MPI_MIN_LOC,
...
53I/O of final resultsStep 1 in row_comm, Gatherv
the columns into matrices of size ( rows) x M
M
0,1
0,0
0,2
0,4
0,3
N
1,3
1,1
1,2
1,4
1,0
2,2
2,4
2,0
2,1
2,3
54I/O of final resultsStep 2 transpose the
matrix in row_comm, Gatherv the columns in M x
N matrix.Result is the Transpose of the matrix
for f.
N
f
M
M
55References - MPI Tutorial
- PACS online course
- http//webct.ncsa.uiuc.edu8900/
- Edinburgh Parallel Computing Center
- http//www.epcc.ed.ac.uk/epic/mpi/notes/mpi-course
-epic.book_1.html - Argonne National Laboratory (MPICH)
- http//www-unix.mcs.anl.gov/mpi/
- MPI Forum
- http//www.mpi-forum.org/docs/docs.html
- MPI The Complete Reference (vols. 1, 2)
- Vol. 1. at http//www.netlib.org/utk/papers/mpi-bo
ok/mpi-book.html - IBM (MPI on the RS/6000 (IBM SP))
- http//publib-b.boulder.ibm.com/Redbooks.nsf/Redbo
okAbstracts
56References Some useful books
- MPI The Complete Reference
- Marc Snir, Steve Otto, Steven Huss-Lederman,
David Walker and Jack Dongara, MIT Press - examples/mpidocs/mpi_complete_reference.ps.Z
- Parallel Programming with MPI
- Peter S. Pacheco, Morgan Kaufman Publishers, Inc
- Using MPI Portable Parallel Programing with
Message Passing Interface - William Gropp, E. Lusk and A. Skjellum, MIT Press
-
57Transpose in Serial 2D-FFT
y
Perform 1D x-transforms on contiguous data (by
columns in Fortran)
x
x
Transpose 2D array, then perform y-transforms on
contiguous data in columns
y
58Transpose in Parallel 2D-FFT
y
Perform 1D x-transforms on contiguous data (by
columns in Fortran)
x
x