MPI Workshop II - PowerPoint PPT Presentation

1 / 58

About This Presentation

Title:

MPI Workshop II

Description:

Dr. Andrew C. Pineda, Dr. Paul M. Alsing ... f(x) = 1/(1 x2) Example: h = 1/N, sum = 0.0 N = 10, h=0.1 ... Abbreviated Parallel Code (Unequal size data blocks) ... – PowerPoint PPT presentation

Number of Views:105

Avg rating:3.0/5.0

Slides: 59

Provided by: andrewc84

Category:

more less

Transcript and Presenter's Notes

Title: MPI Workshop II

1
MPI Workshop - II

Introduction to Collective
Communications
HPC_at_UNM Research Staff
Dr. Andrew C. Pineda, Dr. Paul M. Alsing
Week 2 of 2

2
Todays Topics

Course Map
Basic Collective Communications
MPI_Barrier
MPI_Scatterv, MPI_Gatherv, MPI_Reduce
MPI Routines/Exercises
Pi, Matrix-Matrix mult., Vector-Matrix mult.
Other Collective Calls
Cartesian Topology Example
References

3
Course Roadmap
4
Example 1 - Pi Calculation

Uses the following MPI calls
MPI_BARRIER, MPI_BCAST, MPI_REDUCE
5
Integration Domain Serial

x0 x1 x2 x3
xN
6
Serial Pseudocode

f(x) 1/(1x2) Example
h 1/N, sum 0.0 N 10, h0.1
do i 1, N x.05, .15, .25, .35, .45, .55,
x h(i - 0.5) .65, .75,
.85, .95
sum sum f(x)
enddo
pi h sum

7
Integration Domain Parallel
8
Parallel Pseudocode

P(0) reads in N and Broadcasts N to each
processor
f(x) 1/(1x2) Example
h 1/N, sum 0.0 N 10, h0.1
do i rank1, N, nprocrs Procrs
P(0),P(1),P(2)
x h(i - 0.5) P(0) -gt .05,
.35, .65, .95
sum sum f(x) P(1) -gt .15, .45,
.75
enddo P(2) -gt .25, .55, .85
mypi h sum
Collect (Reduce) mypi from each processor
into a collective value of pi on the output
processor

9
Lab exercise 1

ssh -X ll (ssh -X user_at_ll.alliance.unm.edu
from outside)
cd mpi1/hello-world
mpif77 -o fhello hello.f
(or use included makefile)
make fhello
qsub -I q R11413 -l nodes2ppn2,walltime100
00
mpirun -np 4 -nolocal -machinefile
PBS_NODEFILE fhello
mpirun -np 8 -nolocal -machinefile
PBS_NODEFILE fhello
exit (exit interactive batch session).
If you want to see the hosts that the MPI
processes are mapped to compile and run
hello-name.(c/f). Youll see something
interesting if you run without the -nolocal flag
under ch_p4 interface. (ch_p4 is the default
environment for the training guest accounts.)

10
Collective Communications - Broadcast
broadcast - copy of a piece of data to all
processes.
MPI_BCAST
11
Collective Communications - Reduction
Reduction - collect data back to 1 process,
performing an associative operation on the data,
e.g. addition, product, maximum, etc.

MPI_REDUCE
MPI_SUM, MPI_PROD, MPI_MAX, MPI_MIN, MPI_IAND,
MPI_BAND,...

12
Collective Communications - Synchronization

Collective calls can (but are not required to)
return as soon as their participation in a
collective call is complete.
Return from a call does NOT indicate that other
processes have completed their part in the
communication.
Occasionally, it is necessary to force the
synchronization of processes.
MPI_BARRIER

13
Collective Communications
Broadcast the coefficients to all processors.
Scatter the vectors among N processors as
zpart, xpart, and ypart. Calls can return
as soon as their participation is complete.
14
Example

Vecsum - Basic collective communications calls
MPI_SCATTER - distribute an array evenly among
processors
MPI_GATHER - collect pieces of an array from
processors

15
Vector Sum
16
Vector Sum - contd
17
Example 2 Matrix Multiplication (Easy) in C

Two versions depending on whether or not the
rows of C and A are evenly divisible by the
number of processes. Uses the following MPI
calls MPI_BCAST, MPI_BARRIER, MPI_SCATTERV,
MPI_GATHERV
18
Serial Code in C/C
Note that all the arrays accessed in row major
order. Hence, it makes sense to distribute the
arrays by rows.

for(i0 iltnrow_c i)
for(j0jltncol_c j)
cij0.0e0
for(i0 iltnrow_c i)
for(k0 kltncol_a k)
for(j0jltncol_c j)
cijaikbkj

19
Matrix Multiplication in CParallel Example

20
Collective Communications - Scatter/Gather
scatter/gather - distributes/collects pieces of
an array from 1 process to many
MPI_GATHER, MPI_SCATTER, MPI_GATHERV, MPI_SCATTERV
21
Flavors of Scatter/Gather

Equal-sized pieces of data distributed to each
processor
MPI_SCATTER, MPI_GATHER
Unequal-sized pieces of data distributed
MPI_SCATTERV, MPI_GATHERV
Must specify arrays of sizes of data and their
displacements from the start of the data to be
distributed or collected.
Both of these arrays are of length equal to the
size of communications group

22
Scatter/Scatterv Calling Syntax

int MPI_Scatter(void sendbuf, int sendcount,
MPI_Datatype sendtype, void recvbuf, int
recvcount, MPI_Datatype recvtype, int root,
MPI_Comm comm)
int MPI_Scatterv(void sendbuf, int sendcounts,
int offsets, MPI_Datatype sendtype, void
recvbuf, int recvcount, MPI_Datatype recvtype,
int root, MPI_Comm comm)

23
Abbreviated Parallel Code (Equal size data
blocks)

ierrMPI_Scatter(a,nrow_ancol_a/size,...)
ierrMPI_Bcast(b,nrow_bncol_b,...)
for(i0 iltnrow_c/size i)
for(j0jltncol_c j)
cpartij0.0e0
for(i0 iltnrow_c/size i)
for(k0 kltncol_a k)
for(j0jltncol_c j)
cpartijapartikbkj
ierrMPI_Gather(cpart,(nrow_c/size)ncol_c, ...)

24
Abbreviated Parallel Code (Unequal size data
blocks)

ierrMPI_Scatterv(a,a_chunk_sizes,a_offsets,...)
ierrMPI_Bcast(b,nrow_bncol_b, ...)
for(i0 iltc_chunk_sizesrank/ncol_c i)
for(j0jltncol_c j)
cpartij0.0e0
for(i0 iltc_chunk_sizesrank/ncol_c i)
for(k0 kltncol_a k)
for(j0jltncol_c j)
cpartijapartikbkj
ierrMPI_Gatherv(cpart, c_chunk_sizesrank,
MPI_DOUBLE, ...)
Look at C code to see how sizes and offsets are
done.

25
Fortran version

F77 - no dynamic memory allocation.
F90 - allocatable arrays, arrays allocated in
contiguous memory.
Multi-dimensional arrays are stored in memory in
column major order.
Questions for the student.
How should we distribute the data in this case?
What about loop ordering?
We never distributed B matrix. What if B is
large?

26
Example 3 Vector Matrix Product in C
Illustrates MPI_Scatterv, MPI_Reduce, MPI_Bcast
27
Main part of parallel code

ierrMPI_Scatterv(a,a_chunk_sizes,a_offsets,MPI_DO
UBLE, apart,a_chunk_sizesran
k,MPI_DOUBLE,
root, MPI_COMM_WORLD)
ierrMPI_Scatterv(btmp,b_chunk_sizes,b_offsets,MPI
_DOUBLE,
bparttmp,b_chunk_sizesrank,MPI_DOUBLE,
root, MPI_COMM_WORLD)
initialize cpart to zero
for(k0 klta_chunk_sizesrank k)
for(j0 jltncol_c j)
cpartjapartkbpartkj
ierrMPI_Reduce(cpart, c, ncol_c, MPI_DOUBLE,
MPI_SUM, root,
MPI_COMM_WORLD)

28
Collective Communications - Allgather
MPI_ALLGATHER
29
Collective Communications - Alltoall

MPI_ALLTOALL

30
Transpose in Serial 2D-FFT
y
Perform 1D x-transforms on contiguous data (by
columns in Fortran)
x
x
Transpose 2D array, then perform y-transforms on
contiguous data in columns
y
31
Transpose in Parallel 2D-FFT
y
Perform 1D x-transforms on contiguous data (by
columns in Fortran)
x
x
Transpose 2D array, then perform y-transforms on
contiguous data in columns
y
32
Exercise Vector Matrix Product in C
Rewrite Example 3 to perform the vector matrix
product as shown.
33
Poisson Equation on a 2D Gridperiodic boundary
conditions
34
Serial Poisson Solver

F90 Code
N2 matrix for r and f
Initialize r.
Discretize the equation
iterate until convergence
output results

35
Serial Poisson Solver Solution
36
Serial Poisson Solver (cont)

do j1, M
do i 1, N !Fortran access down columns
first
phi(i,j) rho(i,j)
.25 ( phi_old( modulo(i,N)1, j )
phi_old( modulo(i-2,N)1, j )
phi_old( i, modulo(j,N)1 )
phi_old( i, modulo(j-2,N)1 )
)
enddo
enddo

37
Parallel Poisson Solver in MPIdomain decomp. 3 x
5 processor grid
0,1
0,0
0,2
0,4
0,3
1,3
1,1
1,2
1,4
1,0
2,2
2,4
2,0
2,1
2,3
38
Parallel Poisson Solver in MPIProcessor Grid
e.g. 3 x 5, NM64
39
Parallel Poisson Solver in MPIBoundary data
movement each iteration
0,1
0,0
0,2
0,4
0,3
1,3
1,1
1,2
1,4
1,0
2,2
2,4
2,0
2,1
2,3
40
Ghost Cells Local Indices
M_local 13
1
0
13
14
0
1
ghost column
N_local 21
P(1,2)
21
22
ghost row
41
Data Movemente.g. Shift Right, (East)
--gt ghost cells
--gt boundary data
42
Communicators and Topologies

A Communicator is a set of processors which can
talk to each other
The basic communicator is MPI_COMM_WORLD
One can create new groups or subgroups of
processors from MPI_COMM_WORLD or other
communicators
MPI allows one to associate a Cartesian or Graph
topology with a communicator

43
MPI Cartesian Topology Functions

MPI_CART_CREATE( old_comm, nmbr_of_dims,
dim_sizes(), wrap_around(), reorder, cart_comm,
ierr)
old_comm MPI_COMM_WORLD
nmbr_of_dims 2
dim_sizes() (np_rows, np_cols) (3, 5)
wrap_around ( .true., .true. )
reorder .false. (generally set to .true.)
allows system to reorder the procr s for better
performance
cart_comm grid_comm (name for new communicator)

44
MPI Cartesian Topology Functions

MPI_CART_RANK( comm, coords(), rank, ierr)
comm grid_comm
coords() ( coords(1), coords(2) ), e.g. (0,2)
for P(0,2)
rank processor rank inside grid_com
returns the rank of the procr with coordinates
coords()
MPI_CART_COORDS( comm, rank, nmbr_of_dims,
coords(), ierr )
nmbr_of_dims 2
returns the coordinates of the procr in grid_comm
given its rank in grid_comm

45
MPI Cartesian Topology Functions

MPI_CART_SUB( grid_comm, free_coords(), sub_comm,
ierr)
grid_comm communicator with a topology
free_coords() ( .false., .true. ) -gt (i fixed,
j varies), i.e. row communicator
( .true., .false. ) -gt (i varies, j fixed),
i.e. column communicator
sub_comm the new sub communicator (say row_comm
or col_comm)

46
MPI Cartesian Topology Functions

MPI_CART_SHIFT( grid_comm, direction, displ,
rank_recv_from, rank_send_to, ierr)
grid_comm communicator with a topology
direction 0 ? i varies, ? column shift N or
S
1 ? j varies, ? row shift E or W
disp how many procrs to shift over ( or -)
e.g. N shift direction0, disp -1,
S shift direction0, disp 1
E shift direction1, disp 1
W shift direction1, disp -1

47
MPI Cartesian Topology Functions

MPI_CART_SHIFT( grid_comm, direction, displ,
rank_recv_from, rank_send_to, ierr)
MPI_CART_SHIFT does not actually perform any data
transfer. It returns two ranks.
rank_recv_from the rank of the procr from which
the calling procr will receive the new data
rank_send_to the rank of the procr to which data
will be sent from the calling procr
Note MPI_CART_SHIFT does the modulo arithmetic
if the corresponding dimensions has
wrap_around() .true.

48
Parallel Poisson SolverNupward shift in columns

!N or upward shift
!P(i1,j) ?(recv from)? P(i,j) ?(send to)?
P(i-1,j)
direction 0 !i varies
disp -1 !i ? i-1
top_bottom_buffer phi_old_local(1,)
call MPI_CART_SHIFT( grid_comm, direction, disp,
rank_recv_from,
rank_send_to, ierr)
call MPI_SENDRECV( top_bottom_buffer, M_local1,
MPI_DOUBLE_PRECISION,
rank_send_to, tag,
bottom_ghost_cells, M_local,
MPI_DOUBLE_PRECISION,
rank_recv_from, tag,
grid_comm, status, ierr)
phi_old_local(N_local1,) bottom_ghost_cells

49
Parallel Poisson SolverMain computation

do j1, M_local
do i 1, N_local
phi(i,j) rho(i,j)
.25 ( phi_old( i1, j )
phi_old( i-1, j )
phi_old( i, j1 )
phi_old( i, j-1 )
)
enddo
enddo
Note indices are all within range now due to
ghost cells

50
Parallel Poisson Solver Global vs. Local
Indices

i_offset0
do i 1, coord(1)
i_offset i_offset nmbr_local_rows(i)
enddo
j_offset0
do j 1, coord(2)
j_offset j_offset nmbr_local_cols(j)
enddo
do j j_offset1, j_offset M_local !global
indices
y (real(j)-.5)/MLy-Ly/2
do i i_offset1, i_offset1 N_local
!global
x (real(i)-.5)/NLx
makerho_local(i-i_offset,j-j_offset)
f(x,y)
enddo
enddo

!store with local indices
51
Parallel Poisson Solver in MPIprocessor grid
e.g. 3 x 5, NM64
M
64
1
13
14
27
39
52
40
53
26
1
1
0,0
0,2
0,1
0,4
0,3
22
13
1
N
23
1
1,0
1,1
1,3
1,2
1,4
43
21
44
2,0
2,1
2,2
2,3
2,4
64
52
MPI Reduction Communication Functions

Point-to-point communications in N,S,E,W shifts
MPI_SENDRECV( sendbuf, sendcount,sendtype, dest,
sendtag, recvbuf, recvcount, recvtype,
source,recvtag, comm, status ierr)
Reduction operations in the computation
MPI_ALLREDUCE( sendbuf, recvbuf, count, datatype,
operation, ierr)
operation MPI_SUM, MPI_MAX, MPI_MIN_LOC,
...

53
I/O of final resultsStep 1 in row_comm, Gatherv
the columns into matrices of size ( rows) x M
M
0,1
0,0
0,2
0,4
0,3
N
1,3
1,1
1,2
1,4
1,0
2,2
2,4
2,0
2,1
2,3
54
I/O of final resultsStep 2 transpose the
matrix in row_comm, Gatherv the columns in M x
N matrix.Result is the Transpose of the matrix
for f.
N
f
M
M
55
References - MPI Tutorial

PACS online course
http//webct.ncsa.uiuc.edu8900/
Edinburgh Parallel Computing Center
http//www.epcc.ed.ac.uk/epic/mpi/notes/mpi-course
-epic.book_1.html
Argonne National Laboratory (MPICH)
http//www-unix.mcs.anl.gov/mpi/
MPI Forum
http//www.mpi-forum.org/docs/docs.html
MPI The Complete Reference (vols. 1, 2)
Vol. 1. at http//www.netlib.org/utk/papers/mpi-bo
ok/mpi-book.html
IBM (MPI on the RS/6000 (IBM SP))
http//publib-b.boulder.ibm.com/Redbooks.nsf/Redbo
okAbstracts

56
References Some useful books

MPI The Complete Reference
Marc Snir, Steve Otto, Steven Huss-Lederman,
David Walker and Jack Dongara, MIT Press
examples/mpidocs/mpi_complete_reference.ps.Z
Parallel Programming with MPI
Peter S. Pacheco, Morgan Kaufman Publishers, Inc
Using MPI Portable Parallel Programing with
Message Passing Interface
William Gropp, E. Lusk and A. Skjellum, MIT Press

57
Transpose in Serial 2D-FFT
y
Perform 1D x-transforms on contiguous data (by
columns in Fortran)
x
x
Transpose 2D array, then perform y-transforms on
contiguous data in columns
y
58
Transpose in Parallel 2D-FFT
y
Perform 1D x-transforms on contiguous data (by
columns in Fortran)
x
x

Write a Comment

User Comments (0)