MPI Workshop II - PowerPoint PPT Presentation

1 / 58
About This Presentation
Title:

MPI Workshop II

Description:

Dr. Andrew C. Pineda, Dr. Paul M. Alsing ... f(x) = 1/(1 x2) Example: h = 1/N, sum = 0.0 N = 10, h=0.1 ... Abbreviated Parallel Code (Unequal size data blocks) ... – PowerPoint PPT presentation

Number of Views:105
Avg rating:3.0/5.0
Slides: 59
Provided by: andrewc84
Category:

less

Transcript and Presenter's Notes

Title: MPI Workshop II


1
MPI Workshop - II
  • Introduction to Collective
  • Communications
  • HPC_at_UNM Research Staff
  • Dr. Andrew C. Pineda, Dr. Paul M. Alsing
  • Week 2 of 2

2
Todays Topics
  • Course Map
  • Basic Collective Communications
  • MPI_Barrier
  • MPI_Scatterv, MPI_Gatherv, MPI_Reduce
  • MPI Routines/Exercises
  • Pi, Matrix-Matrix mult., Vector-Matrix mult.
  • Other Collective Calls
  • Cartesian Topology Example
  • References

3
Course Roadmap
4
Example 1 - Pi Calculation

Uses the following MPI calls
MPI_BARRIER, MPI_BCAST, MPI_REDUCE
5
Integration Domain Serial

x0 x1 x2 x3
xN
6
Serial Pseudocode
  • f(x) 1/(1x2) Example
  • h 1/N, sum 0.0 N 10, h0.1
  • do i 1, N x.05, .15, .25, .35, .45, .55,
  • x h(i - 0.5) .65, .75,
    .85, .95
  • sum sum f(x)
  • enddo
  • pi h sum

7
Integration Domain Parallel
8
Parallel Pseudocode
  • P(0) reads in N and Broadcasts N to each
    processor
  • f(x) 1/(1x2) Example
  • h 1/N, sum 0.0 N 10, h0.1
  • do i rank1, N, nprocrs Procrs
    P(0),P(1),P(2)
  • x h(i - 0.5) P(0) -gt .05,
    .35, .65, .95
  • sum sum f(x) P(1) -gt .15, .45,
    .75
  • enddo P(2) -gt .25, .55, .85
  • mypi h sum
  • Collect (Reduce) mypi from each processor
    into a collective value of pi on the output
    processor

9
Lab exercise 1
  • ssh -X ll (ssh -X user_at_ll.alliance.unm.edu
    from outside)
  • cd mpi1/hello-world
  • mpif77 -o fhello hello.f
  • (or use included makefile)
  • make fhello
  • qsub -I q R11413 -l nodes2ppn2,walltime100
    00
  • mpirun -np 4 -nolocal -machinefile
    PBS_NODEFILE fhello
  • mpirun -np 8 -nolocal -machinefile
    PBS_NODEFILE fhello
  • exit (exit interactive batch session).
  • If you want to see the hosts that the MPI
    processes are mapped to compile and run
    hello-name.(c/f). Youll see something
    interesting if you run without the -nolocal flag
    under ch_p4 interface. (ch_p4 is the default
    environment for the training guest accounts.)

10
Collective Communications - Broadcast
broadcast - copy of a piece of data to all
processes.
MPI_BCAST
11
Collective Communications - Reduction
Reduction - collect data back to 1 process,
performing an associative operation on the data,
e.g. addition, product, maximum, etc.
  • MPI_REDUCE
  • MPI_SUM, MPI_PROD, MPI_MAX, MPI_MIN, MPI_IAND,
    MPI_BAND,...

12
Collective Communications - Synchronization
  • Collective calls can (but are not required to)
    return as soon as their participation in a
    collective call is complete.
  • Return from a call does NOT indicate that other
    processes have completed their part in the
    communication.
  • Occasionally, it is necessary to force the
    synchronization of processes.
  • MPI_BARRIER

13
Collective Communications
Broadcast the coefficients to all processors.
Scatter the vectors among N processors as
zpart, xpart, and ypart. Calls can return
as soon as their participation is complete.
14
Example
  • Vecsum - Basic collective communications calls
  • MPI_SCATTER - distribute an array evenly among
    processors
  • MPI_GATHER - collect pieces of an array from
    processors

15
Vector Sum
16
Vector Sum - contd
17
Example 2 Matrix Multiplication (Easy) in C

Two versions depending on whether or not the
rows of C and A are evenly divisible by the
number of processes. Uses the following MPI
calls MPI_BCAST, MPI_BARRIER, MPI_SCATTERV,
MPI_GATHERV
18
Serial Code in C/C
Note that all the arrays accessed in row major
order. Hence, it makes sense to distribute the
arrays by rows.
  • for(i0 iltnrow_c i)
  • for(j0jltncol_c j)
  • cij0.0e0
  • for(i0 iltnrow_c i)
  • for(k0 kltncol_a k)
  • for(j0jltncol_c j)
  • cijaikbkj

19
Matrix Multiplication in CParallel Example

20
Collective Communications - Scatter/Gather
scatter/gather - distributes/collects pieces of
an array from 1 process to many
MPI_GATHER, MPI_SCATTER, MPI_GATHERV, MPI_SCATTERV
21
Flavors of Scatter/Gather
  • Equal-sized pieces of data distributed to each
    processor
  • MPI_SCATTER, MPI_GATHER
  • Unequal-sized pieces of data distributed
  • MPI_SCATTERV, MPI_GATHERV
  • Must specify arrays of sizes of data and their
    displacements from the start of the data to be
    distributed or collected.
  • Both of these arrays are of length equal to the
    size of communications group

22
Scatter/Scatterv Calling Syntax
  • int MPI_Scatter(void sendbuf, int sendcount,
    MPI_Datatype sendtype, void recvbuf, int
    recvcount, MPI_Datatype recvtype, int root,
    MPI_Comm comm)
  • int MPI_Scatterv(void sendbuf, int sendcounts,
  • int offsets, MPI_Datatype sendtype, void
    recvbuf, int recvcount, MPI_Datatype recvtype,
    int root, MPI_Comm comm)

23
Abbreviated Parallel Code (Equal size data
blocks)
  • ierrMPI_Scatter(a,nrow_ancol_a/size,...)
  • ierrMPI_Bcast(b,nrow_bncol_b,...)
  • for(i0 iltnrow_c/size i)
  • for(j0jltncol_c j)
  • cpartij0.0e0
  • for(i0 iltnrow_c/size i)
  • for(k0 kltncol_a k)
  • for(j0jltncol_c j)
  • cpartijapartikbkj
  • ierrMPI_Gather(cpart,(nrow_c/size)ncol_c, ...)

24
Abbreviated Parallel Code (Unequal size data
blocks)
  • ierrMPI_Scatterv(a,a_chunk_sizes,a_offsets,...)
  • ierrMPI_Bcast(b,nrow_bncol_b, ...)
  • for(i0 iltc_chunk_sizesrank/ncol_c i)
  • for(j0jltncol_c j)
  • cpartij0.0e0
  • for(i0 iltc_chunk_sizesrank/ncol_c i)
  • for(k0 kltncol_a k)
  • for(j0jltncol_c j)
  • cpartijapartikbkj
  • ierrMPI_Gatherv(cpart, c_chunk_sizesrank,
  • MPI_DOUBLE, ...)
  • Look at C code to see how sizes and offsets are
    done.

25
Fortran version
  • F77 - no dynamic memory allocation.
  • F90 - allocatable arrays, arrays allocated in
    contiguous memory.
  • Multi-dimensional arrays are stored in memory in
    column major order.
  • Questions for the student.
  • How should we distribute the data in this case?
    What about loop ordering?
  • We never distributed B matrix. What if B is
    large?

26
Example 3 Vector Matrix Product in C
Illustrates MPI_Scatterv, MPI_Reduce, MPI_Bcast
27
Main part of parallel code
  • ierrMPI_Scatterv(a,a_chunk_sizes,a_offsets,MPI_DO
    UBLE, apart,a_chunk_sizesran
    k,MPI_DOUBLE,
  • root, MPI_COMM_WORLD)
  • ierrMPI_Scatterv(btmp,b_chunk_sizes,b_offsets,MPI
    _DOUBLE,
  • bparttmp,b_chunk_sizesrank,MPI_DOUBLE,
  • root, MPI_COMM_WORLD)
  • initialize cpart to zero
  • for(k0 klta_chunk_sizesrank k)
  • for(j0 jltncol_c j)
  • cpartjapartkbpartkj
  • ierrMPI_Reduce(cpart, c, ncol_c, MPI_DOUBLE,
    MPI_SUM, root,
  • MPI_COMM_WORLD)

28
Collective Communications - Allgather
MPI_ALLGATHER
29
Collective Communications - Alltoall
  • MPI_ALLTOALL

30
Transpose in Serial 2D-FFT
y
Perform 1D x-transforms on contiguous data (by
columns in Fortran)
x
x
Transpose 2D array, then perform y-transforms on
contiguous data in columns
y
31
Transpose in Parallel 2D-FFT
y
Perform 1D x-transforms on contiguous data (by
columns in Fortran)
x
x
Transpose 2D array, then perform y-transforms on
contiguous data in columns
y
32
Exercise Vector Matrix Product in C
Rewrite Example 3 to perform the vector matrix
product as shown.
33
Poisson Equation on a 2D Gridperiodic boundary
conditions
34
Serial Poisson Solver
  • F90 Code
  • N2 matrix for r and f
  • Initialize r.
  • Discretize the equation
  • iterate until convergence
  • output results

35
Serial Poisson Solver Solution
36
Serial Poisson Solver (cont)
  • do j1, M
  • do i 1, N !Fortran access down columns
    first
  • phi(i,j) rho(i,j)
  • .25 ( phi_old( modulo(i,N)1, j )
  • phi_old( modulo(i-2,N)1, j )
  • phi_old( i, modulo(j,N)1 )
  • phi_old( i, modulo(j-2,N)1 )
  • )
  • enddo
  • enddo

37
Parallel Poisson Solver in MPIdomain decomp. 3 x
5 processor grid
0,1
0,0
0,2
0,4
0,3
1,3
1,1
1,2
1,4
1,0
2,2
2,4
2,0
2,1
2,3
38
Parallel Poisson Solver in MPIProcessor Grid
e.g. 3 x 5, NM64
39
Parallel Poisson Solver in MPIBoundary data
movement each iteration
0,1
0,0
0,2
0,4
0,3
1,3
1,1
1,2
1,4
1,0
2,2
2,4
2,0
2,1
2,3
40
Ghost Cells Local Indices
M_local 13
1
0
13
14
0
1
ghost column
N_local 21
P(1,2)
21
22
ghost row
41
Data Movemente.g. Shift Right, (East)
--gt ghost cells
--gt boundary data
42
Communicators and Topologies
  • A Communicator is a set of processors which can
    talk to each other
  • The basic communicator is MPI_COMM_WORLD
  • One can create new groups or subgroups of
    processors from MPI_COMM_WORLD or other
    communicators
  • MPI allows one to associate a Cartesian or Graph
    topology with a communicator

43
MPI Cartesian Topology Functions
  • MPI_CART_CREATE( old_comm, nmbr_of_dims,
    dim_sizes(), wrap_around(), reorder, cart_comm,
    ierr)
  • old_comm MPI_COMM_WORLD
  • nmbr_of_dims 2
  • dim_sizes() (np_rows, np_cols) (3, 5)
  • wrap_around ( .true., .true. )
  • reorder .false. (generally set to .true.)
  • allows system to reorder the procr s for better
    performance
  • cart_comm grid_comm (name for new communicator)

44
MPI Cartesian Topology Functions
  • MPI_CART_RANK( comm, coords(), rank, ierr)
  • comm grid_comm
  • coords() ( coords(1), coords(2) ), e.g. (0,2)
    for P(0,2)
  • rank processor rank inside grid_com
  • returns the rank of the procr with coordinates
    coords()
  • MPI_CART_COORDS( comm, rank, nmbr_of_dims,
    coords(), ierr )
  • nmbr_of_dims 2
  • returns the coordinates of the procr in grid_comm
    given its rank in grid_comm

45
MPI Cartesian Topology Functions
  • MPI_CART_SUB( grid_comm, free_coords(), sub_comm,
    ierr)
  • grid_comm communicator with a topology
  • free_coords() ( .false., .true. ) -gt (i fixed,
    j varies), i.e. row communicator
  • ( .true., .false. ) -gt (i varies, j fixed),
    i.e. column communicator
  • sub_comm the new sub communicator (say row_comm
    or col_comm)

46
MPI Cartesian Topology Functions
  • MPI_CART_SHIFT( grid_comm, direction, displ,
    rank_recv_from, rank_send_to, ierr)
  • grid_comm communicator with a topology
  • direction 0 ? i varies, ? column shift N or
    S
  • 1 ? j varies, ? row shift E or W
  • disp how many procrs to shift over ( or -)
  • e.g. N shift direction0, disp -1,
  • S shift direction0, disp 1
  • E shift direction1, disp 1
  • W shift direction1, disp -1

47
MPI Cartesian Topology Functions
  • MPI_CART_SHIFT( grid_comm, direction, displ,
    rank_recv_from, rank_send_to, ierr)
  • MPI_CART_SHIFT does not actually perform any data
    transfer. It returns two ranks.
  • rank_recv_from the rank of the procr from which
    the calling procr will receive the new data
  • rank_send_to the rank of the procr to which data
    will be sent from the calling procr
  • Note MPI_CART_SHIFT does the modulo arithmetic
    if the corresponding dimensions has
    wrap_around() .true.

48
Parallel Poisson SolverNupward shift in columns
  • !N or upward shift
  • !P(i1,j) ?(recv from)? P(i,j) ?(send to)?
    P(i-1,j)
  • direction 0 !i varies
  • disp -1 !i ? i-1
  • top_bottom_buffer phi_old_local(1,)
  • call MPI_CART_SHIFT( grid_comm, direction, disp,
  • rank_recv_from,
    rank_send_to, ierr)
  • call MPI_SENDRECV( top_bottom_buffer, M_local1,
  • MPI_DOUBLE_PRECISION,
    rank_send_to, tag,
  • bottom_ghost_cells, M_local,
  • MPI_DOUBLE_PRECISION,
    rank_recv_from, tag,
  • grid_comm, status, ierr)
  • phi_old_local(N_local1,) bottom_ghost_cells

49
Parallel Poisson SolverMain computation
  • do j1, M_local
  • do i 1, N_local
  • phi(i,j) rho(i,j)
  • .25 ( phi_old( i1, j )
  • phi_old( i-1, j )
  • phi_old( i, j1 )
  • phi_old( i, j-1 )
  • )
  • enddo
  • enddo
  • Note indices are all within range now due to
    ghost cells

50
Parallel Poisson Solver Global vs. Local
Indices
  • i_offset0
  • do i 1, coord(1)
  • i_offset i_offset nmbr_local_rows(i)
  • enddo
  • j_offset0
  • do j 1, coord(2)
  • j_offset j_offset nmbr_local_cols(j)
  • enddo
  • do j j_offset1, j_offset M_local !global
    indices
  • y (real(j)-.5)/MLy-Ly/2
  • do i i_offset1, i_offset1 N_local
    !global
  • x (real(i)-.5)/NLx
  • makerho_local(i-i_offset,j-j_offset)
    f(x,y)
  • enddo
  • enddo

!store with local indices
51
Parallel Poisson Solver in MPIprocessor grid
e.g. 3 x 5, NM64
M
64
1
13
14
27
39
52
40
53
26
1
1
0,0
0,2
0,1
0,4
0,3
22
13
1
N
23
1
1,0
1,1
1,3
1,2
1,4
43
21
44
2,0
2,1
2,2
2,3
2,4
64
52
MPI Reduction Communication Functions
  • Point-to-point communications in N,S,E,W shifts
  • MPI_SENDRECV( sendbuf, sendcount,sendtype, dest,
    sendtag, recvbuf, recvcount, recvtype,
    source,recvtag, comm, status ierr)
  • Reduction operations in the computation
  • MPI_ALLREDUCE( sendbuf, recvbuf, count, datatype,
    operation, ierr)
  • operation MPI_SUM, MPI_MAX, MPI_MIN_LOC,
    ...

53
I/O of final resultsStep 1 in row_comm, Gatherv
the columns into matrices of size ( rows) x M
M
0,1
0,0
0,2
0,4
0,3
N
1,3
1,1
1,2
1,4
1,0
2,2
2,4
2,0
2,1
2,3
54
I/O of final resultsStep 2 transpose the
matrix in row_comm, Gatherv the columns in M x
N matrix.Result is the Transpose of the matrix
for f.
N
f
M
M
55
References - MPI Tutorial
  • PACS online course
  • http//webct.ncsa.uiuc.edu8900/
  • Edinburgh Parallel Computing Center
  • http//www.epcc.ed.ac.uk/epic/mpi/notes/mpi-course
    -epic.book_1.html
  • Argonne National Laboratory (MPICH)
  • http//www-unix.mcs.anl.gov/mpi/
  • MPI Forum
  • http//www.mpi-forum.org/docs/docs.html
  • MPI The Complete Reference (vols. 1, 2)
  • Vol. 1. at http//www.netlib.org/utk/papers/mpi-bo
    ok/mpi-book.html
  • IBM (MPI on the RS/6000 (IBM SP))
  • http//publib-b.boulder.ibm.com/Redbooks.nsf/Redbo
    okAbstracts

56
References Some useful books
  • MPI The Complete Reference
  • Marc Snir, Steve Otto, Steven Huss-Lederman,
    David Walker and Jack Dongara, MIT Press
  • examples/mpidocs/mpi_complete_reference.ps.Z
  • Parallel Programming with MPI
  • Peter S. Pacheco, Morgan Kaufman Publishers, Inc
  • Using MPI Portable Parallel Programing with
    Message Passing Interface
  • William Gropp, E. Lusk and A. Skjellum, MIT Press

57
Transpose in Serial 2D-FFT
y
Perform 1D x-transforms on contiguous data (by
columns in Fortran)
x
x
Transpose 2D array, then perform y-transforms on
contiguous data in columns
y
58
Transpose in Parallel 2D-FFT
y
Perform 1D x-transforms on contiguous data (by
columns in Fortran)
x
x
Write a Comment
User Comments (0)
About PowerShow.com