High Performance Parallel Programming - PowerPoint PPT Presentation

1 / 35

About This Presentation

Title:

High Performance Parallel Programming

Description:

recieve halo values from meighbours; High Performance Parallel Programming ... wait for completion of receiving halo values; Deadlock cannot occur ... – PowerPoint PPT presentation

Number of Views:55

Avg rating:3.0/5.0

Slides: 36

Provided by: dirk71

Learn more at: http://www.cloudbus.org

Category:

more less

Transcript and Presenter's Notes

Title: High Performance Parallel Programming

1
High Performance Parallel Programming

Dirk van der Knijff
Advanced Research Computing
Information Division

2
High Performance Parallel Programming

Lecture 8 Message Passing Interface (MPI) (part
2)

3
Example problem

One-dimensional smoothing
each element set to average of its neigbours

P1
P2
P3
Pn
Process
4
Deadlock

If we implement an algorithm like this
for (iterations)
update all cells
send boundary values to neighbours
recieve halo values from meighbours

5
Non-blocking communications

Routine returns before the communication
completes
Seperate communication into phases
Initiate non-blocking communication
Do some work (perhaps invloving other
communications)
Wait for non-blocking communication to complete
Can test before waiting (or instead of)

6
Solution

So our algorithm now looks like this
for(iterations)
update boundary cells
initiate sending of boundary values
intiate receipt of halo values
update non-boundary cells
wait for completion of sending boundary values
wait for completion of receiving halo values
Deadlock cannot occur
Communication can occur simultaneously in each
direction

7
Non-blocking communication in MPI

All the same arguments as blocking counterparts
plus an extra argument
This argument, request, is a handle which is used
to test when the operation has completed.
Same communication models as blocking mode
MPI_Isend Standard send
MPI_Issend Synchronous send
MPI_Ibsend Buffered send
MPI_Rsend Read send
MPI_Irecv Receive

8
Handles

datatype - same as blocking MPI_Datatype or
integer
communicator - same as blocking MPI_Comm or
integer
request - MPI_Request or integer
a request handle is allocated when a
communication is initiated
MPI_Issend(buf, count, datatype, dest, tag,
comm, handle)

9
Testing for completion

Two types
WAIT type
block until the communication has completed
useful when data or buffer is required
MPI_Wait(request, status)
TEST type
return TRUE or FALSE value depending on
completion
do not block
useful if data is not yet required
MPI_Test(request, flag, status)

10
Blocking and non-blocking

A non-blocking routine followed by a wait is
equivalent to a blocking routine
Send and receive can be blocking or non-blocking
A blocking send can be used with a non-blocking
receive and vice-versa
Non-blocking sends can use any mode
standard, synchronous, buffered, ready
Synchronous mode affects completion, not
initiation
Cannot alter send buffer until send completed

11
Multiple communications

Sometimes have many non-blocking communications
posted at the same time.
MPI provides routines to test multiple
communications
Three types
test for all
test for any
test for some
Each type comes in both wait and test versions

12
MPI_Waitall

Tests all of the specified communications
blocking
MPI_Waitall(count, array_of_requests,
array_of_statuses)
non-blocking
MPI_Testall(count, array_of_requests, flag,
array_of_statuses)
Information about each communication is returned
in array_of_statuses
flag is set to true if all the communications
have completed

13
MPI_Waitany

Tests if any communications have completed
MPI_Waitany(count, array_of_requests, index,
status)
and
MPI_Testany(count, array_of_requests, index,
flag, status)
The index in array_of_requests and status of the
completed communication are returned in index and
status
If more that one has completed choice is arbitrary

14
MPI_Waitsome

Differs from Waitany in behaviour if more than
one can complete
Return status on all communications that can
complete
MPI_Waitsome(count, array_of_requests, outcount,
array_of_indices, array_of_statuses)
MPI_Testsome(count, array_of_requests, outcount,
array_of_indices, array_of_statuses)
Obey a fairness rule to help prevent starvation
Note all completion tests deallocate the request
object when they return as complete. The handle
is set to MPI_REQUEST_NULL

15
Derived datatypes

As discussed last lecture, there are occasions
when we wish to pass data that doesnt fit the
basic model...
e.g.
a matrix sub-block or matrix section (
a(5,))(non-contiguous data items)
a structure(contiguous, differing type)
a set of variables (n, set(n))(random)
There are solutions using standard types but
clumsy...

16
Derived datatypes (cont.)

Two stage process
construct the datatype
Commit the datatype
Datatype is constructed from basic datatypes
using
MPI_Type_contiguous
MPI_Type_vector
MPI_Type_hvector
MPI_Type_indexed
MPI_Type_hindexed
MPI_Type_struct

17
Derived datatypes (cont.)

Once the new datatype is constructed it must be
committed.
MPI_Type_commit(datatype)
After use a datatype can be de-allocated
MPI_Type_free(datatype)
Any messages in progress are unaffected when a
type is freed
Datatypes derived from datatype are also
unaffected

18
Derived datatypes - Type Maps

Any datatype is specified by its type map
A type map is a list of the form -
Displacements may be positive, zero, or negative
Displacements are from the start of the
communication buffer

19
MPI_TYPE_VECTOR

MPI_Type_vector(count, blocklength, stride,
oldtype, newtype)
e.g.
MPI_Datatype new
MPI_Type_vector(2, 3, 5, MPI_DOUBLE, new

MPI_DOUBLE
blocklength3
stride5
count2
20
MPI_Type_struct

MPI_Type_struct(count, array_of_blocklengths,
array_of_displacements, array_of_types, newtype)
e.g.

MPI_INT
MPI_DOUBLE
newtype
21
MPI_Type_struct - example

int blocklen2, extent
MPI_Aint disp2
MPI_Datatype type2, new
struct
MPI_INT int
MPI_DOUBLE dble3
msg
disp0 0
MPI_Type_extent(MPI_INT,extent)
disp1 extent
type0 MPI_INT
type1 MPI_DOUBLE
blocklen0 1
blocklen1 3
MPI_Type_struct(2, blocklen, disp, type,
new)
MPI_Type_commit(new)

22
Derived datatypes - other routines

MPI_Type_size(datatype, size)
returns the total size of all the data items in
datatype
MPI_Type_extent(datatype, extent)
returns the distance between the lower and upper
bounds
MPI_Type_lb(datatype, lb)
returns the lower bound of the datatype (offset
in bytes)
MPI_Type_ub(datatype, ub)
returns the upper bound of the datatype

23
Matching rules

A send and receive are correctly matched if the
type maps of the specified datatypes with the
displacements ignored match according to the
rules for basic datatypes.
The number of basic elements received can be
found using MPI_Get_elements.
MPI_Get_count returns the number of received
elements of the specified datatype
may not be a whole number
if so returns MPI_UNDEFINED

24
Virtual Topologies

Convenient process naming
Naming scheme to fit communication pattern
Simplifies writing of code
Can allow MPI to optimise communications
Creating a topology produces a new communicator
MPI provides mapping functions
Mapping functions compute processor ranks based
on the topology naming scheme

25
Example - a 2D torus
0 (0,0)
3 (1,0)
6 (2,0)
9 (3,0)
1 (0,1)
4 (1,1)
7 (2,1)
10 (3,1)
2 (0,2)
5 (1,2)
8 (2,2)
11 (3,2)
26
Topology Types

Cartesian topologies
each process is connected to its neighbours in a
virtual grid
boundaries can be cyclic, or not
processes are identified by cartesian coordinates
Graph topologies
general connected graphs
Im not going to cover them

27
Creating a cartesian topology

MPI_Cart_create(comm_old, ndims, dims, periods,
reorder, comm_cart)
ndims number of dimensions
dims number of processors in each dimension
periods true or false specifying cyclic
reorder false gt data already distributed - use
existing ranks
true gt MPI can reorder ranks

28
Cartesian mapping functions

MPI_Cart_rank(comm, coords, rank)
used to determine the rank of a process with the
specified coordinates
MPI_Cart_coords(comm, rank, maxdims, coords)
converts process rank to grid coords
MPI_Cart_shift(comm, direction, disp,
rank_source, rank_dest)
provides the correct ranks for a shift
these can then be used in sends and receives
direction is the dimension in which the shift
occurs
no support for diagonal shifts

29
Cartesian partitioning

It is possible to create a partition of a
cartesian topology
Often used to create communicators for row (or
slice) operations
MPI_Cart_sub(comm, remain_dims, new_comm)
If comm defines a 2x3x4 grid and remain_dims
(true, false, true) then MPI_Cart_sub will
create 3 new communicators each with 8 processors
in a 2x4 grid
Note that only one communicator is returned - the
one which contains the calling process.

30
Local notes for ping-pong

E-mail me for an account (or see me or Shaoib or
Srikumar).
We are having some queue problems but will fix
asap
Remember to add /usr/local/mpi/bin to your PATH
Use mpicc to compile (dont add -lmpi)
You need ssh to connect to charm
The other nodes are on a private lan (switch)

31
mpich

After the MPI standard was announced a portable
implementation, mpich, was produced by ANL. It
consists of
libraries and include files - libmpi, mpi.h
compilers - mpicc, mpif90.
These know about things like where relevant
include and library files are
runtime loader - mpirun
Has arguments -np ltnumber of nodesgt, and
-machinefile ltfile of nodenamesgt
implements SPMD paradigm by starting a copy of
program on each node. The program must therefore
do do any differentitation itself (using
MPI_Comm_size() and MPI_Comm_rank() functions).
NOTE our version gets CPUs and their addresses
from PBS (ie, don't use -np and/or -machinefile)

32
PBS

PBS is a batch system - jobs get submitted to a
queue
The job is a shell script to execute your program
The shell script can contain job management
instructions (note that these instructions can
also be in the command line)
PBS will allocate your job to some other
computer, log in as you, and execute your script,
ie your script must contain cd's or aboslute
references to access files (or globus objects)
Useful PBS commands
qsub - submits a job
qstat - monitors status
qdel - deletes a job from a queue

33
PBS directives

Some PBS directives to insert at the start of
your shell script
PBS -q ltqueuenamegt
PBS -e ltfilenamegt (stderr location)
PBS -o ltfilenamegt (stdout location)
PBS -eo (combines stderr and stdout)
PBS -t ltsecondsgt (maximum time)
PBS -l ltattributegtltvaluegt (eg -l nodes2)

34
charm

charm.hpc.unimelb.edu.au is a dual Pentium (PII,
266MHz, 128MB RAM) and is the front end for the
PC farm. It's running Red Hat Linux.
Behind charm are sixteen PCs (all 200MHz MMX,
with 64MB RAM). Their DNS designations are
pc-i11.hpc.unimelb.edu.au, ... ,
pc-i18.hpc.unimelb.edu.au and pc-j11.hpc.unimelb.e
du.au, ... , pc-j18.hpc.unimelb.edu.au.
OpenPBS is the batch system that is implemented
on charm. There are four batch queues implemented
on charm
pque all nodes
exclusive all nodes
pquei pc-i nodes only
pquej pc-j nodes only