High Performance Parallel Programming - PowerPoint PPT Presentation

1 / 35
About This Presentation
Title:

High Performance Parallel Programming

Description:

recieve halo values from meighbours; High Performance Parallel Programming ... wait for completion of receiving halo values; Deadlock cannot occur ... – PowerPoint PPT presentation

Number of Views:55
Avg rating:3.0/5.0
Slides: 36
Provided by: dirk71
Learn more at: http://www.cloudbus.org
Category:

less

Transcript and Presenter's Notes

Title: High Performance Parallel Programming


1
High Performance Parallel Programming
  • Dirk van der Knijff
  • Advanced Research Computing
  • Information Division

2
High Performance Parallel Programming
  • Lecture 8 Message Passing Interface (MPI) (part
    2)

3
Example problem
  • One-dimensional smoothing
  • each element set to average of its neigbours

P1
P2
P3
Pn
Process
4
Deadlock
  • If we implement an algorithm like this
  • for (iterations)
  • update all cells
  • send boundary values to neighbours
  • recieve halo values from meighbours

5
Non-blocking communications
  • Routine returns before the communication
    completes
  • Seperate communication into phases
  • Initiate non-blocking communication
  • Do some work (perhaps invloving other
    communications)
  • Wait for non-blocking communication to complete
  • Can test before waiting (or instead of)

6
Solution
  • So our algorithm now looks like this
  • for(iterations)
  • update boundary cells
  • initiate sending of boundary values
  • intiate receipt of halo values
  • update non-boundary cells
  • wait for completion of sending boundary values
  • wait for completion of receiving halo values
  • Deadlock cannot occur
  • Communication can occur simultaneously in each
    direction

7
Non-blocking communication in MPI
  • All the same arguments as blocking counterparts
    plus an extra argument
  • This argument, request, is a handle which is used
    to test when the operation has completed.
  • Same communication models as blocking mode
  • MPI_Isend Standard send
  • MPI_Issend Synchronous send
  • MPI_Ibsend Buffered send
  • MPI_Rsend Read send
  • MPI_Irecv Receive

8
Handles
  • datatype - same as blocking MPI_Datatype or
    integer
  • communicator - same as blocking MPI_Comm or
    integer
  • request - MPI_Request or integer
  • a request handle is allocated when a
    communication is initiated
  • MPI_Issend(buf, count, datatype, dest, tag,
    comm, handle)

9
Testing for completion
  • Two types
  • WAIT type
  • block until the communication has completed
  • useful when data or buffer is required
  • MPI_Wait(request, status)
  • TEST type
  • return TRUE or FALSE value depending on
    completion
  • do not block
  • useful if data is not yet required
  • MPI_Test(request, flag, status)

10
Blocking and non-blocking
  • A non-blocking routine followed by a wait is
    equivalent to a blocking routine
  • Send and receive can be blocking or non-blocking
  • A blocking send can be used with a non-blocking
    receive and vice-versa
  • Non-blocking sends can use any mode
  • standard, synchronous, buffered, ready
  • Synchronous mode affects completion, not
    initiation
  • Cannot alter send buffer until send completed

11
Multiple communications
  • Sometimes have many non-blocking communications
    posted at the same time.
  • MPI provides routines to test multiple
    communications
  • Three types
  • test for all
  • test for any
  • test for some
  • Each type comes in both wait and test versions

12
MPI_Waitall
  • Tests all of the specified communications
  • blocking
  • MPI_Waitall(count, array_of_requests,
    array_of_statuses)
  • non-blocking
  • MPI_Testall(count, array_of_requests, flag,
    array_of_statuses)
  • Information about each communication is returned
    in array_of_statuses
  • flag is set to true if all the communications
    have completed

13
MPI_Waitany
  • Tests if any communications have completed
  • MPI_Waitany(count, array_of_requests, index,
    status)
  • and
  • MPI_Testany(count, array_of_requests, index,
    flag, status)
  • The index in array_of_requests and status of the
    completed communication are returned in index and
    status
  • If more that one has completed choice is arbitrary

14
MPI_Waitsome
  • Differs from Waitany in behaviour if more than
    one can complete
  • Return status on all communications that can
    complete
  • MPI_Waitsome(count, array_of_requests, outcount,
    array_of_indices, array_of_statuses)
  • MPI_Testsome(count, array_of_requests, outcount,
    array_of_indices, array_of_statuses)
  • Obey a fairness rule to help prevent starvation
  • Note all completion tests deallocate the request
    object when they return as complete. The handle
    is set to MPI_REQUEST_NULL

15
Derived datatypes
  • As discussed last lecture, there are occasions
    when we wish to pass data that doesnt fit the
    basic model...
  • e.g.
  • a matrix sub-block or matrix section (
    a(5,))(non-contiguous data items)
  • a structure(contiguous, differing type)
  • a set of variables (n, set(n))(random)
  • There are solutions using standard types but
    clumsy...

16
Derived datatypes (cont.)
  • Two stage process
  • construct the datatype
  • Commit the datatype
  • Datatype is constructed from basic datatypes
    using
  • MPI_Type_contiguous
  • MPI_Type_vector
  • MPI_Type_hvector
  • MPI_Type_indexed
  • MPI_Type_hindexed
  • MPI_Type_struct

17
Derived datatypes (cont.)
  • Once the new datatype is constructed it must be
    committed.
  • MPI_Type_commit(datatype)
  • After use a datatype can be de-allocated
  • MPI_Type_free(datatype)
  • Any messages in progress are unaffected when a
    type is freed
  • Datatypes derived from datatype are also
    unaffected

18
Derived datatypes - Type Maps
  • Any datatype is specified by its type map
  • A type map is a list of the form -
  • Displacements may be positive, zero, or negative
  • Displacements are from the start of the
    communication buffer

19
MPI_TYPE_VECTOR
  • MPI_Type_vector(count, blocklength, stride,
    oldtype, newtype)
  • e.g.
  • MPI_Datatype new
  • MPI_Type_vector(2, 3, 5, MPI_DOUBLE, new

MPI_DOUBLE
blocklength3
stride5
count2
20
MPI_Type_struct
  • MPI_Type_struct(count, array_of_blocklengths,
    array_of_displacements, array_of_types, newtype)
  • e.g.

MPI_INT
MPI_DOUBLE
newtype
21
MPI_Type_struct - example
  • int blocklen2, extent
  • MPI_Aint disp2
  • MPI_Datatype type2, new
  • struct
  • MPI_INT int
  • MPI_DOUBLE dble3
  • msg
  • disp0 0
  • MPI_Type_extent(MPI_INT,extent)
  • disp1 extent
  • type0 MPI_INT
  • type1 MPI_DOUBLE
  • blocklen0 1
  • blocklen1 3
  • MPI_Type_struct(2, blocklen, disp, type,
    new)
  • MPI_Type_commit(new)

22
Derived datatypes - other routines
  • MPI_Type_size(datatype, size)
  • returns the total size of all the data items in
    datatype
  • MPI_Type_extent(datatype, extent)
  • returns the distance between the lower and upper
    bounds
  • MPI_Type_lb(datatype, lb)
  • returns the lower bound of the datatype (offset
    in bytes)
  • MPI_Type_ub(datatype, ub)
  • returns the upper bound of the datatype

23
Matching rules
  • A send and receive are correctly matched if the
    type maps of the specified datatypes with the
    displacements ignored match according to the
    rules for basic datatypes.
  • The number of basic elements received can be
    found using MPI_Get_elements.
  • MPI_Get_count returns the number of received
    elements of the specified datatype
  • may not be a whole number
  • if so returns MPI_UNDEFINED

24
Virtual Topologies
  • Convenient process naming
  • Naming scheme to fit communication pattern
  • Simplifies writing of code
  • Can allow MPI to optimise communications
  • Creating a topology produces a new communicator
  • MPI provides mapping functions
  • Mapping functions compute processor ranks based
    on the topology naming scheme

25
Example - a 2D torus
0 (0,0)
3 (1,0)
6 (2,0)
9 (3,0)
1 (0,1)
4 (1,1)
7 (2,1)
10 (3,1)
2 (0,2)
5 (1,2)
8 (2,2)
11 (3,2)
26
Topology Types
  • Cartesian topologies
  • each process is connected to its neighbours in a
    virtual grid
  • boundaries can be cyclic, or not
  • processes are identified by cartesian coordinates
  • Graph topologies
  • general connected graphs
  • Im not going to cover them

27
Creating a cartesian topology
  • MPI_Cart_create(comm_old, ndims, dims, periods,
    reorder, comm_cart)
  • ndims number of dimensions
  • dims number of processors in each dimension
  • periods true or false specifying cyclic
  • reorder false gt data already distributed - use
    existing ranks
  • true gt MPI can reorder ranks

28
Cartesian mapping functions
  • MPI_Cart_rank(comm, coords, rank)
  • used to determine the rank of a process with the
    specified coordinates
  • MPI_Cart_coords(comm, rank, maxdims, coords)
  • converts process rank to grid coords
  • MPI_Cart_shift(comm, direction, disp,
    rank_source, rank_dest)
  • provides the correct ranks for a shift
  • these can then be used in sends and receives
  • direction is the dimension in which the shift
    occurs
  • no support for diagonal shifts

29
Cartesian partitioning
  • It is possible to create a partition of a
    cartesian topology
  • Often used to create communicators for row (or
    slice) operations
  • MPI_Cart_sub(comm, remain_dims, new_comm)
  • If comm defines a 2x3x4 grid and remain_dims
    (true, false, true) then MPI_Cart_sub will
    create 3 new communicators each with 8 processors
    in a 2x4 grid
  • Note that only one communicator is returned - the
    one which contains the calling process.

30
Local notes for ping-pong
  • E-mail me for an account (or see me or Shaoib or
    Srikumar).
  • We are having some queue problems but will fix
    asap
  • Remember to add /usr/local/mpi/bin to your PATH
  • Use mpicc to compile (dont add -lmpi)
  • You need ssh to connect to charm
  • The other nodes are on a private lan (switch)

31
mpich
  • After the MPI standard was announced a portable
    implementation, mpich, was produced by ANL. It
    consists of
  • libraries and include files - libmpi, mpi.h
  • compilers - mpicc, mpif90.
  • These know about things like where relevant
    include and library files are
  • runtime loader - mpirun
  • Has arguments -np ltnumber of nodesgt, and
  • -machinefile ltfile of nodenamesgt
  • implements SPMD paradigm by starting a copy of
    program on each node. The program must therefore
    do do any differentitation itself (using
    MPI_Comm_size() and MPI_Comm_rank() functions).
  • NOTE our version gets CPUs and their addresses
    from PBS (ie, don't use -np and/or -machinefile)

32
PBS
  • PBS is a batch system - jobs get submitted to a
    queue
  • The job is a shell script to execute your program
  • The shell script can contain job management
    instructions (note that these instructions can
    also be in the command line)
  • PBS will allocate your job to some other
    computer, log in as you, and execute your script,
    ie your script must contain cd's or aboslute
    references to access files (or globus objects)
  • Useful PBS commands
  • qsub - submits a job
  • qstat - monitors status
  • qdel - deletes a job from a queue

33
PBS directives
  • Some PBS directives to insert at the start of
    your shell script
  • PBS -q ltqueuenamegt
  • PBS -e ltfilenamegt (stderr location)
  • PBS -o ltfilenamegt (stdout location)
  • PBS -eo (combines stderr and stdout)
  • PBS -t ltsecondsgt (maximum time)
  • PBS -l ltattributegtltvaluegt (eg -l nodes2)

34
charm
  • charm.hpc.unimelb.edu.au is a dual Pentium (PII,
    266MHz, 128MB RAM) and is the front end for the
    PC farm. It's running Red Hat Linux.
  • Behind charm are sixteen PCs (all 200MHz MMX,
    with 64MB RAM). Their DNS designations are
    pc-i11.hpc.unimelb.edu.au, ... ,
    pc-i18.hpc.unimelb.edu.au and pc-j11.hpc.unimelb.e
    du.au, ... , pc-j18.hpc.unimelb.edu.au.
  • OpenPBS is the batch system that is implemented
    on charm. There are four batch queues implemented
    on charm
  • pque all nodes
  • exclusive all nodes
  • pquei pc-i nodes only
  • pquej pc-j nodes only

35
High Performance Parallel Programming
  • Thursday More MPI
Write a Comment
User Comments (0)
About PowerShow.com