Title: MPI TIPS AND TRICKS
1MPI TIPS AND TRICKS
- Dr. David Cronk
- Innovative Computing Lab
- University of Tennessee
2Course Outline
- Day 1
- Morning - Lecture
- Point-to-point communication modes
- Collective operations
- Derived datatypes
- Afternoon - Lab
- Hands on exercises demonstrating collective
operations and derived datatypes
3Course Outline (cont)
- Day 2
- Morning - Lecture
- Finish lecture on derived datatypes
- MPI analysis using VAMPIR
- Performance analysis and tuning
- Afternoon - Lab
- Finish Day 1 exercises
- VAMPIR demo
4Course Outline (cont)
- Day 3
- Morning - Lecture
- MPI-I/O
- Afternoon - Lab
- MPI-I/O exercises
5Point-to-Point Communication Modes
- Standard Mode
- blocking
- MPI_SEND (buf, count, datatype, dest, tag, comm)
- MPI_RECV (buf, count, datatype, source, tag,
comm, status) - Generally ONLY use if you cannot call earlier AND
there is no other work that can be done! - Standard ONLY states that buffers can be used
once calls return. It is implementation
dependant on when blocking calls return. - Blocking sends MAY block until a matching receive
is posted. This is not required behavior, but
the standard does not prohibit this behavior
either. Further, a blocking send may have to
wait for system resources such as system managed
message buffers. - Be VERY careful of deadlock when using blocking
calls!
6Point-to-Point Communication Modes (cont)
- Standard Mode
- Non-blocking (immediate) sends/receives
- MPI_ISEND (buf, count, datatype, dest, tag, comm,
request) - MPI_IRECV (buf, count, datatype, source, tag,
comm, request) - MPI_WAIT (request, status)
- MPI_TEST (request, flag, status)
- Allows communication calls to be posted early,
which may improve performance. - Overlap computation and communication
- Latency tolerance
- Less (or no) buffering
- MUST either complete these calls (with wait or
test) or call MPI_REQUEST_FREE
7Point-to-Point Communication Modes (cont)
- Non-standard mode communication
- Only used by the sender! (MPI uses the push
communication model) - Buffered mode - A buffer must be provided by the
application - Synchronous mode - Completes only after a
matching receive has been posted - Ready mode - May only be called when a matching
receive has already been posted
8Point-to-Point Communication Modes Buffered
- MPI_BSEND (buf, count, datatype, dest, tag, comm)
- MPI_IBSEND (buf, count, dtype, dest, tag, comm,
req) - MPI_BUFFER_ATTACH (buff, size)
- MPI_BUFFER_DETACH (buff, size)
- Buffered sends do not rely on system buffers
- The user supplies a buffer that MUST be large
enough for all messages - User need not worry about calls blocking, waiting
for system buffer space - The buffer is managed by MPI
- The user MUST ensure there is no buffer overflow
9Buffered Sends
Seg violation
Buffer overflow
Safe
10Point-to-Point Communication Modes Synchronous
- MPI_SSEND (buf, count, datatype, dest, tag, comm)
- MPI_ISSEND (buf, count, dtype, dest, tag, comm,
req) - Can be started (called) at any time.
- Does not complete until a matching receive has
been posted and the receive operation has been
started - Does NOT mean the matching receive has
completed - Can be used in place of sending and receiving
acknowledgements - Can be more efficient when used appropriately
- buffering may be avoided
11Point-to-Point Communication Modes Ready Mode
- MPI_RSEND (buf, count, datatype, dest, tag, comm)
- MPI_IRSEND (buf, count, dtype, dest, tag, comm,
req) - May ONLY be started (called) if a matching
receive has already been posted. - If a matching receive has not been posted, the
results are undefined - May be most efficient when appropriate
- Removal of handshake operation
- Should only be used with extreme caution
12Ready Mode
UNSAFE
SAFE
13Point-to-Point Communication Modes Performance
Issues
- Non-blocking calls are almost always the way to
go - Communication can be carried out during blocking
system calls - Computation and communication can be overlapped
if there is special purpose communication
hardware - Less likely to have errors that lead to deadlock
- Standard mode is usually sufficient - but
buffered mode can offer advantages - Particularly if there are frequent, large
messages being sent - If the user is unsure the system provides
sufficient buffer space - Synchronous mode can be more efficient if acks
are needed - Also tells the system that buffering is not
required
14Collective Communication
- Amount of data sent must exactly match the amount
of data received - Collective routines are collective across an
entire communicator and must be called in the
same order from all processors within the
communicator - Collective routines are all blocking
- This simply means buffers can be re-used upon
return - Collective routines return as soon as the calling
process participation is complete - Does not say anything about the other processors
- Collective routines may or may not be
synchronizing - No mixing of collective and point-to-point
communication
15Collective Communication
- Barrier MPI_BARRIER (comm)
- Only collective routine which provides explicit
synchronization - Returns at any processor only after all processes
have entered the call
16Collective Communication
- Collective Communication Routines
- Except broadcast, each routine has 2 variants
- Standard variant All messages are the same size
- Vector Variant Each item is a vector of possibly
varying length - If there is a single origin or destination, it is
referred to as the root - Each routine (except broadcast) has distinct send
and receive arguments - Send and receive buffers must be disjoint
- Each can use MPI_IN_PLACE, which allows the user
to specify that data contributed by the caller is
already in its final location.
17Collective Communication Bcast
- MPI_BCAST (buffer, count, datatype, root, comm)
- Strictly in place
- MPI-1 insists on using an intra-communicator
- MPI-2 allows use of an inter-communicator
- REMEMBER A broadcast need not be synchronizing.
Returning from a broadcast tells you nothing
about the status of the other processes involved
in a broadcast. Furthermore, though MPI does not
require MPI_BCAST to be synchronizing, it neither
prohibits synchronous behavior.
18BCAST
THATS BETTER
OOPS!
19Collective Communication Gather
- MPI_GATHER (sendbuf, sendcount, sendtype,
recvbuf, recvcount, recvtype, root, comm) - Receive arguments are only meaningful at the root
- Each processor must send the same amount of data
- Root can use MPI_IN_PLACE for sendbuf
- data is assumed to be in the correct place in the
recvbuf
MPI_GATHER
20MPI_Gather
int tmp20 int res320
WORKS
A OK
21Collective Communication Gatherv
- MPI_GATHERV (sendbuf, sendcount, sendtype,
recvbuf, recvcounts, displs, recvtype, root,
comm) - Vector variant of MPI_GATHER
- Allows a varying amount of data from each proc
- allows root to specify where data from each proc
goes - No portion of the receive buffer may be written
more than once - MPI_IN_PLACE may be used by root.
22Collective Communication Gatherv (cont)
MPI_GATHERV
23Collective Communication Gatherv (cont)
stride 105 root 0 for (i 0 i lt nprocs
i) displsi istride countsI
100 MPI_Gatherv (sbuff, 100, MPI_INT,
rbuff, counts, displs, MPI_INT, root,
MPI_COMM_WORLD)
24Collective Communication Scatter
- MPI_SCATTER (sendbuf, sendcount, sendtype,
recvbuf, recvcount, recvtype, root, comm) - Opposite of MPI_GATHER
- Send arguments only meaningful at root
- Root can use MPI_IN_PLACE for recvbuf
MPI_SCATTER
25MPI_SCATTER
IF (MYPE .EQ. ROOT) THEN OPEN (25,
FILEfilename) READ (25, ) nprocs, nboxes
READ (25, ) mat(i,j) (i1,nboxes)(j1,nprocs)
CLOSE (25) ENDIF CALL MPI_BCAST (nboxes, 1,
MPI_INTEGER, ROOT, MPI_COMM_WORLD, ierr) CALL
MPI_SCATTER (mat, nboxes, MPI_INT, lboxes,
nboxes, MPI_INT, ROOT, MPI_COMM_WORLD, ierr)
26Collective Communication Scatterv
- MPI_SCATTERV (sendbuf, scounts, displs, sendtype,
recvbuf, recvcount, recvtype) - Opposite of MPI_GATHERV
- Send arguments only meaningful at root
- Root can use MPI_IN_PLACE for recvbuf
- No location of the sendbuf can be read more than
once
27Collective Communication Scatterv (cont)
MPI_SCATTERV
28MPI_SCATTERV
C mnb max number of boxes IF (MYPE .EQ. ROOT)
THEN OPEN (25, FILEfilename) READ (25, )
nprocs READ (25, ) (nboxes(I), I1,nprocs)
READ (25, ) mat(I,J) (I1,nboxes(I))(J1,nprocs)
CLOSE (25) DO I 1,nprocs displs(I)
(I-1)mnb ENDDO ENDIF CALL MPI_SCATTER (nboxes,
1, MPI_INT, nb, 1, MPI_INT, ROOT,
MPI_COMM_WORLD, ierr) CALL MPI_SCATTERV (mat,
nboxes, displs, MPI_INT, lboxes, nb, MPI_INT,
ROOT, MPI_COMM_WORLD, ierr)
29Collective Communication Allgather
- MPI_ALLGATHER (sendbuf, sendcount, sendtype,
recvbuf, recvcount, recvtype, comm) - Same as MPI_GATHER, except all processors get the
result - MPI_IN_PLACE may be used for sendbuf of all
processors - Equivalent to a gather followed by a bcast
30Collective Communication Allgatherv
- MPI_ALLGATHERV (sendbuf, sendcount, sendtype,
recvbuf, recvcounts, displs, recvtype, comm) - Same as MPI_GATHERV, except all processors get
the result - MPI_IN_PLACE may be used for sendbuf of all
processors - Equivalent to a gatherv followed by a bcast
31Collective Communication Alltoall
(scatter/gather)
- MPI_ALLTOALL (sendbuf, sendcount, sendtype,
recvbuf, recvcount, recvtype, comm)
32Collective Communication Alltoallv
- MPI_ALLTOALLV (sendbuf, sendcounts, sdispls,
sendtype, recvbuf, recvcounts, rdispls, recvtype,
comm) - Same as MPI_ALLTOALL, but the vector variant
- Can specify how many blocks to send to each
processor, location of blocks to send, how many
blocks to receive from each processor, and where
to place the received blocks
33Collective Communication Alltoallw
- MPI_ALLTOALLW (sendbuf, sendcounts, sdispls,
sendtypes, recvbuf, recvcounts, rdispls,
recvtypes, comm) - Same as MPI_ALLTOALLV, except different datatypes
can be specified for data scattered as well as
data gathered - Can specify how many blocks to send to each
processor, location of blocks to send, how many
blocks to receive from each processor, and where
to place the received blocks - Displacements are now in terms of bytes rather
that types
34Collective Communication Reduction
- Global reduction across all members of a group
- Can us predefined operations or user defined
operations - Can be used on single elements or arrays of
elements - Counts and types must be the same on all
processors - Operations are assumed to be associative
- User defined operations can be different on each
processor, but not recommended
35Collective Communication Reduction (reduce)
- MPI_REDUCE (sendbuf, recvbuf, count, datatype,
op, root, comm) - recvbuf only meaningful on root
- Combines elements (on an element by element
basis) in sendbuf according to op - Results of the reduction are returned to root in
recvbuf - MPI_IN_PLACE can be used for sendbuf on root
36MPI_REDUCE
REAL a(n), b(n,m), c(m) REAL sum(m) DO j1,m
sum(j) 0.0 DO i 1,n sum(j) sum(j)
a(i)b(i,j) ENDDO ENDDO CALL MPI_REDUCE(sum,
c, m, MPI_REAL, MPI_SUM, 0, MPI_COMM_WORLD,
ierr)
37Collective Communication Reduction (cont)
- MPI_ALLREDUCE (sendbuf, recvbuf, count, datatype,
op, comm) - Same as MPI_REDUCE, except all processors get the
result - MPI_REDUCE_SCATTER (sendbuf, recv_buff,
recvcounts, datatype, op, comm) - Acts like it does a reduce followed by a scatterv
38MPI_REDUCE_SCATTER
DO j1,nprocs counts(j) n DO j1,m sum(j)
0.0 DO i 1,n sum(j) sum(j)
a(i)b(i,j) ENDDO ENDDO CALL
MPI_REDUCE_SCATTER(sum, c, counts, MPI_REAL,
MPI_SUM, MPI_COMM_WORLD, ierr)
39Collective Communication Prefix Reduction
- MPI_SCAN (sendbuf, recvbuf, count, datatype, op,
comm) - Performs an inclusive element-wise prefix
reduction - MPI_EXSCAN (sendbuf, recvbuf, count, datatype,
op, comm) - Performs an exclusive prefix reduction
- Results are undefined at process 0
40MPI_SCAN
MPI_SCAN (sbuf, rbuf, 1, MPI_INT, MPI_SUM,
MPI_COMM_WORLD)
MPI_EXSCAN (sbuf, rbuf, 1, MPI_INT, MPI_SUM,
MPI_COMM_WORLD)
41Collective Communication Reduction - user
defined ops
- MPI_OP_CREATE (function, commute, op)
- if commute is true, operation is assumed to be
commutative - Function is a user defined function with 4
arguments - invec input vector
- inoutvec input and output value
- len number of elements
- datatype MPI_DATATYPE
- Returns inveci op inoutveci, i 0..len-1
- MPI_OP_FREE (op)
42Collective Communication Performance Issues
- Collective operations should have much better
performance than simply sending messages directly - Broadcast may make use of a broadcast tree (or
other mechanism) - All collective operations can potentially make
use of a tree (or other) mechanism to improve
performance - Important to use the simplest collective
operations which still achieve the needed results - Use MPI_IN_PLACE whenever appropriate
- Reduces unnecessary memory usage and redundant
data movement
43Derived Datatypes
- A derived datatype is a sequence of primitive
datatypes and displacements - Derived datatypes are created by building on
primitive datatypes - A derived datatypes typemap is the sequence of
(primitive type, disp) pairs that defines the
derived datatype - These displacements need not be positive, unique,
or increasing. - A datatypes type signature is just the sequence
of primitive datatypes - A messages type signature is the type signature
of the datatype being sent, repeated count times
44Derived Datatypes (cont)
Typemap (MPI_INT, 0) (MPI_INT, 12) (MPI_INT,
16) (MPI_INT, 20) (MPI_INT, 36) Type Signature
MPI_INT, MPI_INT, MPI_INT, MPI_INT, MPI_INT
Type Signature MPI_INT, MPI_INT, MPI_INT,
MPI_INT, MPI_INT
In collective communication, the type signature
of data sent must match the type signature of
data received!
45Derived Datatypes (cont)
- Lower Bound The lowest displacement of an entry
of this datatype - Upper Bound Relative address of the last byte
occupied by entries of this datatype, rounded up
to satisfy alignment requirements - Extent The span from lower to upper bound
- MPI_GET_EXTENT (datatype, lb, extent)
- MPI_TYPE_SIZE (datatype, size)
- MPI_GET_ADDRESS (location, address)
46Datatype Constructors
- MPI_TYPE_DUP (oldtype, newtype)
- Simply duplicates an existing type
- Not useful to regular users
- MPI_TYPE_CONTIGUOUS (count, oldtype, newtype)
- Creates a new type representing count contiguous
occurrences of oldtype - ex MPI_TYPE_CONTIGUOUS (2, MPI_INT, 2INT)
- creates a new datatype 2INT which represents an
array of 2 integers
47CONTIGUOUS DATATYPE
P1 sends 100 integers to P2
P1 int buff100 MPI_Datatype dtype ... ... MPI_
Type_contiguous (100, MPI_INT,
dtype) MPI_Type_commit (dtype) MPI_Send
(buff, 1, dtype, 2, tag, MPI_COMM_WORLD)
P2 int buff100 MPI_Recv (buff, 100, MPI_INT,
1, tag, MPI_COMM_WORLD, status)
48Datatype Constructors (cont)
- MPI_TYPE_VECTOR (count, blocklength, stride,
oldtype, newtype) - Creates a datatype representing count regularly
spaced occurrences of blocklength contiguous
oldtypes - stride is in terms of elements of oldtype
- ex MPI_TYPE_VECTOR (4, 2, 3, 2INT, AINT)
49Datatype Constructors (cont)
- MPI_TYPE_HVECTOR (count, blocklength, stride,
oldtype, newtype) - Identical to MPI_TYPE_VECTOR, except stride is
given in bytes rather than elements. - ex MPI_TYPE_HVECTOR (4, 2, 20, 2INT, BINT)
50EXAMPLE
- REAL a(100,100), B(100,100)
- CALL MPI_COMM_RANK (MPI_COMM_WORLD, myrank, ierr)
- CALL MPI_TYPE_SIZE (MPI_REAL, sizeofreal, ierr)
- CALL MPI_TYPE_VECTOR (100, 1, 100,MPI_REAL,
rowtype, ierr) - CALL MPI_TYPE_CREATE_HVECTOR (100, 1, sizeofreal,
rowtype, xpose, ierr) - CALL MPI_TYPE_COMMIT (xpose, ierr)
- CALL MPI_SENDRECV (a, 1, xpose, myrank, 0, b,
100100, MPI_REAL, myrank, 0, MPI_COMM_WORLD,
status, ierr)
51Datatype Constructors (cont)
- MPI_TYPE_INDEXED (count, blocklengths, displs,
oldtype, newtype) - Allows specification of non-contiguous data
layout - Good for irregular problems
- ex MPI_TYPE_INDEXED (3, lengths, displs, 2INT,
CINT) - lengths (2, 4, 3) displs (0,3,8)
- Most often, block sizes are all the same
(typically 1) - MPI-2 introduced a new constructor
52Datatype Constructors (cont)
- MPI_TYPE_CREATE_INDEXED_BLOCK (count,
blocklength, displs, oldtype, newtype) - Same as MPI_TYPE_INDEXED, except all blocks are
the same length (blocklength) - ex MPI_TYPE_INDEXED_BLOCK (7, 1, displs,
MPI_INT, DINT) - displs (1, 3, 4, 6, 9, 13, 14)
53Datatype Constructors (cont)
- MPI_TYPE_CREATE_HINDEXED (count, blocklengths,
displs, oldtype, newtype) - Identical to MPI_TYPE_INDEXED except
displacements are in bytes rather then elements - MPI_TYPE_CREATE_STRUCT (count, lengths, displs,
types, newtype) - Used mainly for sending arrays of structures
- count is number of fields in the structure
- lengths is number of elements in each field
- displs should be calculated (portability)
54MPI_TYPE_CREATE_STRUCT
struct s1 char class double d6 char
b7 struct s1 sarray100
Non-portable
Semi-portable
55MPI_TYPE_CREATE_STRUCT
int i char c100 float f3 int a MPI_Aint
disp4 int lens4 1, 100, 3,
1 MPI_Datatype types4 MPI_INT, MPI_CHAR,
MPI_FLOAT, MPI_INT MPI_Datatype
stype MPI_Get_address(i, disp0) MPI_Get_add
ress(c, disp1) MPI_Get_address(f,
disp2) MPI_Get_address(a, disp3) MPI_Typ
e_create_struct(4, lens, disp, types,
stype) MPI_Type_commit (stype) MPI_Send
(MPI_BOTTOM, 1, stype, ..)
56Derived Datatypes (cont)
- MPI_TYPE_CREATE_RESIZED (oldtype, lb, extent,
newtype) - sets a new lower bound and extent for oldtype
- Does NOT change amount of data sent in a message
- only changes data access pattern
57MPI_TYPE_CREATE_RESIZED
Really Portable
58Datatype Constructors (cont)
- MPI_TYPE_CREATE_SUBARRAY (ndims, sizes, subsizes,
starts, order, oldtype, newtype) - Creates a newtype which represents a contiguous
subsection of an array with ndims dimensions - This sub-array is only contiguous conceptually,
it may not be stored contiguously in memory! - Arrays are assumed to be indexed starting a
zero!!! - Order must be MPI_ORDER_C or MPI_ORDER_FORTRAN
- C programs may specify Fortran ordering, and
vice-versa
59Datatype Constructors Subarrays
(1,1)
(10,10)
MPI_TYPE_CREATE_SUBARRAY (2, sizes, subsizes,
starts, MPI_ORDER_FORTRAN, MPI_INT,
sarray) sizes (10, 10) subsizes (6,6) starts
(3, 3)
60Datatype Constructors Subarrays
(1,1)
(10,10)
MPI_TYPE_CREATE_SUBARRAY (2, sizes, subsizes,
starts, MPI_ORDER_FORTRAN, MPI_INT,
sarray) sizes (10, 10) subsizes (6,6) starts
(2,2)
61Datatype Constructors Darrays
- MPI_TYPE_CREATE_DARRAY (size, rank, dims, gsizes,
distribs, dargs, psizes, order, oldt, newtype) - Used with arrays that are distributed in HPF-like
fashion on Cartesian process grids - Generates datatypes corresponding to the
sub-arrays stored on each processor - Returns in newtype a datatype specific to the
sub-array stored on process rank
62Datatype Constructors (cont)
- Derived datatypes must be committed before they
can be used - MPI_TYPE_COMMIT (datatype)
- Performs a compilation of the datatype
description into an efficient representation - Derived datatypes should be freed when they are
no longer needed - MPI_TYPE_FREE (datatype)
- Does not effect datatypes derived from the freed
datatype or current communication
63Pack and Unpack
- MPI_PACK (inbuf, incount, datatype, outbuf,
outsize, position, comm) - MPI_UNPACK (inbuf, insize, position, outbuf,
outcount, datatype, comm) - MPI_PACK_SIZE (incount, datatype, comm, size)
- Packed messages must be sent with the type
MPI_PACKED - Packed messages can be received with any matching
datatype - Unpacked messages can be received with the type
MPI_PACKED - Receives must use type MPI_PACKED if the messages
are to be unpacked
64Pack and Unpack
65Derived Datatypes Performance Issues
- May allow the user to send fewer or smaller
messages - System dependant on how well this works
- May be able to significantly reduce memory copies
- can make I/O much more efficient
- Data packing may be more efficient if it reduces
the number of send operations by packing
meta-data at the front of the message - This is often possible (and advantageous) for
data layouts that are runtime dependant
66DAY 2
- Morning - Lecture
- Performance analysis and tuning
- Afternoon - Lab
- VAMPIR demo
67Performance analysis and tuning
- It is typically much more difficult to debug and
tune parallel programs - Programmers often have no idea where to begin
searching for possible bottlenecks - A tool that allows the programmer to get a quick
overview of the programs execution can aid the
programmer in beginning this search
68VAMPIR
- Vampir is a visualization program used to
visualize trace data generated by Vampitrace - Vampirtrace is an instrumented MPI library to
link with user code for automatic tracefile
generation on parallel platforms
69Vampir and Vampirtrace
70Vampir Features
- Tool for converting tracefile data for MPI
programs into a variety of graphical views - Highly configurable
- Timeline display with zooming and scrolling
capabilities - Profiling and communications statistics
- Source-code clickback
- OpenMP support under development
71Vampirtrace
- Profiling library for MPI applications
- Produces tracefiles that can be analyzed with the
Vampir performance analysis tool or the Dimemas
performance prediction tool. - Merely linking your application with Vampirtrace
enables tracing of all MPI calls. On some
platforms, calls to user-level subroutines are
also recorded. - API for controlling profiling and for defining
and tracing user-defined activities.
72Running and Analyzing Vampirtrace-instrumented
Programs
- Programs linked with Vampirtrace are started in
the same way as ordinary MPI programs. - Use Vampir to analyze the resulting tracefile.
- Uses configuration file (VAMPIR2.cnf) in
HOME/.VAMPIR_defaults - Can copy VAMPIR_ROOT/etc/VAMPI2.cnf
- If no configuration file is available, VAMPIR
will create one with default values
73Getting Started
If your path is set up correctly, simply enter
vampir
To open a trace file, select File followed by
Open Tracefile Select trace file to open or
enter a known tracefile
Once the tracefile is loaded, VAMPIR starts with
a global timeline display. This is the analysis
starting point
74Vampir Global Timeline Display
75Global Timeline Display
- Context menu is activated with a right mouse
click inside any display window - Zoom in by selecting start of desired region,
left click held, drag mouse to end of desired
region and release - Can zoom in to unlimited depth
- Step out of zooms from context menu
76Zoomed Global Timeline Display
77Displays
- Activity Charts
- Default is pie chart, but can also use
- histograms or table mode
- Can select different activities to be shown
- Can hide some activities
- Can change scale in histograms
Timeline Activity Chart Summary Chart Message
Statistics File I/O Statistics Parallelism
78Global Activity Chart Display (single file)
79Global Activity Chart Display (modular)
80Global Activity Chart with All Symbols Displayed
81Global Activity Chart with MPI Activities
Displayed
82Global Activity Chart with Application Activity
Displayed
83Global Activity Chart with Application Activity
Displayed
84Global Activity Chart with Timeline Portion
Displayed
85Process Activity Chart Display
86Process Activity Chart Display (mpi)
87Process Activity Chart Display (hide max)
88Process Activity Chart Histogram Display
89Process Activity Chart Histogram Display (log
display)
90Process Activity Chart Table Display
91Summary Charts
- Shows total time spent on each activity
- Can be sum of all processors or average for each
processor - Similar context menu options as activity charts
- Default display is horizontal histogram, but can
also be vertical histogram, pie chart, or table
92Global Summaric Chart Displays (all symbols)
93Global Summaric Chart Displays (per process)
94Global Summaric Chart Displays (mpi)
95Global Summaric Chart Displays (timeline)
96Global Summaric Chart Displays
97Communication Statistics
- Shows matrix of comm statistics
- Can show total bytes, total msgs, avg msg size,
longest, shortest, and transmission rates - Can zoom into sub-matrices
- Can get length statistics
- Can filter messages by type (tag) and
communicator
98Global Communication Statistics Display (total
bytes)
99Global Communication Statistics Display (total
bytes zoomed in)
100Global Communication Statistics Display (Average
size)
101Global Communication Statistics Display (total
messages)
102Global Communication Statistics Display using
Timeline Portion
103Global Communication Statistics Display (length
statistics)
104Global Communication Statistics Display (Filter
Dialog)
105Global Parallelism Display
106Tracefile Size
- Often, the trace file from a fully instrumented
code grows to an unmanageable size - Can limit the problem size for analysis
- Can limit the number of iterations
- Can use the vampirtrace API to limit size
- vttraceoff () Disables tracing
- vttraceon() Re-enables tracing
107Performance Analysis and Tuning
- First, make sure there is available speedup in
the MPI routines - Use a profiling tool such as VAMPIR
- If the total time spent in MPI routines is a
small fraction of total execution time, there is
probably not much use tuning the message passing
code - BEWARE Profiling tools can miss compute cycles
used due to non-blocking calls!
108Performance Analysis and Tuning
- If MPI routines account for a significant portion
of your execution time - Try to identify communication hot-spots
- Will changing the order of communication reduce
the hotspot problem? - Will changing the data distribution reduce
communication without increasing computation? - Sending more data is better than sending more
messages
109Performance Analysis and Tuning
- Are you using non-blocking calls?
- Post sends/receives as soon as possible, but
dont wait for their completion if there is still
work you can do! - If you are waiting for long periods of time for
completion of non-blocking sends, this may be an
indication of small system buffers. Consider
using buffered mode.
110Performance Analysis and Tuning
- Are you sending lots of small messages?
- Message passing has significant overhead
(latency). Latency accounts for a large
proportion of the message transmission time for
small messages. - Consider marshaling values into larger messages
if this is appropriate - If you are using derived datatypes, check if the
MPI implementation handles these types
efficiently - Consider using MPI_PACK where appropriate
- dynamic data layouts or sender needs to send the
receiver meta-data.
111Performance Analysis and Tuning
- Use collective operations when appropriate
- many collective operations use mechanisms such as
broadcast trees to achieve better performance - Is your computation to communication ratio too
small? - You may be running on too many processors for the
problem size
112DAY 3
- Morning - Lecture
- MPI-I/O
- Afternoon - Lab
- MPI-I/O exercises
113MPI-I/O
- Introduction
- What is parallel I/O
- Why do we need parallel I/O
- What is MPI-I/O
- MPI-I/O
- Terms and definitions
- File manipulation
- Derived data types and file views
114OUTLINE (cont)
- MPI-I/O (cont)
- Data access
- Non-collective access
- Collective access
- Split collective access
- File interoperability
- Gotchas - Consistency and semantics
115INTRODUCTION
- What is parallel I/O?
- Multiple processes accessing a single file
116INTRODUCTION
- What is parallel I/O?
- Multiple processes accessing a single file
- Often, both data and file access is
non-contiguous - Ghost cells cause non-contiguous data access
- Block or cyclic distributions cause
non-contiguous file access
117Non-Contiguous Access
File layout
Local Mem
118INTRODUCTION
- What is parallel I/O?
- Multiple processes accessing a single file
- Often, both data and file access is
non-contiguous - Ghost cells cause non-contiguous data access
- Block or cyclic distributions cause
non-contiguous file access - Want to access data and files with as few I/O
calls as possible
119INTRODUCTION (cont)
- Why use parallel I/O?
- Many users do not have time to learn the
complexities of I/O optimization
120INTRODUCTION (cont)
Integer dim parameter (dim10000) Integer4
out_array(dim)
OPEN (fh,filename,UNFORMATTED) WRITE(fh)
(out_array(I), I1,dim)
rl 4dim OPEN (fh, filename, DIRECT,
RECLrl) WRITE (fh, REC1) out_array
121INTRODUCTION (cont)
- Why use parallel I/O?
- Many users do not have time to learn the
complexities of I/O optimization - Use of parallel I/O can simplify coding
- Single read/write operation vs. multiple
read/write operations
122INTRODUCTION (cont)
- Why use parallel I/O?
- Many users do not have time to learn the
complexities of I/O optimization - Use of parallel I/O can simplify coding
- Single read/write operation vs. multiple
read/write operations - Parallel I/O potentially offers significant
performance improvement over traditional
approaches
123INTRODUCTION (cont)
- Traditional approaches
- Each process writes to a separate file
- Often requires an additional post-processing step
- Without post-processing, restarts must use same
number of processor - Result sent to a master processor, which collects
results and writes out to disk - Each processor calculates position in file and
writes individually
124INTRODUCTION (cont)
- What is MPI-I/O?
- MPI-I/O is a set of extensions to the original
MPI standard - This is an interface specification It does NOT
give implementation specifics - It provides routines for file manipulation and
data access - Calls to MPI-I/O routines are portable across a
large number of architectures
125MPI-I/O
- Terms and Definitions
- Displacement - Number of bytes from the beginning
of a file - etype - unit of data access within a file
- filetype - datatype used to express access
patterns of a file - file view - definition of access patterns of a
file
126MPI-I/O
- Terms and Definitions
- Offset - Position in the file, relative to the
current view, expressed in terms of number of
etypes - file pointers - offsets into the file maintained
by MPI - Individual file pointer - local to the process
that opened the file - Shared file pointer - shared (and manipulated) by
the group of processes that opened the file
127FILE MANIPULATION
- MPI_FILE_OPEN(Comm, filename, mode, info, fh,
ierr) - Opens the file identified by filename on each
processor in communicator Comm - Collective over this group of processors
- Each processor must use same value for mode and
reference the same file - info is used to give hints about access patterns
128FILE MANIPULATION (cont)
- MPI_FILE_CLOSE (fh)
- This routine synchronizes the file state and then
closes the file - The user must ensure all I/O routines have
completed before closing the file - This is a collective routine (but not
synchronizing)
129DERIVED DATATYPES VIEWS
- Derived datatypes are not part of MPI-I/O
- They are used extensively in conjunction with
MPI-I/O - A filetype is really a datatype expressing the
access pattern of a file - Filetypes are used to set file views
130DERIVED DATATYPES VIEWS
- Non-contiguous memory access
- MPI_TYPE_CREATE_SUBARRAY
- NDIMS - number of dimensions
- ARRAY_OF_SIZES - number of elements in each
dimension of full array - ARRAY_OF_SUBSIZES - number of elements in each
dimension of sub-array - ARRAY_OF_STARTS - starting position in full array
of sub-array in each dimension - ORDER - MPI_ORDER_(C or FORTRAN)
- OLDTYPE - datatype stored in full array
- NEWTYPE - handle to new datatype
131NONCONTIGUOUS MEMORY ACCESS
0,101
0,0
1,1
1,100
101,1
100,100
101,101
101,0
132NONCONTIGUOUS MEMORY ACCESS
- INTEGER sizes(2), subsizes(2), starts(2), dtype,
ierr - sizes(1) 102
- sizes(2) 102
- subsizes(1) 100
- subsizes(2) 100
- starts(1) 1
- starts(2) 1
- CALL MPI_TYPE_CREATE_SUBARRAY(2,sizes,subsizes,sta
rts, MPI_ORDER_FORTRAN,MPI_REAL8,dtype,ierr)
133NONCONTIGUOUS FILE ACCESS
- MPI_FILE_SET_VIEW(
- FH,
- DISP,
- ETYPE,
- FILETYPE,
- DATAREP,
- INFO,
- IERROR)
134NONCONTIGUOUS FILE ACCESS
- The file has holes in it from the processors
perspective
135NONCONTIGUOUS FILE ACCESS
- The file has holes in it from the processors
perspective - MPI_TYPE_CONTIGUOUS(NUM,OLD,NEW,IERR)
- NUM - Number of contiguous elements
- OLD - Old data type
- NEW - New data type
- MPI_TYPE_CREATE_RESIZED(OLD,LB,EXTENT,NEW, IERR)
- OLD - Old data type
- LB - Lower Bound
- EXTENT - New size
- NEW - New data type
136Holes in the file
Memory layout
File layout (2 ints followed by 3 ints)
CALL MPI_TYPE_CONTIGUOUS(2, MPI_INT, CTYPE,
IERR) DISP 4 LB 0 EXTENT54 CALL
MPI_TYPE_CREATE_RESIZED(CTYPE,LB,EXTENT,FTYPE,IERR
) CALL MPI_TYPE_COMMIT(FTYPE, IERR) CALL
MPI_FILE_SET_VIEW(FH,DISP,MPI_INT,FTYPE,native,M
PI_INFO_NULL, IERR)
137NONCONTIGUOUS FILE ACCESS
- The file has holes in it from the processors
perspective - A block-cyclic data distribution
138NONCONTIGUOUS FILE ACCESS
- The file has holes in it from the processors
perspective - A block-cyclic data distribution
- MPI_TYPE_VECTOR(
- COUNT - Number of blocks
- BLOCKLENGTH - Number of elements per block
- STRIDE - Elements between start of each block
- OLDTYPE - Old datatype
- NEWTYPE - New datatype)
139Block-cyclic distribution
File layout (blocks of 4 ints)
CALL MPI_TYPE_VECTOR(3, 4, 16, MPI_INT, FILETYPE,
IERR) CALL MPI_TYPE_COMMIT (FILETYPE, IERR) DISP
4 4 MYRANK CALL MPI_FILE_SET_VIEW (FH,
DISP, MPI_INT, FILETYPE, native, MPI_INFO_NULL,
IERR)
140NONCONTIGUOUS FILE ACCESS
- The file has holes in it from the processors
perspective - A block-cyclic data distribution
- multi-dimensional array access
141NONCONTIGUOUS FILE ACCESS
- The file has holes in it from the processors
perspective - A block-cyclic data distribution
- multi-dimensional array access
- MPI_TYPE_CREATE_SUBARRAY()
142Distributed array access
(0,0)
(0,199)
(199,0)
(199,199)
143Distributed array access
Sizes(1) 200 sizes(2) 200 subsizes(1)
100 subsizes(2) 100 starts(1) 0 starts(2)
0 CALL MPI_TYPE_CREATE_SUBARRAY(2, SIZES,
SUBSIZES, STARTS, MPI_ORDER_FORTRAN, MPI_INT,
FILETYPE, IERR) CALL MPI_TYPE_COMMIT(FILETYPE,
IERR) CALL MPI_FILE_SET_VIEW(FH, 0, MPI_INT,
FILETYPE, NATIVE, MPI_INFO_NULL, IERR)
144NONCONTIGUOUS FILE ACCESS
- The file has holes in it from the processors
perspective - A block-cyclic data distribution
- multi-dimensional array distributed with a block
distribution - Irregularly distributed arrays
145Irregularly distributed arrays
- MPI_TYPE_CREATE_INDEXED_BLOCK
- COUNT - Number of blocks
- LENGTH - Elements per block
- MAP - Array of displacements
- OLD - Old datatype
- NEW - New datatype
146Irregularly distributed arrays
147Irregularly distributed arrays
CALL MPI_TYPE_CREATE_INDEXED_BLOCK (10, 1,
FILE_MAP, MPI_INT, FILETYPE, IERR) CALL
MPI_TYPE_COMMIT (FILETYPE, IERR) DISP 0 CALL
MPI_FILE_SET_VIEW (FH, DISP, MPI_INT, FILETYPE,
native, MPI_INFO_NULL, IERR)
148DATA ACCESS
149COLLECTIVE I/O
Memory layout on 4 processor
File layout
150EXPLICIT OFFSETS
- Parameters
- FH - File handle
- OFFSET - Location in file to start
- BUF - Buffer to write from/read to
- COUNT - Number of elements
- DATATYPE - Type of each element
- STATUS - Return status (blocking)
- REQUEST - Request handle (non-blocking,non-collect
ive)
151EXPLICIT OFFSETS (cont)
- I/O Routines
- MPI_FILE_(READ/WRITE)_AT ()
- MPI_FILE_(READ/WRITE)_AT_ALL ()
- MPI_FILE_I(READ/WRITE)_AT ()
- MPI_FILE_(READ/WRITE)_AT_ALL_BEGIN ()
- MPI_FILE_(READ/WRITE)_AT_ALL_END (FH, BUF, STATUS)
152EXPLICIT OFFSETS
153IDIVIDUAL FILE POINTERS
- Parameters
- FH - File handle
- BUF - Buffer to write to/read from
- COUNT - number of elements to be read/written
- DATATYPE - Type of each element
- STATUS - Return status (blocking)
- REQUEST - Request handle (non-blocking,
non-collective)
154INDIVIDUAL FILE POINTERS
- I/O Routines
- MPI_FILE_(READ/WRITE) ()
- MPI_FILE_(READ/WRITE)_ALL ()
- MPI_FILE_I(READ/WRITE) ()
- MPI_FILE_(READ/WRITE)_ALL_BEGIN()
- MPI_FILE_(READ/WRITE)_ALL_END (FH, BUF, STATUS)
155INDIVIDUAL FILE POINTERS
156SHARED FILE POINTERS
- All processes must have the same view
- Parameters
- FH - File handle
- BUF - Buffer
- COUNT - Number of elements
- DATATYPE - Type of the elements
- STATUS - Return status (blocking)
- REQUEST - Request handle (Non-blocking,
non-collective
157SHARED FILE POINTERS
- I/O Routines
- MPI_FILE_(READ/WRITE)_SHARED ()
- MPI_FILE_I(READ/WRITE)_SHARED ()
- MPI_FILE_(READ/WRITE)_ORDERED ()
- MPI_FILE_(READ/WRITE)_ORDERED_BEGIN ()
- MPI_FILE_(READ/WRITE)_ORDERED_END (FH, BUF,
STATUS)
158SHARED FILE POINTERS
159FILE INTEROPERABILITY
- MPI puts no constraints on how an implementation
should store files - If a file is not stored as a linear byte stream,
there must be a utility for converting the file
into a linear byte stream - Data representation aids interoperability
160FILE INTEROPERABILITY (cont)
- Data Representation
- Native - Data stored exactly as it is in memory.
- Internal - Data may be converted, but can always
be read by the same MPI implementation, even on
different architectures - external32 - This representation is defined by
MPI. Files written in external32 format can be
read by any MPI on any machine
161FILE INTEROPERABILITY (cont)
- Some MPI-I/O implementations (Romio), created
files are no different than those created by the
underlying file system. - This means normal Posix commands (cp, rm, etc)
work with files created by these implementations - Non-MPI programs can read these files
162GOTCHAS - Consistency Semantics
- Collective routines are NOT synchronizing
- Output data may be buffered
- Just because a process has completed a write does
not mean the data is available to other processes - Three ways to ensure file consistency
- MPI_FILE_SET_ATOMICITY ()
- MPI_FILE_SYNC ()
- MPI_FILE_CLOSE ()
163CONSISTENCY SEMANTICS
- MPI_FILE_SET_ATOMICITY ()
- Causes all writes to be immediately written to
disk. This is a collective operation - MPI_FILE_SYNC ()
- Collective operation which forces buffered data
to be written to disk - MPI_FILE_CLOSE ()
- Writes any buffered data to disk before closing
the file
164GOTCHA!!!
CALL MPI_FILE_OPEN (, FH) CALL
MPI_FILE_SET_ATOMICITY (FH) CALL
MPI_FILE_WRITE_AT (FH, 100, ) CALL
MPI_FILE_READ_AT (FH, 0, )
CALL MPI_FILE_OPEN (, FH) CALL
MPI_FILE_SET_ATOMICITY (FH) CALL
MPI_FILE_WRITE_AT (FH, 0, ) CALL
MPI_FILE_READ_AT (FH, 100, )
165GOTCHA!!!
CALL MPI_FILE_OPEN (, FH) CALL
MPI_FILE_SET_ATOMICITY (FH) CALL
MPI_FILE_WRITE_AT (FH, 100, ) CALL MPI_BARRIER
() CALL MPI_FILE_READ_AT (FH, 0, )
CALL MPI_FILE_OPEN (, FH) CALL
MPI_FILE_SET_ATOMICITY (FH) CALL
MPI_FILE_WRITE_AT (FH, 0, ) CALL MPI_BARRIER
() CALL MPI_FILE_READ_AT (FH, 100, )
166GOTCHA!!!
CALL MPI_FILE_OPEN (, FH) CALL MPI_FILE_WRITE_AT
(FH, 100, ) CALL MPI_FILE_CLOSE (FH) CALL
MPI_FILE_OPEN (, FH) CALL MPI_FILE_READ_AT (FH,
0, )
CALL MPI_FILE_OPEN (, FH) CALL MPI_FILE_WRITE_AT
(FH, 0, ) CALL MPI_FILE_CLOSE (FH) CALL
MPI_FILE_OPEN (, FH) CALL MPI_FILE_READ_AT (FH,
100, )
167GOTCHA!!!
CALL MPI_FILE_OPEN (, FH) CALL MPI_FILE_WRITE_AT
(FH, 100, ) CALL MPI_FILE_CLOSE (FH) CALL
MPI_BARRIER () CALL MPI_FILE_OPEN (, FH) CALL
MPI_FILE_READ_AT (FH, 0, )
CALL MPI_FILE_OPEN (, FH) CALL MPI_FILE_WRITE_AT
(FH, 0, ) CALL MPI_FILE_CLOSE (FH) CALL
MPI_BARRIER () CALL MPI_FILE_OPEN (, FH) CALL
MPI_FILE_READ_AT (FH, 100, )
168CONCLUSIONS
- MPI-I/O potentially offers significant
improvement in I/O performance - This improvement can be attained with minimal
effort on part of the user - Simpler programming with fewer calls to I/O
routines - Easier program maintenance due to simple API
169Recommended references
- MPI - The Complete Reference Volume 1, The MPI
Core - MPI - The Complete Reference Volume 2, The MPI
Extensions - USING MPI Portable Parallel Programming with the
Message-Passing Interface - Using MPI-2 Advanced Features of the
Message-Passing Interface