Title: Standard
1Standard
2Contents
- Introduction to MPI
- Message passing
- Different type of communication
- MPI functionalities
- MPI structures
- Basic functions
- Data types
- Contexts and tags
- Groups and communication domains
- Communication functions
- Point to point communications
- Asynchronous communications
- Global communications
- MPI-2
- One-sided communications
- I/O
3Message passing (1)
- Problem
- We have N nodes
- All nodes connected by network
- ? How to use the global computer gathering the N
nodes ?
4Message passing (2)
- One answer message passing
- Execute one process per processor
- Exchange explicitly data between processors
- Synchronize explicitly the different processes
- Two types of data transfer
- Only one process initiate the communication one
sided - The two processes cooperate for the
communication cooperative
5Two types of data transfer
- one sided communications
- No Rendez-vous protocol
- No warning about reading or writing actions
inside local memory for a process - Costly synchronization
- Functions prototypes
- put(remote_process, data)
- get(remote_process, data)
- Cooperatives Communications
- The communication involves the two processes
- Implicit synchronization in the simple case
- Functions prototypes
- send(destination, data)
- recv(source, data)
put()
get()
send()
recv()
6MPI (Message Passing Interface)
- Standard developed by academics and industrial
partners - Objective to specify a portable message passing
library - Imply an execution environment for launching and
connecting together all the processes - Allow
- Synchronous and asynchronous communications
- Global communications
- Separated communication domains
7Contents
- Introduction to MPI
- Message passing
- Different type of communication
- MPI functionalities
- MPI structures
- Basic functions (exemple HelloWorld_MPI.c)
- Data types
- Contexts and tags
- Groups and communication domains
- Communication functions
- Point to point communications
- Asynchronous communications
- Global communications
- MPI-2
- One-sided communications
- I/O
8MPI Programming Structure
- Follows the SPMD programming model
- All processes are launched at the same time
- Same program for every processors
- Can differentiate processors roles by a rank
number
Non parallel section
Parallel section initialization
Multinode parallel section (MPI)
Parallel section termination
Remark Most implementations advise to limit this
program part to the exit call
?
9Basic functions
- MPI environment initialization
- C MPI_Init(argc, char argv)
- Fortran call MPI_Init(ierror)
- MPI Environment termination (program are
recommended to exit after this function call) - C MPI_Finalize()
- Fortran call MPI_Finalize(ierror)
- Getting the process rank
- C MPI_Comm_rank(MPI_COMM_WORLD, rank)
- Fortran call MPI_Comm_rank(MPI_COMM_WORLD,
rank, ierror) - Getting the total number of processes
- C MPI_Comm_size(MPI_COMM_WORLD, size)
- Fortran call MPI_comm_size(MPI_COMM_WORLD,
size, ierror)
10HelloWorld_MPI.c
- include ltstdio.hgt
- include ltmpi.hgt
- void main(int argc, char argv)
- int rang, nprocs
- MPI_Init(argc, argv)
- MPI_Comm_rank(MPI_COMM_WORLD, rang)
- MPI_Comm_size(MPI_COMM_WORLD, nprocs)
- printf(hello, I am d (Of d processes)\n,
rang, nprocs) - MPI_Finalize()
-
11MPI data types
12User data types
- By default MPI exchanges data using vector of
MPI data - It is possible to create data types to simplify
communication operations (simplifying buffer and
linearization operations) - User data types replace the obsolete MPI_PACK
type - A user type consists in a sequence of basic types
and a sequence of offsets for fitting the memory - creation MPI_Type_commit(type)
- Destruction MPI_Type_free(type)
13Contexts and tags
- Need to distinguish different messages in
reception - Context allow to distinguish between a
point-to-point communication and a global
communication - Every message is sent in a within a context, and
must be received in the same context - Context is automatically managed by MPI
- The communication tags allow to identify one
communication among multiple ones - When communication are made asynchronously, this
tags allow to sort them - For reception operations, we can received the
next message by specifying the MPI_ANY_TAG
keyword - Tag management is up to the MPI programmer
14Communication domains
- Nodes can be grouped in a communication domain
called communicator - Every process as a rank number per group it is
involved in - MPI_COMM_WORLD is the default communication
domain gathering all processes and created at the
initialization. - More generally, All operations can only be made
on a single set of processes specified by their
communicator - Each domain constitutes an distinct specific
context for communications
15Split a communicator (1/2) groups
- To create a new domain, first you have to create
a new group of processes - int MPI_Comm_group(MPI_Comm comm, MPI_Group
group) - int MPI_Group_incl(MPI_Group group, int rsize,
int ranks, MPI_Group newgroup) - int MPI_Group_excl(MPI_Group group, int rsize,
int ranks, MPI_Group newgroup) - Set of operations on the groups
- int MPI_Group_union(MPI_Group g1, MPI_Group g2,
MPI_Group gr) - int MPI_Group_intersection(MPI_Group g1,
MPI_Group g2, MPI_Group gr) - int MPI_Group_difference(MPI_Group g1, MPI_Group
g2, MPI_Group gr) - Destruction of a group
- int MPI_Group_free(MPI_Group group)
16Split a communicator (2/2) communicators
- Associating a communicator to a group
- int MPI_Comm_create(MPI_Comm comm, MPI_Group
group, MPI_Comm newcomm) - Dividing a domain in sub-domains
- int MPI_Comm_split(MPI_Comm comm, int color, int
key, MPI_Comm newcomm) - MPI_Comm_split is a collective operation on the
initial communicator comm - Every process gives its color, Every process of
the same color are then in the same newcomm - The MPI_UNDEFINED color allows for a process to
not be part of the new communicator - Every process gives its key, Processes of the
same color are ranked by these keys - A group is implicitly created for each new
communicator created this way - Communicators destruction
- int MPI_Comm_free(MPI_Comm comm)
17Contents
- Introduction to MPI
- Message passing
- Different type of communication
- MPI functionalities
- MPI structures
- Basic functions
- Data types
- Contexts and tags
- Groups and communication domains
- Communication functions
- Point to point communications (exemple Jeton.c)
- Asynchronous communications
- Global communications (exemple trace.c)
- MPI-2
- One-sided communications
- I/O
18Point-to-point communications
- Send and receive data between a pair of processes
- The two processes initiates the communication,
one sends the data, the other asks for the
reception - Communications are identified by tags
- The type and the size of the data must be
specified
19Basic communication functions
- Synchronous sending (between the computation
process and the action of sending) - int MPI_Send(void buf, int count, MPI_Datatype
datatype, int dest, int tag, MPI_Comm comm) - The tag allow unique identifying of messages
- Synchronous data reception
- int MPI_Recv(void buf, int count, MPI_Datatype
datatype, int source, int tag, MPI_Comm comm,
MPI_Status status) - The tag must be identical to the tag sent
- MPI_ANY_SOURCE can be specified to receive from
anyone
20Jeton.c
21Synchronism and asynchronism (1)
- To solve some deadlocks, and to allow le
recouvrement des communications par le calcul,
one can use non blocking functions - In this case, the communication scheme is the
following - Initialization of the non blocking communication
(by either the two or one of the process) - The communication (non blocking or blocking) is
called by other process - computation
- Termination of the communication (Blocking
operation until the communication is performed)
22Synchronism and asynchronism (2)
- Non blocking functions
- int MPI_Isend(void buf, int count, MPI_Datatype
datatype, int dest, int tag, MPI_Comm comm,
MPI_Request request) - int MPI_Irecv(void buf, int count, MPI_Datatype
datatype, int source, int tag, MPI_Comm comm,
MPI_Request request) - The request field is used to know the state of a
non blocking communication. To wait for its
termination, one can call the following function - int MPI_Wait(MPI_Request request, MPI_Status
status)
23Synchronism and asynchronism (3)
- Data can be exchanged by blocking or non blocking
functions. There are multiple functions to manage
how the send and the receive operation are
coupled - To fix the communication mode, you use prefix
(MPI_Send) - Synchronous send (S) finished when the
coresponding receive is posted (hard coupled to
the reception, without buffers) - Buffered send (B) a buffer is created, the
send operation ends when the user buffer is
copied to the system buffer (not coupled to the
reception) - Standard send () the send ends when the
emission buffer is empty (MPI implementation
decides for buffering or coupling to reception) - Ready send (R) User assures that reception
request is already posted when calling this
function (coupled to the reception without
buffer)
24Collective or global operations
- To simplify communication operation involving
multiple processes, one can use collective
operations on a communicator - Typical operations
- reductions
- Data exchange
- Broadcast
- Scatter
- Gather
- All-to-All
- Explicit synchronization
25Reductions (1)
- A reduction is an arithmetic operation on the
distributed data made by a set of processors - Prototype
- C int MPI_Reduce(void sendbuf, void recvbuf,
int count, MPI_Datatype datatype, MPI_Op op, int
root, MPI_Comm communicator) - Fortran MPI_Reduce(sendbuf, recvbuf, count,
datatype, op, root, communicator, ierror) - Using MPI_Reduce(), only the root processor gets
the result - With MPI_AllReduce(), all processes get the result
26Reductions (2)
27Broadcast
- A broadcast operation allows to distribute the
same data to all processes - One-to-all communication, from a specified
process root to all processes of a communicator - Prototypes
- C int MPI_Bcast(void buffer, int count,
MPI_Datatype datatype, int root, MPI_Comm comm) - Fortran MPI_Bcast(buffer, count, datatype,
root, communicator, ierror)
0
0
1
2
3
np-1
1
2
3
np-1
buffer
root 1
28Scatter
- One-to-all operation, different data are sent to
each receiver process according to their rank - Prototypes
- C int MPI_Scatter(void sendbuf, int
sendcount, MPI_Datatype sendtype, void recvbuf,
int recvcount, MPI_Datatype recvtype, int root,
MPI_Comm communicator) - Fortran MPI_Scatter(sendbuf, sendcount,
sendtype, recvbuf, recvcount, recvtype, root,
communicator, ierror) - The send parameters are used by only the sender
process
0
1
2
3
np-1
0
1
2
3
np-1
sendbuf
recvbuf
root 2
29Gather
- All-to-one operation, different data are received
by a receiver process - Prototypes
- C int MPI_Gather(void sendbuf, int sendcount,
MPI_Datatype sendtype, void recvbuf, int
recvcount, MPI_Datatype recvtype, int root,
MPI_Comm communicator) - Fortran MPI_Gather(sendbuf, sendcount,
sendtype, recvbuf, recvcount, recvtype, root,
communicator, ierror) - The receive parameters are only used by the
receiver process
0
1
2
3
np-1
0
1
2
3
np-1
sendbuf
recvbuf
root 3
30All-to-All
- All-to-all operation, different data are sent to
each process, according to their rank - Prototypes
- C int MPI_AlltoAll(void sendbuf, int
sendcount, MPI_Datatype sendtype, void recvbuf,
int recvcount, MPI_Datatype recvtype, int root,
MPI_Comm communicator) - Fortran MPI_Alltoall(sendbuf, sendcount,
sendtype, recvbuf, recvcount, recvtype, root,
communicator, ierror)
0
1
2
3
np-1
0
1
2
3
np-1
sendbuf
recvbuf
31Explicit Synchronization
- Synchronization barrier All processes of a
communicator waits for the last process to enter
the barrier before continuing their execution - For computer with material barrier available
(such as SGI and Cray T3E), the MPI barrier is
slower than these material barrier - Prototype
- C int MPI_Barrier (MPI_Comm communicator)
- Fortran MPI_Barrier(Communicator, IERROR)
32Matrix trace (1)
- Computing the trace of a matrix An
- The matrix trace is the sum of the diagonal
element (square matrix) - One can easily see that the sum can be made on
multiple processor, ending by using a reduction
to compute the complete trace
33Matrix trace (2.1)
34Matrix trace (2.2)
35Contents
- Introduction to MPI
- Message passing
- Different type of communication
- MPI functionalities
- MPI structures
- Basic functions
- Data types
- Contexts and tags
- Groups and communication domains
- Communication functions
- Point to point communications
- Asynchronous communications
- Global communications
- MPI-2
- One-sided communications
- I/O
36One-sided communications (1/2)
- No synchronization during communications
- Allow simulated shared memory implementation
(Remote Memory Access) - Defining the part of memory other processes can
access - MPI_Win_create()
- MPI_Win_free()
- One-sided communication functions
- MPI_Put()
- MPI_Get()
- MPI_Accumulate()
- Operations MPI_SUM, MPI_LAND, MPI_REPLACE
37One-sided communications (2/2)
- Active synchronization function
- MPI_Win_fence()
- Take a win window of memory as parameter
- Collective operation (barrier) on all processes
of the group MPI_Win_group(win) - Act as a synchronization barrier which ends every
RMA transfer using the window win - Passive synchronization function
- MPI_Win_lock() and MPI_Win_unlock()
- Classical mutex functions
- The communications initiator is the only
responsible for the synchronization - When MPI_Win_unlock() returns, every transfer
operation is finished
38Parallel Input/Output
- Need for intelligent management of I/O is
mandatory for parallel applications - MPI-IO is a set of functions for optimised I/O
- Extending classical file access functions
- Collective synchronization for accessing file
- File offset shared or individual
- Blocking or non blocking read
- View (for accessing non sequential memory zone)
- Similar syntax as MPI communication functions
39Dynamic allocation of processes
- Dynamic change of the number of processes
- Spawning new processes during execution
- The MPI_Comm_spawn() function allow to create a
new set of processes on other processors - An inter-communicator links the domain of the
parent to the new domain gathering the new
processes - The MPI_Intercom_merge() function allows the
merge of a unique communicator from an
inter-communicator - MPI-2 allows dynamic MPMD style using the
function MPI_Comm_spawn_multiple() - MPI_Comm_get_attr (MPI_UNIVERSE_SIZE) is used to
know the maximum possible number of MPI processes
- Process destruction
- No explicit exit() function of MPI process
- For exiting a MPI process, its communicator
MPI_COMM_WORLD must contain only finalizing
processes - All inter-communicator must be closed before
finalization
40Remarks and conclusion
- MPI has become, thanks to the distributed
computing community, a standard library for
message passing - The MPI-2 breaks the classic message passing SPMD
model of MPI-1 - Numbers of implementation exist, on most of
architectures - Lots of Documentations and publications are
available
41Some pointers
- MPI standard official site
- http//www-unix.mcs.anl.gov/mpi/
- The MPI forum
- http//www.mpi-forum.org/
- Book MPI, The Complete Reference (Marc Snir et
al.) - http//www.netlib.org/utk/papers/mpi-book/mpi-book
.html
42Standard
43Contents
- MPI implementation
- Performance metrics
- High performance networks
- Communication type / 0-copy
44MPI implementation
- LAM-MPI
- Optimised for collective operations
- MPICH
- Easy writing of new low level driver
- Open-MPI
- Try to combine performance and ease of the two
prior ones - Conform to MPI-2
- IBM / NEC / FUJITSU
- Complete and performant implementation of MPI-2
- Target specific architecture
45Performance metrics
- Comparison criteria
- Latency
- bandwidth
- Collective operation
- Overlapping capabilities
- Real applications
- Measuring tools
- Round Trip Time (ping-pong)
- NetPipe
- NAS benchmarks
- CG
- LU
- BT
- FT
46High performance networks (1/3) Technologies
- Myrinet
- Connexionless reliable api
- Registered buffers
- Fully programmable DMA NIC processor
- Up to full-duplex 2Gb/s bandwidth with Myrinet
2000 - SCINet
- Torus topology based network with static routing
- No need to register buffers
- Very small latency (suitable for RMA)
- Up to 2Gb/s
- Gigabit Ethernet
- No need to registered buffers
- DMA operations
- High latency
- Up to 1Gb/s and 10Gb/s bandwidth
- Infiniband
- Reliable Connexion mode and Unreliable Datagram
mode - Registered buffers
- Queued DMA operations
47High performance networks (2/3) Technologies
- Myrinet
- Socket-GM
- MPICH-GM
- SCINet
- No functional socket API
- SCI-MPICH
- Gigabit Ethernet
- Have to use socket interface
- Infiniband
- IoIP
- LAM-MPI, MPICH, MPI/pro etc
48High performance networks (3/3) Technologies
49Eager vs Rendez-vous (1/2)
- Eager protocol
- Message is sent without control
- Better latency
- Copied in a buffer if the receiver has not posted
the reception yet - Memory consuming for long messages
- Used only for long messages (lt64KB)
- Rendez-vous protocol
- Sender and receiver are synchronized
- High latency
- 0-copy
- Better bandwidth
- Reduce the memory consumption
50Eager vs Rendez-vous (2/2)
51Communication types
52High performance networks and 0-copy
Latence Myrinet 8µs Latence MPICH-GM
33µs Latence MPICH-Vdummy 94µs
53Conclusion
- Many MPI implementation with similar performance
- Multiple measures criteria and multiple tools
- Latency, bandwidth
- Benchmarks and microbenchmarks
- Real applications
- High performance networks lead to consider small
performance details - Network bandwidth equals the memory bandwidth
- Latency smaller than some OS operations
- Performance relies on good programming
- Performance results can vary a lot according to
the type of communication employed - Asynchronism is mandatory
- Bad programming results in bad performance
- 0-copy can be mandatory