Standard

About This Presentation

Title:

Standard

Description:

Standard – PowerPoint PPT presentation

Number of Views:36

Avg rating:3.0/5.0

Slides: 54

Provided by: lri

Category:

more less

Transcript and Presenter's Notes

Title: Standard

1
Standard

Description
Performance

2
Contents

Introduction to MPI
Message passing
Different type of communication
MPI functionalities
MPI structures
Basic functions
Data types
Contexts and tags
Groups and communication domains
Communication functions
Point to point communications
Asynchronous communications
Global communications
MPI-2
One-sided communications
I/O

3
Message passing (1)

Problem
We have N nodes
All nodes connected by network
? How to use the global computer gathering the N
nodes ?

4
Message passing (2)

One answer message passing
Execute one process per processor
Exchange explicitly data between processors
Synchronize explicitly the different processes
Two types of data transfer
Only one process initiate the communication one
sided
The two processes cooperate for the
communication cooperative

5
Two types of data transfer

one sided communications
No Rendez-vous protocol
No warning about reading or writing actions
inside local memory for a process
Costly synchronization
Functions prototypes
put(remote_process, data)
get(remote_process, data)

Cooperatives Communications
The communication involves the two processes
Implicit synchronization in the simple case
Functions prototypes
send(destination, data)
recv(source, data)

put()
get()
send()
recv()
6
MPI (Message Passing Interface)

Standard developed by academics and industrial
partners
Objective to specify a portable message passing
library
Imply an execution environment for launching and
connecting together all the processes
Allow
Synchronous and asynchronous communications
Global communications
Separated communication domains

7
Contents

Introduction to MPI
Message passing
Different type of communication
MPI functionalities
MPI structures
Basic functions (exemple HelloWorld_MPI.c)
Data types
Contexts and tags
Groups and communication domains
Communication functions
Point to point communications
Asynchronous communications
Global communications
MPI-2
One-sided communications
I/O

8
MPI Programming Structure

Follows the SPMD programming model
All processes are launched at the same time
Same program for every processors
Can differentiate processors roles by a rank
number

Non parallel section
Parallel section initialization
Multinode parallel section (MPI)
Parallel section termination
Remark Most implementations advise to limit this
program part to the exit call
?
9
Basic functions

MPI environment initialization
C MPI_Init(argc, char argv)
Fortran call MPI_Init(ierror)
MPI Environment termination (program are
recommended to exit after this function call)
C MPI_Finalize()
Fortran call MPI_Finalize(ierror)
Getting the process rank
C MPI_Comm_rank(MPI_COMM_WORLD, rank)
Fortran call MPI_Comm_rank(MPI_COMM_WORLD,
rank, ierror)
Getting the total number of processes
C MPI_Comm_size(MPI_COMM_WORLD, size)
Fortran call MPI_comm_size(MPI_COMM_WORLD,
size, ierror)

10
HelloWorld_MPI.c

include ltstdio.hgt
include ltmpi.hgt
void main(int argc, char argv)
int rang, nprocs
MPI_Init(argc, argv)
MPI_Comm_rank(MPI_COMM_WORLD, rang)
MPI_Comm_size(MPI_COMM_WORLD, nprocs)
printf(hello, I am d (Of d processes)\n,
rang, nprocs)
MPI_Finalize()

11
MPI data types
12
User data types

By default MPI exchanges data using vector of
MPI data
It is possible to create data types to simplify
communication operations (simplifying buffer and
linearization operations)
User data types replace the obsolete MPI_PACK
type
A user type consists in a sequence of basic types
and a sequence of offsets for fitting the memory
creation MPI_Type_commit(type)
Destruction MPI_Type_free(type)

13
Contexts and tags

Need to distinguish different messages in
reception
Context allow to distinguish between a
point-to-point communication and a global
communication
Every message is sent in a within a context, and
must be received in the same context
Context is automatically managed by MPI
The communication tags allow to identify one
communication among multiple ones
When communication are made asynchronously, this
tags allow to sort them
For reception operations, we can received the
next message by specifying the MPI_ANY_TAG
keyword
Tag management is up to the MPI programmer

14
Communication domains

Nodes can be grouped in a communication domain
called communicator
Every process as a rank number per group it is
involved in
MPI_COMM_WORLD is the default communication
domain gathering all processes and created at the
initialization.
More generally, All operations can only be made
on a single set of processes specified by their
communicator
Each domain constitutes an distinct specific
context for communications

15
Split a communicator (1/2) groups

To create a new domain, first you have to create
a new group of processes
int MPI_Comm_group(MPI_Comm comm, MPI_Group
group)
int MPI_Group_incl(MPI_Group group, int rsize,
int ranks, MPI_Group newgroup)
int MPI_Group_excl(MPI_Group group, int rsize,
int ranks, MPI_Group newgroup)
Set of operations on the groups
int MPI_Group_union(MPI_Group g1, MPI_Group g2,
MPI_Group gr)
int MPI_Group_intersection(MPI_Group g1,
MPI_Group g2, MPI_Group gr)
int MPI_Group_difference(MPI_Group g1, MPI_Group
g2, MPI_Group gr)
Destruction of a group
int MPI_Group_free(MPI_Group group)

16
Split a communicator (2/2) communicators

Associating a communicator to a group
int MPI_Comm_create(MPI_Comm comm, MPI_Group
group, MPI_Comm newcomm)
Dividing a domain in sub-domains
int MPI_Comm_split(MPI_Comm comm, int color, int
key, MPI_Comm newcomm)
MPI_Comm_split is a collective operation on the
initial communicator comm
Every process gives its color, Every process of
the same color are then in the same newcomm
The MPI_UNDEFINED color allows for a process to
not be part of the new communicator
Every process gives its key, Processes of the
same color are ranked by these keys
A group is implicitly created for each new
communicator created this way
Communicators destruction
int MPI_Comm_free(MPI_Comm comm)

17
Contents

Introduction to MPI
Message passing
Different type of communication
MPI functionalities
MPI structures
Basic functions
Data types
Contexts and tags
Groups and communication domains
Communication functions
Point to point communications (exemple Jeton.c)
Asynchronous communications
Global communications (exemple trace.c)
MPI-2
One-sided communications
I/O

18
Point-to-point communications

Send and receive data between a pair of processes
The two processes initiates the communication,
one sends the data, the other asks for the
reception
Communications are identified by tags
The type and the size of the data must be
specified

19
Basic communication functions

Synchronous sending (between the computation
process and the action of sending)
int MPI_Send(void buf, int count, MPI_Datatype
datatype, int dest, int tag, MPI_Comm comm)
The tag allow unique identifying of messages
Synchronous data reception
int MPI_Recv(void buf, int count, MPI_Datatype
datatype, int source, int tag, MPI_Comm comm,
MPI_Status status)
The tag must be identical to the tag sent
MPI_ANY_SOURCE can be specified to receive from
anyone

20
Jeton.c
21
Synchronism and asynchronism (1)

To solve some deadlocks, and to allow le
recouvrement des communications par le calcul,
one can use non blocking functions
In this case, the communication scheme is the
following
Initialization of the non blocking communication
(by either the two or one of the process)
The communication (non blocking or blocking) is
called by other process
computation
Termination of the communication (Blocking
operation until the communication is performed)

22
Synchronism and asynchronism (2)

Non blocking functions
int MPI_Isend(void buf, int count, MPI_Datatype
datatype, int dest, int tag, MPI_Comm comm,
MPI_Request request)
int MPI_Irecv(void buf, int count, MPI_Datatype
datatype, int source, int tag, MPI_Comm comm,
MPI_Request request)
The request field is used to know the state of a
non blocking communication. To wait for its
termination, one can call the following function
int MPI_Wait(MPI_Request request, MPI_Status
status)

23
Synchronism and asynchronism (3)

Data can be exchanged by blocking or non blocking
functions. There are multiple functions to manage
how the send and the receive operation are
coupled
To fix the communication mode, you use prefix
(MPI_Send)
Synchronous send (S) finished when the
coresponding receive is posted (hard coupled to
the reception, without buffers)
Buffered send (B) a buffer is created, the
send operation ends when the user buffer is
copied to the system buffer (not coupled to the
reception)
Standard send () the send ends when the
emission buffer is empty (MPI implementation
decides for buffering or coupling to reception)
Ready send (R) User assures that reception
request is already posted when calling this
function (coupled to the reception without
buffer)

24
Collective or global operations

To simplify communication operation involving
multiple processes, one can use collective
operations on a communicator
Typical operations
reductions
Data exchange
Broadcast
Scatter
Gather
All-to-All
Explicit synchronization

25
Reductions (1)

A reduction is an arithmetic operation on the
distributed data made by a set of processors
Prototype
C int MPI_Reduce(void sendbuf, void recvbuf,
int count, MPI_Datatype datatype, MPI_Op op, int
root, MPI_Comm communicator)
Fortran MPI_Reduce(sendbuf, recvbuf, count,
datatype, op, root, communicator, ierror)
Using MPI_Reduce(), only the root processor gets
the result
With MPI_AllReduce(), all processes get the result

26
Reductions (2)

Available operations

27
Broadcast

A broadcast operation allows to distribute the
same data to all processes
One-to-all communication, from a specified
process root to all processes of a communicator
Prototypes
C int MPI_Bcast(void buffer, int count,
MPI_Datatype datatype, int root, MPI_Comm comm)
Fortran MPI_Bcast(buffer, count, datatype,
root, communicator, ierror)

0
0
1
2
3
np-1
1
2
3
np-1
buffer
root 1
28
Scatter

One-to-all operation, different data are sent to
each receiver process according to their rank
Prototypes
C int MPI_Scatter(void sendbuf, int
sendcount, MPI_Datatype sendtype, void recvbuf,
int recvcount, MPI_Datatype recvtype, int root,
MPI_Comm communicator)
Fortran MPI_Scatter(sendbuf, sendcount,
sendtype, recvbuf, recvcount, recvtype, root,
communicator, ierror)
The send parameters are used by only the sender
process

0
1
2
3
np-1
0
1
2
3
np-1
sendbuf
recvbuf
root 2
29
Gather

All-to-one operation, different data are received
by a receiver process
Prototypes
C int MPI_Gather(void sendbuf, int sendcount,
MPI_Datatype sendtype, void recvbuf, int
recvcount, MPI_Datatype recvtype, int root,
MPI_Comm communicator)
Fortran MPI_Gather(sendbuf, sendcount,
sendtype, recvbuf, recvcount, recvtype, root,
communicator, ierror)
The receive parameters are only used by the
receiver process

0
1
2
3
np-1
0
1
2
3
np-1
sendbuf
recvbuf
root 3
30
All-to-All

All-to-all operation, different data are sent to
each process, according to their rank
Prototypes
C int MPI_AlltoAll(void sendbuf, int
sendcount, MPI_Datatype sendtype, void recvbuf,
int recvcount, MPI_Datatype recvtype, int root,
MPI_Comm communicator)
Fortran MPI_Alltoall(sendbuf, sendcount,
sendtype, recvbuf, recvcount, recvtype, root,
communicator, ierror)

0
1
2
3
np-1
0
1
2
3
np-1
sendbuf
recvbuf
31
Explicit Synchronization

Synchronization barrier All processes of a
communicator waits for the last process to enter
the barrier before continuing their execution
For computer with material barrier available
(such as SGI and Cray T3E), the MPI barrier is
slower than these material barrier
Prototype
C int MPI_Barrier (MPI_Comm communicator)
Fortran MPI_Barrier(Communicator, IERROR)

32
Matrix trace (1)

Computing the trace of a matrix An
The matrix trace is the sum of the diagonal
element (square matrix)
One can easily see that the sum can be made on
multiple processor, ending by using a reduction
to compute the complete trace

33
Matrix trace (2.1)
34
Matrix trace (2.2)
35
Contents

Introduction to MPI
Message passing
Different type of communication
MPI functionalities
MPI structures
Basic functions
Data types
Contexts and tags
Groups and communication domains
Communication functions
Point to point communications
Asynchronous communications
Global communications
MPI-2
One-sided communications
I/O

36
One-sided communications (1/2)

No synchronization during communications
Allow simulated shared memory implementation
(Remote Memory Access)
Defining the part of memory other processes can
access
MPI_Win_create()
MPI_Win_free()
One-sided communication functions
MPI_Put()
MPI_Get()
MPI_Accumulate()
Operations MPI_SUM, MPI_LAND, MPI_REPLACE

37
One-sided communications (2/2)

Active synchronization function
MPI_Win_fence()
Take a win window of memory as parameter
Collective operation (barrier) on all processes
of the group MPI_Win_group(win)
Act as a synchronization barrier which ends every
RMA transfer using the window win
Passive synchronization function
MPI_Win_lock() and MPI_Win_unlock()
Classical mutex functions
The communications initiator is the only
responsible for the synchronization
When MPI_Win_unlock() returns, every transfer
operation is finished

38
Parallel Input/Output

Need for intelligent management of I/O is
mandatory for parallel applications
MPI-IO is a set of functions for optimised I/O
Extending classical file access functions
Collective synchronization for accessing file
File offset shared or individual
Blocking or non blocking read
View (for accessing non sequential memory zone)
Similar syntax as MPI communication functions

39
Dynamic allocation of processes

Dynamic change of the number of processes
Spawning new processes during execution
The MPI_Comm_spawn() function allow to create a
new set of processes on other processors
An inter-communicator links the domain of the
parent to the new domain gathering the new
processes
The MPI_Intercom_merge() function allows the
merge of a unique communicator from an
inter-communicator
MPI-2 allows dynamic MPMD style using the
function MPI_Comm_spawn_multiple()
MPI_Comm_get_attr (MPI_UNIVERSE_SIZE) is used to
know the maximum possible number of MPI processes
Process destruction
No explicit exit() function of MPI process
For exiting a MPI process, its communicator
MPI_COMM_WORLD must contain only finalizing
processes
All inter-communicator must be closed before
finalization

40
Remarks and conclusion

MPI has become, thanks to the distributed
computing community, a standard library for
message passing
The MPI-2 breaks the classic message passing SPMD
model of MPI-1
Numbers of implementation exist, on most of
architectures
Lots of Documentations and publications are
available

41
Some pointers

MPI standard official site
http//www-unix.mcs.anl.gov/mpi/
The MPI forum
http//www.mpi-forum.org/
Book MPI, The Complete Reference (Marc Snir et
al.)
http//www.netlib.org/utk/papers/mpi-book/mpi-book
.html

42
Standard

Description
Performance

43
Contents

MPI implementation
Performance metrics
High performance networks
Communication type / 0-copy

44
MPI implementation

LAM-MPI
Optimised for collective operations
MPICH
Easy writing of new low level driver
Open-MPI
Try to combine performance and ease of the two
prior ones
Conform to MPI-2
IBM / NEC / FUJITSU
Complete and performant implementation of MPI-2
Target specific architecture

45
Performance metrics

Comparison criteria
Latency
bandwidth
Collective operation
Overlapping capabilities
Real applications
Measuring tools
Round Trip Time (ping-pong)
NetPipe
NAS benchmarks
CG
LU
BT
FT

46
High performance networks (1/3) Technologies

Myrinet
Connexionless reliable api
Registered buffers
Fully programmable DMA NIC processor
Up to full-duplex 2Gb/s bandwidth with Myrinet
2000
SCINet
Torus topology based network with static routing
No need to register buffers
Very small latency (suitable for RMA)
Up to 2Gb/s
Gigabit Ethernet
No need to registered buffers
DMA operations
High latency
Up to 1Gb/s and 10Gb/s bandwidth
Infiniband
Reliable Connexion mode and Unreliable Datagram
mode
Registered buffers
Queued DMA operations

47
High performance networks (2/3) Technologies

Myrinet
Socket-GM
MPICH-GM
SCINet
No functional socket API
SCI-MPICH
Gigabit Ethernet
Have to use socket interface
Infiniband
IoIP
LAM-MPI, MPICH, MPI/pro etc

48
High performance networks (3/3) Technologies
49
Eager vs Rendez-vous (1/2)

Eager protocol
Message is sent without control
Better latency
Copied in a buffer if the receiver has not posted
the reception yet
Memory consuming for long messages
Used only for long messages (lt64KB)
Rendez-vous protocol
Sender and receiver are synchronized
High latency
0-copy
Better bandwidth
Reduce the memory consumption

50
Eager vs Rendez-vous (2/2)
51
Communication types
52
High performance networks and 0-copy
Latence Myrinet 8µs Latence MPICH-GM
33µs Latence MPICH-Vdummy 94µs
53
Conclusion

Many MPI implementation with similar performance
Multiple measures criteria and multiple tools
Latency, bandwidth
Benchmarks and microbenchmarks
Real applications
High performance networks lead to consider small
performance details
Network bandwidth equals the memory bandwidth
Latency smaller than some OS operations
Performance relies on good programming
Performance results can vary a lot according to
the type of communication employed
Asynchronism is mandatory
Bad programming results in bad performance
0-copy can be mandatory

Write a Comment

User Comments (0)

About PowerShow.com

Standard - PowerPoint PPT Presentation

Standard

Standard – PowerPoint PPT presentation