Distributed Memory Programming Using Message Passing Interface, MPI

About This Presentation

Title:

Distributed Memory Programming Using Message Passing Interface, MPI

Description:

Distributed Memory Programming Using Message Passing Interface, MPI – PowerPoint PPT presentation

Number of Views:41

Avg rating:3.0/5.0

Slides: 42

Provided by: cacCo

Category:

more less

Transcript and Presenter's Notes

Title: Distributed Memory Programming Using Message Passing Interface, MPI

1
Distributed Memory ProgrammingUsing Message
Passing Interface, MPI
2
The BasicsHelloworld.c

include ltstdio.hgt
include ltmpi.hgt
void main(int argc, char argv )
int myid, numprocs
MPI_Init(argc, argv)
MPI_Comm_size(MPI_COMM_WORLD, numprocs)
MPI_Comm_rank(MPI_COMM_WORLD, myid)
printf("Hello from id d\n", myid)
MPI_Finalize()

3
MPI_Init

First Function call made by every MPI Process
Must be called before any other MPI Call is made

include ltstdio.hgt
include ltmpi.hgt
void main(int argc, char argv )
int i
int myid, numprocs
MPI_Init(argc, argv)
MPI_Comm_size(MPI_COMM_WORLD, numprocs)
MPI_Comm_rank(MPI_COMM_WORLD, myid)
for (i0 iltargc i) printf("argvds\n",i,
argvi)
printf("Hello from id d\n", myid)
MPI_Finalize()

4
MPI_Comm_rank

After MPI is initialized every process is part
of a Communicator
A Communicator provides the environment for
message passing.
MPI_COMM_WORLD is the default Communicator
Return the number (or rank) for each process
numbered 0 to (N-1)

include ltstdio.hgt
include ltmpi.hgt
void main(int argc, char argv )
int i
int myid, numprocs
MPI_Init(argc, argv)
MPI_Comm_size(MPI_COMM_WORLD, numprocs)
MPI_Comm_rank(MPI_COMM_WORLD, myid)
for (i0 iltargc i) printf("argvds\n",i,
argvi)
printf("Hello from id d\n", myid)
MPI_Finalize()

5
MPI_Comm_size

Returns the total number of processes in the
Communicator

include ltstdio.hgt
include ltmpi.hgt
void main(int argc, char argv )
int i
int myid, numprocs
MPI_Init(argc, argv)
MPI_Comm_size(MPI_COMM_WORLD, numprocs)
MPI_Comm_rank(MPI_COMM_WORLD, myid)
for (i0 iltargc i) printf("argvds\n",i,
argvi)
printf("Hello from id d, d or d
processes\n",myid,myid1,numprocs)
MPI_Finalize()

6
MPI_Finalize

Called when all MPI calls are complete
Frees systems resources use by MPI

include ltstdio.hgt
include ltmpi.hgt
void main(int argc, char argv )
int i
int myid, numprocs
MPI_Init(argc, argv)
MPI_Comm_size(MPI_COMM_WORLD, numprocs)
MPI_Comm_rank(MPI_COMM_WORLD, myid)
for (i0 iltargc i) printf("argvds\n",i,
argvi)
printf("Hello from id d, d or d
processes\n",myid,myid1,numprocs)
MPI_Finalize()

7
MPI_SendMPI_Send(void message,int
count,MPI_Datatype datatype,int dest, int tag,
MPI_Comm comm)

include ltstdio.hgt
include ltmpi.hgt
void main(int argc, char argv )
int i
int myid, numprocs
char sig80
MPI_Status status
MPI_Init(argc, argv)
MPI_Comm_size(MPI_COMM_WORLD, numprocs)
MPI_Comm_rank(MPI_COMM_WORLD, myid)
for (i0 iltargc i) printf("argvds\n",i,
argvi)
if (myid 0)
printf("Hello from id d, d or d
processes\n",myid,myid1,numprocs)
for(i1 iltnumprocs i)

8
MPI_DatatypeDatatypes for C

MPI_CHAR signed char
MPI_DOUBLE double
MPI_FLOAT float
MPI_INT int
MPI_LONG long
MPI_LONG_DOUBLE long double
MPI_SHORT short
MPI_UNSIGNED_CHAR unsigned char
MPI_UNSIGNED unsigned int
MPI_UNSIGNED_LONG unsigned long
MPI_UNSIGNED_SHORT unsigned short

9
MPI_Recv MPI_Recv(void message,int
count,MPI_Datatype datatype,int source, int tag,
MPI_Comm comm, MPI_Status status)

include ltstdio.hgt
include ltmpi.hgt
void main(int argc, char argv )
int i
int myid, numprocs
char sig80
MPI_Status status
MPI_Init(argc, argv)
MPI_Comm_size(MPI_COMM_WORLD, numprocs)
MPI_Comm_rank(MPI_COMM_WORLD, myid)
for (i0 iltargc i) printf("argvds\n",i,
argvi)
if (myid 0)
printf("Hello from id d, d or d
processes\n",myid,myid1,numprocs)
for(i1 iltnumprocs i)

10
MPI_StatusStatus Record

MPI_Recv Blocks until a message is received or an
error occurs.
Once MPI_Recv returns the status record can be
checked
status-gtMPI_Source (where the message came from)
status-gtMPI_tag (the tag value)
status-gtMPI_ERROR (error condition)

printf("Hello from id d, d or d
processes\n",myid,myid1,numprocs) for(i1
iltnumprocs i) MPI_Recv(sig,sizeof(
sig),MPI_CHAR,i,0,MPI_COMM_WORLD,status)
printf("s",sig) printf("Message source
d\n",status.MPI_SOURCE) printf("Message
tag d\n",status.MPI_TAG)
printf("Message Error condition
d\n",status.MPI_ERROR)
11
Watch out for Deadlocks!

Deadlocks occur when the code waits for a
condition that will never happen
Remember MPI Send and Receives work like channels
in Fosters Design Methodology
Sends are asynchronous (they send and return)
Receives are synchronous (they block until the
receive is complete)
A common MPI Deadlock happens when 2 processes
are to exchange messages and they both issue and
MPI_Recv before doing an MPI_Send

12
MPI_Wtime MPI_Wtick

Used to measure performance (time a portion of
the code)
Can be used inside MPI Codes to measure the
performance of the algorithm, outside of the MPI
message passing overhead.
MPI_Wtime returns number of seconds since a point
in the past
MPI_Wtick returns the precision returned by
MPI_Wtime

13
MPI_Wtime MPI_Wtickexample

MPI_Init(argc, argv)
MPI_Comm_size(MPI_COMM_WORLD, numprocs)
MPI_Comm_rank(MPI_COMM_WORLD, myid)
for (i0 iltargc i) printf("argvds\n",i,
argvi)
if (myid 0)
printf("Hello from id d, d of d
processes\n",myid,myid1,numprocs)
for(i1 iltnumprocs i)
MPI_Recv(sig,sizeof(sig),MPI_CHAR,i,0,MPI_CO
MM_WORLD,status)
printf("s",sig)
start MPI_Wtime()
for (i0 ilt100 i)
ai i
bi i 10
ci i 7
ai bi ci

14
MPI_BarrierMPI_Barrier(MPI_Comm comm)

A mechanism to force synchronization amongst all
processes
Useful when you are timing performance
Assume all processes are performing the same
calculation
We need to ensure they all start at the same time
Also useful when you want to ensure that all
processes have completed an operation before any
of them begin a new one.

MPI_Barrier(MPI_COMM_WORLD) start
MPI_Wtime() result run_big_computation()
MPI_Barrier(MPI_COMM_WORLD) end MPI_Wtime()
printf("This big computation took .5f
seconds\n",end-start)
15
MPI_BcastMPI_Bcast(void message,int
count,MPI_Datatype datatype,int source, MPI_Comm
comm)

Collective Communication
Allows a process to broadcast a message to all
other processes

MPI_Comm_size(MPI_COMM_WORLD,numprocs)
MPI_Comm_rank(MPI_COMM_WORLD,myid)
while(1)
if (myid 0)
printf("Enter the number of intervals (0
quits) \n")
fflush(stdout)
scanf("d",n)
// if myid 0
MPI_Bcast(n,1,MPI_INT,0,MPI_COMM_WORLD)

16
MPI_ReduceMPI_Reduce(void send_buf, void
recv_buf,int count,MPI_Datatype dtype,MPI_Op op,
int root, MPI_Comm comm)

Collective communication
Processes perform the specified reduction
The root has the results

if (myid 0)
printf("Enter the number of intervals (0
quits) \n")
fflush(stdout)
scanf("d",n)
// if myid 0
MPI_Bcast(n,1,MPI_INT,0,MPI_COMM_WORLD)
if (n 0) break
else
h 1.0 / (double) n
sum 0.0
for (i myid 1 i lt n i numprocs)
x h ((double)i - 0.5)
sum (4.0 / (1.0 xx))
// for
mypi h sum
MPI_Reduce(mypi,pi,1,MPI_DOUBLE,MPI_SUM,0,
MPI_COMM_WORLD)

17
MPI_AllreduceMPI_Allreduce(void send_buf, void
recv_buf,int count,MPI_Datatype dtype,MPI_Op op,
MPI_Comm comm)

Collective communication
Processes perform the specified reduction
All processes have the results

start MPI_Wtime() for (i0 ilt100
i) ai i bi i
10 ci i 7 ai bi
ci end MPI_Wtime()
printf("Our timers precision is .20f
seconds\n",MPI_Wtick()) printf("This silly
loop took .5f seconds\n",end-start) else
sprintf(sig,"Hello from id d, d or d
processes\n",myid,myid1,numprocs)
MPI_Send(sig,sizeof(sig),MPI_CHAR,0,0,MPI_COMM_WOR
LD) MPI_Allreduce(myid,sum,1,MPI_INT,MPI
_SUM,MPI_COMM_WORLD) printf("Sum of all
process ids d\n",sum) MPI_Finalize()
18
MPI Reduction Operators

MPI_BAND bitwise and
MPI_BOR bitwise or
MPI_BXOR bitwise exclusive or
MPI_LAND logical and
MPI_LOR logical or
MPI_LXOR logical exclusive or
MPI_MAX maximum
MPI_MAXLOC maximum and location of maximum
MPI_MIN minimum
MPI_MINLOC minimum and location of minimum
MPI_PROD product
MPI_SUM sum

19
Using Message Passing Interface, MPIMore
Advanced APIs and Examples
20
MPI_Gather (example 1)MPI_Gather ( sendbuf,
sendcnt, sendtype, recvbuf, recvcount, recvtype,
root, comm )

Collective Communication
Root Gathers Data from every process including
itself

include ltstdio.hgt
include ltmpi.hgt
include ltmalloc.hgt
void main(int argc, char argv )
int i,myid, numprocs
int ids
MPI_Status status
MPI_Init(argc, argv)
MPI_Comm_size(MPI_COMM_WORLD, numprocs)
MPI_Comm_rank(MPI_COMM_WORLD, myid)
if (myid 0)
ids (int ) malloc(numprocs sizeof(int))
MPI_Gather(myid,1,MPI_INT,ids,1,MPI_INT,0,MPI_C
OMM_WORLD)
if (myid 0)
for (i0iltnumprocsi)
printf("d\n",idsi)

21
MPI_Gather (example 2)MPI_Gather ( sendbuf,
sendcnt, sendtype, recvbuf, recvcount, recvtype,
root, comm )

include ltstdio.hgt
include ltmpi.hgt
include ltmalloc.hgt
void main(int argc, char argv )
int i,myid, numprocs
char sig80
char signatures
char sigs
MPI_Status status
MPI_Init(argc, argv)
MPI_Comm_size(MPI_COMM_WORLD, numprocs)
MPI_Comm_rank(MPI_COMM_WORLD, myid)
sprintf(sig,"Hello from id d\n",myid)
if (myid 0)
signatures (char ) malloc(numprocs
sizeof(sig))
MPI_Gather(sig,sizeof(sig),MPI_CHAR,signatures,
sizeof(sig),MPI_CHAR,0,MPI_COMM_WORLD)
if (myid 0)

22
MPI_AlltoallMPI_Alltoall( sendbuf, sendcount,
sendtype, recvbuf, recvcnt, recvtype, comm )

Collective Communication
Each process sends receives the same
amount of data to every process including itself

include ltstdio.hgt
include ltmpi.hgt
include ltmalloc.hgt
void main(int argc, char argv )
int i,myid, numprocs
int all,ids
MPI_Status status
MPI_Init(argc, argv)
MPI_Comm_size(MPI_COMM_WORLD, numprocs)
MPI_Comm_rank(MPI_COMM_WORLD, myid)
ids (int ) malloc(numprocs 3
sizeof(int))
all (int ) malloc(numprocs 3
sizeof(int))
for (i0iltnumprocs3i) idsi myid
MPI_Alltoall(ids,3,MPI_INT,all,3,MPI_INT,MPI_COM
M_WORLD)
for (i0iltnumprocs3i)
printf("d\n",alli)

23
Variations of MPI_Send

MPI_Send
MPI_Send( buf, count, datatype, dest, tag, comm )
Non-blocking - based on successful buffering on
receive side
Behavior is implementation dependant and can be
modified at run-time
MPI_Rsend
MPI_Rsend( buf, count, datatype, dest, tag, comm
)
Ready mode send. Send only happens if the
matching receive is posted.
MPI_Ssend
MPI_Ssend( buf, count, datatype, dest, tag, comm
)
Synchronous send.
Returns when matching receive is started and
receive has begun
MPI_Bsend
MPI_Bsend( buf, count, datatype, dest, tag, comm
)
Basic send with user specified buffering via
MPI_Buffer_Attach
MPI must buffer outgoing send and return

24
More Variations of MPI_Send

MPI_Ibsend
MPI_Ibsend( buf, count, datatype, dest, tag,
comm, request )
Non-blocking buffered send
Do not access send buffer until send is complete.
Use request handle to check.
MPI_Irsend
MPI_Irsend( buf, count, datatype, dest, tag,
comm, request )
Non-blocking ready send
Do not access send buffer until send is complete.
Use request handle to check.
MPI_Issend
MPI_Issend( buf, count, datatype, dest, tag,
comm, request )
Synchronous mode non-blocking send.
Control returns when matching receive has begun
Do not access send buffer until send is
complete. Use request handle to check.
MPI_Isend
MPI_Isend( buf, count, datatype, dest, tag, comm,
request )
Immediate non-blocking send (message goes into
pending state
Complete the send with a call to MPI_Wait or
similar function
Do not access send buffer until send is complete.
Use request handle to check.

25
Variations of MPI_Recv

MPI_Recv
MPI_Recv( buf, count, datatype, source, tag,
comm, status )
Blocking receive
MPI_Irecv
MPI_Irecv( buf, count, datatype, source, tag,
comm, request )
Non-blocking receive
Use MPI_Wait to ensure message receipt is
completed before accessing buffer
MPI_Wait
MPI_Wait( MPI_Request request, MPI_Status status
)

26
MPI_Irecv ExampleTask Parallelism fragment
(tp1.c)

while(complete lt iter)
for (w1 wltnumprocs w)
if ((workerw idle) (complete lt
iter))
printf ("Master sending UoWd to
Worker d\n",complete,w)
Unit_of_Work0 acomplete
Unit_of_Work1 bcomplete
// Send the Unit of Work
MPI_Send(Unit_of_Work,2,MPI_INT,w,0,MP
I_COMM_WORLD)
// Post a non-blocking Recv for that
Unit of Work
MPI_Irecv(resultw,1,MPI_INT,w,0,MPI
_COMM_WORLD,recv_reqw)
workerw complete
dispatched
complete // next unit of work to
send out
// foreach idle worker
// Collect returned results

27
MPI_Probe MPI_Iprobe

MPI_Probe
MPI_Probe( source, tag, comm, status )
Blocking test for a message
MPI_Iprobe
int MPI_Iprobe( source, tag, comm, flag, status )
Non-blocking test for a message
Source can be specified or MPI_ANY_SOURCE
Tag can be specified or MPI_ANY_TAG

28
BagBoy Example1 of 3

include ltstdio.hgt
include ltmpi.hgt
include ltmath.hgt
include lttime.hgt
include ltmalloc.hgt
define Products 10
void main(int argc, char argv )
int myid,numprocs
int true 1
int false 0
int messages true
int i,g,items,flag
int customer_items
int checked_out 0
char GroceriesProducts20
"Chips","Lettuce","Bread","Eggs","Pork
Chops","Carrots","Rice","Potatoes","Canned
Breans","Spagetti Sauce"
MPI_Status status

29
BagBoy Example2 of 3

MPI_Init(argc, argv)
MPI_Comm_size(MPI_COMM_WORLD,numprocs)
MPI_Comm_rank(MPI_COMM_WORLD,myid)
if (numprocs gt 2)
if (myid 0) // Master
customer_items (int ) malloc(numprocs
sizeof(int))
for (i1iltnumprocsi) customer_itemsi0
while (messages)
MPI_Iprobe(MPI_ANY_SOURCE,MPI_ANY_TAG,MPI_
COMM_WORLD,flag,status)
if (flag)
MPI_Recv(items,1,MPI_INT,status.MPI_SOU
RCE,status.MPI_TAG,MPI_COMM_WORLD,status)
customer_itemsstatus.MPI_SOURCE
//printf("d d of d\n",status.MPI_SOU
RCE,customer_itemsstatus.MPI_SOURCE,items)
if (customer_itemsstatus.MPI_SOURCE
items) checked_out
printf("s from d\n",Groceriesstatus.M
PI_TAG,status.MPI_SOURCE)

30
BagBoy Example3 of 3

else // Workers
srand((unsigned)time(NULL)myid)
items (rand() 5) 1
for(i1iltitemsi)
g rand() 10
printf("Sending s\n",Groceriesg)
MPI_Send(items,1,MPI_INT,0,g,MPI_COMM_WOR
LD)
// Workers
else
printf("ERRORMust have at least 2 processes
to run\n")
MPI_Finalize()

31
Using Message Passing Interface, MPI Bubble Sort
32
Bubble Sort

include ltstdio.hgt
define N 10
int main (int argc, char argv)
int aN
int i,j,tmp
printf("Unsorted\n")
for (i0 iltN i) ai rand()
printf("d\n",ai)
for (i0 ilt(N-1) i)
for(j(N-1) jgti j--)
if (aj-1 gt aj)
tmp aj
aj aj-1
aj-1 tmp

33
Serial Bubble Sort in Action
34
Step 1 PartitioningDivide Computation Data
into Pieces

The Primitive task would be each element of the
unsorted array
Goals
Order of magnitude more Primitive tasks than
Processors
Minimize redundant computations and data
Primitive tasks are approximately the same size
The number of Primitive tasks increase as problem
size increases

35
Step 2 CommunicationDetermine Communication
Patterns between Primitive Tasks

Each task communicates with its neighbor on each
side
Goals
Communication is balanced among all Tasks
Each Task Communicates with a minimal number of
neighbors
Tasks can Perform Communications concurrently
Tasks can Perform Computations concurrently

Note there are some exceptions in the actual
implementation
36
Step 3 AgglomerationGroup Tasks to Improve
Efficiency or Simplify Programming

Divide unsorted array evenly amongst processes
Perform sort steps in parallel
Exchange elements with other processes when
necessary

Process n
Process 1
Process 2
Process 0
0
N

Increase the locality of the parallel algorithm
Replicated computations take less time than the
communications they replace
Replicated data is small enough to allow the
algorithm to scale
Agglomerated tasks have similar computational and
communications costs
Number of Tasks can increase as the problem size
does
Number of Tasks as small as possible but at least
as large as the number of available processors
Trade-off between agglomeration and cost of
modifications to sequential codes is reasonable

37
Step 4 MappingAssigning Tasks to Processors

Map each process to a processor
This is not a CPU intensive operation so using
multiple processors/machine should be
considered
If the array to be sorted is very large
physical memory limitations may require using
more machines

Processor 3
Processor n
Processor 1
Processor 2
Process n
Process 1
Process 2
Process 0
0
N

Mapping based on one task per processor and
multiple tasks per processor have been considered
Both static and dynamic allocation of tasks to
processors have been evaluated
(NA) If a dynamic allocation of tasks to
processors is chosen, the Task allocator is not a
bottleneck
If Static allocation of tasks to processors is
chosen, the ratio of tasks to processors is at
least 10 to 1

38
Hint Sketch out Algorithm Behavior BEFORE
Implementing1 of 2