Title: Distributed Memory Programming Using Message Passing Interface, MPI
1Distributed Memory ProgrammingUsing Message
Passing Interface, MPI
2The BasicsHelloworld.c
- include ltstdio.hgt
- include ltmpi.hgt
- void main(int argc, char argv )
-
- int myid, numprocs
- MPI_Init(argc, argv)
- MPI_Comm_size(MPI_COMM_WORLD, numprocs)
- MPI_Comm_rank(MPI_COMM_WORLD, myid)
- printf("Hello from id d\n", myid)
- MPI_Finalize()
-
3MPI_Init
- First Function call made by every MPI Process
- Must be called before any other MPI Call is made
- include ltstdio.hgt
- include ltmpi.hgt
- void main(int argc, char argv )
-
- int i
- int myid, numprocs
- MPI_Init(argc, argv)
- MPI_Comm_size(MPI_COMM_WORLD, numprocs)
- MPI_Comm_rank(MPI_COMM_WORLD, myid)
- for (i0 iltargc i) printf("argvds\n",i,
argvi) - printf("Hello from id d\n", myid)
- MPI_Finalize()
-
4MPI_Comm_rank
- After MPI is initialized every process is part
of a Communicator - A Communicator provides the environment for
message passing. - MPI_COMM_WORLD is the default Communicator
- Return the number (or rank) for each process
numbered 0 to (N-1)
- include ltstdio.hgt
- include ltmpi.hgt
- void main(int argc, char argv )
-
- int i
- int myid, numprocs
- MPI_Init(argc, argv)
- MPI_Comm_size(MPI_COMM_WORLD, numprocs)
- MPI_Comm_rank(MPI_COMM_WORLD, myid)
- for (i0 iltargc i) printf("argvds\n",i,
argvi) - printf("Hello from id d\n", myid)
- MPI_Finalize()
-
5MPI_Comm_size
- Returns the total number of processes in the
Communicator
- include ltstdio.hgt
- include ltmpi.hgt
- void main(int argc, char argv )
-
- int i
- int myid, numprocs
- MPI_Init(argc, argv)
- MPI_Comm_size(MPI_COMM_WORLD, numprocs)
- MPI_Comm_rank(MPI_COMM_WORLD, myid)
- for (i0 iltargc i) printf("argvds\n",i,
argvi) - printf("Hello from id d, d or d
processes\n",myid,myid1,numprocs) - MPI_Finalize()
-
6MPI_Finalize
- Called when all MPI calls are complete
- Frees systems resources use by MPI
- include ltstdio.hgt
- include ltmpi.hgt
- void main(int argc, char argv )
-
- int i
- int myid, numprocs
- MPI_Init(argc, argv)
- MPI_Comm_size(MPI_COMM_WORLD, numprocs)
- MPI_Comm_rank(MPI_COMM_WORLD, myid)
- for (i0 iltargc i) printf("argvds\n",i,
argvi) - printf("Hello from id d, d or d
processes\n",myid,myid1,numprocs) - MPI_Finalize()
-
7MPI_SendMPI_Send(void message,int
count,MPI_Datatype datatype,int dest, int tag,
MPI_Comm comm)
- include ltstdio.hgt
- include ltmpi.hgt
- void main(int argc, char argv )
-
- int i
- int myid, numprocs
- char sig80
- MPI_Status status
- MPI_Init(argc, argv)
- MPI_Comm_size(MPI_COMM_WORLD, numprocs)
- MPI_Comm_rank(MPI_COMM_WORLD, myid)
- for (i0 iltargc i) printf("argvds\n",i,
argvi) - if (myid 0)
-
- printf("Hello from id d, d or d
processes\n",myid,myid1,numprocs) - for(i1 iltnumprocs i)
-
8MPI_DatatypeDatatypes for C
- MPI_CHAR signed char
- MPI_DOUBLE double
- MPI_FLOAT float
- MPI_INT int
- MPI_LONG long
- MPI_LONG_DOUBLE long double
- MPI_SHORT short
- MPI_UNSIGNED_CHAR unsigned char
- MPI_UNSIGNED unsigned int
- MPI_UNSIGNED_LONG unsigned long
- MPI_UNSIGNED_SHORT unsigned short
9MPI_Recv MPI_Recv(void message,int
count,MPI_Datatype datatype,int source, int tag,
MPI_Comm comm, MPI_Status status)
- include ltstdio.hgt
- include ltmpi.hgt
- void main(int argc, char argv )
-
- int i
- int myid, numprocs
- char sig80
- MPI_Status status
- MPI_Init(argc, argv)
- MPI_Comm_size(MPI_COMM_WORLD, numprocs)
- MPI_Comm_rank(MPI_COMM_WORLD, myid)
- for (i0 iltargc i) printf("argvds\n",i,
argvi) - if (myid 0)
-
- printf("Hello from id d, d or d
processes\n",myid,myid1,numprocs) - for(i1 iltnumprocs i)
-
10MPI_StatusStatus Record
- MPI_Recv Blocks until a message is received or an
error occurs. - Once MPI_Recv returns the status record can be
checked - status-gtMPI_Source (where the message came from)
- status-gtMPI_tag (the tag value)
- status-gtMPI_ERROR (error condition)
printf("Hello from id d, d or d
processes\n",myid,myid1,numprocs) for(i1
iltnumprocs i) MPI_Recv(sig,sizeof(
sig),MPI_CHAR,i,0,MPI_COMM_WORLD,status)
printf("s",sig) printf("Message source
d\n",status.MPI_SOURCE) printf("Message
tag d\n",status.MPI_TAG)
printf("Message Error condition
d\n",status.MPI_ERROR)
11Watch out for Deadlocks!
- Deadlocks occur when the code waits for a
condition that will never happen - Remember MPI Send and Receives work like channels
in Fosters Design Methodology - Sends are asynchronous (they send and return)
- Receives are synchronous (they block until the
receive is complete) - A common MPI Deadlock happens when 2 processes
are to exchange messages and they both issue and
MPI_Recv before doing an MPI_Send
12MPI_Wtime MPI_Wtick
- Used to measure performance (time a portion of
the code) - Can be used inside MPI Codes to measure the
performance of the algorithm, outside of the MPI
message passing overhead. - MPI_Wtime returns number of seconds since a point
in the past - MPI_Wtick returns the precision returned by
MPI_Wtime
13MPI_Wtime MPI_Wtickexample
- MPI_Init(argc, argv)
- MPI_Comm_size(MPI_COMM_WORLD, numprocs)
- MPI_Comm_rank(MPI_COMM_WORLD, myid)
- for (i0 iltargc i) printf("argvds\n",i,
argvi) - if (myid 0)
-
- printf("Hello from id d, d of d
processes\n",myid,myid1,numprocs) - for(i1 iltnumprocs i)
-
- MPI_Recv(sig,sizeof(sig),MPI_CHAR,i,0,MPI_CO
MM_WORLD,status) - printf("s",sig)
-
- start MPI_Wtime()
- for (i0 ilt100 i)
-
- ai i
- bi i 10
- ci i 7
- ai bi ci
14MPI_BarrierMPI_Barrier(MPI_Comm comm)
- A mechanism to force synchronization amongst all
processes - Useful when you are timing performance
- Assume all processes are performing the same
calculation - We need to ensure they all start at the same time
- Also useful when you want to ensure that all
processes have completed an operation before any
of them begin a new one.
MPI_Barrier(MPI_COMM_WORLD) start
MPI_Wtime() result run_big_computation()
MPI_Barrier(MPI_COMM_WORLD) end MPI_Wtime()
printf("This big computation took .5f
seconds\n",end-start)
15MPI_BcastMPI_Bcast(void message,int
count,MPI_Datatype datatype,int source, MPI_Comm
comm)
- Collective Communication
- Allows a process to broadcast a message to all
other processes
- MPI_Comm_size(MPI_COMM_WORLD,numprocs)
- MPI_Comm_rank(MPI_COMM_WORLD,myid)
- while(1)
-
- if (myid 0)
-
- printf("Enter the number of intervals (0
quits) \n") - fflush(stdout)
- scanf("d",n)
- // if myid 0
- MPI_Bcast(n,1,MPI_INT,0,MPI_COMM_WORLD)
16MPI_ReduceMPI_Reduce(void send_buf, void
recv_buf,int count,MPI_Datatype dtype,MPI_Op op,
int root, MPI_Comm comm)
- Collective communication
- Processes perform the specified reduction
- The root has the results
- if (myid 0)
-
- printf("Enter the number of intervals (0
quits) \n") - fflush(stdout)
- scanf("d",n)
- // if myid 0
- MPI_Bcast(n,1,MPI_INT,0,MPI_COMM_WORLD)
- if (n 0) break
- else
-
- h 1.0 / (double) n
- sum 0.0
- for (i myid 1 i lt n i numprocs)
-
- x h ((double)i - 0.5)
- sum (4.0 / (1.0 xx))
- // for
- mypi h sum
- MPI_Reduce(mypi,pi,1,MPI_DOUBLE,MPI_SUM,0,
MPI_COMM_WORLD)
17MPI_AllreduceMPI_Allreduce(void send_buf, void
recv_buf,int count,MPI_Datatype dtype,MPI_Op op,
MPI_Comm comm)
- Collective communication
- Processes perform the specified reduction
- All processes have the results
start MPI_Wtime() for (i0 ilt100
i) ai i bi i
10 ci i 7 ai bi
ci end MPI_Wtime()
printf("Our timers precision is .20f
seconds\n",MPI_Wtick()) printf("This silly
loop took .5f seconds\n",end-start) else
sprintf(sig,"Hello from id d, d or d
processes\n",myid,myid1,numprocs)
MPI_Send(sig,sizeof(sig),MPI_CHAR,0,0,MPI_COMM_WOR
LD) MPI_Allreduce(myid,sum,1,MPI_INT,MPI
_SUM,MPI_COMM_WORLD) printf("Sum of all
process ids d\n",sum) MPI_Finalize()
18MPI Reduction Operators
- MPI_BAND bitwise and
- MPI_BOR bitwise or
- MPI_BXOR bitwise exclusive or
- MPI_LAND logical and
- MPI_LOR logical or
- MPI_LXOR logical exclusive or
- MPI_MAX maximum
- MPI_MAXLOC maximum and location of maximum
- MPI_MIN minimum
- MPI_MINLOC minimum and location of minimum
- MPI_PROD product
- MPI_SUM sum
19Using Message Passing Interface, MPIMore
Advanced APIs and Examples
20MPI_Gather (example 1)MPI_Gather ( sendbuf,
sendcnt, sendtype, recvbuf, recvcount, recvtype,
root, comm )
- Collective Communication
- Root Gathers Data from every process including
itself
- include ltstdio.hgt
- include ltmpi.hgt
- include ltmalloc.hgt
- void main(int argc, char argv )
-
- int i,myid, numprocs
- int ids
- MPI_Status status
- MPI_Init(argc, argv)
- MPI_Comm_size(MPI_COMM_WORLD, numprocs)
- MPI_Comm_rank(MPI_COMM_WORLD, myid)
- if (myid 0)
- ids (int ) malloc(numprocs sizeof(int))
- MPI_Gather(myid,1,MPI_INT,ids,1,MPI_INT,0,MPI_C
OMM_WORLD) - if (myid 0)
- for (i0iltnumprocsi)
- printf("d\n",idsi)
21MPI_Gather (example 2)MPI_Gather ( sendbuf,
sendcnt, sendtype, recvbuf, recvcount, recvtype,
root, comm )
- include ltstdio.hgt
- include ltmpi.hgt
- include ltmalloc.hgt
- void main(int argc, char argv )
-
- int i,myid, numprocs
- char sig80
- char signatures
- char sigs
- MPI_Status status
- MPI_Init(argc, argv)
- MPI_Comm_size(MPI_COMM_WORLD, numprocs)
- MPI_Comm_rank(MPI_COMM_WORLD, myid)
- sprintf(sig,"Hello from id d\n",myid)
- if (myid 0)
- signatures (char ) malloc(numprocs
sizeof(sig)) - MPI_Gather(sig,sizeof(sig),MPI_CHAR,signatures,
sizeof(sig),MPI_CHAR,0,MPI_COMM_WORLD) - if (myid 0)
22MPI_AlltoallMPI_Alltoall( sendbuf, sendcount,
sendtype, recvbuf, recvcnt, recvtype, comm )
- Collective Communication
- Each process sends receives the same
amount of data to every process including itself
- include ltstdio.hgt
- include ltmpi.hgt
- include ltmalloc.hgt
- void main(int argc, char argv )
-
- int i,myid, numprocs
- int all,ids
- MPI_Status status
- MPI_Init(argc, argv)
- MPI_Comm_size(MPI_COMM_WORLD, numprocs)
- MPI_Comm_rank(MPI_COMM_WORLD, myid)
- ids (int ) malloc(numprocs 3
sizeof(int)) - all (int ) malloc(numprocs 3
sizeof(int)) - for (i0iltnumprocs3i) idsi myid
- MPI_Alltoall(ids,3,MPI_INT,all,3,MPI_INT,MPI_COM
M_WORLD) - for (i0iltnumprocs3i)
- printf("d\n",alli)
23Variations of MPI_Send
- MPI_Send
- MPI_Send( buf, count, datatype, dest, tag, comm )
- Non-blocking - based on successful buffering on
receive side - Behavior is implementation dependant and can be
modified at run-time - MPI_Rsend
- MPI_Rsend( buf, count, datatype, dest, tag, comm
) - Ready mode send. Send only happens if the
matching receive is posted. - MPI_Ssend
- MPI_Ssend( buf, count, datatype, dest, tag, comm
) - Synchronous send.
- Returns when matching receive is started and
receive has begun - MPI_Bsend
- MPI_Bsend( buf, count, datatype, dest, tag, comm
) - Basic send with user specified buffering via
MPI_Buffer_Attach - MPI must buffer outgoing send and return
24More Variations of MPI_Send
- MPI_Ibsend
- MPI_Ibsend( buf, count, datatype, dest, tag,
comm, request ) - Non-blocking buffered send
- Do not access send buffer until send is complete.
Use request handle to check. - MPI_Irsend
- MPI_Irsend( buf, count, datatype, dest, tag,
comm, request ) - Non-blocking ready send
- Do not access send buffer until send is complete.
Use request handle to check. - MPI_Issend
- MPI_Issend( buf, count, datatype, dest, tag,
comm, request ) - Synchronous mode non-blocking send.
- Control returns when matching receive has begun
- Do not access send buffer until send is
complete. Use request handle to check. - MPI_Isend
- MPI_Isend( buf, count, datatype, dest, tag, comm,
request ) - Immediate non-blocking send (message goes into
pending state - Complete the send with a call to MPI_Wait or
similar function - Do not access send buffer until send is complete.
Use request handle to check.
25Variations of MPI_Recv
- MPI_Recv
- MPI_Recv( buf, count, datatype, source, tag,
comm, status ) - Blocking receive
- MPI_Irecv
- MPI_Irecv( buf, count, datatype, source, tag,
comm, request ) - Non-blocking receive
- Use MPI_Wait to ensure message receipt is
completed before accessing buffer - MPI_Wait
- MPI_Wait( MPI_Request request, MPI_Status status
)
26MPI_Irecv ExampleTask Parallelism fragment
(tp1.c)
- while(complete lt iter)
-
- for (w1 wltnumprocs w)
-
- if ((workerw idle) (complete lt
iter)) -
- printf ("Master sending UoWd to
Worker d\n",complete,w) - Unit_of_Work0 acomplete
- Unit_of_Work1 bcomplete
- // Send the Unit of Work
- MPI_Send(Unit_of_Work,2,MPI_INT,w,0,MP
I_COMM_WORLD) - // Post a non-blocking Recv for that
Unit of Work - MPI_Irecv(resultw,1,MPI_INT,w,0,MPI
_COMM_WORLD,recv_reqw) - workerw complete
- dispatched
- complete // next unit of work to
send out -
- // foreach idle worker
- // Collect returned results
27MPI_Probe MPI_Iprobe
- MPI_Probe
- MPI_Probe( source, tag, comm, status )
- Blocking test for a message
- MPI_Iprobe
- int MPI_Iprobe( source, tag, comm, flag, status )
- Non-blocking test for a message
- Source can be specified or MPI_ANY_SOURCE
- Tag can be specified or MPI_ANY_TAG
-
28BagBoy Example1 of 3
- include ltstdio.hgt
- include ltmpi.hgt
- include ltmath.hgt
- include lttime.hgt
- include ltmalloc.hgt
- define Products 10
- void main(int argc, char argv )
-
- int myid,numprocs
- int true 1
- int false 0
- int messages true
- int i,g,items,flag
- int customer_items
- int checked_out 0
- char GroceriesProducts20
"Chips","Lettuce","Bread","Eggs","Pork
Chops","Carrots","Rice","Potatoes","Canned
Breans","Spagetti Sauce" - MPI_Status status
29BagBoy Example2 of 3
- MPI_Init(argc, argv)
- MPI_Comm_size(MPI_COMM_WORLD,numprocs)
- MPI_Comm_rank(MPI_COMM_WORLD,myid)
- if (numprocs gt 2)
-
- if (myid 0) // Master
-
- customer_items (int ) malloc(numprocs
sizeof(int)) - for (i1iltnumprocsi) customer_itemsi0
- while (messages)
-
- MPI_Iprobe(MPI_ANY_SOURCE,MPI_ANY_TAG,MPI_
COMM_WORLD,flag,status) - if (flag)
-
- MPI_Recv(items,1,MPI_INT,status.MPI_SOU
RCE,status.MPI_TAG,MPI_COMM_WORLD,status) - customer_itemsstatus.MPI_SOURCE
- //printf("d d of d\n",status.MPI_SOU
RCE,customer_itemsstatus.MPI_SOURCE,items) - if (customer_itemsstatus.MPI_SOURCE
items) checked_out - printf("s from d\n",Groceriesstatus.M
PI_TAG,status.MPI_SOURCE)
30BagBoy Example3 of 3
- else // Workers
-
- srand((unsigned)time(NULL)myid)
- items (rand() 5) 1
- for(i1iltitemsi)
-
- g rand() 10
- printf("Sending s\n",Groceriesg)
- MPI_Send(items,1,MPI_INT,0,g,MPI_COMM_WOR
LD) -
- // Workers
-
- else
- printf("ERRORMust have at least 2 processes
to run\n") - MPI_Finalize()
-
31Using Message Passing Interface, MPI Bubble Sort
32Bubble Sort
- include ltstdio.hgt
- define N 10
- int main (int argc, char argv)
-
- int aN
- int i,j,tmp
-
- printf("Unsorted\n")
- for (i0 iltN i) ai rand()
printf("d\n",ai) - for (i0 ilt(N-1) i)
- for(j(N-1) jgti j--)
- if (aj-1 gt aj)
-
- tmp aj
- aj aj-1
- aj-1 tmp
-
33Serial Bubble Sort in Action
34Step 1 PartitioningDivide Computation Data
into Pieces
- The Primitive task would be each element of the
unsorted array - Goals
- Order of magnitude more Primitive tasks than
Processors - Minimize redundant computations and data
- Primitive tasks are approximately the same size
- The number of Primitive tasks increase as problem
size increases
35Step 2 CommunicationDetermine Communication
Patterns between Primitive Tasks
- Each task communicates with its neighbor on each
side - Goals
- Communication is balanced among all Tasks
- Each Task Communicates with a minimal number of
neighbors - Tasks can Perform Communications concurrently
- Tasks can Perform Computations concurrently
Note there are some exceptions in the actual
implementation
36Step 3 AgglomerationGroup Tasks to Improve
Efficiency or Simplify Programming
- Divide unsorted array evenly amongst processes
- Perform sort steps in parallel
- Exchange elements with other processes when
necessary
Process n
Process 1
Process 2
Process 0
0
N
- Increase the locality of the parallel algorithm
- Replicated computations take less time than the
communications they replace - Replicated data is small enough to allow the
algorithm to scale - Agglomerated tasks have similar computational and
communications costs - Number of Tasks can increase as the problem size
does - Number of Tasks as small as possible but at least
as large as the number of available processors - Trade-off between agglomeration and cost of
modifications to sequential codes is reasonable
37Step 4 MappingAssigning Tasks to Processors
- Map each process to a processor
- This is not a CPU intensive operation so using
multiple processors/machine should be
considered - If the array to be sorted is very large
physical memory limitations may require using
more machines
Processor 3
Processor n
Processor 1
Processor 2
Process n
Process 1
Process 2
Process 0
0
N
- Mapping based on one task per processor and
multiple tasks per processor have been considered - Both static and dynamic allocation of tasks to
processors have been evaluated - (NA) If a dynamic allocation of tasks to
processors is chosen, the Task allocator is not a
bottleneck - If Static allocation of tasks to processors is
chosen, the ratio of tasks to processors is at
least 10 to 1
38Hint Sketch out Algorithm Behavior BEFORE
Implementing1 of 2
- 7 6 5 4 3 2 1 0
- j3 j7
- 7 6 4 5 3 2 0 1
- j2 j6
- 7 4 6 5 3 0 2 1
- j1 j5
- 4 7 6 5 0 3 2 1
- j0 j4
- lt-gt
- 4 7 6 0 5 3 2 1
- j3 j7
- 4 7 0 6 5 3 1 2
- j2 j6
- 4 0 7 6 5 1 3 2
- j1 j5
- 0 4 7 6 1 5 3 2
- j0 j4
- lt-gt
- 0 4 7 1 6 5 3 2
39Hint 2 of 2
- 0 1 4 2 7 6 5 3
- j3 j7
- 0 1 2 4 7 6 3 5
- j2 j6
- 0 1 2 4 7 3 6 5
- j1 j5
- 0 1 2 4 3 7 6 5
- j0 j4
- lt-gt
- 0 1 2 3 4 7 6 5
- j3 j7
- 0 1 2 3 4 7 5 6
- j2 j6
- 0 1 2 3 4 5 7 6
- j1 j5
- 0 1 2 3 4 5 7 6
- j0 j4
- lt-gt
- 0 1 2 3 4 5 7 6
40Bubble Sort Performance
41Homework Solutions
- Parallel Bubble Sort
- BagBoy