Distributed Memory Programming Using Message Passing Interface, MPI - PowerPoint PPT Presentation

1 / 41
About This Presentation
Title:

Distributed Memory Programming Using Message Passing Interface, MPI

Description:

Distributed Memory Programming Using Message Passing Interface, MPI – PowerPoint PPT presentation

Number of Views:41
Avg rating:3.0/5.0
Slides: 42
Provided by: cacCo
Category:

less

Transcript and Presenter's Notes

Title: Distributed Memory Programming Using Message Passing Interface, MPI


1
Distributed Memory ProgrammingUsing Message
Passing Interface, MPI
2
The BasicsHelloworld.c
  • include ltstdio.hgt
  • include ltmpi.hgt
  • void main(int argc, char argv )
  • int myid, numprocs
  • MPI_Init(argc, argv)
  • MPI_Comm_size(MPI_COMM_WORLD, numprocs)
  • MPI_Comm_rank(MPI_COMM_WORLD, myid)
  • printf("Hello from id d\n", myid)
  • MPI_Finalize()

3
MPI_Init
  • First Function call made by every MPI Process
  • Must be called before any other MPI Call is made
  • include ltstdio.hgt
  • include ltmpi.hgt
  • void main(int argc, char argv )
  • int i
  • int myid, numprocs
  • MPI_Init(argc, argv)
  • MPI_Comm_size(MPI_COMM_WORLD, numprocs)
  • MPI_Comm_rank(MPI_COMM_WORLD, myid)
  • for (i0 iltargc i) printf("argvds\n",i,
    argvi)
  • printf("Hello from id d\n", myid)
  • MPI_Finalize()

4
MPI_Comm_rank
  • After MPI is initialized every process is part
    of a Communicator
  • A Communicator provides the environment for
    message passing.
  • MPI_COMM_WORLD is the default Communicator
  • Return the number (or rank) for each process
    numbered 0 to (N-1)
  • include ltstdio.hgt
  • include ltmpi.hgt
  • void main(int argc, char argv )
  • int i
  • int myid, numprocs
  • MPI_Init(argc, argv)
  • MPI_Comm_size(MPI_COMM_WORLD, numprocs)
  • MPI_Comm_rank(MPI_COMM_WORLD, myid)
  • for (i0 iltargc i) printf("argvds\n",i,
    argvi)
  • printf("Hello from id d\n", myid)
  • MPI_Finalize()

5
MPI_Comm_size
  • Returns the total number of processes in the
    Communicator
  • include ltstdio.hgt
  • include ltmpi.hgt
  • void main(int argc, char argv )
  • int i
  • int myid, numprocs
  • MPI_Init(argc, argv)
  • MPI_Comm_size(MPI_COMM_WORLD, numprocs)
  • MPI_Comm_rank(MPI_COMM_WORLD, myid)
  • for (i0 iltargc i) printf("argvds\n",i,
    argvi)
  • printf("Hello from id d, d or d
    processes\n",myid,myid1,numprocs)
  • MPI_Finalize()

6
MPI_Finalize
  • Called when all MPI calls are complete
  • Frees systems resources use by MPI
  • include ltstdio.hgt
  • include ltmpi.hgt
  • void main(int argc, char argv )
  • int i
  • int myid, numprocs
  • MPI_Init(argc, argv)
  • MPI_Comm_size(MPI_COMM_WORLD, numprocs)
  • MPI_Comm_rank(MPI_COMM_WORLD, myid)
  • for (i0 iltargc i) printf("argvds\n",i,
    argvi)
  • printf("Hello from id d, d or d
    processes\n",myid,myid1,numprocs)
  • MPI_Finalize()

7
MPI_SendMPI_Send(void message,int
count,MPI_Datatype datatype,int dest, int tag,
MPI_Comm comm)
  • include ltstdio.hgt
  • include ltmpi.hgt
  • void main(int argc, char argv )
  • int i
  • int myid, numprocs
  • char sig80
  • MPI_Status status
  • MPI_Init(argc, argv)
  • MPI_Comm_size(MPI_COMM_WORLD, numprocs)
  • MPI_Comm_rank(MPI_COMM_WORLD, myid)
  • for (i0 iltargc i) printf("argvds\n",i,
    argvi)
  • if (myid 0)
  • printf("Hello from id d, d or d
    processes\n",myid,myid1,numprocs)
  • for(i1 iltnumprocs i)

8
MPI_DatatypeDatatypes for C
  • MPI_CHAR signed char
  • MPI_DOUBLE double
  • MPI_FLOAT float
  • MPI_INT int
  • MPI_LONG long
  • MPI_LONG_DOUBLE long double
  • MPI_SHORT short
  • MPI_UNSIGNED_CHAR unsigned char
  • MPI_UNSIGNED unsigned int
  • MPI_UNSIGNED_LONG unsigned long
  • MPI_UNSIGNED_SHORT unsigned short

9
MPI_Recv MPI_Recv(void message,int
count,MPI_Datatype datatype,int source, int tag,
MPI_Comm comm, MPI_Status status)
  • include ltstdio.hgt
  • include ltmpi.hgt
  • void main(int argc, char argv )
  • int i
  • int myid, numprocs
  • char sig80
  • MPI_Status status
  • MPI_Init(argc, argv)
  • MPI_Comm_size(MPI_COMM_WORLD, numprocs)
  • MPI_Comm_rank(MPI_COMM_WORLD, myid)
  • for (i0 iltargc i) printf("argvds\n",i,
    argvi)
  • if (myid 0)
  • printf("Hello from id d, d or d
    processes\n",myid,myid1,numprocs)
  • for(i1 iltnumprocs i)

10
MPI_StatusStatus Record
  • MPI_Recv Blocks until a message is received or an
    error occurs.
  • Once MPI_Recv returns the status record can be
    checked
  • status-gtMPI_Source (where the message came from)
  • status-gtMPI_tag (the tag value)
  • status-gtMPI_ERROR (error condition)

printf("Hello from id d, d or d
processes\n",myid,myid1,numprocs) for(i1
iltnumprocs i) MPI_Recv(sig,sizeof(
sig),MPI_CHAR,i,0,MPI_COMM_WORLD,status)
printf("s",sig) printf("Message source
d\n",status.MPI_SOURCE) printf("Message
tag d\n",status.MPI_TAG)
printf("Message Error condition
d\n",status.MPI_ERROR)
11
Watch out for Deadlocks!
  • Deadlocks occur when the code waits for a
    condition that will never happen
  • Remember MPI Send and Receives work like channels
    in Fosters Design Methodology
  • Sends are asynchronous (they send and return)
  • Receives are synchronous (they block until the
    receive is complete)
  • A common MPI Deadlock happens when 2 processes
    are to exchange messages and they both issue and
    MPI_Recv before doing an MPI_Send

12
MPI_Wtime MPI_Wtick
  • Used to measure performance (time a portion of
    the code)
  • Can be used inside MPI Codes to measure the
    performance of the algorithm, outside of the MPI
    message passing overhead.
  • MPI_Wtime returns number of seconds since a point
    in the past
  • MPI_Wtick returns the precision returned by
    MPI_Wtime

13
MPI_Wtime MPI_Wtickexample
  • MPI_Init(argc, argv)
  • MPI_Comm_size(MPI_COMM_WORLD, numprocs)
  • MPI_Comm_rank(MPI_COMM_WORLD, myid)
  • for (i0 iltargc i) printf("argvds\n",i,
    argvi)
  • if (myid 0)
  • printf("Hello from id d, d of d
    processes\n",myid,myid1,numprocs)
  • for(i1 iltnumprocs i)
  • MPI_Recv(sig,sizeof(sig),MPI_CHAR,i,0,MPI_CO
    MM_WORLD,status)
  • printf("s",sig)
  • start MPI_Wtime()
  • for (i0 ilt100 i)
  • ai i
  • bi i 10
  • ci i 7
  • ai bi ci

14
MPI_BarrierMPI_Barrier(MPI_Comm comm)
  • A mechanism to force synchronization amongst all
    processes
  • Useful when you are timing performance
  • Assume all processes are performing the same
    calculation
  • We need to ensure they all start at the same time
  • Also useful when you want to ensure that all
    processes have completed an operation before any
    of them begin a new one.

MPI_Barrier(MPI_COMM_WORLD) start
MPI_Wtime() result run_big_computation()
MPI_Barrier(MPI_COMM_WORLD) end MPI_Wtime()
printf("This big computation took .5f
seconds\n",end-start)
15
MPI_BcastMPI_Bcast(void message,int
count,MPI_Datatype datatype,int source, MPI_Comm
comm)
  • Collective Communication
  • Allows a process to broadcast a message to all
    other processes
  • MPI_Comm_size(MPI_COMM_WORLD,numprocs)
  • MPI_Comm_rank(MPI_COMM_WORLD,myid)
  • while(1)
  • if (myid 0)
  • printf("Enter the number of intervals (0
    quits) \n")
  • fflush(stdout)
  • scanf("d",n)
  • // if myid 0
  • MPI_Bcast(n,1,MPI_INT,0,MPI_COMM_WORLD)

16
MPI_ReduceMPI_Reduce(void send_buf, void
recv_buf,int count,MPI_Datatype dtype,MPI_Op op,
int root, MPI_Comm comm)
  • Collective communication
  • Processes perform the specified reduction
  • The root has the results
  • if (myid 0)
  • printf("Enter the number of intervals (0
    quits) \n")
  • fflush(stdout)
  • scanf("d",n)
  • // if myid 0
  • MPI_Bcast(n,1,MPI_INT,0,MPI_COMM_WORLD)
  • if (n 0) break
  • else
  • h 1.0 / (double) n
  • sum 0.0
  • for (i myid 1 i lt n i numprocs)
  • x h ((double)i - 0.5)
  • sum (4.0 / (1.0 xx))
  • // for
  • mypi h sum
  • MPI_Reduce(mypi,pi,1,MPI_DOUBLE,MPI_SUM,0,
    MPI_COMM_WORLD)

17
MPI_AllreduceMPI_Allreduce(void send_buf, void
recv_buf,int count,MPI_Datatype dtype,MPI_Op op,
MPI_Comm comm)
  • Collective communication
  • Processes perform the specified reduction
  • All processes have the results

start MPI_Wtime() for (i0 ilt100
i) ai i bi i
10 ci i 7 ai bi
ci end MPI_Wtime()
printf("Our timers precision is .20f
seconds\n",MPI_Wtick()) printf("This silly
loop took .5f seconds\n",end-start) else
sprintf(sig,"Hello from id d, d or d
processes\n",myid,myid1,numprocs)
MPI_Send(sig,sizeof(sig),MPI_CHAR,0,0,MPI_COMM_WOR
LD) MPI_Allreduce(myid,sum,1,MPI_INT,MPI
_SUM,MPI_COMM_WORLD) printf("Sum of all
process ids d\n",sum) MPI_Finalize()
18
MPI Reduction Operators
  • MPI_BAND bitwise and
  • MPI_BOR bitwise or
  • MPI_BXOR bitwise exclusive or
  • MPI_LAND logical and
  • MPI_LOR logical or
  • MPI_LXOR logical exclusive or
  • MPI_MAX maximum
  • MPI_MAXLOC maximum and location of maximum
  • MPI_MIN minimum
  • MPI_MINLOC minimum and location of minimum
  • MPI_PROD product
  • MPI_SUM sum

19
Using Message Passing Interface, MPIMore
Advanced APIs and Examples
20
MPI_Gather (example 1)MPI_Gather ( sendbuf,
sendcnt, sendtype, recvbuf, recvcount, recvtype,
root, comm )
  • Collective Communication
  • Root Gathers Data from every process including
    itself
  • include ltstdio.hgt
  • include ltmpi.hgt
  • include ltmalloc.hgt
  • void main(int argc, char argv )
  • int i,myid, numprocs
  • int ids
  • MPI_Status status
  • MPI_Init(argc, argv)
  • MPI_Comm_size(MPI_COMM_WORLD, numprocs)
  • MPI_Comm_rank(MPI_COMM_WORLD, myid)
  • if (myid 0)
  • ids (int ) malloc(numprocs sizeof(int))
  • MPI_Gather(myid,1,MPI_INT,ids,1,MPI_INT,0,MPI_C
    OMM_WORLD)
  • if (myid 0)
  • for (i0iltnumprocsi)
  • printf("d\n",idsi)

21
MPI_Gather (example 2)MPI_Gather ( sendbuf,
sendcnt, sendtype, recvbuf, recvcount, recvtype,
root, comm )
  • include ltstdio.hgt
  • include ltmpi.hgt
  • include ltmalloc.hgt
  • void main(int argc, char argv )
  • int i,myid, numprocs
  • char sig80
  • char signatures
  • char sigs
  • MPI_Status status
  • MPI_Init(argc, argv)
  • MPI_Comm_size(MPI_COMM_WORLD, numprocs)
  • MPI_Comm_rank(MPI_COMM_WORLD, myid)
  • sprintf(sig,"Hello from id d\n",myid)
  • if (myid 0)
  • signatures (char ) malloc(numprocs
    sizeof(sig))
  • MPI_Gather(sig,sizeof(sig),MPI_CHAR,signatures,
    sizeof(sig),MPI_CHAR,0,MPI_COMM_WORLD)
  • if (myid 0)

22
MPI_AlltoallMPI_Alltoall( sendbuf, sendcount,
sendtype, recvbuf, recvcnt, recvtype, comm )
  • Collective Communication
  • Each process sends receives the same
    amount of data to every process including itself
  • include ltstdio.hgt
  • include ltmpi.hgt
  • include ltmalloc.hgt
  • void main(int argc, char argv )
  • int i,myid, numprocs
  • int all,ids
  • MPI_Status status
  • MPI_Init(argc, argv)
  • MPI_Comm_size(MPI_COMM_WORLD, numprocs)
  • MPI_Comm_rank(MPI_COMM_WORLD, myid)
  • ids (int ) malloc(numprocs 3
    sizeof(int))
  • all (int ) malloc(numprocs 3
    sizeof(int))
  • for (i0iltnumprocs3i) idsi myid
  • MPI_Alltoall(ids,3,MPI_INT,all,3,MPI_INT,MPI_COM
    M_WORLD)
  • for (i0iltnumprocs3i)
  • printf("d\n",alli)

23
Variations of MPI_Send
  • MPI_Send
  • MPI_Send( buf, count, datatype, dest, tag, comm )
  • Non-blocking - based on successful buffering on
    receive side
  • Behavior is implementation dependant and can be
    modified at run-time
  • MPI_Rsend
  • MPI_Rsend( buf, count, datatype, dest, tag, comm
    )
  • Ready mode send. Send only happens if the
    matching receive is posted.
  • MPI_Ssend
  • MPI_Ssend( buf, count, datatype, dest, tag, comm
    )
  • Synchronous send.
  • Returns when matching receive is started and
    receive has begun
  • MPI_Bsend
  • MPI_Bsend( buf, count, datatype, dest, tag, comm
    )
  • Basic send with user specified buffering via
    MPI_Buffer_Attach
  • MPI must buffer outgoing send and return

24
More Variations of MPI_Send
  • MPI_Ibsend
  • MPI_Ibsend( buf, count, datatype, dest, tag,
    comm, request )
  • Non-blocking buffered send
  • Do not access send buffer until send is complete.
    Use request handle to check.
  • MPI_Irsend
  • MPI_Irsend( buf, count, datatype, dest, tag,
    comm, request )
  • Non-blocking ready send
  • Do not access send buffer until send is complete.
    Use request handle to check.
  • MPI_Issend
  • MPI_Issend( buf, count, datatype, dest, tag,
    comm, request )
  • Synchronous mode non-blocking send.
  • Control returns when matching receive has begun
  • Do not access send buffer until send is
    complete. Use request handle to check.
  • MPI_Isend
  • MPI_Isend( buf, count, datatype, dest, tag, comm,
    request )
  • Immediate non-blocking send (message goes into
    pending state
  • Complete the send with a call to MPI_Wait or
    similar function
  • Do not access send buffer until send is complete.
    Use request handle to check.

25
Variations of MPI_Recv
  • MPI_Recv
  • MPI_Recv( buf, count, datatype, source, tag,
    comm, status )
  • Blocking receive
  • MPI_Irecv
  • MPI_Irecv( buf, count, datatype, source, tag,
    comm, request )
  • Non-blocking receive
  • Use MPI_Wait to ensure message receipt is
    completed before accessing buffer
  • MPI_Wait
  • MPI_Wait( MPI_Request request, MPI_Status status
    )

26
MPI_Irecv ExampleTask Parallelism fragment
(tp1.c)
  • while(complete lt iter)
  • for (w1 wltnumprocs w)
  • if ((workerw idle) (complete lt
    iter))
  • printf ("Master sending UoWd to
    Worker d\n",complete,w)
  • Unit_of_Work0 acomplete
  • Unit_of_Work1 bcomplete
  • // Send the Unit of Work
  • MPI_Send(Unit_of_Work,2,MPI_INT,w,0,MP
    I_COMM_WORLD)
  • // Post a non-blocking Recv for that
    Unit of Work
  • MPI_Irecv(resultw,1,MPI_INT,w,0,MPI
    _COMM_WORLD,recv_reqw)
  • workerw complete
  • dispatched
  • complete // next unit of work to
    send out
  • // foreach idle worker
  • // Collect returned results

27
MPI_Probe MPI_Iprobe
  • MPI_Probe
  • MPI_Probe( source, tag, comm, status )
  • Blocking test for a message
  • MPI_Iprobe
  • int MPI_Iprobe( source, tag, comm, flag, status )
  • Non-blocking test for a message
  • Source can be specified or MPI_ANY_SOURCE
  • Tag can be specified or MPI_ANY_TAG

28
BagBoy Example1 of 3
  • include ltstdio.hgt
  • include ltmpi.hgt
  • include ltmath.hgt
  • include lttime.hgt
  • include ltmalloc.hgt
  • define Products 10
  • void main(int argc, char argv )
  • int myid,numprocs
  • int true 1
  • int false 0
  • int messages true
  • int i,g,items,flag
  • int customer_items
  • int checked_out 0
  • char GroceriesProducts20
    "Chips","Lettuce","Bread","Eggs","Pork
    Chops","Carrots","Rice","Potatoes","Canned
    Breans","Spagetti Sauce"
  • MPI_Status status

29
BagBoy Example2 of 3
  • MPI_Init(argc, argv)
  • MPI_Comm_size(MPI_COMM_WORLD,numprocs)
  • MPI_Comm_rank(MPI_COMM_WORLD,myid)
  • if (numprocs gt 2)
  • if (myid 0) // Master
  • customer_items (int ) malloc(numprocs
    sizeof(int))
  • for (i1iltnumprocsi) customer_itemsi0
  • while (messages)
  • MPI_Iprobe(MPI_ANY_SOURCE,MPI_ANY_TAG,MPI_
    COMM_WORLD,flag,status)
  • if (flag)
  • MPI_Recv(items,1,MPI_INT,status.MPI_SOU
    RCE,status.MPI_TAG,MPI_COMM_WORLD,status)
  • customer_itemsstatus.MPI_SOURCE
  • //printf("d d of d\n",status.MPI_SOU
    RCE,customer_itemsstatus.MPI_SOURCE,items)
  • if (customer_itemsstatus.MPI_SOURCE
    items) checked_out
  • printf("s from d\n",Groceriesstatus.M
    PI_TAG,status.MPI_SOURCE)

30
BagBoy Example3 of 3
  • else // Workers
  • srand((unsigned)time(NULL)myid)
  • items (rand() 5) 1
  • for(i1iltitemsi)
  • g rand() 10
  • printf("Sending s\n",Groceriesg)
  • MPI_Send(items,1,MPI_INT,0,g,MPI_COMM_WOR
    LD)
  • // Workers
  • else
  • printf("ERRORMust have at least 2 processes
    to run\n")
  • MPI_Finalize()

31
Using Message Passing Interface, MPI Bubble Sort
32
Bubble Sort
  • include ltstdio.hgt
  • define N 10
  • int main (int argc, char argv)
  • int aN
  • int i,j,tmp
  • printf("Unsorted\n")
  • for (i0 iltN i) ai rand()
    printf("d\n",ai)
  • for (i0 ilt(N-1) i)
  • for(j(N-1) jgti j--)
  • if (aj-1 gt aj)
  • tmp aj
  • aj aj-1
  • aj-1 tmp

33
Serial Bubble Sort in Action
34
Step 1 PartitioningDivide Computation Data
into Pieces
  • The Primitive task would be each element of the
    unsorted array
  • Goals
  • Order of magnitude more Primitive tasks than
    Processors
  • Minimize redundant computations and data
  • Primitive tasks are approximately the same size
  • The number of Primitive tasks increase as problem
    size increases

35
Step 2 CommunicationDetermine Communication
Patterns between Primitive Tasks
  • Each task communicates with its neighbor on each
    side
  • Goals
  • Communication is balanced among all Tasks
  • Each Task Communicates with a minimal number of
    neighbors
  • Tasks can Perform Communications concurrently
  • Tasks can Perform Computations concurrently

Note there are some exceptions in the actual
implementation
36
Step 3 AgglomerationGroup Tasks to Improve
Efficiency or Simplify Programming
  • Divide unsorted array evenly amongst processes
  • Perform sort steps in parallel
  • Exchange elements with other processes when
    necessary

Process n
Process 1
Process 2
Process 0
0
N
  • Increase the locality of the parallel algorithm
  • Replicated computations take less time than the
    communications they replace
  • Replicated data is small enough to allow the
    algorithm to scale
  • Agglomerated tasks have similar computational and
    communications costs
  • Number of Tasks can increase as the problem size
    does
  • Number of Tasks as small as possible but at least
    as large as the number of available processors
  • Trade-off between agglomeration and cost of
    modifications to sequential codes is reasonable

37
Step 4 MappingAssigning Tasks to Processors
  • Map each process to a processor
  • This is not a CPU intensive operation so using
    multiple processors/machine should be
    considered
  • If the array to be sorted is very large
    physical memory limitations may require using
    more machines

Processor 3
Processor n
Processor 1
Processor 2
Process n
Process 1
Process 2
Process 0
0
N
  • Mapping based on one task per processor and
    multiple tasks per processor have been considered
  • Both static and dynamic allocation of tasks to
    processors have been evaluated
  • (NA) If a dynamic allocation of tasks to
    processors is chosen, the Task allocator is not a
    bottleneck
  • If Static allocation of tasks to processors is
    chosen, the ratio of tasks to processors is at
    least 10 to 1

38
Hint Sketch out Algorithm Behavior BEFORE
Implementing1 of 2
  • 7 6 5 4 3 2 1 0
  • j3 j7
  • 7 6 4 5 3 2 0 1
  • j2 j6
  • 7 4 6 5 3 0 2 1
  • j1 j5
  • 4 7 6 5 0 3 2 1
  • j0 j4
  • lt-gt
  • 4 7 6 0 5 3 2 1
  • j3 j7
  • 4 7 0 6 5 3 1 2
  • j2 j6
  • 4 0 7 6 5 1 3 2
  • j1 j5
  • 0 4 7 6 1 5 3 2
  • j0 j4
  • lt-gt
  • 0 4 7 1 6 5 3 2

39
Hint 2 of 2
  • 0 1 4 2 7 6 5 3
  • j3 j7
  • 0 1 2 4 7 6 3 5
  • j2 j6
  • 0 1 2 4 7 3 6 5
  • j1 j5
  • 0 1 2 4 3 7 6 5
  • j0 j4
  • lt-gt
  • 0 1 2 3 4 7 6 5
  • j3 j7
  • 0 1 2 3 4 7 5 6
  • j2 j6
  • 0 1 2 3 4 5 7 6
  • j1 j5
  • 0 1 2 3 4 5 7 6
  • j0 j4
  • lt-gt
  • 0 1 2 3 4 5 7 6

40
Bubble Sort Performance
41
Homework Solutions
  • Parallel Bubble Sort
  • BagBoy
Write a Comment
User Comments (0)
About PowerShow.com