Title: An Introduction to Parallel Programming with MPI
1An Introduction to Parallel Programming with MPI
- March 22, 24, 29, 31
- 2005
- David Adams
2Outline
- Disclaimers
- Overview of basic parallel programming on a
cluster with the goals of MPI - Batch system interaction
- Startup procedures
- Blocking message passing
- Non-blocking message passing
- Collective Communications
3Disclaimers
- I do not have all the answers.
- Completion of this short course will give you
enough tools to begin making use of MPI. It will
not automagically allow your code to run on a
parallel machine simply by logging in. - Some codes are easier to parallelize than others.
4The Goals of MPI
- Design an application programming interface.
- Allow efficient communication.
- Allow for implementations that can be used in a
heterogeneous environment. - Allow convenient C and Fortran 77 bindings.
- Provide a reliable communication interface.
- Portable.
- Thread safe.
5Message Passing Paradigm
6Message Passing Paradigm
7Message Passing Paradigm
- Conceptually, all processors communicate through
messages (even though some may share memory
space). - Low level details of message transport are
handled by MPI and are invisible to the user. - Every processor is running the same program but
will take different logical paths determined by
self processor identification (Who am I?). - Programs are written, in general, for an
arbitrary number of processors though they may be
more efficient on specific numbers (powers of
2?).
8Distributed Memory and I/O Systems
- The cluster machines available at Virginia Tech
are distributed memory distributed I/O systems. - Each node (processor pair) has its own memory and
local hard disk. - Allows asynchronous execution of multiple
instruction streams. - Heavy disk I/O should be delegated to the local
disk instead of across the network and minimized
as much as possible. - While getting your program running, another goal
to keep in mind is to see that it makes good use
of the hardware available to you. - What does good use mean?
9Speedup
- The speedup achieved by a parallel algorithm
running on p processors is the ratio between the
time taken by that parallel computer executing
the fastest serial algorithm and the time taken
by the same parallel computer executing the
parallel algorithm using p processors. - -Designing Efficient Algorithms for Parallel
Computers, Michael J. Quinn
10Speedup
- Sometimes a fastest serial version of the code
is unavailable. - The speedup of a parallel algorithm can be
measured based on the speed of the parallel
algorithm run serially but this gives an unfair
advantage to the parallel code as the
inefficiencies of making the code parallel will
also appear in the serial version.
11Speedup Example
- Our really_big_code01 executes on a single
processor in 100 hours. - The same code on 10 processors takes 10 hours.
- 100 hrs./10 hrs. 10 speedup.
- When speedup p it is called ideal (or perfect)
speedup. - Speedup by itself is not very meaningful. A
speedup of 10 may sound good (We are solving the
problem 10 times as fast!) but what if we were
using 1000 processors to get that number?
12Efficiency
- The efficiency of a parallel algorithm running on
p processors is the speedup divided by p. - -Designing Efficient Algorithms for Parallel
Computers, Michael J. Quinn - From our last example,
- when p 10 the efficiency is 10/101 (great!),
- When p 1000 the efficiency is 10/10000.01
(bad!). - Speedup and efficiency give us an idea of how
well our parallel code is making use of the
available resources.
13Concurrency
- The first step in parallelizing any code is to
identify the types of concurrency found in the
problem itself (not necessarily the serial
algorithm). - Many parallel algorithms show few resemblances to
the (fastest known) serial version they are
compared to and sometimes require an unusual
perspective on the problem.
14Concurrency
- Consider the problem of finding the sum of n
integer values. - A sequential algorithm may look something like
this - BEGIN
- sum A0
- FOR i 1 TO n 1 DO
- sum sum Ai
- ENDFOR
- END
15Concurrency
- Suppose n 4. Then the additions would be done
in a precise order as follows - (A0 A1) A2 A3
- Without any insight into the problem itself we
might assume that the process is completely
sequential and can not be parallelized. - Of course, we know that addition is associative
(mostly). The same expression could be written
as - (A0 A1) (A2 A3)
- By using our insight into the problem of addition
we can exploit the inherent concurrency of the
problem and not the algorithm.
16Communication is Slow
- Continuing our example of adding n integers we
may want to parallelize the process to exploit as
much concurrency as possible. We call on the
services of Clovus the Parallel Guru. - Let n 128.
- Clovus divides the integers into pairs and
distributes them to 64 processors maximizing the
concurrency inherent in the problem. - The solution to the 64 sub-problems are
distributed to 32 and those 32 to 16 etc
17Communication Overhead
- Suppose it takes t units of time to perform a
floating-point addition. - Suppose it takes 100t units of time to pass a
floating-point number from one processor to
another. - The entire calculation on a single processor
would take 127t time units. - Using the maximum number of processors possible
(64) Clovus finds the sum of the first set of
pairs in 101t time units. Further steps for 32,
16, 8, 4, and 2 follow to obtain the final
solution. - (64) (32) (16) (8) (4) (2)
- 101t 101t 101t 101t 101t 101t 606t
total time units
18Parallelism and Pipelining to Achieve Concurrency
- There are two primary ways to achieve concurrency
in an algorithm. - Parallelism
- The use of multiple resources to increase
concurrency. - Partitioning.
- Example Our summation problem.
- Pipelining
- Dividing the computation into a number of steps
that are repeated throughout the algorithm. - An ordered set of segments in which the output of
each segment is the input of its successor. - Example Automobile assembly line.
19Examples(Jacobi style update)
- Imagine we have a cellular automata that we want
to parallelize.
7
8
1
2
3
4
5
6
20Examples
- We try to distribute the rows evenly between two
processors.
7
8
1
2
3
4
5
6
21Examples
- Columns seem to work better for this problem.
7
8
1
2
3
4
5
6
22Examples
- Minimizing communication.
7
8
1
2
3
4
5
6
23Examples(Gauss-Seidel style update)
- Emulating a serial Gauss-Seidel update style with
a pipe.
7
8
1
2
3
4
5
6
24Examples(Gauss-Seidel style update)
- Emulating a serial Gauss-Seidel update style with
a pipe.
7
8
1
2
3
4
5
6
25Examples(Gauss-Seidel style update)
- Emulating a serial Gauss-Seidel update style with
a pipe.
7
8
1
2
3
4
5
6
26Examples(Gauss-Seidel style update)
- Emulating a serial Gauss-Seidel update style with
a pipe.
7
8
1
2
3
4
5
6
27Examples(Gauss-Seidel style update)
- Emulating a serial Gauss-Seidel update style with
a pipe.
7
8
1
2
3
4
5
6
28Examples(Gauss-Seidel style update)
- Emulating a serial Gauss-Seidel update style with
a pipe.
7
8
1
2
3
4
5
6
29Examples(Gauss-Seidel style update)
- Emulating a serial Gauss-Seidel update style with
a pipe.
7
8
1
2
3
4
5
6
30Examples(Gauss-Seidel style update)
- Emulating a serial Gauss-Seidel update style with
a pipe.
7
8
1
2
3
4
5
6
31Examples(Gauss-Seidel style update)
- Emulating a serial Gauss-Seidel update style with
a pipe.
7
8
1
2
3
4
5
6
32Examples(Gauss-Seidel style update)
- Emulating a serial Gauss-Seidel update style with
a pipe.
7
8
1
2
3
4
5
6
33Examples(Gauss-Seidel style update)
- Emulating a serial Gauss-Seidel update style with
a pipe.
7
8
1
2
3
4
5
6
34Examples(Gauss-Seidel style update)
- Emulating a serial Gauss-Seidel update style with
a pipe.
7
8
1
2
3
4
5
6
35Examples(Gauss-Seidel style update)
- Emulating a serial Gauss-Seidel update style with
a pipe.
7
8
1
2
3
4
5
6
36Examples(Gauss-Seidel style update)
- Emulating a serial Gauss-Seidel update style with
a pipe.
7
8
1
2
3
4
5
6
37Examples(Gauss-Seidel style update)
- Emulating a serial Gauss-Seidel update style with
a pipe.
7
8
1
2
3
4
5
6
38Examples(Gauss-Seidel style update)
- Emulating a serial Gauss-Seidel update style with
a pipe.
7
8
1
2
3
4
5
6
39Batch System Interaction
- Both Anantham (400 processors) and System X
(2200 processors) will normally operate in batch
mode. - Jobs are not interactive.
- Multi-user etiquette is enforced by a job
scheduler and queuing system. - Users will submit jobs using a script file built
by the administrator and modified by the user.
40PBS (Portable Batch Scheduler) Submission Script
- /bin/bash
- !
- ! Example of job file to submit parallel MPI
applications. - ! Lines starting with PBS are options for the
qsub command. - ! Lines starting with ! are comments
- ! Set queue (production queue --- the only one
right now) and - ! the number of nodes.
- ! In this case we require 10 nodes from the
entire set ("all"). - PBS -q prod_q
- PBS -l nodes10all
41PBS Submission Script
- ! Set time limit.
- ! The default is 30 minutes of cpu time.
- ! Here we ask for up to 1 hour.
- ! (Note that this is total cpu time, e.g., 10
minutes on - ! each of 4 processors is 40 minutes)
- ! Hoursminutesseconds
- PBS -l cput010000
- ! Name of output files for std output and error
- ! Defaults are ltjob-namegt.oltjob numbergt and
ltjob-namegt.eltjob-numbergt - !PBS -e ZCA.err
- !PBS -o ZCA.log
42PBS Submission Script
- ! Mail to user when job terminates or aborts
- ! PBS -m ae
- !change the working directory (default is home
directory) - cd PBS_O_WORKDIR
- ! Run the parallel MPI executable (change the
default a.out) - ! (Note omit "-kill" if you are running a 1
node job) - /usr/local/bin/mpiexec -kill a.out
43Common Scheduler Commands
- qsub ltscript file namegt
- Submits your script file for scheduling. It is
immediately checked for validity and if it passes
the check you will get a message that your job
has been added to the queue. - qstat
- Displays information on jobs waiting in the queue
and jobs that are running. How much time they
have left and how many processors they are using. - Each job aquires a unique job_id that can be used
to communicate with a job that is already running
(perhaps to kill it). - qdel ltjob_idgt
- If for some reason you have a job that you need
to remove from the queue, this command will do
it. It will also kill a job in progress. - You, of course, only have access to delete your
own jobs.
44MPI Data Types
- MPI thinks of every message as a starting point
in memory and some measure of length along with a
possible interpretation of the data. - The direct measure of length (number of bytes) is
hidden from the user through the use of MPI data
types. - Each language binding (C and Fortran 77) has its
own list of MPI types that are intended to
increase portability as the length of these types
can change from machine to machine. - Interpretations of data can change from machine
to machine in heterogeneous clusters (Macs and
PCs in the same cluster for example).
45MPI types in C
- MPI_CHAR signed char
- MPI_SHORT signed short int
- MPI_INT signed int
- MPI_LONG signed long int
- MPI_UNSIGNED_CHAR unsigned short int
- MPI_UNSIGNED unsigned int
- MPI_UNSIGNED_LONG unsigned long int
- MPI_FLOAT float
- MPI_DOUBLE double
- MPI_LONG_DOUBLE long double
- MPI_BYTE
- MPI_PACKED
46MPI Types in Fortran 77
- MPI_INTEGER INTEGER
- MPI_REAL REAL
- MPI_DOUBLE_PRECISION DOUBLE PRECISION
- MPI_COMPLEX COMPLEX
- MPI_LOGICAL LOGICAL
- MPI_CHARACTER CHARACTER(1)
- MPI_BYTE
- MPI_PACKED
- Caution Fortran90 does not always store arrays
contiguously.
47Functions Appearing in all MPI Programs (Fortran
77)
- MPI_INIT(IERROR)
- INTEGER IERROR
- Must be called before any other MPI routine.
- Can be visualized as the point in the code where
every processor obtains its own copy of the
program and continues to execute though this may
happen earlier.
48Functions Appearing in all MPI Programs (Fortran
77)
- MPI_FINALIZE (IERROR)
- INTEGER IERROR
- This routine cleans up all MPI state.
- Once this routine is called no MPI routine may be
called. - It is the users responsibility to ensure that ALL
pending communications involving a process
complete before the process calls MPI_FINALIZE
49Typical Startup Functions
- MPI_COMM_SIZE(COMM, SIZE, IERROR)
- IN INTEGER COMM
- OUT INTEGER SIZE, IERROR
- Returns the size of the group associated with the
communicator COMM. - Whats a communicator?
50Communicators
- A communicator is an integer that tells MPI what
communication domain it is in. - There is a special communicator that exists in
every MPI program called MPI_COMM_WORLD. - MPI_COMM_WORLD can be thought of as the superset
of all communication domains. Every processor
requested by your initial script is a member of
MPI_COMM_WORLD.
51Typical Startup Functions
- MPI_COMM_SIZE(COMM, SIZE, IERROR)
- IN INTEGER COMM, SIZE, IERROR
- OUT INTEGER SIZE, IERROR
- Returns the size of the group associated with the
communicator COMM. - A typical program contains the following command
as one of the very first MPI calls to provide the
code with the number of processors it has
available for this execution. (Step one of self
identification). - CALL MPI_COMM_SIZE(MPI_COMM_WORLD, size_p, ierr_p)
52Typical Startup Functions
- MPI_COMM_RANK(COMM, RANK, IERROR)
- IN INTEGER COMM
- OUT INTEGER RANK, IERROR
- Indicates the rank of the process that calls it
in the range from 0..size-1, where size is the
return value of MPI_COMM_SIZE. - This rank is relative to the communication domain
specified by the communicator COMM. - For MPI_COMM_WORLD, this function will return the
absolute rank of the process, a unique
identifier. (Step 2 of self identification). - CALL MPI_COMM_Rank(MPI_COMM_WORLD, size_p, ierr_p)
53Startup Variables
- SIZE
- INTEGER size_p
- RANK
- INTEGER rank_p
- STATUS (more on this guy later)
- INTEGER, DIMENSION(MPI_STATUS_SIZE) status_p
- IERROR (Fortran 77)
- INTEGER ierr_p
54Hello WorldFortran90
- PPROGRAM Hello_World
- IMPLICIT NONE
- INCLUDE 'mpif.h'
- INTEGER ierr_p, rank_p, size_p
- INTEGER, DIMENSION(MPI_STATUS_SIZE) status_p
- CALL MPI_INIT(ierr_p)
- CALL MPI_COMM_RANK(MPI_COMM_WORLD, rank_p,
ierr_p) - CALL MPI_COMM_SIZE(MPI_COMM_WORLD, size_p,
ierr_p) - IF (rank_p0) THEN
- WRITE(,) Hello world! I am process 0 and I am
special! - ELSE
- WRITE(,) Hello world! I am process , rank_p
- END IF
- CALL MPI_FINALIZE(ierr_p)
55Hello WorldC
- include ltstdio.hgt
- include ltmpi.hgt
- main (int argc, char argv)
-
- int node
- MPI_Init(argc, argv)
- MPI_Comm_rank(MPI_COMM_WORLD, node)
- if (node 0)
- printf("Hello word! I am C process 0 and I
am special!\n") - else
- printf("Hello word! I am C process d\n",
node) -
- MPI_Finalize()