Title: Introduction to Parallel Computing, MPI, and OpenMP
1Introduction to Parallel Computing, MPI, and
OpenMP
Chunfang Chen, Danny Thorne, Adam Zornes
2Outline
- Introduction to Parallel Computing, by Danny
Thorne - Basic Parallel Computing Concepts
- Hardware Characteristics
- Introduction to MPI, by Chunfang Chen
- Introduction to OpenMP, by Adam Zornes
3Parallel Computing Concepts
- Definition
- Types of Parallelism
- Performance Measures
- Parallelism Issues
4Definition
Parallel Computing
- Computing multiple things simultaneously.
- Usually means computing different parts of the
same problem simultaneously. - In scientific computing, it often means
decomposing a domain into more than one
sub-domain and computing a solution on each
sub-domain separately and simultaneously (or
almost separately and simultaneously).
5Types of Parallelism
- Perfect (a.k.a Embarrassing, Trivial)
Parallelism - Monte-Carlo Methods
- Cellular Automata
- Data Parallelism
- Domain Decomposition
- Dense Matrix Multiplication
- Task Parallelism
- Pipelining
- Monte-Carlo?
- Cellular Automata?
6Performance Measures I
- Peak Performance Theoretical upper bound on
performance. - Sustained Performance Highest consistently
achievable speed. - MHz Million cycles per second.
- MIPS Million instructions per second.
- Mflops Million floating point operations per
second. -
- Speedup Sequential run time divided by parallel
run time.
7Performance Measures II
- Number of Procs p.
- Sequential Run-time Tseq.
- Parallel Run-time Tpar.
- Speedup S Tseq / Tpar. // Want Sp.
- Efficiency E S / p. // Want E1.
- Cost C p Tpar. // Want CTseq.
8Parallelism Issues
- Load Balancing
- Problem Size
- Communication
- Portability
- Scalability
- Amdahls law For constant problem size, speedup
goes to one (efficiency goes to zero) as the
number of processors goes to infinity.
9Hardware Characteristics
- Kinds of Processors
- Types of Memory Organization
- Flow of Control
- Interconnection Networks
10Kinds of Processors
- A few very powerful processors.
- Cray SV1
- 8-32 procs, 1.2Gflops per proc.
- A whole lot of less powerful processors.
- Thinking Machines CM-2
- 65,536 procs, 7Mflops per proc.
- ASCI White, IBM SP Power3
- 8192 procs, 375 MHz per proc.
- A medium quantity of medium power procs.
- Beowulf
- e.g. Bunyip, 192 x Intel Pentium III/550
11Types of Memory Organization
- Distributed Memory
- Shared Memory
- Distributed Shared Memory
12Distributed Memory
13Shared Memory
(HP Super Dome)
14Distributed Shared Memory
15Flow of Control
16Dynamic Interconnection Networks
- a.k.a. Indirect networks.
- Dynamic (Indirect) links between processors and
memory. - Usually used for shared memory computers.
17Static Interconnection Networks
- a.k.a. Direct networks.
- Point-to-point links between processors.
- Usually for message passing (distributed memory)
computers.
18Summary
- Basic Parallel Computing Concepts
- Parallel Computing is
- Perfect Parallelism, Data Parallelism, Task
Parallelism - Peak vs. Sustained Performance, Speedup,
Efficiency, Cost - Load Bal., Communication, Prob. Size,
Scalability, Amdahl - Hardware Characteristics
- Few Powerful Procs, Many Weaker Procs, Medium
- Distributed, Shared, and Distributed-Shared
Memory - Flynns Taxonomy SISD, SIMD, MISD, MIMD
- Bus Network, Crossbar Switched Network,
Multistage - Star, Mesh, Hypercube, Tree Networks
19Links
Alliance Web Based Training for HPC --
http//webct.ncsa.uiuc.edu8900/webct/public/home.
pl Kumar, Grama, Gupta, Karypis, Introduction
to Parallel computing -- ftp//ftp.cs.umn.edu/dept
/users/kumar/book Selected Web Resources for
Parallel Computing -- http//www.eecs.umich.edu/q
stout/parlinks.html Deep Blue --
http//www.research.ibm.com/deepblue/meet/html/d.3
.html Current Trends in Supercomputers and
Scientific Computing -- http//www.jics.utk.edu/CO
LLABOR_INST/MMC/ Writing A Task or Pipeline
Parallel Program -- http//www.epcc.ed.ac.uk/direc
t/VISWS/CINECA/tsld041.htm HP Technical
Documentation -- http//docs.hp.com Linux
Parallel Processing HOWTO -- http//aggregate.org/
PPLINUX/19980105/pphowto.html Introduction to
Parallel Processing -- http//www.jics.utk.edu/I2P
P/I2PPhtml/ Message Passing Interface MPI for
users -- http//www.npac.syr.edu/users/gcf/cps615m
pi95/index.html Intro to Parallel Computing I
-- http//archive.ncsa.uiuc.edu/Alliances/Exemplar
/Training/NCSAMaterials/IntroParallel_I/index.htm
Thinking Machines CM-2. -- http//www.svisions.co
m/sv/cm-dv.html The Beowulf Project --
http//www.beowulf.org Bunyip (Beowulf) Project
-- http//tux.anu.edu.au/Projects/Beowulf/ Robust
Monte Carlo Methods for Light Transport
Simulation -- http//graphics.stanford.edu/papers/
veach_thesis/ An Introduction to Parallel
Computing -- http//www.pcc.qub.ac.uk/tec/courses/
intro/ohp/intro-ohp.html Supercomputing,
Parallel Processors and High Performance
Computing -- http//www.compinfo-center.com/tpsupr
-t.htm Internet Parallel Computing Archive --
http//wotug.ukc.ac.uk/parallel/ IEEE Computer
Society's ParaScope, A Listing of Parallel
Computing Sites -- http//computer.org/parascope/
High Performance Computing (HPC) Wire --
http//www.tgc.com/HPCwire.html KAOS
Laboratory, University of Kentucky --
http//aggregate.org/KAOS/ Notes on Parallel
Computer Architecture -- http//www.npac.syr.edu/n
se/hpccsurvey/architecture/index.html Nan's
Parallel Computing Page -- http//www.cs.rit.edu/
ncs/parallel.html High Performance Computing
Photos -- http//cs.calvin.edu/CS/parallel/resourc
es/photos/ Parallel Networking Topologies --
http//www.cs.rit.edu/icss571/parallelwrl/cgframe
.html What is mixed parallelism? --
http//www.ens-lyon.fr/fsuter/pages/mixedpar.html
20Introduction to MPI
21 Outline
- Introduction to Parallel Computing, by Danny
Thorne - Introduction to MPI, by Chunfang Chen
-
- Writing MPI
- Compiling and linking MPI programs
- Running MPI programs
- Introduction to OpenMP, by Adam Zornes
22Writing MPI Programs
- All MPI programs must include a header file. In
C mpi.h, in fortran mpif.h - All MPI programs must call MPI_INIT as the first
MPI call. This establishes the MPI environment. - All MPI programs must call MPI_FINALIZE as the
last call, this exits MPI.
23Program Welcome to MPI
Program Welcome include mpif.h integer
ierr Call MPI_INIT(ierr) print, Welcome to
MPI Call MPI_FINALIZE(ierr) end
24Commentary
- Only one invocation of MPI_INIT can occur in each
program - Its only argument is an error code (integer)
- MPI_FINALIZE terminates the MPI environment ( no
calls to MPI can be made after MPI_FINALIZE is
called) - All non MPI routine are local i.e. Print,
Welcome to MPI runs on each processor
25Compiling MPI programs
- In many MPI implementations, the program can be
compiled as - mpif90 -o executable program.f
- mpicc -o executable program.c
- mpif90 and mpicc transparently set the include
paths and links to appropriate libraries
26Compiling MPI Programs
- mpif90 and mpicc can be used to compile small
programs - For larger programs, it is ideal to make use of a
makefile
27Running MPI Programs
- mpirun -np 2 executable
- - mpirun indicate that you are using the
- MPI environment.
- - np is the number of processors you
- like to use ( two for the present case)
28Sample Output
- Sample output when run over 2 processors will be
-
- Welcome to MPI
- Welcome to MPI
- Since Print, Welcome to MPI is local
statement, every processor execute it.
29Finding More about Parallel Environment
- Primary questions asked in parallel program are
-
- - How many processors are there?
- - Who am I?
- How many is answered by MPI_COMM_SIZE
- Who am I is answered by MPI_COMM_RANK
30How Many?
- Call MPI_COMM_SIZE(mpi_comm_world, size, ierr)
-
- - mpi_comm_world is the communicator
- - Communicator contains a group of processors
- - size returns the total number of processors
- - integer size
31Who am I?
- The processors are ordered in the group
consecutively from 0 to size-1, which is known as
rank - Call MPI_COMM_RANK(mpi_comm_world,rank,ierr)
-
- - mpi_comm_world is the communicator
- - integer rank
- - for size4, ranks are 0,1,2,3
32Communicator
1
2
0
3
33Program - Welcome to MPI
- Program Welcome
- include mpif.h
- integer size, rank, ierr
- Call MPI_INIT(ierr)
- Call MPI_COMM_SIZE(mpi_comm_world, size, ierr)
- Call MPI_COMM_RANK((mpi_comm_world, rank, ierr)
- print, my rank is, rank, Welcome to MPI
- call MPI_FINALIZE(ierr)
- end
34Sample Output
- Sdx1 28 mpif90 welcome.f90
- /usr/ccs/bin/ld(warning) At least one PA2.0
object file (welcome.o) was detected. The linked
output may not run on a PA 1.x system. - Sdx1 29 mpirun -np 4 a.out
- my rank is 2 Welcome to MPI
- my rank is 0 Welcome to MPI
- my rank is 1 Welcome to MPI
- my rank is 3 Welcome to MPI
35Sending and Receiving Messages
- Communication between processors involves
- - identify sender and receiver
- - the type and amount of data that is being
sent - - how is the receiver identified?
36Communication
- Point to point communication
-
- - affects exactly two processors
- Collective communication
-
- - affects a group of processors in the
communicator
37Point to point Communication
1
0
2
3
38Point to Point Communication
- Communication between two processors
- source processor sends message to destination
processor - destination processor receives the message
- communication takes place within a communicator
- destination processor is identified by its rank
in the communicator
39Communication mode
- Synchronous send(MPI_SSEND)
- buffered send
- (MPI_BSEND)
- standard send
- (MPI_SEND)
- receive(MPI_RECV)
- Only completes when the receive has completed
- Always completes (unless an error occurs),
irrespective of receiver - Message send(receive state unknown)
-
- Completes when a message had arrived
40Standard Send
- Call MPI_SEND(buf,count,datatype,dest,tag,comm,ier
r) - - buf is the name of the array/variable to be
broadcasted - - count is the number of elements to be sent
- - datatype is the type of the data
- - dest is the rank of the destination processor
- - tag is an arbitrary number which can be used
to - distinguish among messages
- - comm is the communicator( mpi_comm_world)
41MPI Receive
- Call MPI_RECV (buf,count,datatype,source,tag,comm,
status,ierr) - - source is the rank of the processor from
which data will - be accepted (this can be the rank of a
specific - processor or a wild card- MPI_ANY_SOURCE)
- - tag is an arbitrary number which can be used
to - distinguish among messages (this can be a
wild card- - MPI_ANY_TAG)
-
42Basic data type (Fortran)
- MPI_INTEGER
- MPI_REAL
- MPI_DOUBLE_PRECISION
- MPI_COMPLEX
- MPI_LOGICAL
- MPI_CHARACTER
- Integer
- Real
- Double Precision
- Complex
- Logical
- Character
43Sample Code with Send/Receive
- include mpif.h
- ! Run on 2 processors
- integer size, rank, ierr,tag,status
- character(14) message
- Call MPI_INIT(ierr)
- Call MPI_COMM_SIZE(mpi_comm_world, size, ierr)
- Call MPI_COMM_RANK((mpi_comm_world, rank, ierr)
- tag7
- if(rank.eq.0)then
-
44Sample Code with Send/Receive (cont.)
- message Welcome to MPI
- call MPI_SEND
- (message,14,MPI_CHARACTER,1,tag,mpi_comm_
world,ierr) - else
- call MPI_RECV (message,14,MPI_CHARACTER,MPI_ANY_
SOURCE,tag,mpi_comm_world,status,ierr) - print, my rank is , rank, message is ,
message - endif
- call MPI_FINALIZE(ierr)
- end
45Sample Output
- Sdx1 30 mpif90 sendrecv.f90
-
- /usr/ccs/bin/ld(warning) At least one PA2.0
object file (sendrecv.o) was detected. The linked
output may not run on a PA 1.x system. - Sdx1 31 mpirun -np 2 a.out
-
- my rank is 1 Message is Welcome to MPI
-
46Collective Communication
1
0
2
3
47Collective Communication
- Will not interfere with point-to-point
communication and vice-versa - All processors must call the collective routine
- Synchronization not guaranteed (except for
barrier) - no tags
- receive buffer must be exactly the right size
48Collective Routines
49Collective RoutineMPI_ BCAST
- call MPI_BCAST
- (buffer,count,datatype,source,comm,ierr)
- - buffer is the name of the array/variable to be
broadcasted - - count is the number of elements to be sent
- - datatype is the type of the data
- - source is the rank of the processor from which
data will be sent - - comm is the communicator( mpi_comm_world)
50Sample code using MPI_BCAST
- include mpif.h
- integer size, rank, ierr
- real para
- Call MPI_INIT(ierr)
- Call MPI_COMM_SIZE(mpi_comm_world, size, ierr)
- Call MPI_COMM_RANK((mpi_comm_world, rank, ierr)
- if (rank.eq.3) para23.0
- Call MPI_BCAST(para,1,MPI_REAL,3,MPI_COMM_WORLD,i
err)
51Sample code (cont.)
- Print,my rank is , rank, after broadcast
para is , para - call MPI_FINALIZE(ierr)
- end
52Sample Output
- Sdx1 32 mpif90 bcast.f90
- /usr/ccs/bin/ld(warning) At least one PA2.0
object file (bcast.o) was detected. The linked
output may not run on a PA 1.x system. - Sdx1 33 mpirun -np 4 a.out
- my rank is 3 after broadcast para is 23.0
- my rank is 2 after broadcast para is 23.0
- my rank is 0 after broadcast para is 23.0
- my rank is 1 after broadcast para is 23.0
53Collective RoutineMPI_ REDUCE
- call MPI_REDUCE
- (sendbuffer,recvbuffer,count,datatype,op,root,comm
,ierr) -
- - sendbuffer is the buffer/array to be sent
- - recvbuffer is the receiving buffer/array
- - datatype is the type of the data
- - op is the collective operation
- - root is the rank of the destination
- - comm is the communicator
54Collective Operation
- MPI_MAX
- MPI_MIN
- MPI_SUM
- MPI_PROD
- MPI_MAXLOC
- MPI_MINLOC
- MPI_LOR
- MPI_LXOR
- maximum
- minimum
- sum
- product
- maximum and location
- minimum and location
- logical OR
- logical exclusive OR
55Sample code using MPI_REDUCE
- include mpif.h
- integer size, rank, ierr
- integer in(2),out(2)
- Call MPI_INIT(ierr)
- Call MPI_COMM_SIZE(mpi_comm_world, size, ierr)
- Call MPI_COMM_RANK((mpi_comm_world, rank, ierr)
- in(1)rank1
- in(2)rank
56Sample code (cont.)
- Call MPI_REDUCE(in,out,1,MPI_2INTEGER,MPI_MAXLOC,
7,MPI_COMM_WORLD,ierr) - if (rank.eq.7) print,my rank is , rank,
max, out(1), at rank,out(2) - Call MPI_REDUCE(in,out,1,MPI_2INTEGER,MPI_MINLOC,
2,MPI_COMM_WORLD,ierr) - if (rank.eq.2) print,my rank is , rank,
min, out(1), at rank,out(2) - call MPI_FINALIZE(ierr)
- end
57Sample Output
- Sdx1 36 mpif90 bcast.f90
- /usr/ccs/bin/ld(warning) At least one PA2.0
object file (bcast.o) was detected. The linked
output may not run on a PA 1.x system. - Sdx1 37 mpirun -np 8 a.out
- my rank is 7 max8 at rank 7
- my rank is 2 min1 at rank 0
-
58Basic Routines in MPI
- Using the following MPI routines, many parallel
programs can be written - - MPI_INIT
- - MPI_COMM_SIZE
- - MPI_COMM_RANK
- - MPI_COMM_SEND
- - MPI_COMM_RECV
- - MPI_COMM_BCAST
- - MPI_COMM_REDUCE
- - MPI_COMM_FINALIZE
-
59Resources
- Online resources
- http//www-unix.mcs.anl.gov/mpi
- http//www.erc.msstate.edu/mpi
- http//www.epm.ornl.gov/walker/mpi
- http//www.epcc.ed.ac.uk/mpi
- http//www.mcs.anl.gov/mpi/mpi-report-1.1/mpi-repo
rt.html - ftp//www.mcs.anl.gov/pub/mpi/mpi-report.html
60OpenMP and You
61Outline
- Introduction to Parallel Computing, by Danny
Thorne - Introduction to MPI, by Chunfang Chen
- Introduction to OpenMP, by Adam Zornes
- What is OpenMP
- A Brief History
- Nuts and Bolts
- Example(s?)
62What is OpenMP
- OpenMP is a portable, multiprocessing API for
shared memory computers - OpenMP is not a language
- Instead, OpenMP specifies a set of subroutines
in an existing language (FORTRAN, C) for parallel
programming on a shared memory machine
63Why is OpenMP Popular?
- No message passing
- OpenMP directives or library calls may be
incorporated incrementally. - The code is in effect a serial code.
- Code size increase is generally smaller.
- OpenMP-enabled codes tend to be more readable
- Vendor involvement
64History of OpenMP
- Emergence of shared memory computers with
proprietary directive driven programming
enviroments in the mid-80s - In 1996 a group formed to create an industry
standard - They called themselves...
65History of OpenMP
- The ARB (OpenMP Architecture Review Board)
- Group of corporations, research groups, and
universities - Original members were ASCI, DEC, HP, IBM,
Intel, KAI, SGI - Has permanent and auxiliary members
- Meet by phone and email to interpret the
standards, answer questions, develop new
specifications, and create publicity
66What Did They Create?
- OpenMp consists of three main parts
- Compiler directives used by the programmer to
communicate with the compiler - Runtime library which enables the setting and
querying of parallel parameters - Environmental variables that can be used to
define a limited number of runtime system
parallel parameters
67The Basic Idea
- The code starts with one master thread
- When a parallel tasks needs to be performed,
additional threads are spawned - When the parallel tasks are finished, the
additional threads are released
68The Basic Idea
69The Illustrious OpenMP Directives
Control Structures what is parallel and what
is serial Work Sharing who does what
Synchronization bring everything back together
Data Scope Attributes (clauses) who can use
what and when and where Orphaning alone but
not necessarily lost
70Regions or Loops, Which is Right for You?
- Two ways to parallelize parallel loops
(fine-grained) and parallel regions
(coarse-grained) - Loops can be do, while, for, etc.
- Parallel regions cut down on overhead, but
require more complex programming (i.e. What
happens to a thread not in use)
71Work Sharing Constructs
A work sharing construct divides the execution
of enclosed code region among participating
processes The DO directive The SECTIONS
directive The SINGLE directive
!omp parallel do do i 1, n a(i)
b(i) c(i) enddo
!omp parallel !omp sections !omp section
call init_field(field) !omp section
call check_grid(grid) !omp end
sections !omp single call do_some_work(a(1))
!omp end single !omp end parallel
72Syncronization Getting it Together
Synchronization directives provide for process
synchronization and mutual exclusion
- The MASTER directive
- The BARRIER directive
- The CRITICAL directive
- The ORDERED directive
- The ATOMIC directive
73Data Scoping Directives
Clauses qualify and scope the variables in a
block of code
- PRIVATE
- SHARED
- DEFAULT (PRIVATE SHARED NONE)
- FIRSTPRIVATE
- LASTPRIVATE
- COPYIN
- REDUCTION
74Orphaning
- Directives that do not appear in the lexical
extent of the parallel construct, but lie in the
dynamic extent are called orphaned directives. - Directives in routines called from within
parallel constructs.
75Runtime Library Routines
- OMP_SET_NUM_THREADS (int)
- OMP_GET_NUM_THREADS( )
- OMP_GET_MAX_THREADS( )
- OMP_GET_THREAD( )
- OMP_GET_NUM_PROCS( )
- OMP_IN_PARALLEL( )
- OMP_SET_DYNAMIC(bool)
76The DREADed LOCKS
- OMP_INIT_LOCK(var)
- OMP_DESTROY_LOCK(var)
- OMP_SET_LOCK(var)
- OMP_UNSET_LOCK(var)
- OMP_TEST_LOCK(var)
77Enviromental Variables
- OMP_SCHEDULE
- OMP_NUM_THREADS
- OMP_DYNAMIC
- OMP_NESTED
78The Example(s?)
79The Example(s?) cont.
80The Example(s?) cont.
81The Example(s?) cont.
82The Example(s?) cont.
83The Requisite Links Page
- http//www.cs.gsu.edu/cscyip/csc4310/
- http//www.openmp.org/
- http//webct.ncsa.uiuc.edu8900/webct/public/show_
courses.pl - http//oscinfo.osc.edu/training/openmp/big/fsld.00
1.html - http//www.ccs.uky.edu/douglas
And the audience wakes upthen stumbles out of
the room...