Introduction to Parallel Computing, MPI, and OpenMP

About This Presentation

Title:

Introduction to Parallel Computing, MPI, and OpenMP

Description:

High Performance Computing Photos -- http://cs.calvin.edu/CS/parallel/resources/photos ... set the include paths and links to appropriate libraries ... – PowerPoint PPT presentation

Number of Views:109

Avg rating:3.0/5.0

Slides: 84

Provided by: LisaAult9

Category:

more less

Transcript and Presenter's Notes

Title: Introduction to Parallel Computing, MPI, and OpenMP

1
Introduction to Parallel Computing, MPI, and
OpenMP
Chunfang Chen, Danny Thorne, Adam Zornes
2
Outline

Introduction to Parallel Computing, by Danny
Thorne
Basic Parallel Computing Concepts
Hardware Characteristics
Introduction to MPI, by Chunfang Chen
Introduction to OpenMP, by Adam Zornes

3
Parallel Computing Concepts

Definition
Types of Parallelism
Performance Measures
Parallelism Issues

4
Definition
Parallel Computing

Computing multiple things simultaneously.
Usually means computing different parts of the
same problem simultaneously.
In scientific computing, it often means
decomposing a domain into more than one
sub-domain and computing a solution on each
sub-domain separately and simultaneously (or
almost separately and simultaneously).

5
Types of Parallelism

Perfect (a.k.a Embarrassing, Trivial)
Parallelism
Monte-Carlo Methods
Cellular Automata
Data Parallelism
Domain Decomposition
Dense Matrix Multiplication
Task Parallelism
Pipelining
Monte-Carlo?
Cellular Automata?

6
Performance Measures I

Peak Performance Theoretical upper bound on
performance.
Sustained Performance Highest consistently
achievable speed.
MHz Million cycles per second.
MIPS Million instructions per second.
Mflops Million floating point operations per
second.
Speedup Sequential run time divided by parallel
run time.

7
Performance Measures II

Number of Procs p.
Sequential Run-time Tseq.
Parallel Run-time Tpar.
Speedup S Tseq / Tpar. // Want Sp.
Efficiency E S / p. // Want E1.
Cost C p Tpar. // Want CTseq.

8
Parallelism Issues

Load Balancing
Problem Size
Communication
Portability
Scalability
Amdahls law For constant problem size, speedup
goes to one (efficiency goes to zero) as the
number of processors goes to infinity.

9
Hardware Characteristics

Kinds of Processors
Types of Memory Organization
Flow of Control
Interconnection Networks

10
Kinds of Processors

A few very powerful processors.
Cray SV1
8-32 procs, 1.2Gflops per proc.

A whole lot of less powerful processors.
Thinking Machines CM-2
65,536 procs, 7Mflops per proc.
ASCI White, IBM SP Power3
8192 procs, 375 MHz per proc.

A medium quantity of medium power procs.
Beowulf
e.g. Bunyip, 192 x Intel Pentium III/550

11
Types of Memory Organization

Distributed Memory
Shared Memory
Distributed Shared Memory

12
Distributed Memory
13
Shared Memory
(HP Super Dome)
14
Distributed Shared Memory
15
Flow of Control
16
Dynamic Interconnection Networks

a.k.a. Indirect networks.
Dynamic (Indirect) links between processors and
memory.
Usually used for shared memory computers.

17
Static Interconnection Networks

a.k.a. Direct networks.
Point-to-point links between processors.
Usually for message passing (distributed memory)
computers.

18
Summary

Basic Parallel Computing Concepts
Parallel Computing is
Perfect Parallelism, Data Parallelism, Task
Parallelism
Peak vs. Sustained Performance, Speedup,
Efficiency, Cost
Load Bal., Communication, Prob. Size,
Scalability, Amdahl
Hardware Characteristics
Few Powerful Procs, Many Weaker Procs, Medium
Distributed, Shared, and Distributed-Shared
Memory
Flynns Taxonomy SISD, SIMD, MISD, MIMD
Bus Network, Crossbar Switched Network,
Multistage
Star, Mesh, Hypercube, Tree Networks

19
Links
Alliance Web Based Training for HPC --
http//webct.ncsa.uiuc.edu8900/webct/public/home.
pl Kumar, Grama, Gupta, Karypis, Introduction
to Parallel computing -- ftp//ftp.cs.umn.edu/dept
/users/kumar/book Selected Web Resources for
Parallel Computing -- http//www.eecs.umich.edu/q
stout/parlinks.html Deep Blue --
http//www.research.ibm.com/deepblue/meet/html/d.3
.html Current Trends in Supercomputers and
Scientific Computing -- http//www.jics.utk.edu/CO
LLABOR_INST/MMC/ Writing A Task or Pipeline
Parallel Program -- http//www.epcc.ed.ac.uk/direc
t/VISWS/CINECA/tsld041.htm HP Technical
Documentation -- http//docs.hp.com Linux
Parallel Processing HOWTO -- http//aggregate.org/
PPLINUX/19980105/pphowto.html Introduction to
Parallel Processing -- http//www.jics.utk.edu/I2P
P/I2PPhtml/ Message Passing Interface MPI for
users -- http//www.npac.syr.edu/users/gcf/cps615m
pi95/index.html Intro to Parallel Computing I
-- http//archive.ncsa.uiuc.edu/Alliances/Exemplar
/Training/NCSAMaterials/IntroParallel_I/index.htm
Thinking Machines CM-2. -- http//www.svisions.co
m/sv/cm-dv.html The Beowulf Project --
http//www.beowulf.org Bunyip (Beowulf) Project
-- http//tux.anu.edu.au/Projects/Beowulf/ Robust
Monte Carlo Methods for Light Transport
Simulation -- http//graphics.stanford.edu/papers/
veach_thesis/ An Introduction to Parallel
Computing -- http//www.pcc.qub.ac.uk/tec/courses/
intro/ohp/intro-ohp.html Supercomputing,
Parallel Processors and High Performance
Computing -- http//www.compinfo-center.com/tpsupr
-t.htm Internet Parallel Computing Archive --
http//wotug.ukc.ac.uk/parallel/ IEEE Computer
Society's ParaScope, A Listing of Parallel
Computing Sites -- http//computer.org/parascope/
High Performance Computing (HPC) Wire --
http//www.tgc.com/HPCwire.html KAOS
Laboratory, University of Kentucky --
http//aggregate.org/KAOS/ Notes on Parallel
Computer Architecture -- http//www.npac.syr.edu/n
se/hpccsurvey/architecture/index.html Nan's
Parallel Computing Page -- http//www.cs.rit.edu/
ncs/parallel.html High Performance Computing
Photos -- http//cs.calvin.edu/CS/parallel/resourc
es/photos/ Parallel Networking Topologies --
http//www.cs.rit.edu/icss571/parallelwrl/cgframe
.html What is mixed parallelism? --
http//www.ens-lyon.fr/fsuter/pages/mixedpar.html

20
Introduction to MPI
21
Outline

Introduction to Parallel Computing, by Danny
Thorne
Introduction to MPI, by Chunfang Chen
Writing MPI
Compiling and linking MPI programs
Running MPI programs
Introduction to OpenMP, by Adam Zornes

22
Writing MPI Programs

All MPI programs must include a header file. In
C mpi.h, in fortran mpif.h
All MPI programs must call MPI_INIT as the first
MPI call. This establishes the MPI environment.
All MPI programs must call MPI_FINALIZE as the
last call, this exits MPI.

23
Program Welcome to MPI
Program Welcome include mpif.h integer
ierr Call MPI_INIT(ierr) print, Welcome to
MPI Call MPI_FINALIZE(ierr) end
24
Commentary

Only one invocation of MPI_INIT can occur in each
program
Its only argument is an error code (integer)
MPI_FINALIZE terminates the MPI environment ( no
calls to MPI can be made after MPI_FINALIZE is
called)
All non MPI routine are local i.e. Print,
Welcome to MPI runs on each processor

25
Compiling MPI programs

In many MPI implementations, the program can be
compiled as
mpif90 -o executable program.f
mpicc -o executable program.c
mpif90 and mpicc transparently set the include
paths and links to appropriate libraries

26
Compiling MPI Programs

mpif90 and mpicc can be used to compile small
programs
For larger programs, it is ideal to make use of a
makefile

27
Running MPI Programs

mpirun -np 2 executable
- mpirun indicate that you are using the
MPI environment.
- np is the number of processors you
like to use ( two for the present case)

28
Sample Output

Sample output when run over 2 processors will be
Welcome to MPI
Welcome to MPI
Since Print, Welcome to MPI is local
statement, every processor execute it.

29
Finding More about Parallel Environment

Primary questions asked in parallel program are
- How many processors are there?
- Who am I?
How many is answered by MPI_COMM_SIZE
Who am I is answered by MPI_COMM_RANK

30
How Many?

Call MPI_COMM_SIZE(mpi_comm_world, size, ierr)
- mpi_comm_world is the communicator
- Communicator contains a group of processors
- size returns the total number of processors
- integer size

31
Who am I?

The processors are ordered in the group
consecutively from 0 to size-1, which is known as
rank
Call MPI_COMM_RANK(mpi_comm_world,rank,ierr)
- mpi_comm_world is the communicator
- integer rank
- for size4, ranks are 0,1,2,3

32
Communicator

MPI_COMM_WORLD

1
2
0
3
33
Program - Welcome to MPI

Program Welcome
include mpif.h
integer size, rank, ierr
Call MPI_INIT(ierr)
Call MPI_COMM_SIZE(mpi_comm_world, size, ierr)
Call MPI_COMM_RANK((mpi_comm_world, rank, ierr)
print, my rank is, rank, Welcome to MPI
call MPI_FINALIZE(ierr)
end

34
Sample Output

Sdx1 28 mpif90 welcome.f90
/usr/ccs/bin/ld(warning) At least one PA2.0
object file (welcome.o) was detected. The linked
output may not run on a PA 1.x system.
Sdx1 29 mpirun -np 4 a.out
my rank is 2 Welcome to MPI
my rank is 0 Welcome to MPI
my rank is 1 Welcome to MPI
my rank is 3 Welcome to MPI

35
Sending and Receiving Messages

Communication between processors involves
- identify sender and receiver
- the type and amount of data that is being
sent
- how is the receiver identified?

36
Communication

Point to point communication
- affects exactly two processors
Collective communication
- affects a group of processors in the
communicator

37
Point to point Communication

MPI_COMM_WORLD

1
0
2
3
38
Point to Point Communication

Communication between two processors
source processor sends message to destination
processor
destination processor receives the message
communication takes place within a communicator
destination processor is identified by its rank
in the communicator

39
Communication mode

Synchronous send(MPI_SSEND)
buffered send
(MPI_BSEND)
standard send
(MPI_SEND)
receive(MPI_RECV)

Only completes when the receive has completed
Always completes (unless an error occurs),
irrespective of receiver
Message send(receive state unknown)
Completes when a message had arrived

40
Standard Send

Call MPI_SEND(buf,count,datatype,dest,tag,comm,ier
r)
- buf is the name of the array/variable to be
broadcasted
- count is the number of elements to be sent
- datatype is the type of the data
- dest is the rank of the destination processor
- tag is an arbitrary number which can be used
to
distinguish among messages
- comm is the communicator( mpi_comm_world)

41
MPI Receive

Call MPI_RECV (buf,count,datatype,source,tag,comm,
status,ierr)
- source is the rank of the processor from
which data will
be accepted (this can be the rank of a
specific
processor or a wild card- MPI_ANY_SOURCE)
- tag is an arbitrary number which can be used
to
distinguish among messages (this can be a
wild card-
MPI_ANY_TAG)

42
Basic data type (Fortran)

MPI_INTEGER
MPI_REAL
MPI_DOUBLE_PRECISION
MPI_COMPLEX
MPI_LOGICAL
MPI_CHARACTER

Integer
Real
Double Precision
Complex
Logical
Character

43
Sample Code with Send/Receive

include mpif.h
! Run on 2 processors
integer size, rank, ierr,tag,status
character(14) message
Call MPI_INIT(ierr)
Call MPI_COMM_SIZE(mpi_comm_world, size, ierr)
Call MPI_COMM_RANK((mpi_comm_world, rank, ierr)
tag7
if(rank.eq.0)then

44
Sample Code with Send/Receive (cont.)

message Welcome to MPI
call MPI_SEND
(message,14,MPI_CHARACTER,1,tag,mpi_comm_
world,ierr)
else
call MPI_RECV (message,14,MPI_CHARACTER,MPI_ANY_
SOURCE,tag,mpi_comm_world,status,ierr)
print, my rank is , rank, message is ,
message
endif
call MPI_FINALIZE(ierr)
end

45
Sample Output

Sdx1 30 mpif90 sendrecv.f90
/usr/ccs/bin/ld(warning) At least one PA2.0
object file (sendrecv.o) was detected. The linked
output may not run on a PA 1.x system.
Sdx1 31 mpirun -np 2 a.out
my rank is 1 Message is Welcome to MPI

46
Collective Communication

MPI_COMM_WORLD

1
0
2
3
47
Collective Communication

Will not interfere with point-to-point
communication and vice-versa
All processors must call the collective routine
Synchronization not guaranteed (except for
barrier)
no tags
receive buffer must be exactly the right size

48
Collective Routines

MPI_BCAST
MPI_REDUCE

49
Collective RoutineMPI_ BCAST

call MPI_BCAST
(buffer,count,datatype,source,comm,ierr)
- buffer is the name of the array/variable to be
broadcasted
- count is the number of elements to be sent
- datatype is the type of the data
- source is the rank of the processor from which
data will be sent
- comm is the communicator( mpi_comm_world)

50
Sample code using MPI_BCAST

include mpif.h
integer size, rank, ierr
real para
Call MPI_INIT(ierr)
Call MPI_COMM_SIZE(mpi_comm_world, size, ierr)
Call MPI_COMM_RANK((mpi_comm_world, rank, ierr)
if (rank.eq.3) para23.0
Call MPI_BCAST(para,1,MPI_REAL,3,MPI_COMM_WORLD,i
err)

51
Sample code (cont.)

Print,my rank is , rank, after broadcast
para is , para
call MPI_FINALIZE(ierr)
end

52
Sample Output

Sdx1 32 mpif90 bcast.f90
/usr/ccs/bin/ld(warning) At least one PA2.0
object file (bcast.o) was detected. The linked
output may not run on a PA 1.x system.
Sdx1 33 mpirun -np 4 a.out
my rank is 3 after broadcast para is 23.0
my rank is 2 after broadcast para is 23.0
my rank is 0 after broadcast para is 23.0
my rank is 1 after broadcast para is 23.0

53
Collective RoutineMPI_ REDUCE

call MPI_REDUCE
(sendbuffer,recvbuffer,count,datatype,op,root,comm
,ierr)
- sendbuffer is the buffer/array to be sent
- recvbuffer is the receiving buffer/array
- datatype is the type of the data
- op is the collective operation
- root is the rank of the destination
- comm is the communicator

54
Collective Operation

MPI_MAX
MPI_MIN
MPI_SUM
MPI_PROD
MPI_MAXLOC
MPI_MINLOC
MPI_LOR
MPI_LXOR

maximum
minimum
sum
product
maximum and location
minimum and location
logical OR
logical exclusive OR

55
Sample code using MPI_REDUCE

include mpif.h
integer size, rank, ierr
integer in(2),out(2)
Call MPI_INIT(ierr)
Call MPI_COMM_SIZE(mpi_comm_world, size, ierr)
Call MPI_COMM_RANK((mpi_comm_world, rank, ierr)
in(1)rank1
in(2)rank

56
Sample code (cont.)

Call MPI_REDUCE(in,out,1,MPI_2INTEGER,MPI_MAXLOC,
7,MPI_COMM_WORLD,ierr)
if (rank.eq.7) print,my rank is , rank,
max, out(1), at rank,out(2)
Call MPI_REDUCE(in,out,1,MPI_2INTEGER,MPI_MINLOC,
2,MPI_COMM_WORLD,ierr)
if (rank.eq.2) print,my rank is , rank,
min, out(1), at rank,out(2)
call MPI_FINALIZE(ierr)
end

57
Sample Output

Sdx1 36 mpif90 bcast.f90
/usr/ccs/bin/ld(warning) At least one PA2.0
object file (bcast.o) was detected. The linked
output may not run on a PA 1.x system.
Sdx1 37 mpirun -np 8 a.out
my rank is 7 max8 at rank 7
my rank is 2 min1 at rank 0

58
Basic Routines in MPI

Using the following MPI routines, many parallel
programs can be written
- MPI_INIT
- MPI_COMM_SIZE
- MPI_COMM_RANK
- MPI_COMM_SEND
- MPI_COMM_RECV
- MPI_COMM_BCAST
- MPI_COMM_REDUCE
- MPI_COMM_FINALIZE

59
Resources

Online resources
http//www-unix.mcs.anl.gov/mpi
http//www.erc.msstate.edu/mpi
http//www.epm.ornl.gov/walker/mpi
http//www.epcc.ed.ac.uk/mpi
http//www.mcs.anl.gov/mpi/mpi-report-1.1/mpi-repo
rt.html
ftp//www.mcs.anl.gov/pub/mpi/mpi-report.html

60
OpenMP and You
61
Outline

Introduction to Parallel Computing, by Danny
Thorne
Introduction to MPI, by Chunfang Chen
Introduction to OpenMP, by Adam Zornes
What is OpenMP
A Brief History
Nuts and Bolts
Example(s?)

62
What is OpenMP

OpenMP is a portable, multiprocessing API for
shared memory computers
OpenMP is not a language
Instead, OpenMP specifies a set of subroutines
in an existing language (FORTRAN, C) for parallel
programming on a shared memory machine

63
Why is OpenMP Popular?

No message passing
OpenMP directives or library calls may be
incorporated incrementally.
The code is in effect a serial code.
Code size increase is generally smaller.
OpenMP-enabled codes tend to be more readable
Vendor involvement

64
History of OpenMP

Emergence of shared memory computers with
proprietary directive driven programming
enviroments in the mid-80s
In 1996 a group formed to create an industry
standard
They called themselves...

65
History of OpenMP

The ARB (OpenMP Architecture Review Board)
Group of corporations, research groups, and
universities
Original members were ASCI, DEC, HP, IBM,
Intel, KAI, SGI
Has permanent and auxiliary members
Meet by phone and email to interpret the
standards, answer questions, develop new
specifications, and create publicity

66
What Did They Create?

OpenMp consists of three main parts
Compiler directives used by the programmer to
communicate with the compiler
Runtime library which enables the setting and
querying of parallel parameters
Environmental variables that can be used to
define a limited number of runtime system
parallel parameters

67
The Basic Idea

The code starts with one master thread
When a parallel tasks needs to be performed,
additional threads are spawned
When the parallel tasks are finished, the
additional threads are released

68
The Basic Idea
69
The Illustrious OpenMP Directives
Control Structures what is parallel and what
is serial Work Sharing who does what
Synchronization bring everything back together
Data Scope Attributes (clauses) who can use
what and when and where Orphaning alone but
not necessarily lost
70
Regions or Loops, Which is Right for You?

Two ways to parallelize parallel loops
(fine-grained) and parallel regions
(coarse-grained)
Loops can be do, while, for, etc.
Parallel regions cut down on overhead, but
require more complex programming (i.e. What
happens to a thread not in use)

71
Work Sharing Constructs
A work sharing construct divides the execution
of enclosed code region among participating
processes The DO directive The SECTIONS
directive The SINGLE directive
!omp parallel do do i 1, n a(i)
b(i) c(i) enddo
!omp parallel !omp sections !omp section
call init_field(field) !omp section
call check_grid(grid) !omp end
sections !omp single call do_some_work(a(1))
!omp end single !omp end parallel
72
Syncronization Getting it Together
Synchronization directives provide for process
synchronization and mutual exclusion

The MASTER directive
The BARRIER directive
The CRITICAL directive
The ORDERED directive
The ATOMIC directive

73
Data Scoping Directives
Clauses qualify and scope the variables in a
block of code

PRIVATE
SHARED
DEFAULT (PRIVATE SHARED NONE)
FIRSTPRIVATE
LASTPRIVATE
COPYIN
REDUCTION

74
Orphaning

Directives that do not appear in the lexical
extent of the parallel construct, but lie in the
dynamic extent are called orphaned directives.
Directives in routines called from within
parallel constructs.

75
Runtime Library Routines