Title: Introductions to Parallel Programming Using OpenMP
1Introductions to Parallel Programming Using
OpenMP
April 7, 2005
- Zhenying Liu, Dr. Barbara Chapman
- High Performance Computing and Tools group
- Computer Science Department
- University of Houston
2Content
- Overview of OpenMP
- Acknowledgement
- OpenMP constructs (5 categories)
- OpenMP exercises
- References
3Overview of OpenMP
- OpenMP is a set of extensions to Fortran/C/C
- OpenMP contains compiler directives, library
routines and environment variables. - Available on most single address space machines.
- shared memory systems, including cc-NUMA
- Chip MultiThreading Chip MultiProcessing (Sun
UltraSPARC IV), Simultaneous Multithreading
(Intel Xeon) - not on distributed memory systems, classic MPPs,
or PC clusters (yet!)
4Shared Memory Architecture
- All processors have access to one global memory
- All processors share the same address space
- The system runs a single copy of the OS
- Processors communicate by reading/writing to the
global memory - Examples multiprocessor PCs (Intel P4), Sun Fire
15K, NEC SX-7, Fujitsu PrimePower, IBM p690, SGI
Origin 3000.
5Shared Memory Systems (cont)
OpenMP Pthreads
6Distributed Memory Systems
MPI HPF
7Clustered of SMPs
MPI hybrid MPI OpenMP
8OpenMP Usage
- Applications
- Applications with intense computational needs
- From video games to big science engineering
- Programmer Accessibility
- From very early programmers in school to
scientists to parallel computing experts - Available to millions of programmers
- In every major (Fortran C/C) compiler
9OpenMP Syntax
- Most of the constructs in OpenMP are compiler
directives or pragmas. - For C and C, the pragmas take the form
- pragma omp construct clause clause
- For Fortran, the directives take one of the
forms - COMP construct clause clause
- !OMP construct clause clause
- OMP construct clause clause
- Since the constructs are directives, an OpenMP
program can be compiled by compilers that dont
support OpenMP.
10OpenMP Programming Model
- Fork-Join Parallelism
- Master thread spawns a team of threads as needed.
- Parallelism is added incrementally i.e. the
sequential program evolves into a parallel
program.
11OpenMPHow is OpenMP Typically Used?
- OpenMP is usually used to parallelize loops
- Find your most time consuming loops.
- Split them up between threads.
Split-up this loop between multiple threads
void main() double Res1000 pragma omp
parallel for for(int i0ilt1000i)
do_huge_comp(Resi)
void main() double Res1000 for(int
i0ilt1000i) do_huge_comp(Resi)
Sequential program
Parallel program
12OpenMPHow do Threads Interact?
- OpenMP is a shared memory model.
- Threads communicate by sharing variables.
- Unintended sharing of data can lead to race
conditions - race condition when the programs outcome
changes as the threads are scheduled differently. - To control race conditions
- Use synchronization to protect data conflicts.
- Synchronization is expensive so
- Change how data is stored to minimize the need
for synchronization.
13OpenMP vs. POSIX Threads
- POSIX threads is the other widely used shared
programming API. - Fairly widely available, usually quite simple to
implement on top of OS kernel threads. - Lower level of abstraction than OpenMP
- library routines only, no directives
- more flexible, but harder to implement and
maintain - OpenMP can be implemented on top of POSIX threads
- Not much difference in availability
- not that many OpenMP C implementations
- no standard Fortran interface for POSIX threads
14Content
- Overview of OpenMP
- Acknowledgement
- OpenMP constructs (5 categories)
- OpenMP exercises
- References
15Acknowledgement
- Slides provided by
- Tim Mattson and Rudolf Eigenmann, SC 99
- Mark Bull from EPCC
- OpenMP program examples
- Lawrence Livermore National Lab
- NAS FT parallelization from PGI tutorial
- Dr. Garbey provided us serial codes of
Naiver-Stokes
16Content
- Overview of OpenMP
- Acknowledgement
- OpenMP constructs (5 categories)
- OpenMP exercises
- References
17OpenMP Constructs
- OpenMPs constructs fall into 5 categories
- Parallel Regions
- Worksharing
- Data Environment
- Synchronization
- Runtime functions/environment variables
- OpenMP is basically the same between Fortran and
C/C
18OpenMP Parallel Regions
- You create threads in OpenMP with the omp
parallel pragma. - For example, To create a 4-thread Parallel
region - Each thread calls pooh(ID,A) for ID 0 to 3
double A1000 omp_set_num_threads(4) pragma
omp parallel int ID omp_get_thread_num()
pooh(ID,A)
Each thread redundantly executes the code within
the structured block
19(No Transcript)
20OpenMP Work-Sharing Constructs
- The for Work-Sharing construct splits up loop
iterations among the threads in a team
pragma omp parallel pragma omp for for
(I0IltNI) NEAT_STUFF(I)
By default, there is a barrier at the end of the
omp for. Use the nowait clause to turn off
the barrier.
21Work Sharing ConstructsA motivating example
Sequential code
for(i0IltNi) ai ai bi
pragma omp parallel int id, i, Nthrds,
istart, iend id omp_get_thread_num()
Nthrds omp_get_num_threads() istart id
N / Nthrds iend (id1) N / Nthrds
for(iistartIltiendi) aiaibi
OpenMP Parallel Region
pragma omp parallel pragma omp for
schedule(static) for(i0IltNi)
aiaibi
OpenMP Parallel Region and a work-sharing for
construct
OpenMP parallel region and a work-sharing for
construct
22OpenMP For ConstructThe Schedule Clause
- The schedule clause effects how loop iterations
are mapped onto threads - uschedule(static ,chunk)
- Deal-out blocks of iterations of size chunk to
each thread. - uschedule(dynamic,chunk)
- Each thread grabs chunk iterations off a queue
until all iterations have been handled. - uschedule(guided,chunk)
- Threads dynamically grab blocks of iterations.
The size of the block starts large and shrinks
down to size chunk as the calculation proceeds. - uschedule(runtime)
- Schedule and chunk size taken from the
OMP_SCHEDULE environment variable.
23OpenMP Work-Sharing Constructs
- The Sections work-sharing construct gives a
different structured block to each thread.
pragma omp parallel pragma omp sections
X_calculation() pragma omp section
y_calculation() pragma omp section
z_calculation()
By default, there is a barrier at the end of the
omp sections. Use the nowait clause to turn
off the barrier.
24Data EnvironmentChanging Storage Attributes
- One can selectively change storage attributes
constructs using the following clauses - SHARED
- PRIVATE
- FIRSTPRIVATE
- THREADPRIVATE
- The value of a private inside a parallel loop can
be transmitted to a global value outside the loop
with - LASTPRIVATE
- The default status can be modified with
- DEFAULT (PRIVATE SHARED NONE)
All data clauses apply to parallel regions and
worksharing constructs except shared which only
applies to parallel regions.
25Data EnvironmentDefault Storage Attributes
- Shared Memory programming model
- Most variables are shared by default
- Global variables are SHARED among threads
- Fortran COMMON blocks, SAVE variables, MODULE
variables - C File scope variables, static
- But not everything is shared...
- Stack variables in sub-programs called from
parallel regions are PRIVATE - Automatic variables within a statement block are
PRIVATE.
26Private Clause
- private(var) creates a local copy of var for each
thread. - The value is uninitialized
- Private copy is not storage associated with
the original
- void wrong()
- int IS 0
- pragma parallel for private(IS)
- for(int J1Jlt1000J)
- IS IS J
- printf(i, IS)
27OpenMP Reduction
- Another clause that effects the way variables are
shared - reduction (op list)
- The variables in list must be shared in the
enclosing parallel region. - Inside a parallel or a worksharing construct
- A local copy of each list variable is made and
initialized depending on the op (e.g. 0 for
) - pair wise op is updated on the local value
- Local copies are reduced into a single global
copy at the end of the construct.
28OpenMP An Reduction Example
- include ltomp.hgt
- define NUM_THREADS 2
- void main ()
-
- int i
- double ZZ, func(), sum0.0
- omp_set_num_threads(NUM_THREADS)
- pragma omp parallel for reduction(sum)
private(ZZ) - for (i0 ilt 1000 i)
- ZZ func(i)
- sum sum ZZ
-
29OpenMP Synchronization
- OpenMP has the following constructs to support
synchronization - barrier
- critical section
- atomic
- flush
- ordered
- single
- master
30(No Transcript)
31Critical and Atomic
- Only one thread at a time can enter a critical
section
COMP PARALLEL DO PRIVATE(B) COMP SHARED(RES)
DO 100 I1,NITERS B DOIT(I) COMP
CRITICAL CALL CONSUME (B, RES) COMP END
CRITICAL 100 CONTINUE
- Atomic is a special case of a critical section
that can be used for certain simple statements
COMP PARALLEL PRIVATE(B) B DOIT(I) COMP
ATOMIC X X B COMP END PARALLEL
32Master directive
- The master construct denotes a structured block
that is only executed by the master thread. The
other threads just skip it (no implied barriers
or flushes).
pragma omp parallel private (tmp)
do_many_things() pragma omp master
exchange_boundaries() pragma barrier
do_many_other_things()
33Single directive
- The single construct denotes a block of code that
is executed by only one thread. - A barrier and a flush are implied at the end of
the single block.
pragma omp parallel private (tmp) do_many_thin
gs() pragma omp single exchange_boundaries()
do_many_other_things()
34OpenMP Library routines
- Lock routines
- omp_init_lock(), omp_set_lock(),
omp_unset_lock(), omp_test_lock() - Runtime environment routines
- Modify/Check the number of threads
- omp_set_num_threads(), omp_get_num_threads(),
omp_get_thread_num(), omp_get_max_threads() - Turn on/off nesting and dynamic mode
- omp_set_nested(), omp_set_dynamic(),
omp_get_nested(), omp_get_dynamic() - Are we in a parallel region?
- omp_in_parallel()
- How many processors in the system?
- omp_num_procs()
35OpenMP Environment Variables
- OMP_NUM_THREADS
- bsh
- export OMP_NUM_THREADS2
- csh
- setenv OMP_NUM_THREADS 4
36Content
- Overview of OpenMP
- Acknowledgement
- OpenMP constructs (5 categories)
- OpenMP exercises
- References
371. Hello World!
include ltomp.hgt main () int nthreads, tid /
Fork a team of threads giving them their own
copies of variables / pragma omp parallel
private(nthreads, tid) / Obtain thread
number / tid omp_get_thread_num()
printf("Hello World from thread d\n", tid)
/ Only master thread does this / if (tid
0) nthreads omp_get_num_threads
() printf("Number of threads d\n",
nthreads) / All threads join master
thread and disband /
38Example Code - Pthread Creation and Termination
include ltpthread.hgt include ltstdio.hgt
define NUM_THREADS 5 void PrintHello(void
threadid) printf("\nd Hello World!\n",
threadid) pthread_exit(NULL) int main
(int argc, char argv) pthread_t
threadsNUM_THREADS int rc, t for(t0
tltNUM_THREADS t) printf("Creating
thread d\n", t) rc
pthread_create(threadst, NULL, PrintHello,
(void )t) if (rc)
printf("ERROR return code from pthread_create()
is d\n", rc) exit(-1)
pthread_exit(NULL)
392. Parallel Loop Reduction
PROGRAM REDUCTION INTEGER I, N REAL
A(100), B(100), SUM ! Some initializations
N 100 DO I 1, N A(I) I
1.0 B(I) A(I) ENDDO SUM
0.0 !OMP PARALLEL DO REDUCTION(SUM) DO
I 1, N SUM SUM (A(I) B(I))
ENDDO PRINT , ' Sum ', SUM END
403. Matrix-vector multiply using a parallel loop
and critical directive
/ Spawn a parallel region explicitly scoping
all variables / pragma omp parallel
shared(a,b,c,nthreads,chunk) private(tid,i,j,k)
pragma omp for schedule (static, chunk) for
(i0 iltNRA i)
printf("threadd did rowd\n",tid,i)
for(j0 jltNCB j) for (k0
kltNCA k) cij aik
bkj
41Steps of Parallelization using OpenMP An Example
from a PGI Tutorial
- Compile a code with the option to enable a
profiler - Run the code and check if the results are correct
- Find out the most time-consuming part of the code
via the profiler information - Parallelize the time-consuming part
- Repeat above steps until you get reasonable
speedup
42How to Use a Profiler
- PGI compiler
- pgf90 -fast -Minfo -Mproffunc fftpde.F -o fftpde
(function level) - -Mproflines (line level)
- -mp for compiling OpenMP codes
- pgprof pgprof.out (show the profiler result)
- Pathscale compiler
- pathf90 -Ofast -pg Fftpde.F -o Fftpde
- pathprof Fftpdemore
43The most time-consuming loop in Fftpde.F
The OpenMP version of this loop in Fftpde_1.F
!OMP PARALLEL PRIVATE(Z) !OMP DO do k1,n3
do j1,n2 do i1,n1
z(i)cmplx(x1real(i,j,k),x1imag(i
,j,k)) end do call
fft(z,inverse,w,n1,m1) do i1,n1
x1real(i,j,k)real(z(i))
x1imag(i,j,k)aimag(z(i)) end do end
do end do !OMP END PARALLEL
do k1,n3 do j1,n2 do i1,n1
z(i)cmplx(x1real(i,j,k),x1imag(i,j,k))
end do call fft(z,inverse,w,n1,m1)
do i1,n1 x1real(i,j,k)real(z(i))
x1imag(i,j,k)aimag(z(i)) end do
end do end do
NEXT compare the 1 and 2 processor profiles
after adding OpenMP to this loop
44Parallelizing the Reminder of Fftpde.F
- The DO 130 loop near line 64 (fftpde_2.F)
- The DO 190 loop near line 115 (fftpde_3.F)
- 3) The DO 220 loop near line 139 (fftpde_4.F)
- 4) The DO 250 loop near line 155 (fftpde_5.F)
45!OMP PARALLEL PRIVATE(KK,KL,T1,T2,IK) !OMP DO
DO 130 K 1, N3 KK K - 1
KL KK T1 S T2 AN C C Find
starting seed T1 for this KK using the binary
rule for exponentiation. C DO 110 I 1,
100 IK KK / 2 IF (2 IK
.NE. KK) T2 RANDLC (T1, T2) IF (IK
.EQ. 0) GOTO 120 T2 RANDLC (T2, T2)
KK IK 110 CONTINUE C C Compute 2
NQ pseudorandom numbers. C 120 continue
CALL VRANLC (N1N2, T2, aa, x1real(1,1,k))
CALL VRANLC (N1N2, T2, aa, x1imag(1,1,k))
130 CONTINUE !OMP END PARALLEL
1. Parallelize the DO 130 loop in Fftpde_2.F
46!OMP PARALLEL PRIVATE(K1,J1,JK,I1) !OMP DO
DO 190 K 1, N3 K1 K - 1 IF
(K .GT. N32) K1 K1 - N3 C DO 180 J 1,
N2 J1 J - 1 IF (J .GT. N22)
J1 J1 - N2 JK J1 2 K1 2 C
DO 170 I 1, N1 I1 I - 1
IF (I .GT. N12) I1 I1 - N1
X3(I,J,K) EXP (AP (I1 2 JK)) 170
CONTINUE C 180 CONTINUE 190 CONTINUE !OMP
END PARALLEL
2. Parallelize the DO 190 loop in Fftpde_3.F
473. Parallelize the DO 220 loop in Fftpde_4.F
!OMP PARALLEL PRIVATE(T1) !OMP DO DO
220 K 1, N3 DO 210 J 1, N2
DO 200 I 1, N1 T1 X3(I,J,K)
KT X2real(I,J,K) T1
X1real(I,J,K) X2imag(I,J,K) T1
X1imag(I,J,K) 200 CONTINUE 210
CONTINUE 220 CONTINUE !OMP END PARALLEL
484. Parallelize the DO 250 loop in Fftpde_5.F
!OMP PARALLEL !OMP DO DO 250 K 1, N3
DO 240 J 1, N2 DO 230 I
1, N1 X2real(I,J,K) RN
X2real(I,J,K) X2imag(I,J,K) RN
X2imag(I,J,K) 230 CONTINUE 240
CONTINUE 250 CONTINUE !OMP END PARALLEL
49Conclusion
- OpenMP is successful in small-to-medium SMP
systems - Multiple cores/CPUs dominate the future computer
architectures OpenMP would be the major parallel
programming language in these architectures. - Simple everybody can learn it in 2 weeks
- Not so simple Dont stop learning! keep learning
it for better performance
50Some Buggy Codes
pragma omp parallel for shared(a,b,c,chunk)
private(i,tid) schedule(static,chunk) tid
omp_get_thread_num() for (i0 i lt N i)
ci ai bi printf("tid d i
d ci f\n", tid, i, ci) / end
of parallel for construct /
51Content
- Overview of OpenMP
- Acknowledgement
- OpenMP constructs (5 categories)
- OpenMP exercises
- References
52References
- OpenMP Official Website
- www.openmp.org
- OpenMP 2.5 Specifications
- An OpenMP book
- Rohit Chandra, Parallel Programming in OpenMP.
Morgan Kaufmann Publishers. - Compunity
- The community of OpenMP researchers and
developers in academia and industry - http//www.compunity.org/
- Conference papers
- WOMPAT, EWOMP, WOMPEI, IWOMP
- http//www.nic.uoregon.edu/iwomp2005/index.htmlpr
ogram
53Exercises
- cp /tmp/omp_examples.tar.gz /
- tar xzvf omp_examples.tar.gz
- (marvin) use pathscale (medusa) use pgi
- pathf90 pathcc -mp -Ofast
- pgf90 pgcc -mp -fast
- Compile(make) and run the codes in
- LLNL_C, LLNL_F, and FFT
- There is a README in each subdirectory
- Set the number of threads before running
- Echo SHELL
- export OMP_NUM_THREADS2 (for bsh)
- setenv OMP_NUM_THREADS2 (for csh)
54OpenMP Compilers and Platforms
- Fujitsu/Lahey Fortran, C and C
- Intel Linux Systems
- Fujitsu Solaris Systems
- HP HP-UX PA-RISC/Itanium , HP Tru64 Unix
- Fortran/C/C
- IBM XL Fortran and C from IBM
- IBM AIX Systems
- Intel C and Fortran Compilers from Intel
- Intel IA32 Linux/Windows Systems
- Intel Itanium-based Linux/Windows Systems
- Guide Fortran and C/C from Intel's KAI Software
Lab - Intel Linux/Windows Systems
- PGF77 and PGF90 Compilers from The Portland
Group, Inc. (PGI) - Intel Linux/Solaris/Windows/NT Systems
55Compilers and Platforms
- SGI MIPSpro 7.4 Compilers
- SGI IRIX Systems
- Sun Microsystems Sun ONE Studio, Compiler
Collection, Fortran 95, C, and C - Sun Solaris Platforms
- VAST from Veridian Pacific-Sierra Research
- IBM AIX Systems
- Intel IA32 Linux/Windows/NT Systems
- SGI IRIX Systems
- Sun Solaris Systems
- PATHSCALE EKOPATH COMPILER SUITE FOR AMD64 and
EM64T, Fortran, C, C - 64-bit Linux
- Microsoft Visual Studio 2005 (Visual C)
- Windows
56Parallelize Win32 API, PI
void main () double pi int i DWORD
threadID int threadArgNUM_THREADS
for(i0 iltNUM_THREADS i) threadArgi i1
InitializeCriticalSection(hUpdateMutex) for
(i0 iltNUM_THREADS i)
thread_handlesi CreateThread(0, 0,
(LPTHREAD_START_ROUTINE) Pi, threadArgi, 0,
threadID) WaitForMultipleObjects(NUM_THREA
DS, thread_handles, TRUE,INFINITE) pi
global_sum step printf(" pi is f \n",pi)
include ltwindows.hgt define NUM_THREADS 2 HANDLE
thread_handlesNUM_THREADS CRITICAL_SECTION
hUpdateMutex static long num_steps
100000 double step double global_sum
0.0 void Pi (void arg) int i, start
double x, sum 0.0 start (int ) arg
step 1.0/(double) num_steps for
(istartilt num_steps iiNUM_THREADS) x
(i-0.5)step sum sum 4.0/(1.0xx)
EnterCriticalSection(hUpdateMutex)
global_sum sum LeaveCriticalSection(hUpd
ateMutex)
Doubles code size!
57Solution Keep it simple
- Threads libraries
- Pro Programmer has control over everything
- Con Programmer must control everything
Programmers scared away
Full control
Increased complexity
Sometimes a simple evolutionary approach is better
58PI Program an example
static long num_steps 100000 double step void
main () int i double x, pi, sum 0.0 step
1.0/(double) num_steps for (i1ilt
num_steps i) x (i-0.5)step sum sum
4.0/(1.0xx) pi step sum
59OpenMP PI Program Parallel Region example (SPMD
Program)
- include ltomp.hgt
- static long num_steps 100000 double step
- define NUM_THREADS 2
- void main ()
- int i double x, pi, sumNUM_THREADS
- step 1.0/(double) num_steps
- omp_set_num_threads(NUM_THREADS)
- pragma omp parallel
- double x int id
- id omp_get_thread_num()
- for (iid, sumid0.0ilt num_steps
iiNUM_THREADS) - x (i0.5)step
- sumid 4.0/(1.0xx)
-
-
- for(i0, pi0.0iltNUM_THREADSi) pi
sumi step
SPMD Programs Each thread runs the same code
with the thread ID selecting any thread specific
behavior.
60OpenMP PI ProgramWork Sharing Construct
include ltomp.hgt static long num_steps 100000
double step define NUM_THREADS 2 void main ()
int i double x, pi, sumNUM_THREADS step
1.0/(double) num_steps omp_set_num_threads(NUM
_THREADS) pragma omp parallel double x
int id id omp_get_thread_num() sumid
0 pragma omp for for (iidilt
num_steps i) x (i0.5)step sumid
4.0/(1.0xx) for(i0,
pi0.0iltNUM_THREADSi)pi sumi step
61OpenMP PI ProgramPrivate Clause and a Critical
Section
- include ltomp.hgt
- static long num_steps 100000 double step
- define NUM_THREADS 2
- void main ()
- int i, id double x, sum, pi0.0step
1.0/(double) num_stepsomp_set_num_threads(NUM_TH
READS)pragma omp parallel private (x, sum)
id omp_get_thread_num() for
(iid,sum0.0ilt num_stepsiiNUM_THREADS) x
(i0.5)step sum 4.0/(1.0xx)
pragma omp critical pi sum
Note We didnt need to create an array to hold
local sums or clutter the code with explicit
declarations of x and sum.
62OpenMP PI ProgramParallel For with a Reduction
- include ltomp.hgt
- static long num_steps 100000 double step
- define NUM_THREADS 2
- void main ()
- int i double x, pi, sum 0.0step
1.0/(double) num_steps - omp_set_num_threads(NUM_THREADS)
- pragma omp parallel for reduction(sum)
private(x) - for (i1ilt num_steps i) x
(i-0.5)step sum sum 4.0/(1.0xx) - pi step sum
OpenMP adds 2 to 4 lines of code