Introductions to Parallel Programming Using OpenMP

About This Presentation

Title:

Introductions to Parallel Programming Using OpenMP

Description:

OpenMP is a set of extensions to Fortran/C/C . OpenMP contains compiler directives, library routines and environment variables. ... Some Buggy Codes ... – PowerPoint PPT presentation

Number of Views:196

Avg rating:3.0/5.0

Slides: 63

Provided by: zhenyi

Category:

more less

Transcript and Presenter's Notes

Title: Introductions to Parallel Programming Using OpenMP

1
Introductions to Parallel Programming Using
OpenMP
April 7, 2005

Zhenying Liu, Dr. Barbara Chapman
High Performance Computing and Tools group
Computer Science Department
University of Houston

2
Content

Overview of OpenMP
Acknowledgement
OpenMP constructs (5 categories)
OpenMP exercises
References

3
Overview of OpenMP

OpenMP is a set of extensions to Fortran/C/C
OpenMP contains compiler directives, library
routines and environment variables.
Available on most single address space machines.
shared memory systems, including cc-NUMA
Chip MultiThreading Chip MultiProcessing (Sun
UltraSPARC IV), Simultaneous Multithreading
(Intel Xeon)
not on distributed memory systems, classic MPPs,
or PC clusters (yet!)

4
Shared Memory Architecture

All processors have access to one global memory
All processors share the same address space
The system runs a single copy of the OS
Processors communicate by reading/writing to the
global memory
Examples multiprocessor PCs (Intel P4), Sun Fire
15K, NEC SX-7, Fujitsu PrimePower, IBM p690, SGI
Origin 3000.

5
Shared Memory Systems (cont)
OpenMP Pthreads
6
Distributed Memory Systems
MPI HPF
7
Clustered of SMPs
MPI hybrid MPI OpenMP
8
OpenMP Usage

Applications
Applications with intense computational needs
From video games to big science engineering
Programmer Accessibility
From very early programmers in school to
scientists to parallel computing experts
Available to millions of programmers
In every major (Fortran C/C) compiler

9
OpenMP Syntax

Most of the constructs in OpenMP are compiler
directives or pragmas.
For C and C, the pragmas take the form
pragma omp construct clause clause
For Fortran, the directives take one of the
forms
COMP construct clause clause
!OMP construct clause clause
OMP construct clause clause
Since the constructs are directives, an OpenMP
program can be compiled by compilers that dont
support OpenMP.

10
OpenMP Programming Model

Fork-Join Parallelism
Master thread spawns a team of threads as needed.
Parallelism is added incrementally i.e. the
sequential program evolves into a parallel
program.

11
OpenMPHow is OpenMP Typically Used?

OpenMP is usually used to parallelize loops
Find your most time consuming loops.
Split them up between threads.

Split-up this loop between multiple threads
void main() double Res1000 pragma omp
parallel for for(int i0ilt1000i)
do_huge_comp(Resi)
void main() double Res1000 for(int
i0ilt1000i) do_huge_comp(Resi)

Sequential program
Parallel program
12
OpenMPHow do Threads Interact?

OpenMP is a shared memory model.
Threads communicate by sharing variables.
Unintended sharing of data can lead to race
conditions
race condition when the programs outcome
changes as the threads are scheduled differently.
To control race conditions
Use synchronization to protect data conflicts.
Synchronization is expensive so
Change how data is stored to minimize the need
for synchronization.

13
OpenMP vs. POSIX Threads

POSIX threads is the other widely used shared
programming API.
Fairly widely available, usually quite simple to
implement on top of OS kernel threads.
Lower level of abstraction than OpenMP
library routines only, no directives
more flexible, but harder to implement and
maintain
OpenMP can be implemented on top of POSIX threads
Not much difference in availability
not that many OpenMP C implementations
no standard Fortran interface for POSIX threads

14
Content

Overview of OpenMP
Acknowledgement
OpenMP constructs (5 categories)
OpenMP exercises
References

15
Acknowledgement

Slides provided by
Tim Mattson and Rudolf Eigenmann, SC 99
Mark Bull from EPCC
OpenMP program examples
Lawrence Livermore National Lab
NAS FT parallelization from PGI tutorial
Dr. Garbey provided us serial codes of
Naiver-Stokes

16
Content

Overview of OpenMP
Acknowledgement
OpenMP constructs (5 categories)
OpenMP exercises
References

17
OpenMP Constructs

OpenMPs constructs fall into 5 categories
Parallel Regions
Worksharing
Data Environment
Synchronization
Runtime functions/environment variables
OpenMP is basically the same between Fortran and
C/C

18
OpenMP Parallel Regions

You create threads in OpenMP with the omp
parallel pragma.
For example, To create a 4-thread Parallel
region
Each thread calls pooh(ID,A) for ID 0 to 3

double A1000 omp_set_num_threads(4) pragma
omp parallel int ID omp_get_thread_num()
pooh(ID,A)
Each thread redundantly executes the code within
the structured block
19
(No Transcript)
20
OpenMP Work-Sharing Constructs

The for Work-Sharing construct splits up loop
iterations among the threads in a team

pragma omp parallel pragma omp for for
(I0IltNI) NEAT_STUFF(I)
By default, there is a barrier at the end of the
omp for. Use the nowait clause to turn off
the barrier.
21
Work Sharing ConstructsA motivating example
Sequential code
for(i0IltNi) ai ai bi
pragma omp parallel int id, i, Nthrds,
istart, iend id omp_get_thread_num()
Nthrds omp_get_num_threads() istart id
N / Nthrds iend (id1) N / Nthrds
for(iistartIltiendi) aiaibi
OpenMP Parallel Region
pragma omp parallel pragma omp for
schedule(static) for(i0IltNi)
aiaibi
OpenMP Parallel Region and a work-sharing for
construct
OpenMP parallel region and a work-sharing for
construct
22
OpenMP For ConstructThe Schedule Clause

The schedule clause effects how loop iterations
are mapped onto threads
uschedule(static ,chunk)
Deal-out blocks of iterations of size chunk to
each thread.
uschedule(dynamic,chunk)
Each thread grabs chunk iterations off a queue
until all iterations have been handled.
uschedule(guided,chunk)
Threads dynamically grab blocks of iterations.
The size of the block starts large and shrinks
down to size chunk as the calculation proceeds.
uschedule(runtime)
Schedule and chunk size taken from the
OMP_SCHEDULE environment variable.

23
OpenMP Work-Sharing Constructs

The Sections work-sharing construct gives a
different structured block to each thread.

pragma omp parallel pragma omp sections
X_calculation() pragma omp section
y_calculation() pragma omp section
z_calculation()
By default, there is a barrier at the end of the
omp sections. Use the nowait clause to turn
off the barrier.
24
Data EnvironmentChanging Storage Attributes

One can selectively change storage attributes
constructs using the following clauses
SHARED
PRIVATE
FIRSTPRIVATE
THREADPRIVATE
The value of a private inside a parallel loop can
be transmitted to a global value outside the loop
with
LASTPRIVATE
The default status can be modified with
DEFAULT (PRIVATE SHARED NONE)

All data clauses apply to parallel regions and
worksharing constructs except shared which only
applies to parallel regions.
25
Data EnvironmentDefault Storage Attributes

Shared Memory programming model
Most variables are shared by default
Global variables are SHARED among threads
Fortran COMMON blocks, SAVE variables, MODULE
variables
C File scope variables, static
But not everything is shared...
Stack variables in sub-programs called from
parallel regions are PRIVATE
Automatic variables within a statement block are
PRIVATE.

26
Private Clause

private(var) creates a local copy of var for each
thread.
The value is uninitialized
Private copy is not storage associated with
the original

void wrong()
int IS 0
pragma parallel for private(IS)
for(int J1Jlt1000J)
IS IS J
printf(i, IS)

27
OpenMP Reduction

Another clause that effects the way variables are
shared
reduction (op list)
The variables in list must be shared in the
enclosing parallel region.
Inside a parallel or a worksharing construct
A local copy of each list variable is made and
initialized depending on the op (e.g. 0 for
)
pair wise op is updated on the local value
Local copies are reduced into a single global
copy at the end of the construct.

28
OpenMP An Reduction Example

include ltomp.hgt
define NUM_THREADS 2
void main ()
int i
double ZZ, func(), sum0.0
omp_set_num_threads(NUM_THREADS)
pragma omp parallel for reduction(sum)
private(ZZ)
for (i0 ilt 1000 i)
ZZ func(i)
sum sum ZZ

29
OpenMP Synchronization

OpenMP has the following constructs to support
synchronization
barrier
critical section
atomic
flush
ordered
single
master

30
(No Transcript)
31
Critical and Atomic

Only one thread at a time can enter a critical
section

COMP PARALLEL DO PRIVATE(B) COMP SHARED(RES)
DO 100 I1,NITERS B DOIT(I) COMP
CRITICAL CALL CONSUME (B, RES) COMP END
CRITICAL 100 CONTINUE

Atomic is a special case of a critical section
that can be used for certain simple statements

COMP PARALLEL PRIVATE(B) B DOIT(I) COMP
ATOMIC X X B COMP END PARALLEL
32
Master directive

The master construct denotes a structured block
that is only executed by the master thread. The
other threads just skip it (no implied barriers
or flushes).

pragma omp parallel private (tmp)
do_many_things() pragma omp master
exchange_boundaries() pragma barrier
do_many_other_things()
33
Single directive

The single construct denotes a block of code that
is executed by only one thread.
A barrier and a flush are implied at the end of
the single block.

pragma omp parallel private (tmp) do_many_thin
gs() pragma omp single exchange_boundaries()
do_many_other_things()
34
OpenMP Library routines

Lock routines
omp_init_lock(), omp_set_lock(),
omp_unset_lock(), omp_test_lock()
Runtime environment routines
Modify/Check the number of threads
omp_set_num_threads(), omp_get_num_threads(),
omp_get_thread_num(), omp_get_max_threads()
Turn on/off nesting and dynamic mode
omp_set_nested(), omp_set_dynamic(),
omp_get_nested(), omp_get_dynamic()
Are we in a parallel region?
omp_in_parallel()
How many processors in the system?
omp_num_procs()

35
OpenMP Environment Variables

OMP_NUM_THREADS
bsh
export OMP_NUM_THREADS2
csh
setenv OMP_NUM_THREADS 4

36
Content

Overview of OpenMP
Acknowledgement
OpenMP constructs (5 categories)
OpenMP exercises
References

37
1. Hello World!
include ltomp.hgt main () int nthreads, tid /
Fork a team of threads giving them their own
copies of variables / pragma omp parallel
private(nthreads, tid) / Obtain thread
number / tid omp_get_thread_num()
printf("Hello World from thread d\n", tid)
/ Only master thread does this / if (tid
0) nthreads omp_get_num_threads
() printf("Number of threads d\n",
nthreads) / All threads join master
thread and disband /
38
Example Code - Pthread Creation and Termination
include ltpthread.hgt include ltstdio.hgt
define NUM_THREADS 5 void PrintHello(void
threadid) printf("\nd Hello World!\n",
threadid) pthread_exit(NULL) int main
(int argc, char argv) pthread_t
threadsNUM_THREADS int rc, t for(t0
tltNUM_THREADS t) printf("Creating
thread d\n", t) rc
pthread_create(threadst, NULL, PrintHello,
(void )t) if (rc)
printf("ERROR return code from pthread_create()
is d\n", rc) exit(-1)
pthread_exit(NULL)
39
2. Parallel Loop Reduction
PROGRAM REDUCTION INTEGER I, N REAL
A(100), B(100), SUM ! Some initializations
N 100 DO I 1, N A(I) I
1.0 B(I) A(I) ENDDO SUM
0.0 !OMP PARALLEL DO REDUCTION(SUM) DO
I 1, N SUM SUM (A(I) B(I))
ENDDO PRINT , ' Sum ', SUM END
40
3. Matrix-vector multiply using a parallel loop
and critical directive
/ Spawn a parallel region explicitly scoping
all variables / pragma omp parallel
shared(a,b,c,nthreads,chunk) private(tid,i,j,k)
pragma omp for schedule (static, chunk) for
(i0 iltNRA i)
printf("threadd did rowd\n",tid,i)
for(j0 jltNCB j) for (k0
kltNCA k) cij aik
bkj
41
Steps of Parallelization using OpenMP An Example
from a PGI Tutorial

Compile a code with the option to enable a
profiler
Run the code and check if the results are correct
Find out the most time-consuming part of the code
via the profiler information
Parallelize the time-consuming part
Repeat above steps until you get reasonable
speedup

42
How to Use a Profiler

PGI compiler
pgf90 -fast -Minfo -Mproffunc fftpde.F -o fftpde
(function level)
-Mproflines (line level)
-mp for compiling OpenMP codes
pgprof pgprof.out (show the profiler result)
Pathscale compiler
pathf90 -Ofast -pg Fftpde.F -o Fftpde
pathprof Fftpdemore

43
The most time-consuming loop in Fftpde.F
The OpenMP version of this loop in Fftpde_1.F
!OMP PARALLEL PRIVATE(Z) !OMP DO do k1,n3
do j1,n2 do i1,n1
z(i)cmplx(x1real(i,j,k),x1imag(i
,j,k)) end do call
fft(z,inverse,w,n1,m1) do i1,n1
x1real(i,j,k)real(z(i))
x1imag(i,j,k)aimag(z(i)) end do end
do end do !OMP END PARALLEL
do k1,n3 do j1,n2 do i1,n1

z(i)cmplx(x1real(i,j,k),x1imag(i,j,k))
end do call fft(z,inverse,w,n1,m1)
do i1,n1 x1real(i,j,k)real(z(i))
x1imag(i,j,k)aimag(z(i)) end do
end do end do
NEXT compare the 1 and 2 processor profiles
after adding OpenMP to this loop
44
Parallelizing the Reminder of Fftpde.F

The DO 130 loop near line 64 (fftpde_2.F)
The DO 190 loop near line 115 (fftpde_3.F)
3) The DO 220 loop near line 139 (fftpde_4.F)
4) The DO 250 loop near line 155 (fftpde_5.F)

45
!OMP PARALLEL PRIVATE(KK,KL,T1,T2,IK) !OMP DO
DO 130 K 1, N3 KK K - 1
KL KK T1 S T2 AN C C Find
starting seed T1 for this KK using the binary
rule for exponentiation. C DO 110 I 1,
100 IK KK / 2 IF (2 IK
.NE. KK) T2 RANDLC (T1, T2) IF (IK
.EQ. 0) GOTO 120 T2 RANDLC (T2, T2)
KK IK 110 CONTINUE C C Compute 2
NQ pseudorandom numbers. C 120 continue
CALL VRANLC (N1N2, T2, aa, x1real(1,1,k))
CALL VRANLC (N1N2, T2, aa, x1imag(1,1,k))
130 CONTINUE !OMP END PARALLEL
1. Parallelize the DO 130 loop in Fftpde_2.F
46
!OMP PARALLEL PRIVATE(K1,J1,JK,I1) !OMP DO
DO 190 K 1, N3 K1 K - 1 IF
(K .GT. N32) K1 K1 - N3 C DO 180 J 1,
N2 J1 J - 1 IF (J .GT. N22)
J1 J1 - N2 JK J1 2 K1 2 C
DO 170 I 1, N1 I1 I - 1
IF (I .GT. N12) I1 I1 - N1
X3(I,J,K) EXP (AP (I1 2 JK)) 170
CONTINUE C 180 CONTINUE 190 CONTINUE !OMP
END PARALLEL
2. Parallelize the DO 190 loop in Fftpde_3.F
47
3. Parallelize the DO 220 loop in Fftpde_4.F
!OMP PARALLEL PRIVATE(T1) !OMP DO DO
220 K 1, N3 DO 210 J 1, N2
DO 200 I 1, N1 T1 X3(I,J,K)
KT X2real(I,J,K) T1
X1real(I,J,K) X2imag(I,J,K) T1
X1imag(I,J,K) 200 CONTINUE 210
CONTINUE 220 CONTINUE !OMP END PARALLEL
48
4. Parallelize the DO 250 loop in Fftpde_5.F
!OMP PARALLEL !OMP DO DO 250 K 1, N3
DO 240 J 1, N2 DO 230 I
1, N1 X2real(I,J,K) RN
X2real(I,J,K) X2imag(I,J,K) RN
X2imag(I,J,K) 230 CONTINUE 240
CONTINUE 250 CONTINUE !OMP END PARALLEL
49
Conclusion

OpenMP is successful in small-to-medium SMP
systems
Multiple cores/CPUs dominate the future computer
architectures OpenMP would be the major parallel
programming language in these architectures.
Simple everybody can learn it in 2 weeks
Not so simple Dont stop learning! keep learning
it for better performance

50
Some Buggy Codes
pragma omp parallel for shared(a,b,c,chunk)
private(i,tid) schedule(static,chunk) tid
omp_get_thread_num() for (i0 i lt N i)
ci ai bi printf("tid d i
d ci f\n", tid, i, ci) / end
of parallel for construct /
51
Content

Overview of OpenMP
Acknowledgement
OpenMP constructs (5 categories)
OpenMP exercises
References

52
References

OpenMP Official Website
www.openmp.org
OpenMP 2.5 Specifications
An OpenMP book
Rohit Chandra, Parallel Programming in OpenMP.
Morgan Kaufmann Publishers.
Compunity
The community of OpenMP researchers and
developers in academia and industry
http//www.compunity.org/
Conference papers
WOMPAT, EWOMP, WOMPEI, IWOMP
http//www.nic.uoregon.edu/iwomp2005/index.htmlpr
ogram

53
Exercises

cp /tmp/omp_examples.tar.gz /
tar xzvf omp_examples.tar.gz
(marvin) use pathscale (medusa) use pgi
pathf90 pathcc -mp -Ofast
pgf90 pgcc -mp -fast
Compile(make) and run the codes in
LLNL_C, LLNL_F, and FFT
There is a README in each subdirectory
Set the number of threads before running
Echo SHELL
export OMP_NUM_THREADS2 (for bsh)
setenv OMP_NUM_THREADS2 (for csh)

54
OpenMP Compilers and Platforms

Fujitsu/Lahey Fortran, C and C
Intel Linux Systems
Fujitsu Solaris Systems
HP HP-UX PA-RISC/Itanium , HP Tru64 Unix
Fortran/C/C
IBM XL Fortran and C from IBM
IBM AIX Systems
Intel C and Fortran Compilers from Intel
Intel IA32 Linux/Windows Systems
Intel Itanium-based Linux/Windows Systems
Guide Fortran and C/C from Intel's KAI Software
Lab
Intel Linux/Windows Systems
PGF77 and PGF90 Compilers from The Portland
Group, Inc. (PGI)
Intel Linux/Solaris/Windows/NT Systems

55
Compilers and Platforms

SGI MIPSpro 7.4 Compilers
SGI IRIX Systems
Sun Microsystems Sun ONE Studio, Compiler
Collection, Fortran 95, C, and C
Sun Solaris Platforms
VAST from Veridian Pacific-Sierra Research
IBM AIX Systems
Intel IA32 Linux/Windows/NT Systems
SGI IRIX Systems
Sun Solaris Systems
PATHSCALE EKOPATH COMPILER SUITE FOR AMD64 and
EM64T, Fortran, C, C
64-bit Linux
Microsoft Visual Studio 2005 (Visual C)
Windows

56
Parallelize Win32 API, PI
void main () double pi int i DWORD
threadID int threadArgNUM_THREADS
for(i0 iltNUM_THREADS i) threadArgi i1
InitializeCriticalSection(hUpdateMutex) for
(i0 iltNUM_THREADS i)
thread_handlesi CreateThread(0, 0,
(LPTHREAD_START_ROUTINE) Pi, threadArgi, 0,
threadID) WaitForMultipleObjects(NUM_THREA
DS, thread_handles, TRUE,INFINITE) pi
global_sum step printf(" pi is f \n",pi)
include ltwindows.hgt define NUM_THREADS 2 HANDLE
thread_handlesNUM_THREADS CRITICAL_SECTION
hUpdateMutex static long num_steps
100000 double step double global_sum
0.0 void Pi (void arg) int i, start
double x, sum 0.0 start (int ) arg
step 1.0/(double) num_steps for
(istartilt num_steps iiNUM_THREADS) x
(i-0.5)step sum sum 4.0/(1.0xx)
EnterCriticalSection(hUpdateMutex)
global_sum sum LeaveCriticalSection(hUpd
ateMutex)
Doubles code size!
57
Solution Keep it simple

Threads libraries
Pro Programmer has control over everything
Con Programmer must control everything

Programmers scared away
Full control
Increased complexity
Sometimes a simple evolutionary approach is better
58
PI Program an example
static long num_steps 100000 double step void
main () int i double x, pi, sum 0.0 step
1.0/(double) num_steps for (i1ilt
num_steps i) x (i-0.5)step sum sum
4.0/(1.0xx) pi step sum
59
OpenMP PI Program Parallel Region example (SPMD
Program)

include ltomp.hgt
static long num_steps 100000 double step
define NUM_THREADS 2
void main ()
int i double x, pi, sumNUM_THREADS
step 1.0/(double) num_steps
omp_set_num_threads(NUM_THREADS)
pragma omp parallel
double x int id
id omp_get_thread_num()
for (iid, sumid0.0ilt num_steps
iiNUM_THREADS)
x (i0.5)step
sumid 4.0/(1.0xx)
for(i0, pi0.0iltNUM_THREADSi) pi
sumi step

SPMD Programs Each thread runs the same code
with the thread ID selecting any thread specific
behavior.
60
OpenMP PI ProgramWork Sharing Construct
include ltomp.hgt static long num_steps 100000
double step define NUM_THREADS 2 void main ()
int i double x, pi, sumNUM_THREADS step
1.0/(double) num_steps omp_set_num_threads(NUM
_THREADS) pragma omp parallel double x
int id id omp_get_thread_num() sumid
0 pragma omp for for (iidilt
num_steps i) x (i0.5)step sumid
4.0/(1.0xx) for(i0,
pi0.0iltNUM_THREADSi)pi sumi step
61
OpenMP PI ProgramPrivate Clause and a Critical
Section

include ltomp.hgt
static long num_steps 100000 double step
define NUM_THREADS 2
void main ()
int i, id double x, sum, pi0.0step
1.0/(double) num_stepsomp_set_num_threads(NUM_TH
READS)pragma omp parallel private (x, sum)
id omp_get_thread_num() for
(iid,sum0.0ilt num_stepsiiNUM_THREADS) x
(i0.5)step sum 4.0/(1.0xx)
pragma omp critical pi sum

Note We didnt need to create an array to hold
local sums or clutter the code with explicit
declarations of x and sum.
62
OpenMP PI ProgramParallel For with a Reduction