Title: Shared Memory Programming with OpenMP
1Shared Memory Programming with OpenMP
- Some examples from Quinn Ch 17
2A common parallel computing model is the
message-passing model.
- Each process has local memory
- Computation only on data in local memory
- Processors exchange data through communication
(MPI)
3Another parallel computing model Shared memory
parallel programming.
Memory
- Parallelization by threads
run time
Master thread runs in serial mode
4Another parallel computing model Shared memory
parallel programming.
Memory
- Parallelization by threads
run time
New threads created and run in parallel mode
fork
5Another parallel computing model Shared memory
parallel programming.
Memory
- Parallelization by threads
run time
fork
join
Master thread runs in serial mode
6OpenMP is a standard for shared memory
programming.
- Allows incremental parallelization
- Profile serial code
- Mark for parallelization those loops that take
the most time - Still have to think to make sure marked loops can
be executed in parallel. - Performance of shared memory codes will likely be
poor for many processors.
7Jacobi code serial
double u_newN2, u_oldN2 u_old00.0,
u_oldN10.0 u_new00.0, u_newN10.0 h1
/N,h2hh for (iteration0 iterationltmax_iterat
ion iteration) for (i1 iltN i)
u_newi0.5(h2u_oldi1u_oldi-1) .
. .
8Jacobi code OpenMP
include ltomp.hgt double u_newN2,
u_oldN2 u_old00.0, u_oldN10.0 u_new0
0.0, u_newN10.0 h1/N,h2hh for
(iteration0 iterationltmax_iteration
iteration) pragma omp parallel for for
(i1 iltN i) u_newi0.5(h2u_old
i1u_oldi-1) . . .
9OpenMP parallel for
- To allow compiler parallelize the loop, control
clause must have canonical shape. - for(i start i lt end i)
- gt i
- lt i--
- gt --i
- i i - inc
- i - inc
- i i inc
- i inc
10OpenMP parallel for
- To allow compiler parallelize the loop, loop body
cant contain statements that allow loop to exit
prematurely. - No break
- No return
- No exit
- No goto statements to labels outside the loop
-
11OpenMP how many threads to use?
- int omp_get_num_procs(void)
- Returns the number of physical processors
available for use by the parallel program. - void omp_set_num_threads (int t)
- Sets the number of threads to be used in parallel
sections. - Can also be controlled by the environment
variable OMP_NUM_THREADS -
12OpenMP how many threads to use?
- int omp_get_num_procs(void)
- Returns the number of physical processors
available for use by the parallel program. - void omp_set_num_threads (int t)
- Sets the number of threads to be used in parallel
sections. - Can also be controlled by the environment
variable OMP_NUM_THREADS -
t omp_get_num_procs() omp_set_num_threads(t)
13OpenMP shared and private variables
- A shared variable has the same address in every
thread (theres only one version) - All threads can access shared variables
- A private variable has a different address in
each thread (theres a version for each thread) - A thread cannot access a private variable of
another thread - Default for the parallel for pragma
- All variables are shared except for the loop
index which is private. -
14Jacobi code OpenMP
include ltomp.hgt double u_newN2,
u_oldN2 u_old00.0, u_oldN10.0 u_new0
0.0, u_newN10.0 h1/N,h2hh for
(iteration0 iterationltmax_iteration
iteration) pragma omp parallel for for
(i1 iltN i) u_newi0.5(h2u_old
i1u_oldi-1) . . .
15Floyds Algorithm
for (k 0 k lt n k) for (i 0 i lt n
i) for (j 0 j lt n j)
Dij min(Dij,
Dik Dkj)
16Floyds Algorithm
for (k 0 k lt n k) for (i 0 i lt n
i) for (j 0 j lt n j)
Dij min(Dij,
Dik Dkj)
Which loop to parallelize? Which loops have
dependencies? The i and j loops have no
dependencies.
17Floyds Algorithm
for (k 0 k lt n k) for (i 0 i lt n
i) for (j 0 j lt n j)
Dij min(Dij,
Dik Dkj)
Which loop to parallelize? Which loops have
dependencies? The i and j loops have no
dependencies.
Dkj min(Dkj, Dkk Dk,j) Dik
min(Dik, Dik Dkk)
18Floyds Algorithm
for (k 0 k lt n k) for (i 0 i lt n
i) for (j 0 j lt n j)
Dij min(Dij,
Dik Dkj)
Which loop to parallelize? Which loops have
dependencies? The i and j loops have no
dependencies.
Dkj and Dik Do not change in the Kth
iteration.
Dkj min(Dkj, Dkk Dk,j) Dik
min(Dik, Dik Dkk)
19Floyds Algorithm OpenMP v.1
for (k 0 k lt n k) for (i 0 i lt n
i) pragma omp parallel for
for (j 0 j lt n j)
Dij min(Dij, Dik Dkj)
20Floyds Algorithm OpenMP v.1
for (k 0 k lt n k) for (i 0 i lt n
i) pragma omp parallel for
for (j 0 j lt n j)
Dij min(Dij, Dik Dkj)
Pay the fork/join overhead n2 times
21Floyds Algorithm OpenMP v.2INCORRECT
for (k 0 k lt n k) pragma omp parallel
for for (i 0 i lt n i)
for (j 0 j lt n j)
Dij min(Dij, Dik Dkj)
22Floyds Algorithm OpenMP v.2INCORRECT
for (k 0 k lt n k) pragma omp parallel
for for (i 0 i lt n i)
for (j 0 j lt n j)
Dij min(Dij, Dik Dkj)
By default, only i will be a private
variable. Everything else, including j, will be a
shared variable. Each thread will be initializing
and incrementing the same j. Unlikely to get
correct results.
23Floyds Algorithm OpenMP v.2CORRECT
for (k 0 k lt n k) pragma omp parallel
for private(j) for (i 0 i lt n i)
for (j 0 j lt n j)
Dij min(Dij, Dik
Dkj)
- A clause is an optional, additional component to
a pragma. - The private (ltvariable listgt) clause directs the
compiler to make listed variables private
24Floyds Algorithm OpenMP v.2CORRECT
for (k 0 k lt n k) pragma omp parallel
for private(j) for (i 0 i lt n i)
for (j 0 j lt n j)
Dij min(Dij, Dik
Dkj)
Pay the fork/join overhead n times
- A clause is an optional, additional component to
a pragma. - The private (ltvariable listgt) clause directs the
compiler to make listed variables private
25OpenMP private variables
- By default, private variables are undefined at
loop entry and loop exit. - The clause firstprivate (x) directs the compiler
to make x a private variable whose initial value
for each thread is the value of x in the master
thread before the loop. - The clause lastprivate (x) directs the compiler
to make x a private variable whose value in the
master thread after the loop will be whatever the
value of x is in the thread that did the
iteration that would come last sequentially. -
26Pi code serial
h 1.0 / (double) n sum 0.0
for (i 1 i lt n i) x h ((double)i
- 0.5) area (4.0/(1.0 xx)) pi h
area
27Pi code OpenMP INCORRECT
h 1.0 / (double) n sum 0.0 pragma omp
parallel for private(x) for (i 1 i
lt n i) x h ((double)i - 0.5)
area (4.0/(1.0 xx)) pi h area
28Race condition
Thread A
Thread B
Value of area
11.667
3.765
11.667
3.563
15.432
15.230
29Race condition
Thread A
Thread B
Value of area
- The operation is not an atomic (indivisible)
operation. - The race condition results in code whose
numerical results are nondeterministic. - One solution is to force the operation to be
executed by one thread at a time.
11.667
3.765
11.667
3.563
15.432
15.230
30Pi code OpenMP CORRECT, but inefficient.
h 1.0 / (double) n sum 0.0 pragma omp
parallel for private(x) for (i 1 i
lt n i) x h ((double)i -
0.5) pragma omp critical area (4.0/(1.0
xx)) pi h area
Critical section is executed by one thread at a
time. Limits attainable speedup via Amdahls law.
31Pi code OpenMP CORRECT
h 1.0 / (double) n sum 0.0 pragma omp
parallel for private(x) reduction(area)
for (i 1 i lt n i) x h
((double)i - 0.5) area (4.0/(1.0
xx)) pi h area
- Note reduction clause on the parallel for pragma
- Compiler handles setting up private variables
for partial sums - Functionally like MPI_Reduce
- syntax reduction (ltopgtltvariablegt)
32The fork/join cost may be greater than parallel
gain from splitting the work.
h 1.0 / (double) n sum 0.0 pragma omp
parallel for private(x) reduction(a) if(ngt500)
for (i 1 i lt n i) x h
((double)i - 0.5) a (4.0/(1.0
xx)) pi h a
- Note if() clause on the parallel for pragma
- syntax if (ltscalar expressiongt)
- if scalar expression evaluates true, loop is
parallelized - otherwise executed sequentially on master thread
- pay fork/join overhead only when loop contains
enough work to cover this cost
33The fork/join cost may be reduced by reordering
loops.
for ( i1 iltm i) for ( j0 jltn j)
aij 2 ai-1j
34The fork/join cost may be reduced by reordering
loops.
for ( i1 iltm i) for ( j0 jltn j)
aij 2 ai-1j
for ( i1 iltm i) pragma omp parallel for
for ( j0 jltn j) aij 2
ai-1j
35The fork/join cost may be reduced by reordering
loops.
for ( i1 iltm i) for ( j0 jltn j)
aij 2 ai-1j
for ( i1 iltm i) pragma omp parallel for
for ( j0 jltn j) aij 2
ai-1j
pragma omp parallel for for ( j0 jltn j)
for ( i1 iltm i) aij 2
ai-1j
36OpenMP and functional parallelism
v velocity_solve( ) p pressure_solve( ) e
energy(v,p) g grids( ) Plot(e,g)
37OpenMP and functional parallelism
v velocity_solve( ) p pressure_solve( ) e
energy(v,p) g grids( ) Plot(e,g)
V
P
G
E
Plot
38OpenMP and functional parallelism
pragma omp parallel sections pragma omp
section v velocity_solve( ) pragma omp
section p pressure_solve( ) pragma omp
section g grids( ) e
energy(v,p) Plot(e,g)