Shared Memory Programming with OpenMP - PowerPoint PPT Presentation

1 / 38

About This Presentation

Title:

Shared Memory Programming with OpenMP

Description:

The private ( variable list ) clause directs the compiler to make ... area = (4.0/(1.0 xx)); pi = h area; Pi code: OpenMP INCORRECT. h = 1.0 / (double) n; ... – PowerPoint PPT presentation

Number of Views:231

Avg rating:3.0/5.0

Slides: 39

Provided by: Jim4110

Category:

more less

Transcript and Presenter's Notes

Title: Shared Memory Programming with OpenMP

1
Shared Memory Programming with OpenMP

Some examples from Quinn Ch 17

2
A common parallel computing model is the
message-passing model.

Each process has local memory
Computation only on data in local memory
Processors exchange data through communication
(MPI)

3
Another parallel computing model Shared memory
parallel programming.

Global memory

Memory

Parallelization by threads

run time
Master thread runs in serial mode
4
Another parallel computing model Shared memory
parallel programming.

Global memory

Memory

Parallelization by threads

run time
New threads created and run in parallel mode
fork
5
Another parallel computing model Shared memory
parallel programming.

Global memory

Memory

Parallelization by threads

run time
fork
join
Master thread runs in serial mode
6
OpenMP is a standard for shared memory
programming.

Allows incremental parallelization
Profile serial code
Mark for parallelization those loops that take
the most time
Still have to think to make sure marked loops can
be executed in parallel.
Performance of shared memory codes will likely be
poor for many processors.

7
Jacobi code serial
double u_newN2, u_oldN2 u_old00.0,
u_oldN10.0 u_new00.0, u_newN10.0 h1
/N,h2hh for (iteration0 iterationltmax_iterat
ion iteration) for (i1 iltN i)
u_newi0.5(h2u_oldi1u_oldi-1) .
. .
8
Jacobi code OpenMP
include ltomp.hgt double u_newN2,
u_oldN2 u_old00.0, u_oldN10.0 u_new0
0.0, u_newN10.0 h1/N,h2hh for
(iteration0 iterationltmax_iteration
iteration) pragma omp parallel for for
(i1 iltN i) u_newi0.5(h2u_old
i1u_oldi-1) . . .
9
OpenMP parallel for

To allow compiler parallelize the loop, control
clause must have canonical shape.
for(i start i lt end i)
gt i
lt i--
gt --i
i i - inc
i - inc
i i inc
i inc

10
OpenMP parallel for

To allow compiler parallelize the loop, loop body
cant contain statements that allow loop to exit
prematurely.
No break
No return
No exit
No goto statements to labels outside the loop

11
OpenMP how many threads to use?

int omp_get_num_procs(void)
Returns the number of physical processors
available for use by the parallel program.
void omp_set_num_threads (int t)
Sets the number of threads to be used in parallel
sections.
Can also be controlled by the environment
variable OMP_NUM_THREADS

12
OpenMP how many threads to use?

int omp_get_num_procs(void)
Returns the number of physical processors
available for use by the parallel program.
void omp_set_num_threads (int t)
Sets the number of threads to be used in parallel
sections.
Can also be controlled by the environment
variable OMP_NUM_THREADS

t omp_get_num_procs() omp_set_num_threads(t)
13
OpenMP shared and private variables

A shared variable has the same address in every
thread (theres only one version)
All threads can access shared variables
A private variable has a different address in
each thread (theres a version for each thread)
A thread cannot access a private variable of
another thread
Default for the parallel for pragma
All variables are shared except for the loop
index which is private.

14
Jacobi code OpenMP
include ltomp.hgt double u_newN2,
u_oldN2 u_old00.0, u_oldN10.0 u_new0
0.0, u_newN10.0 h1/N,h2hh for
(iteration0 iterationltmax_iteration
iteration) pragma omp parallel for for
(i1 iltN i) u_newi0.5(h2u_old
i1u_oldi-1) . . .
15
Floyds Algorithm
for (k 0 k lt n k) for (i 0 i lt n
i) for (j 0 j lt n j)
Dij min(Dij,
Dik Dkj)
16
Floyds Algorithm
for (k 0 k lt n k) for (i 0 i lt n
i) for (j 0 j lt n j)
Dij min(Dij,
Dik Dkj)
Which loop to parallelize? Which loops have
dependencies? The i and j loops have no
dependencies.
17
Floyds Algorithm
for (k 0 k lt n k) for (i 0 i lt n
i) for (j 0 j lt n j)
Dij min(Dij,
Dik Dkj)
Which loop to parallelize? Which loops have
dependencies? The i and j loops have no
dependencies.
Dkj min(Dkj, Dkk Dk,j) Dik
min(Dik, Dik Dkk)
18
Floyds Algorithm
for (k 0 k lt n k) for (i 0 i lt n
i) for (j 0 j lt n j)
Dij min(Dij,
Dik Dkj)
Which loop to parallelize? Which loops have
dependencies? The i and j loops have no
dependencies.
Dkj and Dik Do not change in the Kth
iteration.
Dkj min(Dkj, Dkk Dk,j) Dik
min(Dik, Dik Dkk)
19
Floyds Algorithm OpenMP v.1
for (k 0 k lt n k) for (i 0 i lt n
i) pragma omp parallel for
for (j 0 j lt n j)
Dij min(Dij, Dik Dkj)

20
Floyds Algorithm OpenMP v.1
for (k 0 k lt n k) for (i 0 i lt n
i) pragma omp parallel for
for (j 0 j lt n j)
Dij min(Dij, Dik Dkj)

Pay the fork/join overhead n2 times
21
Floyds Algorithm OpenMP v.2INCORRECT
for (k 0 k lt n k) pragma omp parallel
for for (i 0 i lt n i)
for (j 0 j lt n j)
Dij min(Dij, Dik Dkj)

22
Floyds Algorithm OpenMP v.2INCORRECT
for (k 0 k lt n k) pragma omp parallel
for for (i 0 i lt n i)
for (j 0 j lt n j)
Dij min(Dij, Dik Dkj)

By default, only i will be a private
variable. Everything else, including j, will be a
shared variable. Each thread will be initializing
and incrementing the same j. Unlikely to get
correct results.
23
Floyds Algorithm OpenMP v.2CORRECT
for (k 0 k lt n k) pragma omp parallel
for private(j) for (i 0 i lt n i)
for (j 0 j lt n j)
Dij min(Dij, Dik
Dkj)

A clause is an optional, additional component to
a pragma.
The private (ltvariable listgt) clause directs the
compiler to make listed variables private

24
Floyds Algorithm OpenMP v.2CORRECT
for (k 0 k lt n k) pragma omp parallel
for private(j) for (i 0 i lt n i)
for (j 0 j lt n j)
Dij min(Dij, Dik
Dkj)
Pay the fork/join overhead n times

A clause is an optional, additional component to
a pragma.
The private (ltvariable listgt) clause directs the
compiler to make listed variables private

25
OpenMP private variables

By default, private variables are undefined at
loop entry and loop exit.
The clause firstprivate (x) directs the compiler
to make x a private variable whose initial value
for each thread is the value of x in the master
thread before the loop.
The clause lastprivate (x) directs the compiler
to make x a private variable whose value in the
master thread after the loop will be whatever the
value of x is in the thread that did the
iteration that would come last sequentially.

26
Pi code serial
h 1.0 / (double) n sum 0.0
for (i 1 i lt n i) x h ((double)i
- 0.5) area (4.0/(1.0 xx)) pi h
area
27
Pi code OpenMP INCORRECT
h 1.0 / (double) n sum 0.0 pragma omp
parallel for private(x) for (i 1 i
lt n i) x h ((double)i - 0.5)
area (4.0/(1.0 xx)) pi h area
28
Race condition
Thread A
Thread B
Value of area
11.667
3.765
11.667
3.563
15.432
15.230
29
Race condition
Thread A
Thread B
Value of area

The operation is not an atomic (indivisible)
operation.
The race condition results in code whose
numerical results are nondeterministic.
One solution is to force the operation to be
executed by one thread at a time.

11.667
3.765
11.667
3.563
15.432
15.230
30
Pi code OpenMP CORRECT, but inefficient.
h 1.0 / (double) n sum 0.0 pragma omp
parallel for private(x) for (i 1 i
lt n i) x h ((double)i -
0.5) pragma omp critical area (4.0/(1.0
xx)) pi h area
Critical section is executed by one thread at a
time. Limits attainable speedup via Amdahls law.
31
Pi code OpenMP CORRECT
h 1.0 / (double) n sum 0.0 pragma omp
parallel for private(x) reduction(area)
for (i 1 i lt n i) x h
((double)i - 0.5) area (4.0/(1.0
xx)) pi h area

Note reduction clause on the parallel for pragma
Compiler handles setting up private variables
for partial sums
Functionally like MPI_Reduce
syntax reduction (ltopgtltvariablegt)

32
The fork/join cost may be greater than parallel
gain from splitting the work.
h 1.0 / (double) n sum 0.0 pragma omp
parallel for private(x) reduction(a) if(ngt500)
for (i 1 i lt n i) x h
((double)i - 0.5) a (4.0/(1.0
xx)) pi h a

Note if() clause on the parallel for pragma
syntax if (ltscalar expressiongt)
if scalar expression evaluates true, loop is
parallelized
otherwise executed sequentially on master thread
pay fork/join overhead only when loop contains
enough work to cover this cost

33
The fork/join cost may be reduced by reordering
loops.
for ( i1 iltm i) for ( j0 jltn j)
aij 2 ai-1j
34
The fork/join cost may be reduced by reordering
loops.
for ( i1 iltm i) for ( j0 jltn j)
aij 2 ai-1j
for ( i1 iltm i) pragma omp parallel for
for ( j0 jltn j) aij 2
ai-1j
35
The fork/join cost may be reduced by reordering
loops.
for ( i1 iltm i) for ( j0 jltn j)
aij 2 ai-1j
for ( i1 iltm i) pragma omp parallel for
for ( j0 jltn j) aij 2
ai-1j
pragma omp parallel for for ( j0 jltn j)
for ( i1 iltm i) aij 2
ai-1j
36
OpenMP and functional parallelism
v velocity_solve( ) p pressure_solve( ) e
energy(v,p) g grids( ) Plot(e,g)
37
OpenMP and functional parallelism
v velocity_solve( ) p pressure_solve( ) e
energy(v,p) g grids( ) Plot(e,g)
V
P
G
E
Plot
38
OpenMP and functional parallelism
pragma omp parallel sections pragma omp
section v velocity_solve( ) pragma omp
section p pressure_solve( ) pragma omp
section g grids( ) e
energy(v,p) Plot(e,g)

Write a Comment

User Comments (0)