OpenMP in Practice

About This Presentation

Title:

OpenMP in Practice

Description:

Designing and Building Parallel Programs. 1. OpenMP in Practice. Gina Goff. Rice University ... Designing and Building Parallel Programs. 4. OpenMP ... – PowerPoint PPT presentation

Number of Views:47

Avg rating:3.0/5.0

Slides: 81

Provided by: csR7

Category:

more less

Transcript and Presenter's Notes

Title: OpenMP in Practice

1
OpenMP in Practice

Gina Goff
Rice University

2
Outline

Introduction
Parallelism, Synchronization, and Environments
Restructuring/Designing Programs in OpenMP
Example Programs

3
Outline

Introduction
Parallelism, Synchronization, and Environments
Restructuring/Designing Programs in OpenMP
Example Programs

4
OpenMP

A portable fork-join parallel model for
shared-memory architectures
Portable
Based on Parallel Computing Forum (PCF)
Fortran 77 binding here today C coming this year

5
OpenMP (2)

Fork-join model
Execution starts with one thread of control
Parallel regions fork off new threads on entry
Threads join back together at the end of the
region
Shared memory
(Some) Memory can be accessed by all threads

6
Shared Memory

Computation(s) using several processors
Each processor has some private memory
Each processor has access to a memory shared with
the other processors
Synchronization
Used to protect integrity of parallel program
Prevents unsafe memory accesses
Fine-grained synchronization (point to point)
Barrier use for global synchronization

7
Shared Memory in Pictures
8
OpenMP

Two basic flavors of parallelism
Coarse-grained
Program is broken into segments (threads) that
can be executed in parallel
Use barriers to re-synchronize execution at the
end
Fine-grained
Execute iterations of DO loop(s) in parallel

9
OpenMP in Pictures
10
Design of OpenMP

A flexible standard, easily implemented across
different platforms
Control structures
Minimal for simplicity and encouraging common
cases
PARALLEL, DO, SECTIONS, SINGLE, MASTER
Data environment
New data access capabilities for forked threads
SHARED, PRIVATE, REDUCTION

11
Design of OpenMP (2)

Synchronization
Simple implicit synch at beginning and end of
control structures
Explicit synch for more complex patterns
BARRIER, CRITICAL, ATOMIC, FLUSH, ORDERED
Runtime library
Manages modes for forking and scheduling threads
E.g, OMP_GET_THREAD_NUM

12
Whos In OpenMP?

Software Vendors
Absoft Corp.
Edinburgh Portable Compilers
Kuck Associates, Inc.
Myrias Computer Technologies
Numerical Algorithms Group
The Portland Group, Inc.
Hardware Vendors
Digital Equipment Corp.
Hewlett-Packard
IBM
Intel
Silicon Graphics/Cray Research

Solution Vendors
ADINA RD, Inc.
ANSYS, Inc.
CPLEX division of ILOG
Fluent, Inc.
LSTC Corp.
MECALOG SARL
Oxford Molecular Group PLC
Research Organizations
US Department of Energy ASCI Program
Universite Louis Pasteur, Strasbourg

13
Outline

Introduction
Parallelism, Synchronization, and Environments
Restructuring/Designing Programs in OpenMP
Example Programs

14
Control Structures

PARALLEL / END PARALLEL
The actual fork and join
Number of threads wont change inside parallel
region
Single Program Multiple Data (SPMD) execution
within region
SINGLE / END SINGLE
(Short) sequential section
MASTER / END MASTER
SINGLE on master processor

!OMP PARALLEL CALL S1 !OMP SINGLE CALL
S2 !OMP END SINGLE CALL S3 !OMP END PARALLEL
15
Control Structures (2)

DO / END DO
The classic parallel loop
Inside parallel region
Or convenient combined directive PARALLEL DO
Iteration space is divided among available
threads
Loop index is private to thread by default

16
Control Structures (3)

SECTIONS / END SECTIONS
Task parallelism, potentially MIMD
SECTION marks tasks
Inside parallel region
Nested parallelism
Requires creating new parallel region
Not supported on all OpenMP implementations
If not allowed, inner PARALLEL is a no-op

!OMP PARALLEL SECTIONS !OMP SECTION !OMP
PARALLEL DO DO J 1, 2 CALL FOO(J)
END DO !OMP END DO !OMP SECTION CALL
BAR(2) !OMP SECTION !OMP PARALLEL DO
DO K 1, 3 CALL BAR(K) END DO
!OMP END DO !OMP END PARALLEL SECTIONS
17
DO Scheduling

Static Scheduling (default)
Divides loop into equal size iteration chunks
Based on runtime loop limits
Totally parallel scheduling algorithm
Dynamic Scheduling
Threads go to scheduler to get next chunk
Guided chunks taper down at end of loop

18
DO Scheduling (2)
1
7
13
19
25
31
1
7
13
19
25
31
2
8
14
20
26
32
2
8
14
20
26
32
3
9
15
21
27
33
3
9
15
21
27
33
4
10
16
22
28
34
4
10
16
22
28
34
5
11
17
23
29
35
5
11
17
23
29
35
6
12
18
24
30
36
6
12
18
24
30
36
!OMP PARALLEL DO
!OMP PARALLEL DO
!OMP SCHEDULE(DYNAMIC,1)
!OMP SCHEDULE(GUIDED,1)
DO J 1, 36
DO J 1, 36
CALL SUBR(J)
CALL SUBR(J)
END DO
END DO
!OMP END DO
!OMP END DO
19
Orphaned Directives
PROGRAM main !OMP PARALLEL CALL foo() CALL
bar() CALL error() !OMP END PARALLEL SUBROUTINE
error() ! Not allowed due to ! nested control
structs !OMP SECTIONS !OMP SECTION CALL
foo() !OMP SECTION CALL bar() !OMP END
SECTIONS END
SUBROUTINE foo() !OMP DO DO i 1, n ... END
DO !OMP END DO END SUBROUTINE bar() !OMP
SECTIONS !OMP SECTION CALL section1() !OMP
SECTION ... !OMP SECTION ... !OMP END
SECTIONS END
20
OpenMP Synchronization

Implicit barriers wait for all threads
DO, END DO
SECTIONS, END SECTIONS
SINGLE, END SINGLE
MASTER, END MASTER
NOWAIT at END can override synch
Global barriers ? all threads must hit in the
same order

21
OpenMP Synchronization (2)

Explicit directives provide finer control
BARRIER must be hit by all threads in team
CRITICAL (name), END CRITICAL
Only one thread may enter at a time
ATOMIC Single-statement critical section
for reduction
FLUSH (list) Synchronization point at
which the implementation is required to
provide a consistent view of memory
ORDERED For pipelining loop iterations

22
OpenMP Data Environments

Data can be PRIVATE or SHARED
Private data is for local variables
Shared data is global
Data can be private to a thread all processors
in thread can access the data, but other threads
cant see it

23
OpenMP Data Environments
COMMON /mine/ z
INTEGER x(3), y(3), k
!OMP THREADPRIVATE(mine)
!OMP PARALLEL DO DEFAULT(PRIVATE), SHARED(x)
!OMP REDUCTION (z)
DO k 1, 3
x(k) k
y(k) kk
z z x(k)y(k)
END DO
!OMP END PARALLEL DO
SHARED MEMORY
x
1
2
3
Thread 0
Thread 1
Thread 2
4
y
1
y
y
9
z'
z'
4
z'
1
9
z
36
24
Brief Example
25
OpenMP Environment Runtime Library

For controlling execution
Needed for tuning, but may limit portability
Control through environment variables or runtime
library calls
Runtime library takes precedence in conflict

26
OpenMP Environment Runtime (2)

OMP_NUM_THREADS How many to use in parallel
region?
OMP_GET_NUM_THREADS,
OMP_SET_NUM_THREADS
Related OMP_GET_THREAD_NUM,
OMP_GET_MAX_THREADS, OMP_GET_NUM_PROCS
OMP_DYNAMIC Should runtime system choose number
of threads?
OMP_GET_DYNAMIC, OMP_SET_DYNAMIC

27
OpenMP Environment Runtime (3)

OMP_NESTED Should nested parallel regions be
supported?
OMP_GET_NESTED, OMP_SET_NESTED
OMP_SCHEDULE Choose DO scheduling option
Used by RUNTIME clause
OMP_IN_PARALLEL Is the program in a parallel
region?

28
Outline

Introduction
Parallelism, Synchronization, and Environments
Restructuring/Designing Programs in OpenMP
Example Programs

29
Analyzing for Parallelism

Profiling
Walk the loop nests
Multiple parallel loops

30
Program Profile

Is dataset large enough?
At the top of the list, should find
parallel regions
routines called within them
What is cumulative percent?
Watch for system libraries near top
e.g., spin_wait_join_barrier

31
Walking the Key Loop Nest

Usually the outermost parallel loop
Ignore timestep and convergence loops
Ignore loops with few iterations
Ignore loops that call unimportant subroutines
Dont be put off by
Loops that write shared data
Loops that step through linked lists
Loops with I/O

32
Multiple Parallel Loops

Nested parallel loops are good
Pick easiest or most parallel code
Think about locality
Use IF clause to select best based on dataset
Plan on doing one across clusters
Non nested parallel loops
Consider loop fusion (impacts locality)
Execute code between in parallel region

33
Example Loop Nest

subroutine fem3d()
10 call addmon()
if(numelh.ne0) call solide
subroutine solide
do 20 i1,nelt
do 20 j1,nelg
call unpki
call strain
call force
20 continue
if() return
goto 10

subroutine force()
do 10 ilft,llt
sgv(i) sig1(i)-qp(i)vol(i)
10 continue
do 50 n1,nnc
i0ia(n)
i1ia(n1)-1
do 50 ii0,i1
e(1,ix(i))e(1,ix(i))ep11(i)
50 continue

34
Restructuring Applications

Two level strategy for parallel processing
Determining shared and local variables
Adding synchronization

35
Two Levels of Parallel Processing

Two level approach isolates major concerns
makes code easier to update
Algorithm/Architecture Level
Unique to your software
Provides majority of SMP performance

36
Two Levels of Parallel Processing (cont.)

Platform Specific Level
Vendor provides insight
Remove last performance obstacles
Be careful to limit use of non-portable constructs

37
Determining Shared and Private

What are the variable classes?
Process for determining class
First private/last private

38
Types of Variables

Start with access patterns
Read Only disposition elsewhere
Write then Read possibly local
Read then Write independent or reductions
Written live on exit?
Goal determine storage classes
Local or private variables are local per thread
Shared variables are everything else

39
Guidelines for Classifying Variables

In general, big things are shared
The major arrays that take all the space
Its the threads default model
Program local vars are parallel private vars
Temporaries used require one copy per thread
Subroutine locals become private automatically
Move up from leaf subroutines to parallel region
Equivalences ick

40
Process of Classifying Variables

Examine refs to each var to determine shared list
Split common into shared common and private
common if vars require different storage classes
Use copy-in to private common as alternative
Construct private list and declare private
commons by examining the types of remaining
variables

41
Process of Classifying Variables (2)
Only Read in P Region
Put on Shared list
Contains parallel loop index (Diff iterations
reference diff parts)
Examine Refs
Put on Shared list
Modified in P Region
Go to next page
Does not contain parallel loop index
42
Process of Classifying Vars (3)
Known Size
Put on Private list
Formal Parameter
Change to Pointee
Unknown
Pointee
Put on Shared
Var Type
Refd in called routines
Declare Private Common
Common Member
Private List w/Firstprivate
Only refd in P Region
Change to Common
Static
Local to subr
Private List
Automatic
43
Firstprivate and Lastprivate

LASTPRIVATE copies value(s) from local copy
assigned on last iteration of loop to global copy
of variables or arrays
FIRSTPRIVATE copies value(s) from global
variables or arrays to local copy for first
iteration of loop on each processor

44
Firstprivate and Lastprivate (2)

Parallelizing a loop and not knowing whether
there are side effects?
subroutine foo(n)
common /foobar/a(1000),b(1000),x
comp parallel do shared(a,b,n) lastprivate(x)
do 10 i1,n
xa(i)2 b(i)2
10 b(i) sqrt(x)
end

Use lastprivate because dont know where or if x
in common /foobar/ will be used again
45
Choosing Placing Synchronization

Finding variables that need to be synchronized
Two frequently used types
Critical/ordered sections small updates to
shared structures
Barriers delimit phases of computation
Doing reductions

46
What to Synchronize

Updates parallel do invariant variables that are
read then written
Place critical/ordered section around groups of
updates
Pay attention to control flow
Make sure you dont branch in or out
Pull it outside loop or region if efficient

47
Example Critical/Ordered Section

if (ncycle.eq.0) then
do 60 ilft,llt
dt2amin1(dtx(i),dt2)
if (dt2.eq.dtx(i)) then
ielmtc128(ndum-1)i
ielmtcnhex(ielmtc)
ityptc1
endif
ielmtd128(ndum-1)i
ielmtdnhex(ielmtd)
write (13,90) ielmtd,dtx(i)
write (13,100)ielmtc
60 continue
endif

do 70 ilft,llt
70 dt2amin1(dtx(i),dt2)
if (mess.ne.'sw2.') return
do 80 ilft,llt
if (dt2.eq.dtx(i)) then
ielmtc128(ndum-1)i
ielmtcnhex(ielmtc)
ityptc1
endif
80 continue

48
Reductions

Correct (but slow) program
sum 0.0
comp parallel private(i) shared(sum,a,n)
comp pdo
do 10 i1,n
comp critical
sum sum a(i)
comp end critical
10 continue
comp end parallel

Serial program is a reduction
sum 0.0
do 10 i1,n
10 sum sum a(i)

49
(Flawed) Plan For a Good Reduction

Incorrect parallel program
comp parallel private(suml,i)
comp shared(sum,a,n)
suml 0.0
comp do
do 10 i1,n
10 suml suml a(i)
cbug need critical section next
sum sum suml
comp end parallel

50
Good Reductions

Correct reduction
comp parallel private(suml,i)
comp shared(sum,a,n)
suml 0.0
comp do
do 10 i1,n
10 suml suml a(i)
comp critical
sum sum suml
comp end critical
comp end parallel

Using Reduction does the same comp
parallel comp shared(a,n) comp
reduction(sum) comp do private(i) do 10
i1,n 10 sum sum a(i) comp end parallel
51
Typical Parallel Bugs

Problem incorrectly pointing to the same place
Symptom bad answers
Fix initialization of local pointers
Problem incorrectly pointing to different places
Symptom segmentation violation
Fix localization of shared data
Problem incorrect initialization of parallel
regions
Symptom bad answers
Fix Copy in? / Use parallel region outside
parallel do.

52
Typical Parallel Bugs (2)

Problem not saving values from parallel regions
Symptom bad answers, core dump
Fix transfer from local into shared
Problem unsynchronized access
Symptom bad answers
Fix critical section / barrier / local
accumulation
Problem numerical inconsistency
Symptom run-to-run variation in answers
Fix different scheduling mechanisms / ordered
sections / consistent parallel reductions

53
Typical Parallel Bugs (3)

Problem inconsistently synchronized I/O stmts
Symptom jumbled output, system error messages
Fix critical/ordered section around I/O
Problem inconsistent declarations of common vars
Symptom segment violation
Fix verify consistent declaration
Problem parallel stack size problems
Symptom core dump
Fix increase stack size

54
Outline

Introduction
Parallelism, Synchronization, and Environments
Restructuring/Designing Programs in OpenMP
Example Programs

55
Designing Parallel Programs in OpenMP

Partition
Divide problem into tasks
Communicate
Determine amount and pattern of communication
Agglomerate
Combine tasks
Map
Assign agglomerated tasks to physical processors

56
Designing Parallel Programs in OpenMP (2)

Partition
In OpenMP, look for any independent operations
(loop parallel, task parallel)
Communicate
In OpenMP, look for synch points and dependences
Agglomerate
In OpenMP, create parallel loops an/or parallel
sections
Map
In OpenMP, implicit or explicit scheduling
Data mapping goes outside the standard

57
Jacobi Iteration The Problem

Numerically solve a PDE on a square mesh
Method
Update each mesh point by the average of its
neighbors
Repeat until converged

58
Jacobi Iteration OpenMP Partitioning,
Communication, and Agglomeration

Partitioning does not change at all
Data parallelism natural for this problem
Communication does not change at all
Related directly to task partitioning

59
Partitioning, Communication, and Agglomeration (2)

Agglomeration analysis changes a little
OpenMP cannot nest control constructs easily
Requires intervening parallel section, with
OMP_NESTED turned on
Major issue on shared memory machines is locality
in memory layout
Nearest neighbors agglomerated together as blocks
Therefore, encourage each processor to keep using
the same contiguous section(s) of memory

60
Jacobi Iteration OpenMP Mapping

Minimize forking and synchronization overhead
One parallel region at highest possible level
Mark outermost possible loop for work sharing
Keep each processor working on the same data
Consistent schedule for DO loops
Trust underlying system not to migrate threads
for no reason
Lay out data to be contiguous
Column-major ordering in Fortran
Therefore, make dimension of outermost
work-shared loop the column

61
Jacobi Iteration OpenMP Program
(to be continued)
62
Jacobi Iteration/Program (2)
63
Irregular Mesh The Problem

The Problem
Given an irregular mesh of values
Update each value using its neighbors in the mesh
The Approach
Store the mesh as a list of edges
Process all edges in parallel
Compute contribution of edge
Add to one endpoint, subtract from the other

64
Irregular Mesh Sequential Program
REAL x(nnode), y(nnode), flux INTEGER
iedge(nedge,2) err tol 1e6 DO WHILE (err gt
tol) DO i 1, nedge flux (y(iedge(i,1))-y(iedge
(i,2))) / 2 x(iedge(i,1)) x(iedge(i,1)) -
flux x(iedge(i,2)) x(iedge(i,2)) flux err
err flux(i)flux(i) END DO err err / nedge DO
i 1, nnode y(i) x(i) END DO END DO
65
Irregular Mesh OpenMP Partitioning

Flux computations are data-parallel
flux (x(iedge(i,1))-x(iedge(i,2)))/2
Independent because edge_val ? node_val
Node updates are nearly data-parallel
x(iedge(i,1)) x(iedge(i,1)) - flux(i)
Not truly independent because sometimes
iedge(iY,1) iedge(iX,2)
But ATOMIC supports updates using associative
operators
Error check is a reduction
err err flux(i)flux(i)
REDUCTION class for variables

66
Irregular Mesh OpenMP Communication

Communication needed for all parts
Between edges and nodes to compute flux
Edge-node and node-node to compute x
Reduction to compute err
Edge and node communication is static, local with
respect to grid
But unstructured with respect to array indices
Reduction communication is static, global

67
Irregular Mesh OpenMP Agglomeration

Because of the tight ties between flux, x, and
err, it is best to keep the loop intact
Incremental parallelization via OpenMP works
perfectly
No differences between computation in different
iterations
Any agglomeration scheme is likely to work well
for load balance
Dont specify SCHEDULE
Make the system earn its keep

68
Irregular Mesh OpenMP Mapping

There may be significant differences in data
movement based on scheduling
The ideal
Every processor runs over its own edges (static
scheduling)
Endpoints of these edges are not shared by other
processors
Data moves to its home on the first pass, then
stays put

69
Irregular Mesh OpenMP Mapping (2)

The reality
The graph is connected ? some endpoints must be
shared
Memory systems move more than one word at a time
? false sharing
OpenMP does not standardize how to resolve this
Best bet Once the program is working, look for
good orderings of data

70
Irregular Mesh OpenMP Program
71
Irregular Mesh

Divide edge list among processors
Ideally, would like all edges referring to a
given vertex to be assigned to the same processor
Often easier said than done

72
Irregular Mesh Pictures
73
Irregular Mesh Bad Data Order
74
Irregular Mesh Good Data Order
75
OpenMP Summary

Based on fork-join parallelism in shared memory
Threads start at beginning of parallel region,
come back together at end
Close to some hardware
Linked from traditional languages
Very good for sharing data and incremental
parallelization
Unclear if it is feasible for distributed memory
More information at http//www.openmp.org

76
Three Systems Compared

HPF
Good abstraction data parallelism
System can hide many details from programmer
Two-edged sword
Well-suited for regular problems on machines with
locality
MPI
Lower-level abstraction message passing
System works everywhere, is usually the first
tool available on new systems
Well-suited to handling data on distributed
memory machines, but requires work up front

77
Three Systems Compared (2)

OpenMP
Good abstraction fork-join
System excellent for incremental parallelization
on shared memory
No implementations yet on distributed memory
Well-suited for any parallel application if
locality is not an issue
Can we combine paradigms?
Yes, although its still research

78
OpenMP MPI

Modern parallel machines are often shared memory
nodes connected by message passing
Can be programmed by calling MPI from OpenMP
MPI implementation must be thread-safe
ASCI project is using this heavily

79
MPI HPF

Many applications (like the atmosphere/ocean
model) consist of several data-parallel modules
Can link HPF codes on different machines using
MPI
Requires special MPI implementation and runtime
HPFMPI project at Argonne has done
proof-of-concept

80
HPF OpenMP