Title: OpenMP in Practice
1OpenMP in Practice
- Gina Goff
- Rice University
2Outline
- Introduction
- Parallelism, Synchronization, and Environments
- Restructuring/Designing Programs in OpenMP
- Example Programs
3Outline
- Introduction
- Parallelism, Synchronization, and Environments
- Restructuring/Designing Programs in OpenMP
- Example Programs
4OpenMP
- A portable fork-join parallel model for
shared-memory architectures - Portable
- Based on Parallel Computing Forum (PCF)
- Fortran 77 binding here today C coming this year
5OpenMP (2)
- Fork-join model
- Execution starts with one thread of control
- Parallel regions fork off new threads on entry
- Threads join back together at the end of the
region - Shared memory
- (Some) Memory can be accessed by all threads
6Shared Memory
- Computation(s) using several processors
- Each processor has some private memory
- Each processor has access to a memory shared with
the other processors - Synchronization
- Used to protect integrity of parallel program
- Prevents unsafe memory accesses
- Fine-grained synchronization (point to point)
- Barrier use for global synchronization
7Shared Memory in Pictures
8OpenMP
- Two basic flavors of parallelism
- Coarse-grained
- Program is broken into segments (threads) that
can be executed in parallel - Use barriers to re-synchronize execution at the
end - Fine-grained
- Execute iterations of DO loop(s) in parallel
9OpenMP in Pictures
10Design of OpenMP
- A flexible standard, easily implemented across
different platforms - Control structures
- Minimal for simplicity and encouraging common
cases - PARALLEL, DO, SECTIONS, SINGLE, MASTER
- Data environment
- New data access capabilities for forked threads
- SHARED, PRIVATE, REDUCTION
11Design of OpenMP (2)
- Synchronization
- Simple implicit synch at beginning and end of
control structures - Explicit synch for more complex patterns
BARRIER, CRITICAL, ATOMIC, FLUSH, ORDERED - Runtime library
- Manages modes for forking and scheduling threads
- E.g, OMP_GET_THREAD_NUM
12Whos In OpenMP?
- Software Vendors
- Absoft Corp.
- Edinburgh Portable Compilers
- Kuck Associates, Inc.
- Myrias Computer Technologies
- Numerical Algorithms Group
- The Portland Group, Inc.
- Hardware Vendors
- Digital Equipment Corp.
- Hewlett-Packard
- IBM
- Intel
- Silicon Graphics/Cray Research
- Solution Vendors
- ADINA RD, Inc.
- ANSYS, Inc.
- CPLEX division of ILOG
- Fluent, Inc.
- LSTC Corp.
- MECALOG SARL
- Oxford Molecular Group PLC
- Research Organizations
- US Department of Energy ASCI Program
- Universite Louis Pasteur, Strasbourg
13Outline
- Introduction
- Parallelism, Synchronization, and Environments
- Restructuring/Designing Programs in OpenMP
- Example Programs
14Control Structures
- PARALLEL / END PARALLEL
- The actual fork and join
- Number of threads wont change inside parallel
region - Single Program Multiple Data (SPMD) execution
within region - SINGLE / END SINGLE
- (Short) sequential section
- MASTER / END MASTER
- SINGLE on master processor
!OMP PARALLEL CALL S1 !OMP SINGLE CALL
S2 !OMP END SINGLE CALL S3 !OMP END PARALLEL
15Control Structures (2)
- DO / END DO
- The classic parallel loop
- Inside parallel region
- Or convenient combined directive PARALLEL DO
- Iteration space is divided among available
threads - Loop index is private to thread by default
16Control Structures (3)
- SECTIONS / END SECTIONS
- Task parallelism, potentially MIMD
- SECTION marks tasks
- Inside parallel region
- Nested parallelism
- Requires creating new parallel region
- Not supported on all OpenMP implementations
- If not allowed, inner PARALLEL is a no-op
!OMP PARALLEL SECTIONS !OMP SECTION !OMP
PARALLEL DO DO J 1, 2 CALL FOO(J)
END DO !OMP END DO !OMP SECTION CALL
BAR(2) !OMP SECTION !OMP PARALLEL DO
DO K 1, 3 CALL BAR(K) END DO
!OMP END DO !OMP END PARALLEL SECTIONS
17DO Scheduling
- Static Scheduling (default)
- Divides loop into equal size iteration chunks
- Based on runtime loop limits
- Totally parallel scheduling algorithm
- Dynamic Scheduling
- Threads go to scheduler to get next chunk
- Guided chunks taper down at end of loop
18DO Scheduling (2)
1
7
13
19
25
31
1
7
13
19
25
31
2
8
14
20
26
32
2
8
14
20
26
32
3
9
15
21
27
33
3
9
15
21
27
33
4
10
16
22
28
34
4
10
16
22
28
34
5
11
17
23
29
35
5
11
17
23
29
35
6
12
18
24
30
36
6
12
18
24
30
36
!OMP PARALLEL DO
!OMP PARALLEL DO
!OMP SCHEDULE(DYNAMIC,1)
!OMP SCHEDULE(GUIDED,1)
DO J 1, 36
DO J 1, 36
CALL SUBR(J)
CALL SUBR(J)
END DO
END DO
!OMP END DO
!OMP END DO
19Orphaned Directives
PROGRAM main !OMP PARALLEL CALL foo() CALL
bar() CALL error() !OMP END PARALLEL SUBROUTINE
error() ! Not allowed due to ! nested control
structs !OMP SECTIONS !OMP SECTION CALL
foo() !OMP SECTION CALL bar() !OMP END
SECTIONS END
SUBROUTINE foo() !OMP DO DO i 1, n ... END
DO !OMP END DO END SUBROUTINE bar() !OMP
SECTIONS !OMP SECTION CALL section1() !OMP
SECTION ... !OMP SECTION ... !OMP END
SECTIONS END
20OpenMP Synchronization
- Implicit barriers wait for all threads
- DO, END DO
- SECTIONS, END SECTIONS
- SINGLE, END SINGLE
- MASTER, END MASTER
- NOWAIT at END can override synch
- Global barriers ? all threads must hit in the
- same order
21OpenMP Synchronization (2)
- Explicit directives provide finer control
- BARRIER must be hit by all threads in team
- CRITICAL (name), END CRITICAL
- Only one thread may enter at a time
- ATOMIC Single-statement critical section
- for reduction
- FLUSH (list) Synchronization point at
- which the implementation is required to
- provide a consistent view of memory
- ORDERED For pipelining loop iterations
22OpenMP Data Environments
- Data can be PRIVATE or SHARED
- Private data is for local variables
- Shared data is global
- Data can be private to a thread all processors
in thread can access the data, but other threads
cant see it
23OpenMP Data Environments
COMMON /mine/ z
INTEGER x(3), y(3), k
!OMP THREADPRIVATE(mine)
!OMP PARALLEL DO DEFAULT(PRIVATE), SHARED(x)
!OMP REDUCTION (z)
DO k 1, 3
x(k) k
y(k) kk
z z x(k)y(k)
END DO
!OMP END PARALLEL DO
SHARED MEMORY
x
1
2
3
Thread 0
Thread 1
Thread 2
4
y
1
y
y
9
z'
z'
4
z'
1
9
z
36
24Brief Example
25OpenMP Environment Runtime Library
- For controlling execution
- Needed for tuning, but may limit portability
- Control through environment variables or runtime
library calls - Runtime library takes precedence in conflict
26OpenMP Environment Runtime (2)
- OMP_NUM_THREADS How many to use in parallel
region? - OMP_GET_NUM_THREADS,
- OMP_SET_NUM_THREADS
- Related OMP_GET_THREAD_NUM,
- OMP_GET_MAX_THREADS, OMP_GET_NUM_PROCS
- OMP_DYNAMIC Should runtime system choose number
of threads? - OMP_GET_DYNAMIC, OMP_SET_DYNAMIC
27OpenMP Environment Runtime (3)
- OMP_NESTED Should nested parallel regions be
supported? - OMP_GET_NESTED, OMP_SET_NESTED
- OMP_SCHEDULE Choose DO scheduling option
- Used by RUNTIME clause
- OMP_IN_PARALLEL Is the program in a parallel
region?
28Outline
- Introduction
- Parallelism, Synchronization, and Environments
- Restructuring/Designing Programs in OpenMP
- Example Programs
29Analyzing for Parallelism
- Profiling
- Walk the loop nests
- Multiple parallel loops
30Program Profile
- Is dataset large enough?
- At the top of the list, should find
- parallel regions
- routines called within them
- What is cumulative percent?
- Watch for system libraries near top
- e.g., spin_wait_join_barrier
31Walking the Key Loop Nest
- Usually the outermost parallel loop
- Ignore timestep and convergence loops
- Ignore loops with few iterations
- Ignore loops that call unimportant subroutines
- Dont be put off by
- Loops that write shared data
- Loops that step through linked lists
- Loops with I/O
32Multiple Parallel Loops
- Nested parallel loops are good
- Pick easiest or most parallel code
- Think about locality
- Use IF clause to select best based on dataset
- Plan on doing one across clusters
- Non nested parallel loops
- Consider loop fusion (impacts locality)
- Execute code between in parallel region
33Example Loop Nest
- subroutine fem3d()
- 10 call addmon()
- if(numelh.ne0) call solide
- subroutine solide
- do 20 i1,nelt
- do 20 j1,nelg
- call unpki
- call strain
- call force
- 20 continue
- if() return
- goto 10
- subroutine force()
- do 10 ilft,llt
- sgv(i) sig1(i)-qp(i)vol(i)
- 10 continue
- do 50 n1,nnc
- i0ia(n)
- i1ia(n1)-1
- do 50 ii0,i1
- e(1,ix(i))e(1,ix(i))ep11(i)
- 50 continue
34Restructuring Applications
- Two level strategy for parallel processing
- Determining shared and local variables
- Adding synchronization
35Two Levels of Parallel Processing
- Two level approach isolates major concerns
makes code easier to update - Algorithm/Architecture Level
- Unique to your software
- Provides majority of SMP performance
36Two Levels of Parallel Processing (cont.)
- Platform Specific Level
- Vendor provides insight
- Remove last performance obstacles
- Be careful to limit use of non-portable constructs
37Determining Shared and Private
- What are the variable classes?
- Process for determining class
- First private/last private
38Types of Variables
- Start with access patterns
- Read Only disposition elsewhere
- Write then Read possibly local
- Read then Write independent or reductions
- Written live on exit?
- Goal determine storage classes
- Local or private variables are local per thread
- Shared variables are everything else
39Guidelines for Classifying Variables
- In general, big things are shared
- The major arrays that take all the space
- Its the threads default model
- Program local vars are parallel private vars
- Temporaries used require one copy per thread
- Subroutine locals become private automatically
- Move up from leaf subroutines to parallel region
- Equivalences ick
40Process of Classifying Variables
- Examine refs to each var to determine shared list
- Split common into shared common and private
common if vars require different storage classes - Use copy-in to private common as alternative
- Construct private list and declare private
commons by examining the types of remaining
variables
41Process of Classifying Variables (2)
Only Read in P Region
Put on Shared list
Contains parallel loop index (Diff iterations
reference diff parts)
Examine Refs
Put on Shared list
Modified in P Region
Go to next page
Does not contain parallel loop index
42Process of Classifying Vars (3)
Known Size
Put on Private list
Formal Parameter
Change to Pointee
Unknown
Pointee
Put on Shared
Var Type
Refd in called routines
Declare Private Common
Common Member
Private List w/Firstprivate
Only refd in P Region
Change to Common
Static
Local to subr
Private List
Automatic
43Firstprivate and Lastprivate
- LASTPRIVATE copies value(s) from local copy
assigned on last iteration of loop to global copy
of variables or arrays - FIRSTPRIVATE copies value(s) from global
variables or arrays to local copy for first
iteration of loop on each processor
44Firstprivate and Lastprivate (2)
- Parallelizing a loop and not knowing whether
there are side effects? - subroutine foo(n)
- common /foobar/a(1000),b(1000),x
- comp parallel do shared(a,b,n) lastprivate(x)
- do 10 i1,n
- xa(i)2 b(i)2
- 10 b(i) sqrt(x)
- end
Use lastprivate because dont know where or if x
in common /foobar/ will be used again
45Choosing Placing Synchronization
- Finding variables that need to be synchronized
- Two frequently used types
- Critical/ordered sections small updates to
shared structures - Barriers delimit phases of computation
- Doing reductions
46What to Synchronize
- Updates parallel do invariant variables that are
read then written - Place critical/ordered section around groups of
updates - Pay attention to control flow
- Make sure you dont branch in or out
- Pull it outside loop or region if efficient
47Example Critical/Ordered Section
- if (ncycle.eq.0) then
- do 60 ilft,llt
- dt2amin1(dtx(i),dt2)
- if (dt2.eq.dtx(i)) then
- ielmtc128(ndum-1)i
- ielmtcnhex(ielmtc)
- ityptc1
- endif
- ielmtd128(ndum-1)i
- ielmtdnhex(ielmtd)
- write (13,90) ielmtd,dtx(i)
- write (13,100)ielmtc
- 60 continue
- endif
- do 70 ilft,llt
- 70 dt2amin1(dtx(i),dt2)
- if (mess.ne.'sw2.') return
- do 80 ilft,llt
- if (dt2.eq.dtx(i)) then
- ielmtc128(ndum-1)i
- ielmtcnhex(ielmtc)
- ityptc1
- endif
- 80 continue
48Reductions
- Correct (but slow) program
- sum 0.0
- comp parallel private(i) shared(sum,a,n)
- comp pdo
- do 10 i1,n
- comp critical
- sum sum a(i)
- comp end critical
- 10 continue
- comp end parallel
- Serial program is a reduction
- sum 0.0
- do 10 i1,n
- 10 sum sum a(i)
49(Flawed) Plan For a Good Reduction
- Incorrect parallel program
- comp parallel private(suml,i)
- comp shared(sum,a,n)
- suml 0.0
- comp do
- do 10 i1,n
- 10 suml suml a(i)
- cbug need critical section next
- sum sum suml
- comp end parallel
50Good Reductions
- Correct reduction
- comp parallel private(suml,i)
- comp shared(sum,a,n)
- suml 0.0
- comp do
- do 10 i1,n
- 10 suml suml a(i)
- comp critical
- sum sum suml
- comp end critical
- comp end parallel
Using Reduction does the same comp
parallel comp shared(a,n) comp
reduction(sum) comp do private(i) do 10
i1,n 10 sum sum a(i) comp end parallel
51Typical Parallel Bugs
- Problem incorrectly pointing to the same place
- Symptom bad answers
- Fix initialization of local pointers
- Problem incorrectly pointing to different places
- Symptom segmentation violation
- Fix localization of shared data
- Problem incorrect initialization of parallel
regions - Symptom bad answers
- Fix Copy in? / Use parallel region outside
parallel do.
52Typical Parallel Bugs (2)
- Problem not saving values from parallel regions
- Symptom bad answers, core dump
- Fix transfer from local into shared
- Problem unsynchronized access
- Symptom bad answers
- Fix critical section / barrier / local
accumulation - Problem numerical inconsistency
- Symptom run-to-run variation in answers
- Fix different scheduling mechanisms / ordered
sections / consistent parallel reductions
53Typical Parallel Bugs (3)
- Problem inconsistently synchronized I/O stmts
- Symptom jumbled output, system error messages
- Fix critical/ordered section around I/O
- Problem inconsistent declarations of common vars
- Symptom segment violation
- Fix verify consistent declaration
- Problem parallel stack size problems
- Symptom core dump
- Fix increase stack size
54Outline
- Introduction
- Parallelism, Synchronization, and Environments
- Restructuring/Designing Programs in OpenMP
- Example Programs
55Designing Parallel Programs in OpenMP
- Partition
- Divide problem into tasks
- Communicate
- Determine amount and pattern of communication
- Agglomerate
- Combine tasks
- Map
- Assign agglomerated tasks to physical processors
56Designing Parallel Programs in OpenMP (2)
- Partition
- In OpenMP, look for any independent operations
(loop parallel, task parallel) - Communicate
- In OpenMP, look for synch points and dependences
- Agglomerate
- In OpenMP, create parallel loops an/or parallel
sections - Map
- In OpenMP, implicit or explicit scheduling
- Data mapping goes outside the standard
57Jacobi Iteration The Problem
- Numerically solve a PDE on a square mesh
- Method
- Update each mesh point by the average of its
neighbors - Repeat until converged
58Jacobi Iteration OpenMP Partitioning,
Communication, and Agglomeration
- Partitioning does not change at all
- Data parallelism natural for this problem
- Communication does not change at all
- Related directly to task partitioning
59Partitioning, Communication, and Agglomeration (2)
- Agglomeration analysis changes a little
- OpenMP cannot nest control constructs easily
- Requires intervening parallel section, with
OMP_NESTED turned on - Major issue on shared memory machines is locality
in memory layout - Nearest neighbors agglomerated together as blocks
- Therefore, encourage each processor to keep using
the same contiguous section(s) of memory
60Jacobi Iteration OpenMP Mapping
- Minimize forking and synchronization overhead
- One parallel region at highest possible level
- Mark outermost possible loop for work sharing
- Keep each processor working on the same data
- Consistent schedule for DO loops
- Trust underlying system not to migrate threads
for no reason - Lay out data to be contiguous
- Column-major ordering in Fortran
- Therefore, make dimension of outermost
work-shared loop the column
61Jacobi Iteration OpenMP Program
(to be continued)
62Jacobi Iteration/Program (2)
63Irregular Mesh The Problem
- The Problem
- Given an irregular mesh of values
- Update each value using its neighbors in the mesh
- The Approach
- Store the mesh as a list of edges
- Process all edges in parallel
- Compute contribution of edge
- Add to one endpoint, subtract from the other
64Irregular Mesh Sequential Program
REAL x(nnode), y(nnode), flux INTEGER
iedge(nedge,2) err tol 1e6 DO WHILE (err gt
tol) DO i 1, nedge flux (y(iedge(i,1))-y(iedge
(i,2))) / 2 x(iedge(i,1)) x(iedge(i,1)) -
flux x(iedge(i,2)) x(iedge(i,2)) flux err
err flux(i)flux(i) END DO err err / nedge DO
i 1, nnode y(i) x(i) END DO END DO
65Irregular Mesh OpenMP Partitioning
- Flux computations are data-parallel
- flux (x(iedge(i,1))-x(iedge(i,2)))/2
- Independent because edge_val ? node_val
- Node updates are nearly data-parallel
- x(iedge(i,1)) x(iedge(i,1)) - flux(i)
- Not truly independent because sometimes
iedge(iY,1) iedge(iX,2) - But ATOMIC supports updates using associative
operators - Error check is a reduction
- err err flux(i)flux(i)
- REDUCTION class for variables
66Irregular Mesh OpenMP Communication
- Communication needed for all parts
- Between edges and nodes to compute flux
- Edge-node and node-node to compute x
- Reduction to compute err
- Edge and node communication is static, local with
respect to grid - But unstructured with respect to array indices
- Reduction communication is static, global
67Irregular Mesh OpenMP Agglomeration
- Because of the tight ties between flux, x, and
err, it is best to keep the loop intact - Incremental parallelization via OpenMP works
perfectly - No differences between computation in different
iterations - Any agglomeration scheme is likely to work well
for load balance - Dont specify SCHEDULE
- Make the system earn its keep
68Irregular Mesh OpenMP Mapping
- There may be significant differences in data
movement based on scheduling - The ideal
- Every processor runs over its own edges (static
scheduling) - Endpoints of these edges are not shared by other
processors - Data moves to its home on the first pass, then
stays put
69Irregular Mesh OpenMP Mapping (2)
- The reality
- The graph is connected ? some endpoints must be
shared - Memory systems move more than one word at a time
? false sharing - OpenMP does not standardize how to resolve this
- Best bet Once the program is working, look for
good orderings of data
70Irregular Mesh OpenMP Program
71Irregular Mesh
- Divide edge list among processors
- Ideally, would like all edges referring to a
given vertex to be assigned to the same processor - Often easier said than done
72Irregular Mesh Pictures
73Irregular Mesh Bad Data Order
74Irregular Mesh Good Data Order
75OpenMP Summary
- Based on fork-join parallelism in shared memory
- Threads start at beginning of parallel region,
come back together at end - Close to some hardware
- Linked from traditional languages
- Very good for sharing data and incremental
parallelization - Unclear if it is feasible for distributed memory
- More information at http//www.openmp.org
76Three Systems Compared
- HPF
- Good abstraction data parallelism
- System can hide many details from programmer
- Two-edged sword
- Well-suited for regular problems on machines with
locality - MPI
- Lower-level abstraction message passing
- System works everywhere, is usually the first
tool available on new systems - Well-suited to handling data on distributed
memory machines, but requires work up front
77Three Systems Compared (2)
- OpenMP
- Good abstraction fork-join
- System excellent for incremental parallelization
on shared memory - No implementations yet on distributed memory
- Well-suited for any parallel application if
locality is not an issue - Can we combine paradigms?
- Yes, although its still research
78OpenMP MPI
- Modern parallel machines are often shared memory
nodes connected by message passing - Can be programmed by calling MPI from OpenMP
- MPI implementation must be thread-safe
- ASCI project is using this heavily
79MPI HPF
- Many applications (like the atmosphere/ocean
model) consist of several data-parallel modules - Can link HPF codes on different machines using
MPI - Requires special MPI implementation and runtime
- HPFMPI project at Argonne has done
proof-of-concept
80HPF OpenMP
- HPF can be implemented by translating it to
OpenMP - Good idea on shared-memory machines
- May have real advantages for optimizing locality
and data layout - HPF may call OpenMP directly
- Proposal being made at HPF Users Group meeting
next week - Not trivial, since HPF and OpenMP may not agree
on data layout - Things could get worse if MPI is also implemented
on OpenMP