Title: ECE 1747 Parallel Programming
1ECE 1747 Parallel Programming
- Shared Memory OpenMP
- Environment and Synchronization
2What is OpenMP?
- Standard for shared memory programming for
scientific applications. - Has specific support for scientific application
needs (unlike Pthreads). - Rapidly gaining acceptance among vendors and
application writers. - See http//www.openmp.org for more info.
3OpenMP API Overview
- API is a set of compiler directives inserted in
the source program (in addition to some library
functions). - Ideally, compiler directives do not affect
sequential code. - pragmas in C / C .
- (special) comments in Fortran code.
4OpenMP API Example
- Sequential code
- statement1
- statement2
- statement3
- Assume we want to execute statement 2 in
parallel, and statement 1 and 3 sequentially.
5OpenMP API Example (2 of 2)
- OpenMP parallel code
- statement 1
- pragma ltspecific OpenMP directivegt
- statement2
- statement3
- Statement 2 (may be) executed in parallel.
- Statement 1 and 3 are executed sequentially.
6Important Note
- By giving a parallel directive, the user asserts
that the program will remain correct if the
statement is executed in parallel. - OpenMP compiler does not check correctness.
- Some tools exist for helping with that.
- Totalview - good parallel debugger
(www.etnus.com)
7API Semantics
- Master thread executes sequential code.
- Master and slaves execute parallel code.
- Note very similar to fork-join semantics of
Pthreads create/join primitives.
8OpenMP Implementation Overview
- OpenMP implementation
- compiler,
- library.
- Unlike Pthreads (purely a library).
9OpenMP Example Usage (1 of 2)
Sequential Program
OpenMP Compiler
Annotated Source
compiler switch
Parallel Program
10OpenMP Example Usage (2 of 2)
- If you give sequential switch,
- comments and pragmas are ignored.
- If you give parallel switch,
- comments and/or pragmas are read, and
- cause translation into parallel program.
- Ideally, one source for both sequential and
parallel program (big maintenance plus).
11OpenMP Directives
- Parallelization directives
- parallel region
- parallel for
- Data environment directives
- shared, private, threadprivate, reduction, etc.
- Synchronization directives
- barrier, critical
12General Rules about Directives
- They always apply to the next statement, which
must be a structured block. - Examples
- pragma omp
- statement
- pragma omp
- statement1 statement2 statement3
13OpenMP Parallel Region
- pragma omp parallel
- A number of threads are spawned at entry.
- Each thread executes the same code.
- Each thread waits at the end.
- Very similar to a number of create/joins with
the same function in Pthreads.
14Getting Threads to do Different Things
- Through explicit thread identification (as in
Pthreads). - Through work-sharing directives.
15Thread Identification
- int omp_get_thread_num()
- int omp_get_num_threads()
- Gets the thread id.
- Gets the total number of threads.
16Example
- pragma omp parallel
-
- if( !omp_get_thread_num() )
- master()
- else
- slave()
17Work Sharing Directives
- Always occur within a parallel region directive.
- Two principal ones are
- parallel for
- parallel section
18OpenMP Parallel For
- pragma omp parallel
- pragma omp for
- for( )
- Each thread executes a subset of the iterations.
- All threads wait at the end of the parallel for.
19Multiple Work Sharing Directives
- May occur within a single parallel region
- pragma omp parallel
-
- pragma omp for
- for( )
- pragma omp for
- for( )
-
- All threads wait at the end of the first for.
20The NoWait Qualifier
- pragma omp parallel
-
- pragma omp for nowait
- for( )
- pragma omp for
- for( )
-
- Threads proceed to second for w/o waiting.
21Parallel Sections Directive
- pragma omp parallel
-
- pragma omp sections
-
-
- pragma omp section ? this is a delimiter
-
- pragma omp section
-
-
-
22A Useful Shorthand
- pragma omp parallel
- pragma omp for
- for ( )
- is equivalent to
- pragma omp parallel for
- for ( )
- (Same for parallel sections)
23Note the Difference between ...
- pragma omp parallel
-
- pragma omp for
- for( )
- f()
- pragma omp for
- for( )
24 and ...
- pragma omp parallel for
- for( )
- f()
- pragma omp parallel for
- for( )
25Sequential Matrix Multiply
- for( i0 iltn i )
- for( j0 jltn j )
- cij 0.0
- for( k0 kltn k )
- cij aikbkj
-
26OpenMP Matrix Multiply
- pragma omp parallel for
- for( i0 iltn i )
- for( j0 jltn j )
- cij 0.0
- for( k0 kltn k )
- cij aikbkj
-
27Sequential SOR
- for some number of timesteps/iterations
- for (i0 iltn i )
- for( j1, jltn, j )
- tempij 0.25
- ( gridi-1j gridi1j
- gridij-1 gridij1 )
- for( i0 iltn i )
- for( j1 jltn j )
- gridij tempij
28OpenMP SOR
- for some number of timesteps/iterations
- pragma omp parallel for
- for (i0 iltn i )
- for( j0, jltn, j )
- tempij 0.25
- ( gridi-1j gridi1j
- gridij-1 gridij1 )
- pragma omp parallel for
- for( i0 iltn i )
- for( j0 jltn j )
- gridij tempij
29Equivalent OpenMP SOR
- for some number of timesteps/iterations
- pragma omp parallel
-
- pragma omp for
- for (i0 iltn i )
- for( j0, jltn, j )
- tempij 0.25
- ( gridi-1j gridi1j
- gridij-1 gridij1 )
- pragma omp for
- for( i0 iltn i )
- for( j0 jltn j )
- gridij tempij
-
30Some Advanced Features
- Conditional parallelism.
- Scheduling options.
- (More can be found in the specification)
31Conditional Parallelism Issue
- Oftentimes, parallelism is only useful if the
problem size is sufficiently big. - For smaller sizes, overhead of parallelization
exceeds benefit.
32Conditional Parallelism Specification
- pragma omp parallel if( expression )
- pragma omp for if( expression )
- pragma omp parallel for if( expression )
- Execute in parallel if expression is true,
otherwise execute sequentially.
33Conditional Parallelism Example
- for( i0 iltn i )
- pragma omp parallel for if( n-i gt 100 )
- for( ji1 jltn j )
- for( ki1 kltn k )
- ajk ajk - aikaij / ajj
34Scheduling of Iterations Issue
- Scheduling assigning iterations to a thread.
- So far, we have assumed the default which is
block scheduling. - OpenMP allows other scheduling strategies as
well, for instance cyclic, gss (guided
self-scheduling), etc.
35Scheduling of Iterations Specification
- pragma omp parallel for schedule(ltschedgt)
- ltschedgt can be one of
- block (default)
- cyclic
- gss
36Example
- Multiplication of two matrices C A x B, where
the A matrix is upper-triangular (all elements
below diagonal are 0).
0
A
37Sequential Matrix Multiply Becomes
- for( i0 iltn i )
- for( j0 jltn j )
- cij 0.0
- for( ki kltn k )
- cij aikbkj
-
- Load imbalance with block distribution.
38OpenMP Matrix Multiply
- pragma omp parallel for schedule( cyclic )
- for( i0 iltn i )
- for( j0 jltn j )
- cij 0.0
- for( ki kltn k )
- cij aikbkj
-
39Data Environment Directives (1 of 2)
- All variables are by default shared.
- One exception the loop variable of a parallel
for is private. - By using data directives, some variables can be
made private or given other special
characteristics.
40Reminder Matrix Multiply
- pragma omp parallel for
- for( i0 iltn i )
- for( j0 jltn j )
- cij 0.0
- for( k0 kltn k )
- cij aikbkj
-
- a, b, c are shared
- i, j, k are private
41Data Environment Directives (2 of 2)
- Private
- Threadprivate
- Reduction
42Private Variables
- pragma omp parallel for private( list )
- Makes a private copy for each thread for each
variable in the list. - This and all further examples are with parallel
for, but same applies to other region and
work-sharing directives.
43Private Variables Example (1 of 2)
- for( i0 iltn i )
- tmp ai
- ai bi
- bi tmp
-
- Swaps the values in a and b.
- Loop-carried dependence on tmp.
- Easily fixed by privatizing tmp.
44Private Variables Example (2 of 2)
- pragma omp parallel for private( tmp )
- for( i0 iltn i )
- tmp ai
- ai bi
- bi tmp
-
- Removes dependence on tmp.
- Would be more difficult to do in Pthreads.
45Private Variables Alternative 1
- for( i0 iltn i )
- tmpi ai
- ai bi
- bi tmpi
-
- Requires sequential program change.
- Wasteful in space, O(n) vs. O(p).
46Private Variables Alternative 2
- f()
-
- int tmp / local allocation on stack /
- for( ifrom iltto i )
- tmp ai
- ai bi
- bi tmp
-
47Threadprivate
- Private variables are private on a parallel
region basis. - Threadprivate variables are global variables that
are private throughout the execution of the
program.
48Threadprivate
- pragma omp threadprivate( list )
- Example pragma omp threadprivate( x)
- Requires program change in Pthreads.
- Requires an array of size p.
- Access as xpthread_self().
- Costly if accessed frequently.
- Not cheap in OpenMP either.
49Reduction Variables
- pragma omp parallel for reduction( oplist )
- op is one of , , -, , , , , or
- The variables in list must be used with this
operator in the loop. - The variables are automatically initialized to
sensible values.
50Reduction Variables Example
- pragma omp parallel for reduction( sum )
- for( i0 iltn i )
- sum ai
- Sum is automatically initialized to zero.
51SOR Sequential Code with Convergence
- for( diff gt delta )
- for (i0 iltn i )
- for( j0 jltn, j )
- diff 0
- for( i0 iltn i )
- for( j0 jltn j )
- diff max(diff, fabs(gridij -
tempij)) - gridij tempij
-
52SOR Sequential Code with Convergence
- for( diff gt delta )
- pragma omp parallel for
- for (i0 iltn i )
- for( j0 jltn, j )
- diff 0
- pragma omp parallel for reduction( max diff )
- for( i0 iltn i )
- for( j0 jltn j )
- diff max(diff, fabs(gridij -
tempij)) - gridij tempij
-
53SOR Sequential Code with Convergence
- for( diff gt delta )
- pragma omp parallel for
- for (i0 iltn i )
- for( j0 jltn, j )
- diff 0
- pragma omp parallel for reduction( max diff )
- for( i0 iltn i )
- for( j0 jltn j )
- diff max(diff, fabs(gridij -
tempij)) - gridij tempij
-
-
- Bummer no reduction operator for max or min.
54Synchronization Primitives
- Critical
- pragma omp critical name
- Implements critical sections by name.
- Similar to Pthreads mutex locks (name lock).
- Barrier
- pragma omp critical barrier
- Implements global barrier.
55OpenMP SOR with Convergence (1 of 2)
- pragma omp parallel private( mydiff )
- for( diff gt delta )
- pragma omp for nowait
- for( ifrom iltto i )
- for( j0 jltn, j )
- diff 0.0
- mydiff 0.0
- pragma omp barrier
- ...
56OpenMP SOR with Convergence (2 of 2)
- ...
- pragma omp for nowait
- for( ifrom iltto i )
- for( j0 jltn j )
- mydiffmax(mydiff,fabs(gridij-tempij)
- gridij tempij
-
- pragma critical
- diff max( diff, mydiff )
- pragma barrier
-
57Synchronization Primitives
- Big bummer no condition variables.
- Result must busy wait for condition
synchronization. - Clumsy.
- Very inefficient on some architectures.
58PIPE Sequential Program
- for( i0 iltnum_pic, read(in_pic) i )
- int_pic_1 trans1( in_pic )
- int_pic_2 trans2( int_pic_1)
- int_pic_3 trans3( int_pic_2)
- out_pic trans4( int_pic_3)
59Sequential vs. Parallel Execution
- Sequential
- Parallel
- (Color -- picture horizontal line -- processor).
60PIPE Parallel Program
- P0 for( i0 iltnum_pics, read(in_pic) i )
- int_pic_1i trans1( in_pic )
- signal(event_1_2i)
-
- P1 for( i0 iltnum_pics i )
- wait( event_1_2i )
- int_pic_2i trans2( int_pic_1i )
- signal(event_2_3i)
-
61PIPE Main Program
- pragma omp parallel sections
-
- pragma omp section
- stage1()
- pragma omp section
- stage2()
- pragma omp section
- stage3()
- pragma omp section
- stage4()
62PIPE Stage 1
- void stage1()
-
- num1 0
- for( i0 iltnum_pics, read(in_pic) i )
- int_pic_1i trans1( in_pic )
- pragma omp critical 1
- num1
-
-
63PIPE Stage 2
- void stage2 ()
-
- for( i0 iltnum_pic i )
- do
- pragma omp critical 1
- cond (num1 lt i)
- while (cond)
- int_pic_2i trans2(int_pic_1i)
- pragma omp critical 2
- num2
-
-
64OpenMP PIPE
- Note the need to exit critical during wait.
- Otherwise no access by other thread.
- Never busy-wait inside critical!