Title: Shared Memory Parallel Programming
1Shared Memory Parallel Programming
2OpenMP Overview
COMP FLUSH
pragma omp critical
COMP THREADPRIVATE(/ABC/)
CALL OMP_SET_NUM_THREADS(10)
- OpenMP An API for Writing Multithreaded
Applications - A set of compiler directives and library routines
for parallel application programmers - Greatly simplifies writing multi-threaded (MT)
programs in Fortran, C and C - Standardizes last 20 years of SMP practice
COMP parallel do shared(a, b, c)
call omp_test_lock(jlok)
call OMP_INIT_LOCK (ilok)
COMP MASTER
COMP ATOMIC
COMP SINGLE PRIVATE(X)
setenv OMP_SCHEDULE dynamic
COMP PARALLEL DO ORDERED PRIVATE (A, B, C)
COMP ORDERED
COMP PARALLEL REDUCTION ( A, B)
COMP SECTIONS
pragma omp parallel for private(A, B)
!OMP BARRIER
COMP PARALLEL COPYIN(/blk/)
COMP DO lastprivate(XX)
Nthrds OMP_GET_NUM_PROCS()
omp_set_lock(lck)
The name OpenMP is the property of the OpenMP
Architecture Review Board.
3OpenMP Programming Model
- Master thread spawns a team of threads as needed
- Parallelism is added incrementally until desired
performance is achieved i.e. the sequential
program evolves into a parallel program
Master Thread
Parallel Regions
4Life is Short, Remember?
Its official OpenMP is easier to use than MPI!
5How Mainstream Can You Be?
- Based firmly upon prior experience (PCF)
- Simplified and streamlined existing APIs
- High level programming model
- Programmer makes strategic decisions
- Compiler figures out details
- Generally available in standard commercial
compilers - Including Microsoft, now GNU
- Research Omni, OpenUH, PCOMP etc.
6The OpenMP ARB
- OpenMP is maintained by the OpenMP Architecture
Review Board (the ARB), which - Interprets OpenMP
- Writes new specifications - keeps OpenMP relevant
- Works to increase the impact of OpenMP
- Members are organizations - not individuals
- Current members
- Permanent Cray, Fujitsu, HP, IBM, Intel, MS,
NEC, PGI, SGI, Sun - Auxiliary ASCI, cOMPunity, EPCC, KSL, NASA, RWTH
Aachen
www.compunity.org
7OpenMP Release History
1998
OpenMPC/C 1.0
2005
OpenMP 2.5
OpenMPFortran 1.0
OpenMPFortran 1.1
A single specification for Fortran, C and C
1997
1999
8OpenMP 2.5
- Merged language-specific APIs
- Fixed minor problems
- Reorganized material
- Improved specification of nested parallelism
- Internal control variables
- Fixed the flush (memory model)
9Where will OpenMP be Relevant in Future?
Its either multithreading, or a real heat wave.
Simultaneous multithreading, hyperthreading, chip
multithreading, streaming
10OpenMP Definitions Constructs vs. Regions
in OpenMP
OpenMP constructs occupy a single compilation
unit while a region can span multiple source
files.
poo.f
bar.f
call whoami COMP PARALLEL call
whoami COMP END PARALLEL
subroutine whoami external
omp_get_thread_num integer iam,
omp_get_thread_num iam omp_get_thread_num(
) COMP CRITICAL print,Hello from ,
iam COMP END CRITICAL return end
A Parallel construct
The Parallel region is the text of the construct
plus any code called inside the construct
Orphan constructs can execute outside a parallel
region
11Parallel Regions
- You create threads in OpenMP with the omp
parallel pragma. - For example, To create a 4 thread parallel region
double A1000omp_set_num_threads(4)pragma
omp parallel int ID omp_get_thread_num()
pooh(ID,A)
Runtime function to request a certain number of
threads
Each thread executes a copy of the code within
the structured block
Runtime function returning a thread ID
- Each thread calls pooh(ID,A) for ID 0 to 3
The name OpenMP is the property of the OpenMP
Architecture Review Board
12Parallel Regions
- You create threads in OpenMP with the omp
parallel pragma. - For example, To create a 4 thread parallel region
clause to request a certain number of threads
double A1000 pragma omp parallel
num_threads(4) int ID omp_get_thread_num()
pooh(ID,A)
Each thread executes a copy of the code within
the structured block
Runtime function returning a thread ID
- Each thread calls pooh(ID,A) for ID 0 to 3
The name OpenMP is the property of the OpenMP
Architecture Review Board
13Parallel Regions
double A1000omp_set_num_threads(4) pragma
omp parallel int ID
omp_get_thread_num() pooh(ID,
A) printf(all done\n)
- Each thread executes the same code redundantly.
double A1000
omp_set_num_threads(4)
A single copy of A is shared between all threads.
pooh(1,A)
pooh(2,A)
pooh(3,A)
pooh(0,A)
printf(all done\n)
Threads wait here for all threads to finish
before proceeding (i.e. a barrier)
The name OpenMP is the property of the OpenMP
Architecture Review Board
14ExerciseA multi-threaded Hello world program
- Write a multithreaded program where each thread
prints hello world.
void main() int ID 0 printf(
hello(d) , ID) printf( world(d) \n,
ID)
15A multi-threaded Hello world program
- Write a multithreaded program where each thread
prints hello world.
include omp.hvoid main() pragma omp
parallel int ID omp_get_thread_num()
printf( hello(d) , ID) printf(
world(d) \n, ID)
OpenMP include file
Sample Output hello(1) hello(0)
world(1) world(0) hello (3) hello(2)
world(3) world(2)
Parallel region with default number of threads
Runtime library function to return a thread ID.
End of the Parallel region
16Parallel Regions and the if clauseActive vs
inactive parallel regions.
- An optional if clause causes the parallel region
to be active only if the logical expression
within the clause evaluates to true. - An if clause that evaluates to false causes the
parallel region to be inactive (i.e. executed by
a team of size one).
double AN pragma omp parallel if
(Ngt1000) int ID omp_get_thread_num()
pooh(ID,A)
The name OpenMP is the property of the OpenMP
Architecture Review Board
17OpenMP Work-Sharing Constructs
- The for work-sharing construct splits up loop
iterations among the threads in a team
pragma omp parallelpragma omp for for
(I0IltNI) NEAT_STUFF(I)
By default, there is a barrier at the end of the
omp for. Use the nowait clause to turn off
the barrier. pragma omp for
nowait nowait is useful between two
consecutive, independent omp for loops.
18Work Sharing ConstructsA motivating example
for(i0IltNi) ai ai bi
Sequential code
pragma omp parallel int id, i, Nthrds,
istart, iend id omp_get_thread_num() Nthrds
omp_get_num_threads() istart id N /
Nthrds iend (id1) N / Nthrds for(iistart
Iltiendi) ai ai bi
OpenMP parallel region
OpenMP parallel region and a work-sharing
for-construct
pragma omp parallel pragma omp for
schedule(static) for(i0IltNi) ai
ai bi
19OpenMP For/Do constructThe schedule clause
- Affects how loop iterations are mapped onto
threads - schedule(static ,chunk)
- Deal-out blocks of iterations of size chunk to
each thread. - schedule(dynamic,chunk)
- Each thread grabs chunk iterations off a queue
until all iterations have been handled. - schedule(guided,chunk)
- Threads dynamically grab blocks of iterations.
The size of the block starts large and shrinks
down to size chunk as the calculation proceeds. - schedule(runtime)
- Schedule and chunk size taken from the
OMP_SCHEDULE environment variable.
20The schedule clause
Least work at runtime scheduling done at
compile-time
Most work at runtime complex scheduling logic
used at run-time
21The schedule clause
20 iterations 6 threads Static schedule
3 iterations per thread ? last thread has 5
iterations 4 iterations per thread ? last thread
has 0 iterations !
22OpenMP Work-Sharing Constructs
- The Sections work-sharing construct gives a
different structured block to each thread.
pragma omp parallelpragma omp
sectionspragma omp section X_calculation()p
ragma omp section y_calculation()pragma omp
section z_calculation()
By default, there is a barrier at the end of the
omp sections. Use the nowait clause to turn
off the barrier.
23OpenMP Work-Sharing Constructs
- The master construct denotes a structured block
that is only executed by the master thread. The
other threads just skip it (no synchronization is
implied).
pragma omp parallel private (tmp) do_many_thi
ngs( )pragma omp master
exchange_boundaries( ) pragma
barrier do_many_other_things( )
24OpenMP Work-Sharing Constructs
- The single construct denotes a block of code that
is executed by only one thread. - A barrier is implied at the end of the single
block.
pragma omp parallel private (tmp) do_many_thi
ngs( )pragma omp single
exchange_boundaries( ) do_many_other_things(
)
25Combined parallel/work-share
- OpenMP shortcut Put the parallel and the
work-share on the same line
double resMAX int i pragma omp parallel
pragma omp for for (i0ilt MAX i)
resi huge()
double resMAX int i pragma omp parallel
for for (i0ilt MAX i) resi
huge()
These are equivalent
- Theres also a parallel sections construct.
26Some Examples
- Three examples of application parallelization
under OpenMP - Remember application developer gives
parallelization strategy - Implementation figures out details of work to be
performed by each threads - Also maps threads to hardware resources at run
time
27Matrix Multiply
- for( i0 iltn i )
- for( j0 jltn j )
- cij 0.0
- for( i0 iltn i )
- for( j0 jltn j )
- for( k0 kltn k )
- cij aikbkj
28Parallel Matrix Multiply
- No loop-carried dependences in i- or j-loop
- In both loop nests
- Loop-carried dependence on k-loop
- All i- and j-iterations can be run in parallel
29Problem Statement
j
i
i
30Matrix Multiply
- pragma omp parallel for
- for( i0 iltn i )
- for( j0 jltn j )
- cij 0.0
- pragma omp parallel for
- for( i0 iltn i )
- for( j0 jltn j )
- for( k0 kltn k )
- cij aikbkj
31Parallel Matrix Multiply (contd.)
- OpenMP permits parallelization of only one loop
in loop nest - We have chosen an approach with coarse
granularity - We could have parallelized the j loops
- Performance influenced by cost of memory accesses
- May require some experimentation to choose best
strategy
Homework experiment with OpenMP matrix
multiplication