Title: OpenMP 3'0
1OpenMP 3.0
- In Courtesy of OpenMP committee and Federico
Massaioli CASPUR
2- Oct 1997 1.0 Fortran
- Oct 1998 1.0 C/C
- Nov 1999 1.1 Fortran (interpretations added)
- Nov 2000 2.0 Fortran
- Mar 2002 2.0 C/C
- May 2005 2.5 Fortran/C/C (mostly a merge)
- Main targets
- rapid, good enough speedup on small SMPs
- with an incremental approach to parallelization
- without preventing scalability on big SMPs
- and trying to be compatible with the serial
version
3OpenMP 3.0
- Under active discussion by OpenMP ARB
- Over 50 issues have been discussed
- 15 new proposals may get into OpenMP 3.0
- Weekly meetings email wiki
4OpenMP in real life
- Rapidly adopted
- at least for first specs
- some slow down for the latest ones
- some runtimes not up to user expectations
- Mostly used in HPC and scientific computing
- plus some ISVs
- plus some numerical libraries
- plus some growing interest in other fields
- Typical parallelization approach
- incrementally parallelize, checking for
correctness - most frequent pattern parallel (a.k.a.
worksharing) loops - then take care of data locality
5The parallel construct
- pragma omp parallel
-
- // Im not alone, anymore
- // do this
- // do that
- switch(omp_get_thread_num())
- case 0
- //
- case 1
- //
-
redundantasynchronousexecution
Most users do not even realize how powerful this
isin itself
6Worksharing the loop construct
redundantasynchronousexecution
- pragma omp for
- for (i 0 i lt N i)
- //
- //
chunk 1
chunk n
chunk 2
chunk
redundantasynchronousexecution
The number of iterations must be known each
time the construct is entered at runtime, and
must be the same for each thread
7Worksharing the sections construct
redundantasynchronousexecution
- pragma omp sections
-
- pragma omp section
-
- // do this
-
- pragma omp section
-
- // do that
-
- //
s. 1
s. n
s. 2
s.
s. k
redundantasynchronousexecution
The omp section directive must be closely
nestedin a sections construct, where nothing but
omp sections is allowed!
8Worksharing the single construct
redundantasynchronousexecution
- pragma omp single
-
- // Im the only one working
nowait
s.
redundantasynchronousexecution
More complex than it appears, if the barrier is
removed!
9And more stuff
- Data scoping clauses
- shared
- private
- Synchronization and ordering
- barriers
- critical sections
- atomic updates of scalar types
- ordered sections
- Memory consistency
- the dreaded omp flush
- Syntactic sugar
- combined parallel and worksharing constructs
- initialization of private variables, reductions
- seems very sweet to average programmers
10Pointer chasing loops in OpenMP?
- for(p list p p p-gtnext)
- process(p-gtitem)
-
- Doesnt suit an omp for amount of iterations not
known in advance - Transformation to a canonical loop can be very
labour-intensive/inefficient
11omp single nowait is our friend
- pragma omp parallel private(p)
-
- for(p list p p p-gtnext)
- pragma omp single nowait
- process(p-gtitem)
-
- Each thread redundantly iterates through the loop
- For each single value of p, only one thread is
allowed to enter the single construct - Very few people realize this
- Compiler/runtime developers sometimes among them,
unfortunately
12Some experimental measurements
- ? omp for
- (including time to
- build an array of
- pointers)
- omp singlenowait
100 work per list element
10 work per list element
13Something unfeasible
- void preorder(node p)
- process(p-gtdata)
- pragma omp parallel num_threads(2)
-
- if (p-gtleft)
- pragma omp single nowait
- preorder(p-gtleft)
- if (p-gtright)
- pragma omp single nowait
- preorder(p-gtright)
-
-
- Because worksharing constructs cant be closely
nested - And stressing nested parallelism so much is nota
good idea
14More good reasons
- Multiblock grids on complex topologies,
multiresolution grids, immersed grids, AMR - Fluid-structure interactions in presence of
moving parts - Agent based models
- immunology
- financial markets simulations
- Complex bodies, hierarchical assemblies of moving
parts - robotics
- manoeuvering
- Interaction of many different components
- SPICE3, T. Weng, R. Perng, B. Chapman, IWOMP 06
15Tasks in OpenMP
- OpenMP 3.0
- expected release in 2007
- tasks are the critical path
- Backward compatibility
- are they consistent with the present standard?
- do innovations break something?
- User expectations
- are they powerful and general enough?
- do they break users assumptions?
- Viability
- can they be efficiently compiled? (outlining vs.
MET) - is good performance attainable?
- Forward compatibility
- are we blocking future roads?
16Tasks
- Adding tasking is the biggest addition for 3.0
- Being worked on by a separate subcommittee
- Led by Jay Hoeflinger at Intel
- Re-examined issue from ground up
- not rubber-stamping Intel taskqs
- Main ideas are agreed, still working on some
details. - This is a snapshot of current status not the
final proposal - things may still change!
17UPC dynamic sections
- pragma omp sections
-
- for(p list p p p-gtnext)
- pragma omp section
-
- process(p-gtdata)
-
-
-
- Nanos Mercurium research compiler, Nth-Lib
runtime - Too restrictive no nested sections allowed
- Valuable proof-of-concept
- Interesting results on many codes (including a
web server)
18Intel Workqueueing
- void preorder(node p)
- pragma intel omp taskq
-
- process(p-gtdata)
- if (p-gtleft)
- pragma intel omp task
- preorder(p-gtleft)
- if (p-gtright)
- pragma intel omp task
- preorder(p-gtright)
-
-
- Powerful model, used in real codes
- FLAME project, UT Austin
- Performance issues (push work to other threads)
- Nested taskqs have a different behavior
19Cilk
- cilk void postorder(node p)
- if (p-gtleft)
- spawn postorder(p-gtleft)
- if (p-gtright)
- spawn postorder(p-gtright)
- sync
- process(p)
-
- Why relevant to OpenMP tasks?
- work first principle minimizes overhead
- work-stealing burden on idle threads
- Why not relevant to OpenMP tasks?
- Cilk procedures create a scope
- OpenMP has directives, scope is dynamic
20Task directive
- pragma omp task ltclausegt
- ltstructured blockgt
- ??A task is generated each time a thread (the
encountering thread) encounters a task directive. - ??A task is executed by a thread, called the
task-thread. - ??A task is possibly-deferred work. The
task-thread may be the encountering thread or any
other thread in the encountering threads team. - ??A task barrier blocks the thread that
encounters it until a set of associated tasks is
completed. - ??Any thread may execute pending tasks when it is
waiting at a task barrier that it encounters, or
at a team barrier for its team. - ??A given task must be completed by next task
barrier to which it is associated or the next
team barrier of the team containing its
encountering thread, whichever comes first.
21Barriers
- Two types of task barriers
- taskwaitthread waits here until all tasks
generated in the current task (or thread if no
task) are complete - pragma omp taskwait
- taskgroupthread waits at end of structured block
until all tasks generated by the execution of the
structured block are complete - pragma omp taskgroup
- ltstructured blockgt
- Thread team barriers (implicit and explicit)
- pragma omp barrier
- Implicit barrier at end of structured block for
parallel or any worksharing construct - Task behavior is the same as for taskwait
22Tasks
- void preorder(node p)
- process(p-gtdata)
- if (p-gtleft)
- pragma omp task
- preorder(p-gtleft)
- if (p-gtright)
- pragma omp task
- preorder(p-gtright)
-
- Will be executed by a single thread, in
unspecified order, possibly deferred - Tasks can be closely nested (a task can directly
generate other tasks) - No task queue concept, its up to the
implementation - More possibilities for optimization
23Dependencies among tasks
- void postorder(node p)
- pragma omp taskgroup
- if (p-gtleft)
- pragma omp task
- postorder(p-gtleft)
- if (p-gtright)
- pragma omp task
- postorder(p-gtright)
-
- process(p-gtdata)
-
- Parent task suspended until children tasks
complete - Structured directive (as opposed to standard
barriers) - clearly defines the scope
- more space for optimizations
suspend point
24A real code example (very simplified)
- void traversal(node p, void (f)(node ))
- pragma omp taskgroup
- if (p-gtleft)
- pragma omp task
- traversal(p-gtleft, f)
- if (p-gtright)
- pragma omp task
- traversal(p-gtright, f)
-
- f(p-gtdata)
-
- Hierarchical description of a complex body
- Used for both dependent and independent updates
- not an issue in a serial code
- serious concurrency reduction in a parallel code
25A possible approach
- void traversal(node p, void (f)(node ), int
po) - pragma omp taskgroup if(po)
- if (p-gtleft)
- pragma omp task
- traversal(p-gtleft, f, po)
- if (p-gtright)
- pragma omp task
- traversal(p-gtright, f, po)
-
- f(p-gtdata)
-
- More concurrency for independent nodes updates
- Without code duplication
- Negligible performance impact, in most cases
- Really better than an unstructured task barrier?
26Switches, switches, all those switches
- Task switching the ability of a thread to
suspend a task and execute another one - needed to avoid explosion of internal data
structures - avoids starvation, enhances parallelism
- Thread switching the property of a suspended
task to be resumed by a different thread - needed for work-first (particularly on parent
tasks) - aggressively avoids starvation and enhances
parallelism - A restricted set of suspend points (at task and
taskgroup boundaries) - Flush operation must be automatically inserted
- Some people scared, particularly by thread
switching - see data issues later
- thread switching must be switchable on or off
27Data scoping rules for tasks
- Thread switching is a BIG change
- private in OpenMP means private to a thread
- care must be exercised across switches
- Variables must be private to the task
- Can you rely on thread private storage?
- How do children task inherit private parent
storage? - capture a private copy of parent value
- access the child task modifies parent task
storage - data persistence issues in the latter case, a
nightmare for the compiler to spot
28Task Directive
- Data sharing attribute clauses (final names not
determined) - ltcaptureaddressgt
- ltcapturevaluegt
- lttaskprivategt
- Can be nested inside worksharing constructs
- Defaults for data sharing attributes are under
discussion - Each task is executed by a single thread,
although can use a parallel directive within a
task. - Tasks can be nested inside other tasks.
29What should the default be?
- Child access parent storage by default
- implicit task barriers at the end of tasks and
routines - safety by default for less than average users
- potentially reduces concurrency, hampers
optimization - users could take synchronization for granted
- Child storage private/captured by default
- access to parent storage only legal inside a
taskgroup - cleaner, code more readable
- potential for optimization and performance
- First analyses on test codes
- few accesses to parent storage from children
tasks - extensive manual privatization needed in the
other case - probably not a big issue to sort out
30Tasks everywhere?
- Some people think of making present OpenMP
constructs into syntactic sugar for tasks - uniform units of work for the runtime to manage
- CMP whit many cores
- Threads are so 1990s! Geez!!!
- Almost trivial to do for worksharing constructs
- What about code in parallel region?
- think of many many cores without OOO
- compiler could split everything in tasks
- making tasks a worksharing model will not work
- One of the hardest things to be compatible with
31Conclusions (well, not really)
- Apparently a lot of choices to make, but
- interrelationships are being sorted out
- number of choices is being reduced
- consensus is building up
- Work on test code, real and synthetic
- a lot of rewarding work
- useful to motivate the proposal, as well
- most interesting codes cannot be shared
- testing potential concurrency and code
readability - Lack of a reference implementation
- Forward compatibility must be further investigated
32Still To Be Done
- Determine other clauses for a task
- Reduction?
- Ordered?
- What is the default data sharing attribute for
variables not specified in a task clause? - Decide names for everything
- Make a reference implementation, evaluate it.
- Craft good wording for a proposal.
33Nested parallelism
- Better support for nested parallelism
- Multiple internal control variables
- Allows, for example, calling omp_set_num_threads()
inside a parallel region. - Library routines to determine depth of nesting
and IDs of parent/grandparent etc. threads. - Allow threadprivate variables to persist across
inner parallel regions - Looking for a way of describing the nesting
structure up-front so the runtime can make
intelligent thread placement decisions
34Parallel loops
- Guarantee that this works
- !omp do schedule(static)
- do i1,n
- a(i) ....
- end do
- !omp end do nowait
- !omp do schedule(static)
- do i1,n
- .... a(i)
- end do
35Loops (cont.)
- ??Allow collapsing of perfectly nested loops
- !omp parallel do collapse(2)
- do i1,n
- do j1,n
- .....
- end do
- end do
- ??Will form a single loop and then parallelize
that
36Loops (cont.)
- Make schedule(runtime) more useful
- can set it with library routine
- allow implementations to implement their own
schedule kinds - Add a new schedule kind which gives full freedom
to the runtime to determine the scheduling of
iterations to threads.
37Portable control of threads
- Add environment variable to control the size of
child threadsstack - Add environment variable to hint to runtime how
to treat idle threads - ACTIVE keep threads alive at barriers/locks
- PASSIVE try to release processor at
barriers/locks - not set use implementation specific controls
38NUMA support
- Leading candidate is a MIGRATE_NEXT_TOUCH
directive - move the data to the node where the next thread
to access it is running - would be ignored on UMA systems or NUMA systems
that cant support it - Still up for discussion and not certain to make
it into 3.0
39Thread Subteam
- Organize threads into different subteams
- Introduce intra-team synchronziations
- Support threads topology
- May support thread mapping to hardware in the
future
40Placement of Threads on Machine
SMP
CPU
CPU
CPU
CORE
CORE
CORE
CORE
Intra-team barrier
Global barrier
Threads
Thread subteam 2
Thread subteam 1
41BT-MZ Performance with Subteams
Courtesy of H. Jin, NASA Ames
42BT-MZ Performance with Subteams
Number of zone groups number of subteams
43Odds and ends
??
- Allow unsigned intsin parallel for loops
- Disallow use of the original variable as master
threads private variable - Make it clearer where/how private objects are
constructed/destructed - Relax some restrictions on allocatablearrays and
Fortran pointers - Plug some minor gaps in memory model
- Improve C/C grammar