OpenMP 3'0

About This Presentation

Title:

OpenMP 3'0

Description:

OpenMP 3.0. In Courtesy of OpenMP committee and Federico ... Geez!!!' Almost trivial to do for worksharing constructs. What about code in parallel region? ... – PowerPoint PPT presentation

Number of Views:22

Avg rating:3.0/5.0

Slides: 44

Provided by: leih8

Category:

more less

Transcript and Presenter's Notes

Title: OpenMP 3'0

1
OpenMP 3.0

In Courtesy of OpenMP committee and Federico
Massaioli CASPUR

Oct 1997 1.0 Fortran
Oct 1998 1.0 C/C
Nov 1999 1.1 Fortran (interpretations added)
Nov 2000 2.0 Fortran
Mar 2002 2.0 C/C
May 2005 2.5 Fortran/C/C (mostly a merge)
Main targets
rapid, good enough speedup on small SMPs
with an incremental approach to parallelization
without preventing scalability on big SMPs
and trying to be compatible with the serial
version

3
OpenMP 3.0

Under active discussion by OpenMP ARB
Over 50 issues have been discussed
15 new proposals may get into OpenMP 3.0
Weekly meetings email wiki

4
OpenMP in real life

Rapidly adopted
at least for first specs
some slow down for the latest ones
some runtimes not up to user expectations
Mostly used in HPC and scientific computing
plus some ISVs
plus some numerical libraries
plus some growing interest in other fields
Typical parallelization approach
incrementally parallelize, checking for
correctness
most frequent pattern parallel (a.k.a.
worksharing) loops
then take care of data locality

5
The parallel construct

pragma omp parallel
// Im not alone, anymore
// do this
// do that
switch(omp_get_thread_num())
case 0
//
case 1
//

redundantasynchronousexecution
Most users do not even realize how powerful this
isin itself
6
Worksharing the loop construct
redundantasynchronousexecution

pragma omp for
for (i 0 i lt N i)
//
//

chunk 1
chunk n
chunk 2
chunk
redundantasynchronousexecution
The number of iterations must be known each
time the construct is entered at runtime, and
must be the same for each thread
7
Worksharing the sections construct
redundantasynchronousexecution

pragma omp sections
pragma omp section
// do this
pragma omp section
// do that
//

s. 1
s. n
s. 2
s.
s. k
redundantasynchronousexecution
The omp section directive must be closely
nestedin a sections construct, where nothing but
omp sections is allowed!
8
Worksharing the single construct
redundantasynchronousexecution

pragma omp single
// Im the only one working

nowait
s.
redundantasynchronousexecution
More complex than it appears, if the barrier is
removed!
9
And more stuff

Data scoping clauses
shared
private
Synchronization and ordering
barriers
critical sections
atomic updates of scalar types
ordered sections
Memory consistency
the dreaded omp flush
Syntactic sugar
combined parallel and worksharing constructs
initialization of private variables, reductions
seems very sweet to average programmers

10
Pointer chasing loops in OpenMP?

for(p list p p p-gtnext)
process(p-gtitem)
Doesnt suit an omp for amount of iterations not
known in advance
Transformation to a canonical loop can be very
labour-intensive/inefficient

11
omp single nowait is our friend

pragma omp parallel private(p)
for(p list p p p-gtnext)
pragma omp single nowait
process(p-gtitem)
Each thread redundantly iterates through the loop
For each single value of p, only one thread is
allowed to enter the single construct
Very few people realize this
Compiler/runtime developers sometimes among them,
unfortunately

12
Some experimental measurements

? omp for
(including time to
build an array of
pointers)
omp singlenowait

100 work per list element
10 work per list element
13
Something unfeasible

void preorder(node p)
process(p-gtdata)
pragma omp parallel num_threads(2)
if (p-gtleft)
pragma omp single nowait
preorder(p-gtleft)
if (p-gtright)
pragma omp single nowait
preorder(p-gtright)
Because worksharing constructs cant be closely
nested
And stressing nested parallelism so much is nota
good idea

14
More good reasons

Multiblock grids on complex topologies,
multiresolution grids, immersed grids, AMR
Fluid-structure interactions in presence of
moving parts
Agent based models
immunology
financial markets simulations
Complex bodies, hierarchical assemblies of moving
parts
robotics
manoeuvering
Interaction of many different components
SPICE3, T. Weng, R. Perng, B. Chapman, IWOMP 06

15
Tasks in OpenMP

OpenMP 3.0
expected release in 2007
tasks are the critical path
Backward compatibility
are they consistent with the present standard?
do innovations break something?
User expectations
are they powerful and general enough?
do they break users assumptions?
Viability
can they be efficiently compiled? (outlining vs.
MET)
is good performance attainable?
Forward compatibility
are we blocking future roads?

16
Tasks

Adding tasking is the biggest addition for 3.0
Being worked on by a separate subcommittee
Led by Jay Hoeflinger at Intel
Re-examined issue from ground up
not rubber-stamping Intel taskqs
Main ideas are agreed, still working on some
details.
This is a snapshot of current status not the
final proposal
things may still change!

17
UPC dynamic sections

pragma omp sections
for(p list p p p-gtnext)
pragma omp section
process(p-gtdata)
Nanos Mercurium research compiler, Nth-Lib
runtime
Too restrictive no nested sections allowed
Valuable proof-of-concept
Interesting results on many codes (including a
web server)

18
Intel Workqueueing

void preorder(node p)
pragma intel omp taskq
process(p-gtdata)
if (p-gtleft)
pragma intel omp task
preorder(p-gtleft)
if (p-gtright)
pragma intel omp task
preorder(p-gtright)
Powerful model, used in real codes
FLAME project, UT Austin
Performance issues (push work to other threads)
Nested taskqs have a different behavior

19
Cilk

cilk void postorder(node p)
if (p-gtleft)
spawn postorder(p-gtleft)
if (p-gtright)
spawn postorder(p-gtright)
sync
process(p)
Why relevant to OpenMP tasks?
work first principle minimizes overhead
work-stealing burden on idle threads
Why not relevant to OpenMP tasks?
Cilk procedures create a scope
OpenMP has directives, scope is dynamic

20
Task directive

pragma omp task ltclausegt
ltstructured blockgt
??A task is generated each time a thread (the
encountering thread) encounters a task directive.
??A task is executed by a thread, called the
task-thread.
??A task is possibly-deferred work. The
task-thread may be the encountering thread or any
other thread in the encountering threads team.
??A task barrier blocks the thread that
encounters it until a set of associated tasks is
completed.
??Any thread may execute pending tasks when it is
waiting at a task barrier that it encounters, or
at a team barrier for its team.
??A given task must be completed by next task
barrier to which it is associated or the next
team barrier of the team containing its
encountering thread, whichever comes first.

21
Barriers

Two types of task barriers
taskwaitthread waits here until all tasks
generated in the current task (or thread if no
task) are complete
pragma omp taskwait
taskgroupthread waits at end of structured block
until all tasks generated by the execution of the
structured block are complete
pragma omp taskgroup
ltstructured blockgt
Thread team barriers (implicit and explicit)
pragma omp barrier
Implicit barrier at end of structured block for
parallel or any worksharing construct
Task behavior is the same as for taskwait

22
Tasks

void preorder(node p)
process(p-gtdata)
if (p-gtleft)
pragma omp task
preorder(p-gtleft)
if (p-gtright)
pragma omp task
preorder(p-gtright)
Will be executed by a single thread, in
unspecified order, possibly deferred
Tasks can be closely nested (a task can directly
generate other tasks)
No task queue concept, its up to the
implementation
More possibilities for optimization

23
Dependencies among tasks

void postorder(node p)
pragma omp taskgroup
if (p-gtleft)
pragma omp task
postorder(p-gtleft)
if (p-gtright)
pragma omp task
postorder(p-gtright)
process(p-gtdata)
Parent task suspended until children tasks
complete
Structured directive (as opposed to standard
barriers)
clearly defines the scope
more space for optimizations

suspend point
24
A real code example (very simplified)

void traversal(node p, void (f)(node ))
pragma omp taskgroup
if (p-gtleft)
pragma omp task
traversal(p-gtleft, f)
if (p-gtright)
pragma omp task
traversal(p-gtright, f)
f(p-gtdata)
Hierarchical description of a complex body
Used for both dependent and independent updates
not an issue in a serial code
serious concurrency reduction in a parallel code

25
A possible approach

void traversal(node p, void (f)(node ), int
po)
pragma omp taskgroup if(po)
if (p-gtleft)
pragma omp task
traversal(p-gtleft, f, po)
if (p-gtright)
pragma omp task
traversal(p-gtright, f, po)
f(p-gtdata)
More concurrency for independent nodes updates
Without code duplication
Negligible performance impact, in most cases
Really better than an unstructured task barrier?

26
Switches, switches, all those switches

Task switching the ability of a thread to
suspend a task and execute another one
needed to avoid explosion of internal data
structures
avoids starvation, enhances parallelism
Thread switching the property of a suspended
task to be resumed by a different thread
needed for work-first (particularly on parent
tasks)
aggressively avoids starvation and enhances
parallelism
A restricted set of suspend points (at task and
taskgroup boundaries)
Flush operation must be automatically inserted
Some people scared, particularly by thread
switching
see data issues later
thread switching must be switchable on or off

27
Data scoping rules for tasks

Thread switching is a BIG change
private in OpenMP means private to a thread
care must be exercised across switches
Variables must be private to the task
Can you rely on thread private storage?
How do children task inherit private parent
storage?
capture a private copy of parent value
access the child task modifies parent task
storage
data persistence issues in the latter case, a
nightmare for the compiler to spot

28
Task Directive

Data sharing attribute clauses (final names not
determined)
ltcaptureaddressgt
ltcapturevaluegt
lttaskprivategt
Can be nested inside worksharing constructs
Defaults for data sharing attributes are under
discussion
Each task is executed by a single thread,
although can use a parallel directive within a
task.
Tasks can be nested inside other tasks.

29
What should the default be?

Child access parent storage by default
implicit task barriers at the end of tasks and
routines
safety by default for less than average users
potentially reduces concurrency, hampers
optimization
users could take synchronization for granted
Child storage private/captured by default
access to parent storage only legal inside a
taskgroup
cleaner, code more readable
potential for optimization and performance
First analyses on test codes
few accesses to parent storage from children
tasks
extensive manual privatization needed in the
other case
probably not a big issue to sort out

30
Tasks everywhere?

Some people think of making present OpenMP
constructs into syntactic sugar for tasks
uniform units of work for the runtime to manage
CMP whit many cores
Threads are so 1990s! Geez!!!
Almost trivial to do for worksharing constructs
What about code in parallel region?
think of many many cores without OOO
compiler could split everything in tasks
making tasks a worksharing model will not work
One of the hardest things to be compatible with

31
Conclusions (well, not really)

Apparently a lot of choices to make, but
interrelationships are being sorted out
number of choices is being reduced
consensus is building up
Work on test code, real and synthetic
a lot of rewarding work
useful to motivate the proposal, as well
most interesting codes cannot be shared
testing potential concurrency and code
readability
Lack of a reference implementation
Forward compatibility must be further investigated

32
Still To Be Done

Determine other clauses for a task
Reduction?
Ordered?
What is the default data sharing attribute for
variables not specified in a task clause?
Decide names for everything
Make a reference implementation, evaluate it.
Craft good wording for a proposal.

33
Nested parallelism

Better support for nested parallelism
Multiple internal control variables
Allows, for example, calling omp_set_num_threads()
inside a parallel region.
Library routines to determine depth of nesting
and IDs of parent/grandparent etc. threads.
Allow threadprivate variables to persist across
inner parallel regions
Looking for a way of describing the nesting
structure up-front so the runtime can make
intelligent thread placement decisions

34
Parallel loops

Guarantee that this works
!omp do schedule(static)
do i1,n
a(i) ....
end do
!omp end do nowait
!omp do schedule(static)
do i1,n
.... a(i)
end do

35
Loops (cont.)

??Allow collapsing of perfectly nested loops
!omp parallel do collapse(2)
do i1,n
do j1,n
.....
end do
end do
??Will form a single loop and then parallelize
that

36
Loops (cont.)

Make schedule(runtime) more useful
can set it with library routine
allow implementations to implement their own
schedule kinds
Add a new schedule kind which gives full freedom
to the runtime to determine the scheduling of
iterations to threads.

37
Portable control of threads

Add environment variable to control the size of
child threadsstack
Add environment variable to hint to runtime how
to treat idle threads
ACTIVE keep threads alive at barriers/locks
PASSIVE try to release processor at
barriers/locks
not set use implementation specific controls

38
NUMA support

Leading candidate is a MIGRATE_NEXT_TOUCH
directive
move the data to the node where the next thread
to access it is running
would be ignored on UMA systems or NUMA systems
that cant support it
Still up for discussion and not certain to make
it into 3.0

39
Thread Subteam

Organize threads into different subteams
Introduce intra-team synchronziations
Support threads topology
May support thread mapping to hardware in the
future

40
Placement of Threads on Machine
SMP
CPU
CPU
CPU
CORE
CORE
CORE
CORE
Intra-team barrier
Global barrier
Threads
Thread subteam 2
Thread subteam 1
41
BT-MZ Performance with Subteams
Courtesy of H. Jin, NASA Ames
42
BT-MZ Performance with Subteams
Number of zone groups number of subteams
43
Odds and ends
??