OpenMP 3'0 - PowerPoint PPT Presentation

1 / 43
About This Presentation
Title:

OpenMP 3'0

Description:

OpenMP 3.0. In Courtesy of OpenMP committee and Federico ... Geez!!!' Almost trivial to do for worksharing constructs. What about code in parallel region? ... – PowerPoint PPT presentation

Number of Views:21
Avg rating:3.0/5.0
Slides: 44
Provided by: leih8
Category:
Tags: openmp | geez | runtimes

less

Transcript and Presenter's Notes

Title: OpenMP 3'0


1
OpenMP 3.0
  • In Courtesy of OpenMP committee and Federico
    Massaioli CASPUR

2
  • Oct 1997 1.0 Fortran
  • Oct 1998 1.0 C/C
  • Nov 1999 1.1 Fortran (interpretations added)
  • Nov 2000 2.0 Fortran
  • Mar 2002 2.0 C/C
  • May 2005 2.5 Fortran/C/C (mostly a merge)
  • Main targets
  • rapid, good enough speedup on small SMPs
  • with an incremental approach to parallelization
  • without preventing scalability on big SMPs
  • and trying to be compatible with the serial
    version

3
OpenMP 3.0
  • Under active discussion by OpenMP ARB
  • Over 50 issues have been discussed
  • 15 new proposals may get into OpenMP 3.0
  • Weekly meetings email wiki

4
OpenMP in real life
  • Rapidly adopted
  • at least for first specs
  • some slow down for the latest ones
  • some runtimes not up to user expectations
  • Mostly used in HPC and scientific computing
  • plus some ISVs
  • plus some numerical libraries
  • plus some growing interest in other fields
  • Typical parallelization approach
  • incrementally parallelize, checking for
    correctness
  • most frequent pattern parallel (a.k.a.
    worksharing) loops
  • then take care of data locality

5
The parallel construct
  • pragma omp parallel
  • // Im not alone, anymore
  • // do this
  • // do that
  • switch(omp_get_thread_num())
  • case 0
  • //
  • case 1
  • //

redundantasynchronousexecution
Most users do not even realize how powerful this
isin itself
6
Worksharing the loop construct
redundantasynchronousexecution
  • pragma omp for
  • for (i 0 i lt N i)
  • //
  • //

chunk 1
chunk n
chunk 2
chunk
redundantasynchronousexecution
The number of iterations must be known each
time the construct is entered at runtime, and
must be the same for each thread
7
Worksharing the sections construct
redundantasynchronousexecution
  • pragma omp sections
  • pragma omp section
  • // do this
  • pragma omp section
  • // do that
  • //

s. 1
s. n
s. 2
s.
s. k
redundantasynchronousexecution
The omp section directive must be closely
nestedin a sections construct, where nothing but
omp sections is allowed!
8
Worksharing the single construct
redundantasynchronousexecution
  • pragma omp single
  • // Im the only one working

nowait
s.
redundantasynchronousexecution
More complex than it appears, if the barrier is
removed!
9
And more stuff
  • Data scoping clauses
  • shared
  • private
  • Synchronization and ordering
  • barriers
  • critical sections
  • atomic updates of scalar types
  • ordered sections
  • Memory consistency
  • the dreaded omp flush
  • Syntactic sugar
  • combined parallel and worksharing constructs
  • initialization of private variables, reductions
  • seems very sweet to average programmers

10
Pointer chasing loops in OpenMP?
  • for(p list p p p-gtnext)
  • process(p-gtitem)
  • Doesnt suit an omp for amount of iterations not
    known in advance
  • Transformation to a canonical loop can be very
    labour-intensive/inefficient

11
omp single nowait is our friend
  • pragma omp parallel private(p)
  • for(p list p p p-gtnext)
  • pragma omp single nowait
  • process(p-gtitem)
  • Each thread redundantly iterates through the loop
  • For each single value of p, only one thread is
    allowed to enter the single construct
  • Very few people realize this
  • Compiler/runtime developers sometimes among them,
    unfortunately

12
Some experimental measurements
  • ? omp for
  • (including time to
  • build an array of
  • pointers)
  • omp singlenowait

100 work per list element
10 work per list element
13
Something unfeasible
  • void preorder(node p)
  • process(p-gtdata)
  • pragma omp parallel num_threads(2)
  • if (p-gtleft)
  • pragma omp single nowait
  • preorder(p-gtleft)
  • if (p-gtright)
  • pragma omp single nowait
  • preorder(p-gtright)
  • Because worksharing constructs cant be closely
    nested
  • And stressing nested parallelism so much is nota
    good idea

14
More good reasons
  • Multiblock grids on complex topologies,
    multiresolution grids, immersed grids, AMR
  • Fluid-structure interactions in presence of
    moving parts
  • Agent based models
  • immunology
  • financial markets simulations
  • Complex bodies, hierarchical assemblies of moving
    parts
  • robotics
  • manoeuvering
  • Interaction of many different components
  • SPICE3, T. Weng, R. Perng, B. Chapman, IWOMP 06

15
Tasks in OpenMP
  • OpenMP 3.0
  • expected release in 2007
  • tasks are the critical path
  • Backward compatibility
  • are they consistent with the present standard?
  • do innovations break something?
  • User expectations
  • are they powerful and general enough?
  • do they break users assumptions?
  • Viability
  • can they be efficiently compiled? (outlining vs.
    MET)
  • is good performance attainable?
  • Forward compatibility
  • are we blocking future roads?

16
Tasks
  • Adding tasking is the biggest addition for 3.0
  • Being worked on by a separate subcommittee
  • Led by Jay Hoeflinger at Intel
  • Re-examined issue from ground up
  • not rubber-stamping Intel taskqs
  • Main ideas are agreed, still working on some
    details.
  • This is a snapshot of current status not the
    final proposal
  • things may still change!

17
UPC dynamic sections
  • pragma omp sections
  • for(p list p p p-gtnext)
  • pragma omp section
  • process(p-gtdata)
  • Nanos Mercurium research compiler, Nth-Lib
    runtime
  • Too restrictive no nested sections allowed
  • Valuable proof-of-concept
  • Interesting results on many codes (including a
    web server)

18
Intel Workqueueing
  • void preorder(node p)
  • pragma intel omp taskq
  • process(p-gtdata)
  • if (p-gtleft)
  • pragma intel omp task
  • preorder(p-gtleft)
  • if (p-gtright)
  • pragma intel omp task
  • preorder(p-gtright)
  • Powerful model, used in real codes
  • FLAME project, UT Austin
  • Performance issues (push work to other threads)
  • Nested taskqs have a different behavior

19
Cilk
  • cilk void postorder(node p)
  • if (p-gtleft)
  • spawn postorder(p-gtleft)
  • if (p-gtright)
  • spawn postorder(p-gtright)
  • sync
  • process(p)
  • Why relevant to OpenMP tasks?
  • work first principle minimizes overhead
  • work-stealing burden on idle threads
  • Why not relevant to OpenMP tasks?
  • Cilk procedures create a scope
  • OpenMP has directives, scope is dynamic

20
Task directive
  • pragma omp task ltclausegt
  • ltstructured blockgt
  • ??A task is generated each time a thread (the
    encountering thread) encounters a task directive.
  • ??A task is executed by a thread, called the
    task-thread.
  • ??A task is possibly-deferred work. The
    task-thread may be the encountering thread or any
    other thread in the encountering threads team.
  • ??A task barrier blocks the thread that
    encounters it until a set of associated tasks is
    completed.
  • ??Any thread may execute pending tasks when it is
    waiting at a task barrier that it encounters, or
    at a team barrier for its team.
  • ??A given task must be completed by next task
    barrier to which it is associated or the next
    team barrier of the team containing its
    encountering thread, whichever comes first.

21
Barriers
  • Two types of task barriers
  • taskwaitthread waits here until all tasks
    generated in the current task (or thread if no
    task) are complete
  • pragma omp taskwait
  • taskgroupthread waits at end of structured block
    until all tasks generated by the execution of the
    structured block are complete
  • pragma omp taskgroup
  • ltstructured blockgt
  • Thread team barriers (implicit and explicit)
  • pragma omp barrier
  • Implicit barrier at end of structured block for
    parallel or any worksharing construct
  • Task behavior is the same as for taskwait

22
Tasks
  • void preorder(node p)
  • process(p-gtdata)
  • if (p-gtleft)
  • pragma omp task
  • preorder(p-gtleft)
  • if (p-gtright)
  • pragma omp task
  • preorder(p-gtright)
  • Will be executed by a single thread, in
    unspecified order, possibly deferred
  • Tasks can be closely nested (a task can directly
    generate other tasks)
  • No task queue concept, its up to the
    implementation
  • More possibilities for optimization

23
Dependencies among tasks
  • void postorder(node p)
  • pragma omp taskgroup
  • if (p-gtleft)
  • pragma omp task
  • postorder(p-gtleft)
  • if (p-gtright)
  • pragma omp task
  • postorder(p-gtright)
  • process(p-gtdata)
  • Parent task suspended until children tasks
    complete
  • Structured directive (as opposed to standard
    barriers)
  • clearly defines the scope
  • more space for optimizations

suspend point
24
A real code example (very simplified)
  • void traversal(node p, void (f)(node ))
  • pragma omp taskgroup
  • if (p-gtleft)
  • pragma omp task
  • traversal(p-gtleft, f)
  • if (p-gtright)
  • pragma omp task
  • traversal(p-gtright, f)
  • f(p-gtdata)
  • Hierarchical description of a complex body
  • Used for both dependent and independent updates
  • not an issue in a serial code
  • serious concurrency reduction in a parallel code

25
A possible approach
  • void traversal(node p, void (f)(node ), int
    po)
  • pragma omp taskgroup if(po)
  • if (p-gtleft)
  • pragma omp task
  • traversal(p-gtleft, f, po)
  • if (p-gtright)
  • pragma omp task
  • traversal(p-gtright, f, po)
  • f(p-gtdata)
  • More concurrency for independent nodes updates
  • Without code duplication
  • Negligible performance impact, in most cases
  • Really better than an unstructured task barrier?

26
Switches, switches, all those switches
  • Task switching the ability of a thread to
    suspend a task and execute another one
  • needed to avoid explosion of internal data
    structures
  • avoids starvation, enhances parallelism
  • Thread switching the property of a suspended
    task to be resumed by a different thread
  • needed for work-first (particularly on parent
    tasks)
  • aggressively avoids starvation and enhances
    parallelism
  • A restricted set of suspend points (at task and
    taskgroup boundaries)
  • Flush operation must be automatically inserted
  • Some people scared, particularly by thread
    switching
  • see data issues later
  • thread switching must be switchable on or off

27
Data scoping rules for tasks
  • Thread switching is a BIG change
  • private in OpenMP means private to a thread
  • care must be exercised across switches
  • Variables must be private to the task
  • Can you rely on thread private storage?
  • How do children task inherit private parent
    storage?
  • capture a private copy of parent value
  • access the child task modifies parent task
    storage
  • data persistence issues in the latter case, a
    nightmare for the compiler to spot

28
Task Directive
  • Data sharing attribute clauses (final names not
    determined)
  • ltcaptureaddressgt
  • ltcapturevaluegt
  • lttaskprivategt
  • Can be nested inside worksharing constructs
  • Defaults for data sharing attributes are under
    discussion
  • Each task is executed by a single thread,
    although can use a parallel directive within a
    task.
  • Tasks can be nested inside other tasks.

29
What should the default be?
  • Child access parent storage by default
  • implicit task barriers at the end of tasks and
    routines
  • safety by default for less than average users
  • potentially reduces concurrency, hampers
    optimization
  • users could take synchronization for granted
  • Child storage private/captured by default
  • access to parent storage only legal inside a
    taskgroup
  • cleaner, code more readable
  • potential for optimization and performance
  • First analyses on test codes
  • few accesses to parent storage from children
    tasks
  • extensive manual privatization needed in the
    other case
  • probably not a big issue to sort out

30
Tasks everywhere?
  • Some people think of making present OpenMP
    constructs into syntactic sugar for tasks
  • uniform units of work for the runtime to manage
  • CMP whit many cores
  • Threads are so 1990s! Geez!!!
  • Almost trivial to do for worksharing constructs
  • What about code in parallel region?
  • think of many many cores without OOO
  • compiler could split everything in tasks
  • making tasks a worksharing model will not work
  • One of the hardest things to be compatible with

31
Conclusions (well, not really)
  • Apparently a lot of choices to make, but
  • interrelationships are being sorted out
  • number of choices is being reduced
  • consensus is building up
  • Work on test code, real and synthetic
  • a lot of rewarding work
  • useful to motivate the proposal, as well
  • most interesting codes cannot be shared
  • testing potential concurrency and code
    readability
  • Lack of a reference implementation
  • Forward compatibility must be further investigated

32
Still To Be Done
  • Determine other clauses for a task
  • Reduction?
  • Ordered?
  • What is the default data sharing attribute for
    variables not specified in a task clause?
  • Decide names for everything
  • Make a reference implementation, evaluate it.
  • Craft good wording for a proposal.

33
Nested parallelism
  • Better support for nested parallelism
  • Multiple internal control variables
  • Allows, for example, calling omp_set_num_threads()
    inside a parallel region.
  • Library routines to determine depth of nesting
    and IDs of parent/grandparent etc. threads.
  • Allow threadprivate variables to persist across
    inner parallel regions
  • Looking for a way of describing the nesting
    structure up-front so the runtime can make
    intelligent thread placement decisions

34
Parallel loops
  • Guarantee that this works
  • !omp do schedule(static)
  • do i1,n
  • a(i) ....
  • end do
  • !omp end do nowait
  • !omp do schedule(static)
  • do i1,n
  • .... a(i)
  • end do

35
Loops (cont.)
  • ??Allow collapsing of perfectly nested loops
  • !omp parallel do collapse(2)
  • do i1,n
  • do j1,n
  • .....
  • end do
  • end do
  • ??Will form a single loop and then parallelize
    that

36
Loops (cont.)
  • Make schedule(runtime) more useful
  • can set it with library routine
  • allow implementations to implement their own
    schedule kinds
  • Add a new schedule kind which gives full freedom
    to the runtime to determine the scheduling of
    iterations to threads.


37
Portable control of threads
  • Add environment variable to control the size of
    child threadsstack
  • Add environment variable to hint to runtime how
    to treat idle threads
  • ACTIVE keep threads alive at barriers/locks
  • PASSIVE try to release processor at
    barriers/locks
  • not set use implementation specific controls

38
NUMA support
  • Leading candidate is a MIGRATE_NEXT_TOUCH
    directive
  • move the data to the node where the next thread
    to access it is running
  • would be ignored on UMA systems or NUMA systems
    that cant support it
  • Still up for discussion and not certain to make
    it into 3.0

39
Thread Subteam
  • Organize threads into different subteams
  • Introduce intra-team synchronziations
  • Support threads topology
  • May support thread mapping to hardware in the
    future

40
Placement of Threads on Machine
SMP
CPU
CPU
CPU
CORE
CORE
CORE
CORE
Intra-team barrier
Global barrier
Threads
Thread subteam 2
Thread subteam 1
41
BT-MZ Performance with Subteams
Courtesy of H. Jin, NASA Ames
42
BT-MZ Performance with Subteams
Number of zone groups number of subteams
43
Odds and ends
??
  • Allow unsigned intsin parallel for loops
  • Disallow use of the original variable as master
    threads private variable
  • Make it clearer where/how private objects are
    constructed/destructed
  • Relax some restrictions on allocatablearrays and
    Fortran pointers
  • Plug some minor gaps in memory model
  • Improve C/C grammar
Write a Comment
User Comments (0)
About PowerShow.com