ECE 1747 Parallel Programming - PowerPoint PPT Presentation

About This Presentation
Title:

ECE 1747 Parallel Programming

Description:

OpenMP Example Usage (1 of 2) OpenMP. Compiler. Annotated ... OpenMP Example Usage (2 of 2) If you give sequential switch, comments and pragmas are ignored. ... – PowerPoint PPT presentation

Number of Views:40
Avg rating:3.0/5.0
Slides: 65
Provided by: CITI
Category:

less

Transcript and Presenter's Notes

Title: ECE 1747 Parallel Programming


1
ECE 1747 Parallel Programming
  • Shared Memory OpenMP
  • Environment and Synchronization

2
What is OpenMP?
  • Standard for shared memory programming for
    scientific applications.
  • Has specific support for scientific application
    needs (unlike Pthreads).
  • Rapidly gaining acceptance among vendors and
    application writers.
  • See http//www.openmp.org for more info.

3
OpenMP API Overview
  • API is a set of compiler directives inserted in
    the source program (in addition to some library
    functions).
  • Ideally, compiler directives do not affect
    sequential code.
  • pragmas in C / C .
  • (special) comments in Fortran code.

4
OpenMP API Example
  • Sequential code
  • statement1
  • statement2
  • statement3
  • Assume we want to execute statement 2 in
    parallel, and statement 1 and 3 sequentially.

5
OpenMP API Example (2 of 2)
  • OpenMP parallel code
  • statement 1
  • pragma ltspecific OpenMP directivegt
  • statement2
  • statement3
  • Statement 2 (may be) executed in parallel.
  • Statement 1 and 3 are executed sequentially.

6
Important Note
  • By giving a parallel directive, the user asserts
    that the program will remain correct if the
    statement is executed in parallel.
  • OpenMP compiler does not check correctness.
  • Some tools exist for helping with that.
  • Totalview - good parallel debugger
    (www.etnus.com)

7
API Semantics
  • Master thread executes sequential code.
  • Master and slaves execute parallel code.
  • Note very similar to fork-join semantics of
    Pthreads create/join primitives.

8
OpenMP Implementation Overview
  • OpenMP implementation
  • compiler,
  • library.
  • Unlike Pthreads (purely a library).

9
OpenMP Example Usage (1 of 2)
Sequential Program
OpenMP Compiler
Annotated Source
compiler switch
Parallel Program
10
OpenMP Example Usage (2 of 2)
  • If you give sequential switch,
  • comments and pragmas are ignored.
  • If you give parallel switch,
  • comments and/or pragmas are read, and
  • cause translation into parallel program.
  • Ideally, one source for both sequential and
    parallel program (big maintenance plus).

11
OpenMP Directives
  • Parallelization directives
  • parallel region
  • parallel for
  • Data environment directives
  • shared, private, threadprivate, reduction, etc.
  • Synchronization directives
  • barrier, critical

12
General Rules about Directives
  • They always apply to the next statement, which
    must be a structured block.
  • Examples
  • pragma omp
  • statement
  • pragma omp
  • statement1 statement2 statement3

13
OpenMP Parallel Region
  • pragma omp parallel
  • A number of threads are spawned at entry.
  • Each thread executes the same code.
  • Each thread waits at the end.
  • Very similar to a number of create/joins with
    the same function in Pthreads.

14
Getting Threads to do Different Things
  • Through explicit thread identification (as in
    Pthreads).
  • Through work-sharing directives.

15
Thread Identification
  • int omp_get_thread_num()
  • int omp_get_num_threads()
  • Gets the thread id.
  • Gets the total number of threads.

16
Example
  • pragma omp parallel
  • if( !omp_get_thread_num() )
  • master()
  • else
  • slave()

17
Work Sharing Directives
  • Always occur within a parallel region directive.
  • Two principal ones are
  • parallel for
  • parallel section

18
OpenMP Parallel For
  • pragma omp parallel
  • pragma omp for
  • for( )
  • Each thread executes a subset of the iterations.
  • All threads wait at the end of the parallel for.

19
Multiple Work Sharing Directives
  • May occur within a single parallel region
  • pragma omp parallel
  • pragma omp for
  • for( )
  • pragma omp for
  • for( )
  • All threads wait at the end of the first for.

20
The NoWait Qualifier
  • pragma omp parallel
  • pragma omp for nowait
  • for( )
  • pragma omp for
  • for( )
  • Threads proceed to second for w/o waiting.

21
Parallel Sections Directive
  • pragma omp parallel
  • pragma omp sections
  • pragma omp section ? this is a delimiter
  • pragma omp section

22
A Useful Shorthand
  • pragma omp parallel
  • pragma omp for
  • for ( )
  • is equivalent to
  • pragma omp parallel for
  • for ( )
  • (Same for parallel sections)

23
Note the Difference between ...
  • pragma omp parallel
  • pragma omp for
  • for( )
  • f()
  • pragma omp for
  • for( )

24
and ...
  • pragma omp parallel for
  • for( )
  • f()
  • pragma omp parallel for
  • for( )

25
Sequential Matrix Multiply
  • for( i0 iltn i )
  • for( j0 jltn j )
  • cij 0.0
  • for( k0 kltn k )
  • cij aikbkj

26
OpenMP Matrix Multiply
  • pragma omp parallel for
  • for( i0 iltn i )
  • for( j0 jltn j )
  • cij 0.0
  • for( k0 kltn k )
  • cij aikbkj

27
Sequential SOR
  • for some number of timesteps/iterations
  • for (i0 iltn i )
  • for( j1, jltn, j )
  • tempij 0.25
  • ( gridi-1j gridi1j
  • gridij-1 gridij1 )
  • for( i0 iltn i )
  • for( j1 jltn j )
  • gridij tempij

28
OpenMP SOR
  • for some number of timesteps/iterations
  • pragma omp parallel for
  • for (i0 iltn i )
  • for( j0, jltn, j )
  • tempij 0.25
  • ( gridi-1j gridi1j
  • gridij-1 gridij1 )
  • pragma omp parallel for
  • for( i0 iltn i )
  • for( j0 jltn j )
  • gridij tempij

29
Equivalent OpenMP SOR
  • for some number of timesteps/iterations
  • pragma omp parallel
  • pragma omp for
  • for (i0 iltn i )
  • for( j0, jltn, j )
  • tempij 0.25
  • ( gridi-1j gridi1j
  • gridij-1 gridij1 )
  • pragma omp for
  • for( i0 iltn i )
  • for( j0 jltn j )
  • gridij tempij

30
Some Advanced Features
  • Conditional parallelism.
  • Scheduling options.
  • (More can be found in the specification)

31
Conditional Parallelism Issue
  • Oftentimes, parallelism is only useful if the
    problem size is sufficiently big.
  • For smaller sizes, overhead of parallelization
    exceeds benefit.

32
Conditional Parallelism Specification
  • pragma omp parallel if( expression )
  • pragma omp for if( expression )
  • pragma omp parallel for if( expression )
  • Execute in parallel if expression is true,
    otherwise execute sequentially.

33
Conditional Parallelism Example
  • for( i0 iltn i )
  • pragma omp parallel for if( n-i gt 100 )
  • for( ji1 jltn j )
  • for( ki1 kltn k )
  • ajk ajk - aikaij / ajj

34
Scheduling of Iterations Issue
  • Scheduling assigning iterations to a thread.
  • So far, we have assumed the default which is
    block scheduling.
  • OpenMP allows other scheduling strategies as
    well, for instance cyclic, gss (guided
    self-scheduling), etc.

35
Scheduling of Iterations Specification
  • pragma omp parallel for schedule(ltschedgt)
  • ltschedgt can be one of
  • block (default)
  • cyclic
  • gss

36
Example
  • Multiplication of two matrices C A x B, where
    the A matrix is upper-triangular (all elements
    below diagonal are 0).

0
A
37
Sequential Matrix Multiply Becomes
  • for( i0 iltn i )
  • for( j0 jltn j )
  • cij 0.0
  • for( ki kltn k )
  • cij aikbkj
  • Load imbalance with block distribution.

38
OpenMP Matrix Multiply
  • pragma omp parallel for schedule( cyclic )
  • for( i0 iltn i )
  • for( j0 jltn j )
  • cij 0.0
  • for( ki kltn k )
  • cij aikbkj

39
Data Environment Directives (1 of 2)
  • All variables are by default shared.
  • One exception the loop variable of a parallel
    for is private.
  • By using data directives, some variables can be
    made private or given other special
    characteristics.

40
Reminder Matrix Multiply
  • pragma omp parallel for
  • for( i0 iltn i )
  • for( j0 jltn j )
  • cij 0.0
  • for( k0 kltn k )
  • cij aikbkj
  • a, b, c are shared
  • i, j, k are private

41
Data Environment Directives (2 of 2)
  • Private
  • Threadprivate
  • Reduction

42
Private Variables
  • pragma omp parallel for private( list )
  • Makes a private copy for each thread for each
    variable in the list.
  • This and all further examples are with parallel
    for, but same applies to other region and
    work-sharing directives.

43
Private Variables Example (1 of 2)
  • for( i0 iltn i )
  • tmp ai
  • ai bi
  • bi tmp
  • Swaps the values in a and b.
  • Loop-carried dependence on tmp.
  • Easily fixed by privatizing tmp.

44
Private Variables Example (2 of 2)
  • pragma omp parallel for private( tmp )
  • for( i0 iltn i )
  • tmp ai
  • ai bi
  • bi tmp
  • Removes dependence on tmp.
  • Would be more difficult to do in Pthreads.

45
Private Variables Alternative 1
  • for( i0 iltn i )
  • tmpi ai
  • ai bi
  • bi tmpi
  • Requires sequential program change.
  • Wasteful in space, O(n) vs. O(p).

46
Private Variables Alternative 2
  • f()
  • int tmp / local allocation on stack /
  • for( ifrom iltto i )
  • tmp ai
  • ai bi
  • bi tmp

47
Threadprivate
  • Private variables are private on a parallel
    region basis.
  • Threadprivate variables are global variables that
    are private throughout the execution of the
    program.

48
Threadprivate
  • pragma omp threadprivate( list )
  • Example pragma omp threadprivate( x)
  • Requires program change in Pthreads.
  • Requires an array of size p.
  • Access as xpthread_self().
  • Costly if accessed frequently.
  • Not cheap in OpenMP either.

49
Reduction Variables
  • pragma omp parallel for reduction( oplist )
  • op is one of , , -, , , , , or
  • The variables in list must be used with this
    operator in the loop.
  • The variables are automatically initialized to
    sensible values.

50
Reduction Variables Example
  • pragma omp parallel for reduction( sum )
  • for( i0 iltn i )
  • sum ai
  • Sum is automatically initialized to zero.

51
SOR Sequential Code with Convergence
  • for( diff gt delta )
  • for (i0 iltn i )
  • for( j0 jltn, j )
  • diff 0
  • for( i0 iltn i )
  • for( j0 jltn j )
  • diff max(diff, fabs(gridij -
    tempij))
  • gridij tempij

52
SOR Sequential Code with Convergence
  • for( diff gt delta )
  • pragma omp parallel for
  • for (i0 iltn i )
  • for( j0 jltn, j )
  • diff 0
  • pragma omp parallel for reduction( max diff )
  • for( i0 iltn i )
  • for( j0 jltn j )
  • diff max(diff, fabs(gridij -
    tempij))
  • gridij tempij

53
SOR Sequential Code with Convergence
  • for( diff gt delta )
  • pragma omp parallel for
  • for (i0 iltn i )
  • for( j0 jltn, j )
  • diff 0
  • pragma omp parallel for reduction( max diff )
  • for( i0 iltn i )
  • for( j0 jltn j )
  • diff max(diff, fabs(gridij -
    tempij))
  • gridij tempij
  • Bummer no reduction operator for max or min.

54
Synchronization Primitives
  • Critical
  • pragma omp critical name
  • Implements critical sections by name.
  • Similar to Pthreads mutex locks (name lock).
  • Barrier
  • pragma omp critical barrier
  • Implements global barrier.

55
OpenMP SOR with Convergence (1 of 2)
  • pragma omp parallel private( mydiff )
  • for( diff gt delta )
  • pragma omp for nowait
  • for( ifrom iltto i )
  • for( j0 jltn, j )
  • diff 0.0
  • mydiff 0.0
  • pragma omp barrier
  • ...

56
OpenMP SOR with Convergence (2 of 2)
  • ...
  • pragma omp for nowait
  • for( ifrom iltto i )
  • for( j0 jltn j )
  • mydiffmax(mydiff,fabs(gridij-tempij)
  • gridij tempij
  • pragma critical
  • diff max( diff, mydiff )
  • pragma barrier

57
Synchronization Primitives
  • Big bummer no condition variables.
  • Result must busy wait for condition
    synchronization.
  • Clumsy.
  • Very inefficient on some architectures.

58
PIPE Sequential Program
  • for( i0 iltnum_pic, read(in_pic) i )
  • int_pic_1 trans1( in_pic )
  • int_pic_2 trans2( int_pic_1)
  • int_pic_3 trans3( int_pic_2)
  • out_pic trans4( int_pic_3)

59
Sequential vs. Parallel Execution
  • Sequential
  • Parallel
  • (Color -- picture horizontal line -- processor).

60
PIPE Parallel Program
  • P0 for( i0 iltnum_pics, read(in_pic) i )
  • int_pic_1i trans1( in_pic )
  • signal(event_1_2i)
  • P1 for( i0 iltnum_pics i )
  • wait( event_1_2i )
  • int_pic_2i trans2( int_pic_1i )
  • signal(event_2_3i)

61
PIPE Main Program
  • pragma omp parallel sections
  • pragma omp section
  • stage1()
  • pragma omp section
  • stage2()
  • pragma omp section
  • stage3()
  • pragma omp section
  • stage4()

62
PIPE Stage 1
  • void stage1()
  • num1 0
  • for( i0 iltnum_pics, read(in_pic) i )
  • int_pic_1i trans1( in_pic )
  • pragma omp critical 1
  • num1

63
PIPE Stage 2
  • void stage2 ()
  • for( i0 iltnum_pic i )
  • do
  • pragma omp critical 1
  • cond (num1 lt i)
  • while (cond)
  • int_pic_2i trans2(int_pic_1i)
  • pragma omp critical 2
  • num2

64
OpenMP PIPE
  • Note the need to exit critical during wait.
  • Otherwise no access by other thread.
  • Never busy-wait inside critical!
Write a Comment
User Comments (0)
About PowerShow.com