Computational Methods in Physics PHYS 3437 - PowerPoint PPT Presentation

About This Presentation
Title:

Computational Methods in Physics PHYS 3437

Description:

Race conditions. Summary of other clauses you can use in ... Race Conditions ... Global bracket finder: subdivide region and let each CPU search in their ... – PowerPoint PPT presentation

Number of Views:36
Avg rating:3.0/5.0
Slides: 41
Provided by: RobTh6
Category:

less

Transcript and Presenter's Notes

Title: Computational Methods in Physics PHYS 3437


1
Computational Methods in Physics PHYS 3437
  • Dr Rob Thacker
  • Dept of Astronomy Physics (MM-301C)
  • thacker_at_ap.smu.ca

2
Todays Lecture
  • Recap from end of last lecture
  • Some technical details related to parallel
    programming
  • Data dependencies
  • Race conditions
  • Summary of other clauses you can use in setting
    up parallel loops

3
Recap
COMP PARALLEL DO COMP DEFAULT(NONE) COMP
PRIVATE(i),SHARED(X,Y,n,a) do i1,n
Y(i)aX(i)Y(i) end do
4
SHARED and PRIVATE
  • Most commonly used directives which are necessary
    to ensure correct execution
  • PRIVATE any variable declared as private will be
    local only to a given thread and is inaccesible
    to others (also it is uninitialized)
  • This means that if you have a variable, say t, in
    the serial section of the code and then use it in
    a loop, the value of t in the loop will not carry
    over the value of t from the serial part
  • Watch out for this but there is a way around
    it
  • SHARED any variable declared as shared will be
    accessible by all other threads of execution

5
Example
  • The SHARED and PRIVATE specifications can be long!

COMP PRIVATE(icb,icol,izt,iyt,icell,iz_off,iy_of
f,ibz, COMP iby,ibx,i,rxadd,ryadd,rzadd,inx,iny,
inz,nb,nebs,ibrf, COMP nbz,nby,nbx,nbrf,nbref,jn
box,jnboxnhc,idt,mdt,iboxd, COMP
dedge,idir,redge,is,ie,twoh,dosph,rmind,in,ixyz, C
OMP redaughter,Ustmp,ngpp,hpp,vpp,apps,epp,hppi,
hpp2, COMP rh2,hpp2i,hpp3i,hpp5i,dpp,divpp,dcvpp
,nspp,rnspp, COMP rad2torbin,de1,dosphflag,dosph
nb,nbzlow,nbzhigh,nbylow, COMP
nbyhigh,nbxlow,nbxhigh,nbzadd,nbyadd,r3i,r2i,r1i,
COMP dosphnbnb,dogravnb,js,je,j,rad2,rmj,grc,igr
c,gfrac, COMP Gr,hppj,jlist,dx,rdv,rcv,v2,radii2
,rbin,ibin,fbin, COMP wl1,dwl1,drnspp,hppa,hppji
,hppj2i,hppj3i,hppj5i, COMP wl2,dwl2,w,dw,df,dpp
i,divppr,dcvpp2,dcvppm,divppm,csi, COMP
fi,prhoi2,ispp,frcij,rdotv,hpa,rmuij,rhoij,cij,qij
, COMP frc3,frc4,hcalc,rath,av,frc2,dr1,dr2,dr3,
dr12,dr22,dr32, COMP appg1,appg2,appg3,gdiff,ddi
ff,d2diff,dv1,dv2,dv3,rpp, COMP Gro)
6
FIRSTPRIVATE
  • Declaring a variable FIRSTPRIVATE will ensure
    that its value is copied in from any prior piece
    of serial code
  • However (of course) if the variable is not
    initialized in the serial section it will remain
    uninitialized
  • Happens only once for a given thread set
  • Try to avoid writing to variables declared
    FIRSTPRIVATE

7
FIRSTPRIVATE example
a5.0 COMP PARALLEL DO COMP SHARED(r),
PRIVATE(i) COMP FIRSTPRIVATE(a) do i1,n
r(i)max(a,r(i)) end do
  • Lower bound of values is set to value of A,
    without FIRSTPRIVATE clause a0.0

8
LASTPRIVATE
  • Occasionally it may be necessary to know the last
    value of a variable from the end of the loop
  • LASTPRIVATE variables will initialize the value
    of the variable in the serial section using the
    last (sequential) value of the variable from the
    parallel loop

9
Default behaviour
  • You can actually omit the SHARED and PRIVATE
    statements what is the expected behaviour?
  • Scalars are private by default
  • Arrays are shared by default

Bad practice in my opinion specify the types
for everything
10
DEFAULT
  • I recommend using DEFAULT(NONE) at all times
  • Forces specification of all variable types
  • Alternatively, can use DEFAULT(SHARED), or
    DEFAULT(PRIVATE) to specify that un-scoped
    variables will default to the particular type
    chosen
  • e.g. choosing DEFAULT(PRIVATE) will ensure any
    un-scoped variable is private

11
The Parallel Do Pragmas
  • So far weve considered a small subset of
    functionality
  • Before we talk more about data dependencies, lets
    look briefly at what other statements can be used
    in a parallel do loop
  • Besides PRIVATE and SHARED variables there are a
    number of other clauses that can be applied

12
Loop Level Parallelism in more detail
  • For each parallel do(for) pragma, the following
    clauses are possible

C/C private shared firstprivate lastprivate redu
ction ordered schedule copyin
FORTRAN PRIVATE SHARED FIRSTPRIVATELASTPRIVATE RE
DUCTION ORDERED SCHEDULE COPYIN DEFAULT
Redmost frequently used
Clauses in italics we have already seen.
13
More background on data dependencies
  • Suppose you try to parallelize the following loop
  • Wont work as it is written since iteration i,
    depends upon iteration i-1 and thus we cant
    start anything in parallel
  • To see this explicitly, let n20 and start thread
    1 at i1, and thread 2 at i11 then thread 1 sets
    Y(1)1.0 and thread 2 sets Y(11)1.0 (which is
    wrong!)

c0.0 do i1,n cc1.0 Y(i)c end do
14
Simple solution
  • This loop can easily be re-written in a way that
    can be parallelized
  • There is no longer any dependence on the previous
    operation
  • Private variables i, Shared variables Y(),c,n

c0.0 do i1,n Y(i)cfloat(i) end do ccn
15
Types of Data Dependencies
  • Suppose we have operations O1,O2
  • True Dependence
  • O2 has a true dependence on O1 if O2 reads a
    value written by O1
  • Anti Dependence
  • O2 has an anti-dependence on O1 if O2 writes a
    value read by O1
  • Output Dependence
  • O2 has an output dependence on O1 if O2 writes a
    variable written by O1

16
Examples
A1A2A3 B1A1B2
  • True dependence
  • Anti-dependence
  • Output dependence

B1A1B2 A1C2
B15 B12
17
Dealing with Data Dependencies
  • Any loop where iterations depend upon the
    previous one has a potential problem
  • Any result which depends upon the order of the
    iterations will be a problem
  • Good first test of whether something can be
    parallelized reverse the loop iteration order
  • Not all data dependencies can be eliminated
  • Accumulations of variables (e.g. sum of elements
    in an array) can be dealt with easily

18
Accumulations
  • Consider the following loop
  • It apparently has a data dependency however
    each thread can sum values of a independently
  • OpenMP provides an explicit interface for this
    kind of operation (REDUCTION)

a0.0 do i1,n aaX(i) end do
19
REDUCTION clause
  • This clause deals with parallel versions of the
    following loops
  • Outcome is determined by a reduction over all
    the values for each thread
  • e.g. max over all of a set, is equivalent to the
    max over all max with subsets
  • Max(A) where AU An Max(U Max(An))

do i1,N amax(a,b(i)) end do
do i1,N amin(a,b(i)) end do
do i1,n aab(i) end do
20
Examples
  • Syntax REDUCTION(OP variable) where
    OPmax,min,,-, ( logic ops)

COMP PARALLEL DO COMP PRIVATE(i),
SHARED(b) COMP REDUCTION(maxa) do i1,N
amax(a,b(i)) end do
COMP PARALLEL DO COMP PRIVATE(i),
SHARED(b) COMP REDUCTION(mina) do i1,N
amin(a,b(i)) end do
21
What is REDUCTION actually doing?
  • Saving you from writing more code
  • The reduction clause generates an array of the
    reduction variables, and each thread is
    responsible for a certain element in the array
  • The final reduction over all the array elements
    (when the loop is finished) is performed
    transparently to the user

22
Initialization
  • Reduction variables are initialized as follows
    (from the standard)

Operator
Initialization 0 1 - 0 MAX Smallest
rep. numberMIN Largest rep. number
23
Race Conditions
  • Common operation is to resolve a spatial position
    into an array index consider following loop
  • Looks innocent enough but suppose two particles
    have the same positions

COMP PARALLEL DO COMP DEFAULT(NONE) COMP
PRIVATE(i,j) COMP SHARED(r,A) do i1,n
jint(r(i)) A(j)A(j)1.
end do
r() array of positions A() array that is
modified using information from r()
24
Race Conditions A concurrency problem
  • Two different threads of execution can
    concurrently attempt to update the same memory
    location

Start A(j)1.
time
25
Dealing with Race Conditions
  • Need mechanism to ensure updates to single
    variables occur within a critical section
  • Any thread entering a critical section blocks all
    others
  • Critical sections can be established by using
    lock variables
  • Think of lock variables as preventing more than
    one thread from working on a particular piece of
    code at any one time
  • Just like a lock on door prevents people from
    entering a room

26
Deadlocks The pitfall of locking
  • Must ensure a situation is not created where
    requests in possession create a deadlock
  • Nested locks are a classic example of this
  • Can also create problem with multiple processes -
    deadly embrace

Resource 1
Resource 2
Holds
Process 1
Process 2
Requests
27
Solutions
  • Need to ensure memory read/writes occur without
    any overlap
  • If the access occurs to a single region, we can
    use a critical section

do i1,n work COMP
CRITICAL(lckx) aa1. COMP END
CRITICAL(lckx) end do
Only one thread will be allowed inside the
critical section at a time.
I have given a name to the critical section but
you dont have to do this.
28
ATOMIC
  • If all you want to do is ensure the correct
    update of one variable you can use the atomic
    update facility
  • Exactly the same as a critical section around one
    single update point

COMP PARALLEL DO do i1,n
work COMP ATOMIC aa1. end do
29
Can be inefficient
doing work
waiting for lock
  • If other threads are waiting to enter the
    critical section then the program may even
    degenerate to a serial code!
  • Make sure there is much more work outside the
    locked region than inside it!

FORK
JOIN
Parallel Section where each thread waits for the
lock before being able to proceed A complete
disaster
30
COPYIN ORDERED
  • Suppose you have a small section of code that
    needs to be executed always in sequential order
  • However, remaining work can be done in any order
  • Placing an ORDERED clause around the work section
    will force threads to execute this section of
    code sequentially
  • If a common block is specified as private in a
    parallel do, COPYIN will ensure that all threads
    are initialized with the same values as in the
    serial section of the code
  • Essentially FIRSTPRIVATE for common
    blocks/globals

31
Subtle point about running in parallel
  • When running in parallel you are only as fast as
    your slowest thread
  • In example, total work is 40 seconds, have 4
    cpus
  • Max speed up would be 40/410 secs
  • All have to equal 10 secs though to give max
    speed-up

Example of poor load balance, only a 40/162.5
speed-up despite using 4 processors
32
SCHEDULE
  • This is the mechanism for determining how work is
    spread among threads
  • Important for ensuring that work is spread evenly
    among the threads just having the same number
    of each iterations may not guarantee they all
    complete at the same time
  • Four types of scheduling possible STATIC,
    DYNAMIC, GUIDED, RUNTIME

33
STATIC scheduling
  • Simplest of the four
  • If SCHEDULE is unspecified, STATIC scheduling
    will result
  • Default behaviour is to simply divide up the
    iterations among the threads n/( threads)
  • STATIC(chunksize), creates a cyclic distribution
    of iterations

34
Comparison
STATIC No chunksize
THREAD 1
THREAD 2
THREAD 3
THREAD 4
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
STATIC chunksize1
THREAD 1
THREAD 2
THREAD 3
THREAD 4
1
9
13
2
6
10
14
3
7
11
15
4
8
12
16
5
35
DYNAMIC scheduling
  • DYNAMIC scheduling is a personal favourite
  • Specify using DYNAMIC(chunksize)
  • Simple implementation of master-worker type
    distribution of iterations
  • Master thread passes off values of iterations to
    the workers of size chunksize
  • Not a silver bullet if load balance is too
    severe (i.e. one thread takes longer than the
    rest combined) an algorithm rewrite is necessary
  • Also not good if you need a regular access
    pattern for data locality

36
Master-Worker Model
REQUEST
Master
REQUEST
REQUEST
37
Other ways to use OpenMP
  • Weve really only skimmed the surface of what you
    can do
  • However, we have covered the important details
  • OpenMP provides a different programming model to
    just using loops
  • It isnt that much harder, but you need to think
    slightly differently
  • Check out www.openmp.org for more details

38
Applying to algorithms used in the course
  • What could we apply OpenMP to?
  • Root finding algorithms are actually
    fundamentally serial!
  • Global bracket finder subdivide region and let
    each CPU search in their allotted space in
    parallel
  • LU decomposition can be parallelized
  • Numerical integration can be parallelized
  • ODE solvers are not usually good parallelization
    candidates, but it is problem dependent
  • MC methods usually (but not always) parallelize
    well

39
Summary
  • The main difficulty in loop level parallel
    programming is figuring out whether there are
    data dependencies or race conditions
  • Remember that variables do not naturally carry
    into a parallel loop, or for that matter out of
    one
  • Use FIRSTPRIVATE and LASTPRIVATE when you need to
    do this
  • SCHEDULE provides many options
  • Use DYNAMIC when you have unknown amount of work
    in a loop
  • Use STATIC when you need a regular access pattern
    to array

40
Next Lecture
  • Introduction to visualization
Write a Comment
User Comments (0)
About PowerShow.com