Title: Computational Methods in Physics PHYS 3437
1Computational Methods in Physics PHYS 3437
- Dr Rob Thacker
- Dept of Astronomy Physics (MM-301C)
- thacker_at_ap.smu.ca
2Todays Lecture
- Recap from end of last lecture
- Some technical details related to parallel
programming - Data dependencies
- Race conditions
- Summary of other clauses you can use in setting
up parallel loops
3Recap
COMP PARALLEL DO COMP DEFAULT(NONE) COMP
PRIVATE(i),SHARED(X,Y,n,a) do i1,n
Y(i)aX(i)Y(i) end do
4SHARED and PRIVATE
- Most commonly used directives which are necessary
to ensure correct execution - PRIVATE any variable declared as private will be
local only to a given thread and is inaccesible
to others (also it is uninitialized) - This means that if you have a variable, say t, in
the serial section of the code and then use it in
a loop, the value of t in the loop will not carry
over the value of t from the serial part - Watch out for this but there is a way around
it - SHARED any variable declared as shared will be
accessible by all other threads of execution
5Example
- The SHARED and PRIVATE specifications can be long!
COMP PRIVATE(icb,icol,izt,iyt,icell,iz_off,iy_of
f,ibz, COMP iby,ibx,i,rxadd,ryadd,rzadd,inx,iny,
inz,nb,nebs,ibrf, COMP nbz,nby,nbx,nbrf,nbref,jn
box,jnboxnhc,idt,mdt,iboxd, COMP
dedge,idir,redge,is,ie,twoh,dosph,rmind,in,ixyz, C
OMP redaughter,Ustmp,ngpp,hpp,vpp,apps,epp,hppi,
hpp2, COMP rh2,hpp2i,hpp3i,hpp5i,dpp,divpp,dcvpp
,nspp,rnspp, COMP rad2torbin,de1,dosphflag,dosph
nb,nbzlow,nbzhigh,nbylow, COMP
nbyhigh,nbxlow,nbxhigh,nbzadd,nbyadd,r3i,r2i,r1i,
COMP dosphnbnb,dogravnb,js,je,j,rad2,rmj,grc,igr
c,gfrac, COMP Gr,hppj,jlist,dx,rdv,rcv,v2,radii2
,rbin,ibin,fbin, COMP wl1,dwl1,drnspp,hppa,hppji
,hppj2i,hppj3i,hppj5i, COMP wl2,dwl2,w,dw,df,dpp
i,divppr,dcvpp2,dcvppm,divppm,csi, COMP
fi,prhoi2,ispp,frcij,rdotv,hpa,rmuij,rhoij,cij,qij
, COMP frc3,frc4,hcalc,rath,av,frc2,dr1,dr2,dr3,
dr12,dr22,dr32, COMP appg1,appg2,appg3,gdiff,ddi
ff,d2diff,dv1,dv2,dv3,rpp, COMP Gro)
6FIRSTPRIVATE
- Declaring a variable FIRSTPRIVATE will ensure
that its value is copied in from any prior piece
of serial code - However (of course) if the variable is not
initialized in the serial section it will remain
uninitialized - Happens only once for a given thread set
- Try to avoid writing to variables declared
FIRSTPRIVATE
7FIRSTPRIVATE example
a5.0 COMP PARALLEL DO COMP SHARED(r),
PRIVATE(i) COMP FIRSTPRIVATE(a) do i1,n
r(i)max(a,r(i)) end do
- Lower bound of values is set to value of A,
without FIRSTPRIVATE clause a0.0
8LASTPRIVATE
- Occasionally it may be necessary to know the last
value of a variable from the end of the loop - LASTPRIVATE variables will initialize the value
of the variable in the serial section using the
last (sequential) value of the variable from the
parallel loop
9Default behaviour
- You can actually omit the SHARED and PRIVATE
statements what is the expected behaviour? - Scalars are private by default
- Arrays are shared by default
Bad practice in my opinion specify the types
for everything
10DEFAULT
- I recommend using DEFAULT(NONE) at all times
- Forces specification of all variable types
- Alternatively, can use DEFAULT(SHARED), or
DEFAULT(PRIVATE) to specify that un-scoped
variables will default to the particular type
chosen - e.g. choosing DEFAULT(PRIVATE) will ensure any
un-scoped variable is private
11The Parallel Do Pragmas
- So far weve considered a small subset of
functionality - Before we talk more about data dependencies, lets
look briefly at what other statements can be used
in a parallel do loop - Besides PRIVATE and SHARED variables there are a
number of other clauses that can be applied
12Loop Level Parallelism in more detail
- For each parallel do(for) pragma, the following
clauses are possible
C/C private shared firstprivate lastprivate redu
ction ordered schedule copyin
FORTRAN PRIVATE SHARED FIRSTPRIVATELASTPRIVATE RE
DUCTION ORDERED SCHEDULE COPYIN DEFAULT
Redmost frequently used
Clauses in italics we have already seen.
13More background on data dependencies
- Suppose you try to parallelize the following loop
- Wont work as it is written since iteration i,
depends upon iteration i-1 and thus we cant
start anything in parallel - To see this explicitly, let n20 and start thread
1 at i1, and thread 2 at i11 then thread 1 sets
Y(1)1.0 and thread 2 sets Y(11)1.0 (which is
wrong!)
c0.0 do i1,n cc1.0 Y(i)c end do
14Simple solution
- This loop can easily be re-written in a way that
can be parallelized - There is no longer any dependence on the previous
operation - Private variables i, Shared variables Y(),c,n
c0.0 do i1,n Y(i)cfloat(i) end do ccn
15Types of Data Dependencies
- Suppose we have operations O1,O2
- True Dependence
- O2 has a true dependence on O1 if O2 reads a
value written by O1 - Anti Dependence
- O2 has an anti-dependence on O1 if O2 writes a
value read by O1 - Output Dependence
- O2 has an output dependence on O1 if O2 writes a
variable written by O1
16Examples
A1A2A3 B1A1B2
- True dependence
- Anti-dependence
- Output dependence
B1A1B2 A1C2
B15 B12
17Dealing with Data Dependencies
- Any loop where iterations depend upon the
previous one has a potential problem - Any result which depends upon the order of the
iterations will be a problem - Good first test of whether something can be
parallelized reverse the loop iteration order - Not all data dependencies can be eliminated
- Accumulations of variables (e.g. sum of elements
in an array) can be dealt with easily
18Accumulations
- Consider the following loop
- It apparently has a data dependency however
each thread can sum values of a independently - OpenMP provides an explicit interface for this
kind of operation (REDUCTION)
a0.0 do i1,n aaX(i) end do
19REDUCTION clause
- This clause deals with parallel versions of the
following loops - Outcome is determined by a reduction over all
the values for each thread - e.g. max over all of a set, is equivalent to the
max over all max with subsets - Max(A) where AU An Max(U Max(An))
do i1,N amax(a,b(i)) end do
do i1,N amin(a,b(i)) end do
do i1,n aab(i) end do
20Examples
- Syntax REDUCTION(OP variable) where
OPmax,min,,-, ( logic ops)
COMP PARALLEL DO COMP PRIVATE(i),
SHARED(b) COMP REDUCTION(maxa) do i1,N
amax(a,b(i)) end do
COMP PARALLEL DO COMP PRIVATE(i),
SHARED(b) COMP REDUCTION(mina) do i1,N
amin(a,b(i)) end do
21What is REDUCTION actually doing?
- Saving you from writing more code
- The reduction clause generates an array of the
reduction variables, and each thread is
responsible for a certain element in the array - The final reduction over all the array elements
(when the loop is finished) is performed
transparently to the user
22Initialization
- Reduction variables are initialized as follows
(from the standard)
Operator
Initialization 0 1 - 0 MAX Smallest
rep. numberMIN Largest rep. number
23Race Conditions
- Common operation is to resolve a spatial position
into an array index consider following loop - Looks innocent enough but suppose two particles
have the same positions
COMP PARALLEL DO COMP DEFAULT(NONE) COMP
PRIVATE(i,j) COMP SHARED(r,A) do i1,n
jint(r(i)) A(j)A(j)1.
end do
r() array of positions A() array that is
modified using information from r()
24Race Conditions A concurrency problem
- Two different threads of execution can
concurrently attempt to update the same memory
location
Start A(j)1.
time
25Dealing with Race Conditions
- Need mechanism to ensure updates to single
variables occur within a critical section - Any thread entering a critical section blocks all
others - Critical sections can be established by using
lock variables - Think of lock variables as preventing more than
one thread from working on a particular piece of
code at any one time - Just like a lock on door prevents people from
entering a room
26Deadlocks The pitfall of locking
- Must ensure a situation is not created where
requests in possession create a deadlock - Nested locks are a classic example of this
- Can also create problem with multiple processes -
deadly embrace
Resource 1
Resource 2
Holds
Process 1
Process 2
Requests
27Solutions
- Need to ensure memory read/writes occur without
any overlap - If the access occurs to a single region, we can
use a critical section
do i1,n work COMP
CRITICAL(lckx) aa1. COMP END
CRITICAL(lckx) end do
Only one thread will be allowed inside the
critical section at a time.
I have given a name to the critical section but
you dont have to do this.
28ATOMIC
- If all you want to do is ensure the correct
update of one variable you can use the atomic
update facility - Exactly the same as a critical section around one
single update point
COMP PARALLEL DO do i1,n
work COMP ATOMIC aa1. end do
29Can be inefficient
doing work
waiting for lock
- If other threads are waiting to enter the
critical section then the program may even
degenerate to a serial code! - Make sure there is much more work outside the
locked region than inside it!
FORK
JOIN
Parallel Section where each thread waits for the
lock before being able to proceed A complete
disaster
30COPYIN ORDERED
- Suppose you have a small section of code that
needs to be executed always in sequential order - However, remaining work can be done in any order
- Placing an ORDERED clause around the work section
will force threads to execute this section of
code sequentially - If a common block is specified as private in a
parallel do, COPYIN will ensure that all threads
are initialized with the same values as in the
serial section of the code - Essentially FIRSTPRIVATE for common
blocks/globals
31Subtle point about running in parallel
- When running in parallel you are only as fast as
your slowest thread - In example, total work is 40 seconds, have 4
cpus - Max speed up would be 40/410 secs
- All have to equal 10 secs though to give max
speed-up
Example of poor load balance, only a 40/162.5
speed-up despite using 4 processors
32SCHEDULE
- This is the mechanism for determining how work is
spread among threads - Important for ensuring that work is spread evenly
among the threads just having the same number
of each iterations may not guarantee they all
complete at the same time - Four types of scheduling possible STATIC,
DYNAMIC, GUIDED, RUNTIME
33STATIC scheduling
- Simplest of the four
- If SCHEDULE is unspecified, STATIC scheduling
will result - Default behaviour is to simply divide up the
iterations among the threads n/( threads) - STATIC(chunksize), creates a cyclic distribution
of iterations
34Comparison
STATIC No chunksize
THREAD 1
THREAD 2
THREAD 3
THREAD 4
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
STATIC chunksize1
THREAD 1
THREAD 2
THREAD 3
THREAD 4
1
9
13
2
6
10
14
3
7
11
15
4
8
12
16
5
35DYNAMIC scheduling
- DYNAMIC scheduling is a personal favourite
- Specify using DYNAMIC(chunksize)
- Simple implementation of master-worker type
distribution of iterations - Master thread passes off values of iterations to
the workers of size chunksize - Not a silver bullet if load balance is too
severe (i.e. one thread takes longer than the
rest combined) an algorithm rewrite is necessary - Also not good if you need a regular access
pattern for data locality
36Master-Worker Model
REQUEST
Master
REQUEST
REQUEST
37Other ways to use OpenMP
- Weve really only skimmed the surface of what you
can do - However, we have covered the important details
- OpenMP provides a different programming model to
just using loops - It isnt that much harder, but you need to think
slightly differently - Check out www.openmp.org for more details
38Applying to algorithms used in the course
- What could we apply OpenMP to?
- Root finding algorithms are actually
fundamentally serial! - Global bracket finder subdivide region and let
each CPU search in their allotted space in
parallel - LU decomposition can be parallelized
- Numerical integration can be parallelized
- ODE solvers are not usually good parallelization
candidates, but it is problem dependent - MC methods usually (but not always) parallelize
well
39Summary
- The main difficulty in loop level parallel
programming is figuring out whether there are
data dependencies or race conditions - Remember that variables do not naturally carry
into a parallel loop, or for that matter out of
one - Use FIRSTPRIVATE and LASTPRIVATE when you need to
do this - SCHEDULE provides many options
- Use DYNAMIC when you have unknown amount of work
in a loop - Use STATIC when you need a regular access pattern
to array
40Next Lecture
- Introduction to visualization