Computational Methods in Physics PHYS 3437 - PowerPoint PPT Presentation

About This Presentation

Title:

Computational Methods in Physics PHYS 3437

Description:

Race conditions. Summary of other clauses you can use in ... Race Conditions ... Global bracket finder: subdivide region and let each CPU search in their ... – PowerPoint PPT presentation

Number of Views:36

Avg rating:3.0/5.0

Slides: 41

Provided by: RobTh6

Category:

more less

Transcript and Presenter's Notes

Title: Computational Methods in Physics PHYS 3437

1
Computational Methods in Physics PHYS 3437

Dr Rob Thacker
Dept of Astronomy Physics (MM-301C)
thacker_at_ap.smu.ca

2
Todays Lecture

Recap from end of last lecture
Some technical details related to parallel
programming
Data dependencies
Race conditions
Summary of other clauses you can use in setting
up parallel loops

3
Recap
COMP PARALLEL DO COMP DEFAULT(NONE) COMP
PRIVATE(i),SHARED(X,Y,n,a) do i1,n
Y(i)aX(i)Y(i) end do
4
SHARED and PRIVATE

Most commonly used directives which are necessary
to ensure correct execution
PRIVATE any variable declared as private will be
local only to a given thread and is inaccesible
to others (also it is uninitialized)
This means that if you have a variable, say t, in
the serial section of the code and then use it in
a loop, the value of t in the loop will not carry
over the value of t from the serial part
Watch out for this but there is a way around
it
SHARED any variable declared as shared will be
accessible by all other threads of execution

5
Example

The SHARED and PRIVATE specifications can be long!

COMP PRIVATE(icb,icol,izt,iyt,icell,iz_off,iy_of
f,ibz, COMP iby,ibx,i,rxadd,ryadd,rzadd,inx,iny,
inz,nb,nebs,ibrf, COMP nbz,nby,nbx,nbrf,nbref,jn
box,jnboxnhc,idt,mdt,iboxd, COMP
dedge,idir,redge,is,ie,twoh,dosph,rmind,in,ixyz, C
OMP redaughter,Ustmp,ngpp,hpp,vpp,apps,epp,hppi,
hpp2, COMP rh2,hpp2i,hpp3i,hpp5i,dpp,divpp,dcvpp
,nspp,rnspp, COMP rad2torbin,de1,dosphflag,dosph
nb,nbzlow,nbzhigh,nbylow, COMP
nbyhigh,nbxlow,nbxhigh,nbzadd,nbyadd,r3i,r2i,r1i,
COMP dosphnbnb,dogravnb,js,je,j,rad2,rmj,grc,igr
c,gfrac, COMP Gr,hppj,jlist,dx,rdv,rcv,v2,radii2
,rbin,ibin,fbin, COMP wl1,dwl1,drnspp,hppa,hppji
,hppj2i,hppj3i,hppj5i, COMP wl2,dwl2,w,dw,df,dpp
i,divppr,dcvpp2,dcvppm,divppm,csi, COMP
fi,prhoi2,ispp,frcij,rdotv,hpa,rmuij,rhoij,cij,qij
, COMP frc3,frc4,hcalc,rath,av,frc2,dr1,dr2,dr3,
dr12,dr22,dr32, COMP appg1,appg2,appg3,gdiff,ddi
ff,d2diff,dv1,dv2,dv3,rpp, COMP Gro)
6
FIRSTPRIVATE

Declaring a variable FIRSTPRIVATE will ensure
that its value is copied in from any prior piece
of serial code
However (of course) if the variable is not
initialized in the serial section it will remain
uninitialized
Happens only once for a given thread set
Try to avoid writing to variables declared
FIRSTPRIVATE

7
FIRSTPRIVATE example
a5.0 COMP PARALLEL DO COMP SHARED(r),
PRIVATE(i) COMP FIRSTPRIVATE(a) do i1,n
r(i)max(a,r(i)) end do

Lower bound of values is set to value of A,
without FIRSTPRIVATE clause a0.0

8
LASTPRIVATE

Occasionally it may be necessary to know the last
value of a variable from the end of the loop
LASTPRIVATE variables will initialize the value
of the variable in the serial section using the
last (sequential) value of the variable from the
parallel loop

9
Default behaviour

You can actually omit the SHARED and PRIVATE
statements what is the expected behaviour?
Scalars are private by default
Arrays are shared by default

Bad practice in my opinion specify the types
for everything
10
DEFAULT

I recommend using DEFAULT(NONE) at all times
Forces specification of all variable types
Alternatively, can use DEFAULT(SHARED), or
DEFAULT(PRIVATE) to specify that un-scoped
variables will default to the particular type
chosen
e.g. choosing DEFAULT(PRIVATE) will ensure any
un-scoped variable is private

11
The Parallel Do Pragmas

So far weve considered a small subset of
functionality
Before we talk more about data dependencies, lets
look briefly at what other statements can be used
in a parallel do loop
Besides PRIVATE and SHARED variables there are a
number of other clauses that can be applied

12
Loop Level Parallelism in more detail

For each parallel do(for) pragma, the following
clauses are possible

C/C private shared firstprivate lastprivate redu
ction ordered schedule copyin
FORTRAN PRIVATE SHARED FIRSTPRIVATELASTPRIVATE RE
DUCTION ORDERED SCHEDULE COPYIN DEFAULT
Redmost frequently used
Clauses in italics we have already seen.
13
More background on data dependencies

Suppose you try to parallelize the following loop
Wont work as it is written since iteration i,
depends upon iteration i-1 and thus we cant
start anything in parallel
To see this explicitly, let n20 and start thread
1 at i1, and thread 2 at i11 then thread 1 sets
Y(1)1.0 and thread 2 sets Y(11)1.0 (which is
wrong!)

c0.0 do i1,n cc1.0 Y(i)c end do
14
Simple solution

This loop can easily be re-written in a way that
can be parallelized
There is no longer any dependence on the previous
operation
Private variables i, Shared variables Y(),c,n

c0.0 do i1,n Y(i)cfloat(i) end do ccn
15
Types of Data Dependencies

Suppose we have operations O1,O2
True Dependence
O2 has a true dependence on O1 if O2 reads a
value written by O1
Anti Dependence
O2 has an anti-dependence on O1 if O2 writes a
value read by O1
Output Dependence
O2 has an output dependence on O1 if O2 writes a
variable written by O1

16
Examples
A1A2A3 B1A1B2

True dependence
Anti-dependence
Output dependence

B1A1B2 A1C2
B15 B12
17
Dealing with Data Dependencies

Any loop where iterations depend upon the
previous one has a potential problem
Any result which depends upon the order of the
iterations will be a problem
Good first test of whether something can be
parallelized reverse the loop iteration order
Not all data dependencies can be eliminated
Accumulations of variables (e.g. sum of elements
in an array) can be dealt with easily

18
Accumulations

Consider the following loop
It apparently has a data dependency however
each thread can sum values of a independently
OpenMP provides an explicit interface for this
kind of operation (REDUCTION)

a0.0 do i1,n aaX(i) end do
19
REDUCTION clause

This clause deals with parallel versions of the
following loops
Outcome is determined by a reduction over all
the values for each thread
e.g. max over all of a set, is equivalent to the
max over all max with subsets
Max(A) where AU An Max(U Max(An))

do i1,N amax(a,b(i)) end do
do i1,N amin(a,b(i)) end do
do i1,n aab(i) end do
20
Examples

Syntax REDUCTION(OP variable) where
OPmax,min,,-, ( logic ops)

COMP PARALLEL DO COMP PRIVATE(i),
SHARED(b) COMP REDUCTION(maxa) do i1,N
amax(a,b(i)) end do
COMP PARALLEL DO COMP PRIVATE(i),
SHARED(b) COMP REDUCTION(mina) do i1,N
amin(a,b(i)) end do
21
What is REDUCTION actually doing?

Saving you from writing more code
The reduction clause generates an array of the
reduction variables, and each thread is
responsible for a certain element in the array
The final reduction over all the array elements
(when the loop is finished) is performed
transparently to the user

22
Initialization

Reduction variables are initialized as follows
(from the standard)

Operator
Initialization 0 1 - 0 MAX Smallest
rep. numberMIN Largest rep. number
23
Race Conditions

Common operation is to resolve a spatial position
into an array index consider following loop
Looks innocent enough but suppose two particles
have the same positions

COMP PARALLEL DO COMP DEFAULT(NONE) COMP
PRIVATE(i,j) COMP SHARED(r,A) do i1,n
jint(r(i)) A(j)A(j)1.
end do
r() array of positions A() array that is
modified using information from r()
24
Race Conditions A concurrency problem

Two different threads of execution can
concurrently attempt to update the same memory
location

Start A(j)1.
time
25
Dealing with Race Conditions

Need mechanism to ensure updates to single
variables occur within a critical section
Any thread entering a critical section blocks all
others
Critical sections can be established by using
lock variables
Think of lock variables as preventing more than
one thread from working on a particular piece of
code at any one time
Just like a lock on door prevents people from
entering a room

26
Deadlocks The pitfall of locking

Must ensure a situation is not created where
requests in possession create a deadlock
Nested locks are a classic example of this
Can also create problem with multiple processes -
deadly embrace

Resource 1
Resource 2
Holds
Process 1
Process 2
Requests
27
Solutions

Need to ensure memory read/writes occur without
any overlap
If the access occurs to a single region, we can
use a critical section

do i1,n work COMP
CRITICAL(lckx) aa1. COMP END
CRITICAL(lckx) end do
Only one thread will be allowed inside the
critical section at a time.
I have given a name to the critical section but
you dont have to do this.
28
ATOMIC

If all you want to do is ensure the correct
update of one variable you can use the atomic
update facility
Exactly the same as a critical section around one
single update point

COMP PARALLEL DO do i1,n
work COMP ATOMIC aa1. end do
29
Can be inefficient
doing work
waiting for lock

If other threads are waiting to enter the
critical section then the program may even
degenerate to a serial code!
Make sure there is much more work outside the
locked region than inside it!

FORK
JOIN
Parallel Section where each thread waits for the
lock before being able to proceed A complete
disaster
30
COPYIN ORDERED

Suppose you have a small section of code that
needs to be executed always in sequential order
However, remaining work can be done in any order
Placing an ORDERED clause around the work section
will force threads to execute this section of
code sequentially
If a common block is specified as private in a
parallel do, COPYIN will ensure that all threads
are initialized with the same values as in the
serial section of the code
Essentially FIRSTPRIVATE for common
blocks/globals

31
Subtle point about running in parallel

When running in parallel you are only as fast as
your slowest thread
In example, total work is 40 seconds, have 4
cpus
Max speed up would be 40/410 secs
All have to equal 10 secs though to give max
speed-up

Example of poor load balance, only a 40/162.5
speed-up despite using 4 processors
32
SCHEDULE

This is the mechanism for determining how work is
spread among threads
Important for ensuring that work is spread evenly
among the threads just having the same number
of each iterations may not guarantee they all
complete at the same time
Four types of scheduling possible STATIC,
DYNAMIC, GUIDED, RUNTIME

33
STATIC scheduling

Simplest of the four
If SCHEDULE is unspecified, STATIC scheduling
will result
Default behaviour is to simply divide up the
iterations among the threads n/( threads)
STATIC(chunksize), creates a cyclic distribution
of iterations

34
Comparison
STATIC No chunksize
THREAD 1
THREAD 2
THREAD 3
THREAD 4
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
STATIC chunksize1
THREAD 1
THREAD 2
THREAD 3
THREAD 4
1
9
13
2
6
10
14
3
7
11
15
4
8
12
16
5
35
DYNAMIC scheduling

DYNAMIC scheduling is a personal favourite
Specify using DYNAMIC(chunksize)
Simple implementation of master-worker type
distribution of iterations
Master thread passes off values of iterations to
the workers of size chunksize
Not a silver bullet if load balance is too
severe (i.e. one thread takes longer than the
rest combined) an algorithm rewrite is necessary
Also not good if you need a regular access
pattern for data locality

36
Master-Worker Model
REQUEST
Master
REQUEST
REQUEST
37
Other ways to use OpenMP

Weve really only skimmed the surface of what you
can do
However, we have covered the important details
OpenMP provides a different programming model to
just using loops
It isnt that much harder, but you need to think
slightly differently
Check out www.openmp.org for more details

38
Applying to algorithms used in the course

What could we apply OpenMP to?
Root finding algorithms are actually
fundamentally serial!
Global bracket finder subdivide region and let
each CPU search in their allotted space in
parallel
LU decomposition can be parallelized
Numerical integration can be parallelized
ODE solvers are not usually good parallelization
candidates, but it is problem dependent
MC methods usually (but not always) parallelize
well

39
Summary

The main difficulty in loop level parallel
programming is figuring out whether there are
data dependencies or race conditions
Remember that variables do not naturally carry
into a parallel loop, or for that matter out of
one
Use FIRSTPRIVATE and LASTPRIVATE when you need to
do this
SCHEDULE provides many options
Use DYNAMIC when you have unknown amount of work
in a loop
Use STATIC when you need a regular access pattern
to array

40
Next Lecture