Title: Computational Methods in Physics PHYS 3437
1Computational Methods in Physics PHYS 3437
- Dr Rob Thacker
- Dept of Astronomy Physics (MM-301C)
- thacker_at_ap.smu.ca
2Todays Lecture
- Introduction to parallel programming
- Concepts what are parallel computers, what is
parallel programming? - Why do you need to use parallel programming?
- When parallelism will be beneficial
- Amdahls Law
- Very brief introduction to OpenMP
3Why bother to teach this in an undergrad Physics
course?
- Because parallel computing is now ubiquitous
- Most laptops are parallel computers, for example
- Dual/Quad core chips are already standard, in the
future we can look forward to 8/16/32/64 cores
per chip! - Actually Sun Microsystems already sells a chip
with 8 cores - I predict that by 2012 you will be buying chips
with 16 cores - If we want to use all this capacity, then we will
need to run codes that can use more than one CPU
core at a time - Such codes are said to be parallel
- Exposure to these concepts will help you
significantly if you want to go to grad school in
an area that uses computational methods
extensively - Because not many people have these skills!
- If you are interested, an excellent essay on how
computing is changing can be found here - http//view.eecs.berkeley.edu/wiki/Main_Page
4Some caveats
- In two lectures we cannot cover very much on
parallel computing - We will concentrate on the simplest kind of
parallel programming - Exposes some of the inherent problems
- Still gives you useful increased performance
- Remember, making a code run 10 times faster turns
a week into a day! - The type of programming well be looking at is
often limited in terms of the maximum speed-up
possible, but factors of 10 are pretty common
5Why cant the compiler just make my code parallel
for me?
- In some situations it can, but most of the time
it cant - You really are smarter than a compiler is!
- There are many situations where a compiler will
not be able to make something parallel but you
can - Compilers that can attempt to parallelize code
are called auto-parallelizing - Some people have suggested writing parallel
languages that only allow the types of code that
can be easily parallelized - These have proven to not be very popular and are
too restrictive - At present, the most popular way of parallel
programming is to add additional commands to your
original code - These commands are sometimes called pragmas or
directives
6Recap von Neumann architecture
Machine instructions are encoded in binary
stored key insight!
- First practical stored-program architecture
- Still in use today
- Speed is limited by the bandwidth of data between
memory and processing unit - von Neumann bottleneck
MEMORY
DATA MEMORY
PROGRAM MEMORY
CONTROL UNIT
PROCESS UNIT
CPU
INPUT
OUTPUT
Developed while working on the EDVAC design
7Shared memory computers
Program these computers using OpenMP extensions
to C,FORTRAN
CPU
CPU
CPU
CPU
MEMORY
Traditional shared memory design all processors
share a memory bus All of the processors see the
share the same memory locations. This means that
programming these computers is reasonably
straightforward. Sometimes called SMPs for
symmetric multi-processor.
8Distributed memory computers
Program these computers using MPI or PVM
extensions to C, FORTRAN
NETWORK
CPU
CPU
CPU
CPU
MEMORY
MEMORY
MEMORY
MEMORY
Really a collection of computers linked together
via a network. Each processor has its own memory
and must communicate with other processors over
the network to get information from other memory
locations. This is really quite difficult at
times. This is the architecture of computer
clusters (you could actually have each CPU
here be a shared memory computer).
9Parallel execution
- What do we mean by being able to do things in
parallel? - Suppose the input data of an operation is divided
into series of independent parts - Processing of the parts is carried out
independently - A simple example is operations on vectors/arrays
where we loop over array indices
ARRAY A(i)
10Some subtleties
- However, you cant always do this
- Consider
- do i2,n
- a(i)a(i-1)
- end do
- This kind of loop has what we call a dependence
- If you update a value of a(i) before a(i-1) has
been updated then you will get the wrong answer
compared to running on a single processor - Well talk a little more about this later, but it
does mean that not every loop can be
parallelized
11Issues to be aware of
- Parallel computing is not about being cool and
doing lots and lots of flops - Flops floating point operations per second
- We want solutions to problems in a reasonable
amount of time - Sometimes that means doing a lot of calculations
e.g. consider what we found about the number of
collisions for molecules in air - Gains from algorithmic improvements will often
swamp hardware improvements - Dont be brain-limited, if there is a better
algorithm use it
12Algorithmic Improvements in n-body simulations
Improvements in the speed of algorithms are
proportionally better than the speed increase of
computers over the same time interval.
13Identifying Performance Desires
Code Evolution timescale
Frequency of Use
Positive Precondition
Daily
Hundreds of executions between changes
Monthly
Changes each run
Yearly
Negative Precondition
14Performance Characteristics
Level of Synchronization
Execution Time
Positive Precondition
Days
None
Hours
Infrequent (every minute)
Minutes
Frequent (many per second)
Negative Precondition
15Data and Algorithm
Data structures
Algorithmic complexity
Positive Precondition
Simple
Regular, static
Complex
Irregular, dynamic
approximately the number of stages
Negative Precondition
16Requirements
Positive Precondition
Must significantly increase resolution/length of
integration
Need a factor of 2 increase
Current resolution meets needs
Negative Precondition
17How much speed-up can we achieve?
- Some parts of a code cannot be run in parallel
- For example the loop over a(i)a(i-1) from
earlier - Any code that cannot be executed in parallel is
said to be serial or sequential - Lets suppose in terms of the total execution time
of a program a fraction fs has to be run in
serial, while fp can be run in parallel on n cpus - Equivalently the time spent in each fraction will
be ts and tp so the total time on 1 cpu is
t1cputstp - If we can run the parallel fraction on n cpus
then it will take a time tp/n - The total time will then be tncputstp/n
18Amdahls Law
- How much speed-up (Snt1cpu/tncpu) is feasible?
- Amdahls Law is the most significant limit. Given
our previous results and n processors, the
maximum speed-up is given by - Only if the serial fraction fs(ts/(tstp)) is
zero is perfect speed-up possible (at least in
theory)
19Amdahls Law
Speed-up
Ncpu
20What is OpenMP?
- OpenMP is a pragma based application
programmer interface (API) that provides a
simple extension to C/C and FORTRAN - Pragma is just a fancy word for instructions
- It is exclusively designed for shared memory
programming - Ultimately, OpenMP is a very simple interface to
something called threads based programming - What actually happens when you break up a loop
into pieces is that a number of threads of
execution are created that can run the loop
pieces in parallel
21Threads based execution
- Serial execution, interspersed with parallel
In practice many compilers block execution of the
extra threads during serial sections, this saves
the overhead of the fork-join operation
22Some background to threads programming
- There is actually an entire set of commands in C
to allow you to create threads - You could, if you wanted, program with these
commands - The most common thread standard is called POSIX
- However, OpenMP provides a simple interface to a
lot of the functionality provided by threads - If it is simple, and does what you need why
bother going to the effort of using threads
programming?
23Components of OpenMP
Directives (Pragmas in your code)
Runtime Library Routines (Compiler)
Environment Variables (set at Unix prompt)
24OpenMP Where did it come from?
- Prior to 1997, vendors all had their own
proprietary shared memory programming commands - Programs were not portable from one SMP to
another - Researchers were calling for some kind of
portability - ANSI X3H5 (1994) proposal tried to formalize a
shared memory standard but ultimately failed - OpenMP (1997) worked because the vendors got
behind it and there was new growth in the shared
memory market place - Very hard for researchers to get new languages
supported now, must have backing from computer
vendors!
25Bottomline
- For OpenMP shared memory programming in
general, one only has to worry about parallelism
of work - This is because all the processors in a
shared-memory computer can see all the same
memory locations - On distributed-memory computers one has to worry
both about parallelism of the work and also the
placement of data - Is the value I need in the memory of another
processor? - Data movement is what makes distributed-memory
codes (usually written in something called MPI)
so much longer it can be highly non-trivial - Although it can be easy it depends on the
algorithm
26First Steps
- Loop level parallelism is the simplest and
easiest way to use OpenMP - Take each do loop and make it parallel (if
possible) - It allows you to slowly build up parallelism
within your application - However, not all loops are immediately
parallelizeable due to dependencies
27Loop Level Parallelism
- Consider the single precision vector add-multiply
operation YaXY (SAXPY)
C/C
for (i1iltni) YiaXi
pragma omp parallel for \ private(i)
shared(X,Y,n,a) for (i1iltni)
YiaXi
28In more detail
COMP PARALLEL DO COMP DEFAULT(NONE) COMP
PRIVATE(i),SHARED(X,Y,n,a) do i1,n
Y(i)aX(i)Y(i) end do
29A quick note
- To be fully lexically correct you may want to
include an COMP END PARALLEL DO - In f90 programs use !OMP as a sentinel
- Notice that the sentinels mean that the OpenMP
commands look like comments - A compiler that has OpenMP compatibility turned
on will see the comments after the sentinel - This means you can still compile the code on
computers that dont have OpenMP
30How the compiler handles OpenMP
- When you compile an OpenMP code you need to add
flags to the compile line, e.g. - f77 openmp o myprogram myprogram.f
- Unfortunately different compilers have different
commands for turning on OpenMP support, the above
will work on Sun machines - When the compiler flag is turned on, you now
force the compiler to link in all of the
additional libraries (and so on) necessary to run
the threads - This is all transparent to you though
31Requirements for parallel loops
- To divide up the work the compiler needs to know
the number of iterations to be executed the
trip count must be computable - They must also not exhibit any of the
dependencies we mentioned - Well review this more in the next lecture
- Actually a good test for dependencies is running
the loop from n to 1, rather than 1 to n. If you
get a different answer that suggests there are
dependencies - DO WHILE is not parallelizable using these
directives - There is actually a way of parallelizing DO WHILE
using a different set of OpenMP commands, but we
dont have time to cover that - The loop can only have one exit point therefore
BREAK or GOTOs are not allowed
32Performance limitations
- Each time you start and end a parallel loop there
is an overhead associated with the threads - These overheads must always be added to the time
taken to calculate the loop itself - Therefore there is a limit on the smallest loop
size that will achieve speed up - In practice, we need roughly 5000 floating point
operations in a loop for it to be worth
parallelizing - A good rule of thumb is that any thread should
have at least 1000 floating point operations - Thus small loops are simply not worth the bother!
33Summary
- Shared memory parallel computers can be
programmed using the OpenMP extensions to
C,FORTRAN - Distributed memory computers require a different
parallel language - The easiest way to use OpenMP is to make loops
parallel by dividing work up among threads - Compiler handles most of the difficult parts of
coding - However, not all loops are immediately
parallelizable - Dependencies may prevent parallelization
- Loops are made to run in parallel by adding
directives (pragmas) to your code - These directives appear to be comments to
ordinary compilers
34Next Lecture
- More details on dependencies and how we can deal
with them