Computational Methods in Physics PHYS 3437 - PowerPoint PPT Presentation

About This Presentation

Title:

Computational Methods in Physics PHYS 3437

Description:

Concepts what are parallel computers, what is parallel programming? ... FORTRAN - ampersand. necessary for continuation. A quick note ... – PowerPoint PPT presentation

Number of Views:32

Avg rating:3.0/5.0

Slides: 35

Provided by: RobTh6

Category:

more less

Transcript and Presenter's Notes

Title: Computational Methods in Physics PHYS 3437

1
Computational Methods in Physics PHYS 3437

Dr Rob Thacker
Dept of Astronomy Physics (MM-301C)
thacker_at_ap.smu.ca

2
Todays Lecture

Introduction to parallel programming
Concepts what are parallel computers, what is
parallel programming?
Why do you need to use parallel programming?
When parallelism will be beneficial
Amdahls Law
Very brief introduction to OpenMP

3
Why bother to teach this in an undergrad Physics
course?

Because parallel computing is now ubiquitous
Most laptops are parallel computers, for example
Dual/Quad core chips are already standard, in the
future we can look forward to 8/16/32/64 cores
per chip!
Actually Sun Microsystems already sells a chip
with 8 cores
I predict that by 2012 you will be buying chips
with 16 cores
If we want to use all this capacity, then we will
need to run codes that can use more than one CPU
core at a time
Such codes are said to be parallel
Exposure to these concepts will help you
significantly if you want to go to grad school in
an area that uses computational methods
extensively
Because not many people have these skills!
If you are interested, an excellent essay on how
computing is changing can be found here
http//view.eecs.berkeley.edu/wiki/Main_Page

4
Some caveats

In two lectures we cannot cover very much on
parallel computing
We will concentrate on the simplest kind of
parallel programming
Exposes some of the inherent problems
Still gives you useful increased performance
Remember, making a code run 10 times faster turns
a week into a day!
The type of programming well be looking at is
often limited in terms of the maximum speed-up
possible, but factors of 10 are pretty common

5
Why cant the compiler just make my code parallel
for me?

In some situations it can, but most of the time
it cant
You really are smarter than a compiler is!
There are many situations where a compiler will
not be able to make something parallel but you
can
Compilers that can attempt to parallelize code
are called auto-parallelizing
Some people have suggested writing parallel
languages that only allow the types of code that
can be easily parallelized
These have proven to not be very popular and are
too restrictive
At present, the most popular way of parallel
programming is to add additional commands to your
original code
These commands are sometimes called pragmas or
directives

6
Recap von Neumann architecture
Machine instructions are encoded in binary
stored key insight!

First practical stored-program architecture
Still in use today
Speed is limited by the bandwidth of data between
memory and processing unit
von Neumann bottleneck

MEMORY
DATA MEMORY
PROGRAM MEMORY
CONTROL UNIT
PROCESS UNIT
CPU
INPUT
OUTPUT
Developed while working on the EDVAC design
7
Shared memory computers
Program these computers using OpenMP extensions
to C,FORTRAN
CPU
CPU
CPU
CPU
MEMORY
Traditional shared memory design all processors
share a memory bus All of the processors see the
share the same memory locations. This means that
programming these computers is reasonably
straightforward. Sometimes called SMPs for
symmetric multi-processor.
8
Distributed memory computers
Program these computers using MPI or PVM
extensions to C, FORTRAN
NETWORK
CPU
CPU
CPU
CPU
MEMORY
MEMORY
MEMORY
MEMORY
Really a collection of computers linked together
via a network. Each processor has its own memory
and must communicate with other processors over
the network to get information from other memory
locations. This is really quite difficult at
times. This is the architecture of computer
clusters (you could actually have each CPU
here be a shared memory computer).
9
Parallel execution

What do we mean by being able to do things in
parallel?
Suppose the input data of an operation is divided
into series of independent parts
Processing of the parts is carried out
independently
A simple example is operations on vectors/arrays
where we loop over array indices

ARRAY A(i)
10
Some subtleties

However, you cant always do this
Consider
do i2,n
a(i)a(i-1)
end do
This kind of loop has what we call a dependence
If you update a value of a(i) before a(i-1) has
been updated then you will get the wrong answer
compared to running on a single processor
Well talk a little more about this later, but it
does mean that not every loop can be
parallelized

11
Issues to be aware of

Parallel computing is not about being cool and
doing lots and lots of flops
Flops floating point operations per second
We want solutions to problems in a reasonable
amount of time
Sometimes that means doing a lot of calculations
e.g. consider what we found about the number of
collisions for molecules in air
Gains from algorithmic improvements will often
swamp hardware improvements
Dont be brain-limited, if there is a better
algorithm use it

12
Algorithmic Improvements in n-body simulations
Improvements in the speed of algorithms are
proportionally better than the speed increase of
computers over the same time interval.
13
Identifying Performance Desires
Code Evolution timescale
Frequency of Use
Positive Precondition
Daily
Hundreds of executions between changes
Monthly
Changes each run
Yearly
Negative Precondition
14
Performance Characteristics
Level of Synchronization
Execution Time
Positive Precondition
Days
None
Hours
Infrequent (every minute)
Minutes
Frequent (many per second)
Negative Precondition
15
Data and Algorithm
Data structures
Algorithmic complexity
Positive Precondition
Simple
Regular, static
Complex
Irregular, dynamic
approximately the number of stages
Negative Precondition
16
Requirements
Positive Precondition
Must significantly increase resolution/length of
integration
Need a factor of 2 increase
Current resolution meets needs
Negative Precondition
17
How much speed-up can we achieve?

Some parts of a code cannot be run in parallel
For example the loop over a(i)a(i-1) from
earlier
Any code that cannot be executed in parallel is
said to be serial or sequential
Lets suppose in terms of the total execution time
of a program a fraction fs has to be run in
serial, while fp can be run in parallel on n cpus
Equivalently the time spent in each fraction will
be ts and tp so the total time on 1 cpu is
t1cputstp
If we can run the parallel fraction on n cpus
then it will take a time tp/n
The total time will then be tncputstp/n

18
Amdahls Law

How much speed-up (Snt1cpu/tncpu) is feasible?
Amdahls Law is the most significant limit. Given
our previous results and n processors, the
maximum speed-up is given by
Only if the serial fraction fs(ts/(tstp)) is
zero is perfect speed-up possible (at least in
theory)

19
Amdahls Law
Speed-up
Ncpu
20
What is OpenMP?

OpenMP is a pragma based application
programmer interface (API) that provides a
simple extension to C/C and FORTRAN
Pragma is just a fancy word for instructions
It is exclusively designed for shared memory
programming
Ultimately, OpenMP is a very simple interface to
something called threads based programming
What actually happens when you break up a loop
into pieces is that a number of threads of
execution are created that can run the loop
pieces in parallel

21
Threads based execution

Serial execution, interspersed with parallel

In practice many compilers block execution of the
extra threads during serial sections, this saves
the overhead of the fork-join operation
22
Some background to threads programming

There is actually an entire set of commands in C
to allow you to create threads
You could, if you wanted, program with these
commands
The most common thread standard is called POSIX
However, OpenMP provides a simple interface to a
lot of the functionality provided by threads
If it is simple, and does what you need why
bother going to the effort of using threads
programming?

23
Components of OpenMP
Directives (Pragmas in your code)
Runtime Library Routines (Compiler)
Environment Variables (set at Unix prompt)
24
OpenMP Where did it come from?

Prior to 1997, vendors all had their own
proprietary shared memory programming commands
Programs were not portable from one SMP to
another
Researchers were calling for some kind of
portability
ANSI X3H5 (1994) proposal tried to formalize a
shared memory standard but ultimately failed
OpenMP (1997) worked because the vendors got
behind it and there was new growth in the shared
memory market place
Very hard for researchers to get new languages
supported now, must have backing from computer
vendors!

25
Bottomline

For OpenMP shared memory programming in
general, one only has to worry about parallelism
of work
This is because all the processors in a
shared-memory computer can see all the same
memory locations
On distributed-memory computers one has to worry
both about parallelism of the work and also the
placement of data
Is the value I need in the memory of another
processor?
Data movement is what makes distributed-memory
codes (usually written in something called MPI)
so much longer it can be highly non-trivial
Although it can be easy it depends on the
algorithm

26
First Steps

Loop level parallelism is the simplest and
easiest way to use OpenMP
Take each do loop and make it parallel (if
possible)
It allows you to slowly build up parallelism
within your application
However, not all loops are immediately
parallelizeable due to dependencies

27
Loop Level Parallelism

Consider the single precision vector add-multiply
operation YaXY (SAXPY)

C/C
for (i1iltni) YiaXi
pragma omp parallel for \ private(i)
shared(X,Y,n,a) for (i1iltni)
YiaXi
28
In more detail
COMP PARALLEL DO COMP DEFAULT(NONE) COMP
PRIVATE(i),SHARED(X,Y,n,a) do i1,n
Y(i)aX(i)Y(i) end do
29
A quick note

To be fully lexically correct you may want to
include an COMP END PARALLEL DO
In f90 programs use !OMP as a sentinel
Notice that the sentinels mean that the OpenMP
commands look like comments
A compiler that has OpenMP compatibility turned
on will see the comments after the sentinel
This means you can still compile the code on
computers that dont have OpenMP

30
How the compiler handles OpenMP

When you compile an OpenMP code you need to add
flags to the compile line, e.g.
f77 openmp o myprogram myprogram.f
Unfortunately different compilers have different
commands for turning on OpenMP support, the above
will work on Sun machines
When the compiler flag is turned on, you now
force the compiler to link in all of the
additional libraries (and so on) necessary to run
the threads
This is all transparent to you though

31
Requirements for parallel loops

To divide up the work the compiler needs to know
the number of iterations to be executed the
trip count must be computable
They must also not exhibit any of the
dependencies we mentioned
Well review this more in the next lecture
Actually a good test for dependencies is running
the loop from n to 1, rather than 1 to n. If you
get a different answer that suggests there are
dependencies
DO WHILE is not parallelizable using these
directives
There is actually a way of parallelizing DO WHILE
using a different set of OpenMP commands, but we
dont have time to cover that
The loop can only have one exit point therefore
BREAK or GOTOs are not allowed

32
Performance limitations

Each time you start and end a parallel loop there
is an overhead associated with the threads
These overheads must always be added to the time
taken to calculate the loop itself
Therefore there is a limit on the smallest loop
size that will achieve speed up
In practice, we need roughly 5000 floating point
operations in a loop for it to be worth
parallelizing
A good rule of thumb is that any thread should
have at least 1000 floating point operations
Thus small loops are simply not worth the bother!

33
Summary

Shared memory parallel computers can be
programmed using the OpenMP extensions to
C,FORTRAN
Distributed memory computers require a different
parallel language
The easiest way to use OpenMP is to make loops
parallel by dividing work up among threads
Compiler handles most of the difficult parts of
coding
However, not all loops are immediately
parallelizable
Dependencies may prevent parallelization
Loops are made to run in parallel by adding
directives (pragmas) to your code
These directives appear to be comments to
ordinary compilers