Title: Introduction to Scientific Computing
1Introduction to Scientific Computing
Doug Sondak sondak_at_bu.edu Boston
University Scientific Computing and Visualization
2Outline
- Introduction
- Software
- Parallelization
- Hardware
3Introduction
- What is Scientific Computing?
- Need for speed
- Need for memory
- Simulations tend to grow until they overwhelm
available resources - If I can simulate 1000 neurons, wouldnt it be
cool if I could do 2000? 10000? 1087? - Example flow over an airplane
- It has been estimated that if a teraflop machine
were available, would take about 200,000 years to
solve (resolving all scales). - If Homo Erectus had a teraflop machine, we could
be getting the result right about now.
4Introduction (contd)
- Optimization
- Profile serial (1-processor) code
- Tells where most time is consumed
- Is there any low fruit?
- Faster algorithm
- Optimized library
- Wasted operations
- Parallelization
- Break problem up into chunks
- Solve chunks simultaneously on different
processors
5Software
6Compiler
- The compiler is your friend (usually)
- Optimizers are quite refined
- Always try highest level
- Usually O3
- Sometimes fast, -O5,
- Loads of flags, many for optimization
- Good news many compilers will automatically
parallelize for shared-memory systems - Bad news this usually doesnt work well
7Software
- Libraries
- Solver is often a major consumer of CPU time
- Numerical Recipes is a good book, but many
algorithms are not optimal - Lapack is a good resource
- Libraries are often available that have been
optimized for the local architecture - Disadvantage not portable
8Parallelization
9Parallelization
- Divide and conquer!
- divide operations among many processors
- perform operations simultaneously
- if serial run takes 10 hours and we hit the
problem with 5000 processors, it should take
about 7 seconds to complete, right? - not so easy, of course
10Parallelization (contd)
- problem some calculations depend upon previous
calculations - cant be performed simultaneously
- sometimes tied to the physics of the problem,
e.g., time evolution of a system - want to maximize amount of parallel code
- occasionally easy
- usually requires some work
11Parallelization (3)
- method used for parallelization may depend on
hardware - distributed memory
- each processor has own address space
- if one processor needs data from another
processor, must be explicitly passed - shared memory
- common address space
- no message passing required
12Parallelization (4)
13Parallelization (5)
- MPI
- for both distributed and shared memory
- portable
- freely downloadable
- OpenMP
- shared memory only
- must be supported by compiler (most do)
- usually easier than MPI
- can be implemented incrementally
14MPI
- Computational domain is typically decomposed into
regions - One region assigned to each processor
- Separate copy of program runs on each processor
15MPI
- Discretized domain to solve flow over airfoil
- System of coupled PDEs solved at each point
16MPI
- Decomposed domain for 4 processors
17MPI
- Since points depend on adjacent points, must
transfer information after each iteration - This is done with explicit calls in the source
code
18MPI
- Diminishing returns
- Sending messages can get expensive
- Want to maximize ratio of computation to
communication - Parallel speedup, parallel efficiency
T time n number of processors
19Speedup
20Parallel Efficiency
21OpenMP
- Usually loop-level parallelization
-
- An OpenMP directive is placed in the source code
before the loop - Assigns subset of loop indices to each processor
- No message passing since each processor can see
the whole domain
for(i0 iltN i) do lots of stuff
22OpenMP
- Cant guarantee order of operations
-
for(i 0 i lt 7 i) ai 1 for(i
1 i lt 7 i) ai 2ai-1
Example of how to do it wrong!
Parallelize this loop on 2 processors
i ai (serial) ai (parallel)
0 1 1
1 2 2
2 4 4
3 8 8
4 16 2
5 32 4
6 64 8
Proc. 0
Proc. 1
23Hardware
24Hardware
- A faster processor is obviously good, but
- Memory access speed is often a big driver
- Cache a critical element of memory system
- Processors have internal parallelism such as
pipelines and multiply-add instructions
25Cache
- Cache is a small chunk of fast memory between the
main memory and the registers
26Cache (contd)
- Variables are moved from main memory to cache in
lines - L1 cache line sizes on our machines
- Opteron (blade cluster) 64 bytes
- Power4 (p-series) 128 bytes
- PPC440 (Blue Gene) 32 bytes
- Pentium III (linux cluster) 32 bytes
- If variables are used repeatedly, code will run
faster since cache memory is much faster than
main memory
27Cache (contd)
- Why not just make the main memory out of the same
stuff as cache? - Expensive
- Runs hot
- This was actually done in Cray computers
- Liquid cooling system
28Cache (contd)
- Cache hit
- Required variable is in cache
- Cache miss
- Required variable not in cache
- If cache is full, something else must be thrown
out (sent back to main memory) to make room - Want to minimize number of cache misses
29Cache example
mini cache holds 2 lines, 4 words each
for(i0 ilt10 i) xi i
x0
x8
x1
x9
x2
a
x3
b
Main memory
x4
x5
x6
x7
30Cache example (contd)
- We will ignore i for simplicity
- need x0, not in cache cache miss
- load line from memory into cache
- next 3 loop indices result in cache hits
x0
x1
x2
x3
for(i0 ilt10 i) xi i
x0
x8
x1
x9
x2
a
x3
b
x4
x5
x6
x7
31Cache example (contd)
- need x4, not in cache cache miss
- load line from memory into cache
- next 3 loop indices result in cache hits
x0
x4
x1
x5
x6
x2
x3
x7
for(i0 ilt10 i) xi i
x0
x8
x1
x9
x2
a
x3
b
x4
x5
x6
x7
32Cache example (contd)
- need x8, not in cache cache miss
- load line from memory into cache
- no room in cache!
- replace old line
x8
x4
x9
x5
x6
a
b
x7
for(i0 ilt10 i) xi i
x0
x8
x1
x9
x2
a
x3
b
x4
x5
x6
x7
33Cache (contd)
- Contiguous access is important
- In C, multidimensional array is stored in memory
as - a00
- a01
- a02
-
34Cache (contd)
- In Fortran and Matlab, multidimensional array is
stored the opposite way - a(1,1)
- a(2,1)
- a(3,1)
-
35Cache (contd)
- Rule Always order your loops appropriately
-
for(i0 iltN i) for(j0 jltN j)
aij 1.0
do j 1, n do i 1, n a(i,j)
1.0 enddo enddo
C
Fortran
36SCF Machines
37p-series
- Shared memory
- IBM Power4 processors
- 32 KB L1 cache per processor
- 1.41 MB L2 cache per pair of processors
- 128 MB L3 cache per 8 processors
38p-series
machine Proc. spd procs memory
kite 1.3 GHz 32 32 GB
pogo 1.3 GHz 32 32 GB
frisbee 1.3 GHz 32 32 GB
domino 1.3 GHz 16 16 GB
twister 1.1 GHz 8 16 GB
scrabble 1.1 GHz 8 16 GB
marbles 1.1 GHz 8 16 GB
crayon 1.1 GHz 8 16 GB
litebrite 1.1 GHz 8 16 GB
hotwheels 1.1 GHz 8 16 GB
39Blue Gene
- Distributed memory
- 2048 processors
- 1024 2-processor nodes
- IBM PowerPC 440 processors
- 700 MHz
- 512 MB memory per node (per 2 processors)
- 32 KB L1 cache per node
- 2 MB L2 cache per node
- 4 MB L3 cache per node
40BladeCenter
- Hybrid memory
- 56 processors
- 14 4-processor nodes
- AMD Opteron processors
- 2.6 GHz
- 8 GB memory per node (per 4 processors)
- Each node has shared memory
- 64 KB L1 cache per 2 processors
- 1 MB L2 cache per 2 processors
41Linux Cluster
- Hybrid memory
- 104 processors
- 52 2-processor nodes
- Intel Pentium III processors
- 1.3 GHz
- 1 GB memory per node (per 2 processors)
- Each node has shared memory
- 16 KB L1 cache per 2 processors
- 512 KB L2 cache per 2 processors
42For More Information
- SCV web site
- http//scv.bu.edu/
- Todays presentations are available at
- http//scv.bu.edu/documentation/presentations/
- under the title Introduction to Scientific
Computing and Visualization
43Next Time
- G T code
- Time it
- Look at effect of compiler flags
- profile it
- Where is time consumed?
- Modify it to improve serial performance