Introduction to Scientific Computing - PowerPoint PPT Presentation

1 / 43

About This Presentation

Title:

Introduction to Scientific Computing

Description:

It has been estimated that if a teraflop machine were ... kite. 1.3 GHz. 32. 32 GB. pogo. 1.3 GHz. 32. 32 GB. frisbee. 1.3 GHz. 32. 32 GB. domino. 1.3 GHz ... – PowerPoint PPT presentation

Number of Views:19

Avg rating:3.0/5.0

Slides: 44

Provided by: scv

Learn more at: http://scv.bu.edu

Category:

more less

Transcript and Presenter's Notes

Title: Introduction to Scientific Computing

1
Introduction to Scientific Computing

Doug Sondak sondak_at_bu.edu Boston
University Scientific Computing and Visualization
2
Outline

Introduction
Software
Parallelization
Hardware

3
Introduction

What is Scientific Computing?
Need for speed
Need for memory
Simulations tend to grow until they overwhelm
available resources
If I can simulate 1000 neurons, wouldnt it be
cool if I could do 2000? 10000? 1087?
Example flow over an airplane
It has been estimated that if a teraflop machine
were available, would take about 200,000 years to
solve (resolving all scales).
If Homo Erectus had a teraflop machine, we could
be getting the result right about now.

4
Introduction (contd)

Optimization
Profile serial (1-processor) code
Tells where most time is consumed
Is there any low fruit?
Faster algorithm
Optimized library
Wasted operations
Parallelization
Break problem up into chunks
Solve chunks simultaneously on different
processors

5
Software
6
Compiler

The compiler is your friend (usually)
Optimizers are quite refined
Always try highest level
Usually O3
Sometimes fast, -O5,
Loads of flags, many for optimization
Good news many compilers will automatically
parallelize for shared-memory systems
Bad news this usually doesnt work well

7
Software

Libraries
Solver is often a major consumer of CPU time
Numerical Recipes is a good book, but many
algorithms are not optimal
Lapack is a good resource
Libraries are often available that have been
optimized for the local architecture
Disadvantage not portable

8
Parallelization
9
Parallelization

Divide and conquer!
divide operations among many processors
perform operations simultaneously
if serial run takes 10 hours and we hit the
problem with 5000 processors, it should take
about 7 seconds to complete, right?
not so easy, of course

10
Parallelization (contd)

problem some calculations depend upon previous
calculations
cant be performed simultaneously
sometimes tied to the physics of the problem,
e.g., time evolution of a system
want to maximize amount of parallel code
occasionally easy
usually requires some work

11
Parallelization (3)

method used for parallelization may depend on
hardware
distributed memory
each processor has own address space
if one processor needs data from another
processor, must be explicitly passed
shared memory
common address space
no message passing required

12
Parallelization (4)

13
Parallelization (5)

MPI
for both distributed and shared memory
portable
freely downloadable
OpenMP
shared memory only
must be supported by compiler (most do)
usually easier than MPI
can be implemented incrementally

14
MPI

Computational domain is typically decomposed into
regions
One region assigned to each processor
Separate copy of program runs on each processor

15
MPI

Discretized domain to solve flow over airfoil
System of coupled PDEs solved at each point

16
MPI

Decomposed domain for 4 processors

17
MPI

Since points depend on adjacent points, must
transfer information after each iteration
This is done with explicit calls in the source
code

18
MPI

Diminishing returns
Sending messages can get expensive
Want to maximize ratio of computation to
communication
Parallel speedup, parallel efficiency

T time n number of processors
19
Speedup

20
Parallel Efficiency

21
OpenMP

Usually loop-level parallelization
An OpenMP directive is placed in the source code
before the loop
Assigns subset of loop indices to each processor
No message passing since each processor can see
the whole domain

for(i0 iltN i) do lots of stuff
22
OpenMP

Cant guarantee order of operations

for(i 0 i lt 7 i) ai 1 for(i
1 i lt 7 i) ai 2ai-1
Example of how to do it wrong!
Parallelize this loop on 2 processors
i ai (serial) ai (parallel)
0 1 1
1 2 2
2 4 4
3 8 8
4 16 2
5 32 4
6 64 8
Proc. 0
Proc. 1
23
Hardware
24
Hardware

A faster processor is obviously good, but
Memory access speed is often a big driver
Cache a critical element of memory system
Processors have internal parallelism such as
pipelines and multiply-add instructions

25
Cache

Cache is a small chunk of fast memory between the
main memory and the registers

26
Cache (contd)

Variables are moved from main memory to cache in
lines
L1 cache line sizes on our machines
Opteron (blade cluster) 64 bytes
Power4 (p-series) 128 bytes
PPC440 (Blue Gene) 32 bytes
Pentium III (linux cluster) 32 bytes
If variables are used repeatedly, code will run
faster since cache memory is much faster than
main memory

27
Cache (contd)

Why not just make the main memory out of the same
stuff as cache?
Expensive
Runs hot
This was actually done in Cray computers
Liquid cooling system

28
Cache (contd)

Cache hit
Required variable is in cache
Cache miss
Required variable not in cache
If cache is full, something else must be thrown
out (sent back to main memory) to make room
Want to minimize number of cache misses

29
Cache example

mini cache holds 2 lines, 4 words each
for(i0 ilt10 i) xi i
x0
x8
x1
x9
x2
a
x3
b
Main memory
x4

x5
x6
x7
30
Cache example (contd)

We will ignore i for simplicity
need x0, not in cache cache miss
load line from memory into cache
next 3 loop indices result in cache hits

x0
x1
x2
x3
for(i0 ilt10 i) xi i
x0
x8
x1
x9
x2
a
x3
b
x4

x5
x6
x7
31
Cache example (contd)

need x4, not in cache cache miss
load line from memory into cache
next 3 loop indices result in cache hits

x0
x4
x1
x5
x6
x2
x3
x7
for(i0 ilt10 i) xi i
x0
x8
x1
x9
x2
a
x3
b
x4

x5
x6
x7
32
Cache example (contd)

need x8, not in cache cache miss
load line from memory into cache
no room in cache!
replace old line

x8
x4
x9
x5
x6
a
b
x7
for(i0 ilt10 i) xi i
x0
x8
x1
x9
x2
a
x3
b
x4

x5
x6
x7
33
Cache (contd)

Contiguous access is important
In C, multidimensional array is stored in memory
as
a00
a01
a02

34
Cache (contd)

In Fortran and Matlab, multidimensional array is
stored the opposite way
a(1,1)
a(2,1)
a(3,1)

35
Cache (contd)

Rule Always order your loops appropriately

for(i0 iltN i) for(j0 jltN j)
aij 1.0
do j 1, n do i 1, n a(i,j)
1.0 enddo enddo
C
Fortran
36
SCF Machines
37
p-series

Shared memory
IBM Power4 processors
32 KB L1 cache per processor
1.41 MB L2 cache per pair of processors
128 MB L3 cache per 8 processors

38
p-series

machine Proc. spd procs memory
kite 1.3 GHz 32 32 GB
pogo 1.3 GHz 32 32 GB
frisbee 1.3 GHz 32 32 GB
domino 1.3 GHz 16 16 GB
twister 1.1 GHz 8 16 GB
scrabble 1.1 GHz 8 16 GB
marbles 1.1 GHz 8 16 GB
crayon 1.1 GHz 8 16 GB
litebrite 1.1 GHz 8 16 GB
hotwheels 1.1 GHz 8 16 GB
39
Blue Gene