Introduction to Scientific Computing - PowerPoint PPT Presentation

About This Presentation
Title:

Introduction to Scientific Computing

Description:

It has been estimated that if a teraflop machine were ... kite. 1.3 GHz. 32. 32 GB. pogo. 1.3 GHz. 32. 32 GB. frisbee. 1.3 GHz. 32. 32 GB. domino. 1.3 GHz ... – PowerPoint PPT presentation

Number of Views:19
Avg rating:3.0/5.0
Slides: 44
Provided by: scv
Learn more at: http://scv.bu.edu
Category:

less

Transcript and Presenter's Notes

Title: Introduction to Scientific Computing


1
Introduction to Scientific Computing

Doug Sondak sondak_at_bu.edu Boston
University Scientific Computing and Visualization
2
Outline
  • Introduction
  • Software
  • Parallelization
  • Hardware

3
Introduction
  • What is Scientific Computing?
  • Need for speed
  • Need for memory
  • Simulations tend to grow until they overwhelm
    available resources
  • If I can simulate 1000 neurons, wouldnt it be
    cool if I could do 2000? 10000? 1087?
  • Example flow over an airplane
  • It has been estimated that if a teraflop machine
    were available, would take about 200,000 years to
    solve (resolving all scales).
  • If Homo Erectus had a teraflop machine, we could
    be getting the result right about now.

4
Introduction (contd)
  • Optimization
  • Profile serial (1-processor) code
  • Tells where most time is consumed
  • Is there any low fruit?
  • Faster algorithm
  • Optimized library
  • Wasted operations
  • Parallelization
  • Break problem up into chunks
  • Solve chunks simultaneously on different
    processors

5
Software
6
Compiler
  • The compiler is your friend (usually)
  • Optimizers are quite refined
  • Always try highest level
  • Usually O3
  • Sometimes fast, -O5,
  • Loads of flags, many for optimization
  • Good news many compilers will automatically
    parallelize for shared-memory systems
  • Bad news this usually doesnt work well

7
Software
  • Libraries
  • Solver is often a major consumer of CPU time
  • Numerical Recipes is a good book, but many
    algorithms are not optimal
  • Lapack is a good resource
  • Libraries are often available that have been
    optimized for the local architecture
  • Disadvantage not portable

8
Parallelization
9
Parallelization
  • Divide and conquer!
  • divide operations among many processors
  • perform operations simultaneously
  • if serial run takes 10 hours and we hit the
    problem with 5000 processors, it should take
    about 7 seconds to complete, right?
  • not so easy, of course

10
Parallelization (contd)
  • problem some calculations depend upon previous
    calculations
  • cant be performed simultaneously
  • sometimes tied to the physics of the problem,
    e.g., time evolution of a system
  • want to maximize amount of parallel code
  • occasionally easy
  • usually requires some work

11
Parallelization (3)
  • method used for parallelization may depend on
    hardware
  • distributed memory
  • each processor has own address space
  • if one processor needs data from another
    processor, must be explicitly passed
  • shared memory
  • common address space
  • no message passing required

12
Parallelization (4)

13
Parallelization (5)
  • MPI
  • for both distributed and shared memory
  • portable
  • freely downloadable
  • OpenMP
  • shared memory only
  • must be supported by compiler (most do)
  • usually easier than MPI
  • can be implemented incrementally

14
MPI
  • Computational domain is typically decomposed into
    regions
  • One region assigned to each processor
  • Separate copy of program runs on each processor

15
MPI
  • Discretized domain to solve flow over airfoil
  • System of coupled PDEs solved at each point

16
MPI
  • Decomposed domain for 4 processors

17
MPI
  • Since points depend on adjacent points, must
    transfer information after each iteration
  • This is done with explicit calls in the source
    code

18
MPI
  • Diminishing returns
  • Sending messages can get expensive
  • Want to maximize ratio of computation to
    communication
  • Parallel speedup, parallel efficiency

T time n number of processors
19
Speedup

20
Parallel Efficiency

21
OpenMP
  • Usually loop-level parallelization
  • An OpenMP directive is placed in the source code
    before the loop
  • Assigns subset of loop indices to each processor
  • No message passing since each processor can see
    the whole domain

for(i0 iltN i) do lots of stuff
22
OpenMP
  • Cant guarantee order of operations

for(i 0 i lt 7 i) ai 1 for(i
1 i lt 7 i) ai 2ai-1
Example of how to do it wrong!
Parallelize this loop on 2 processors
i ai (serial) ai (parallel)
0 1 1
1 2 2
2 4 4
3 8 8
4 16 2
5 32 4
6 64 8
Proc. 0
Proc. 1
23
Hardware
24
Hardware
  • A faster processor is obviously good, but
  • Memory access speed is often a big driver
  • Cache a critical element of memory system
  • Processors have internal parallelism such as
    pipelines and multiply-add instructions

25
Cache
  • Cache is a small chunk of fast memory between the
    main memory and the registers

26
Cache (contd)
  • Variables are moved from main memory to cache in
    lines
  • L1 cache line sizes on our machines
  • Opteron (blade cluster) 64 bytes
  • Power4 (p-series) 128 bytes
  • PPC440 (Blue Gene) 32 bytes
  • Pentium III (linux cluster) 32 bytes
  • If variables are used repeatedly, code will run
    faster since cache memory is much faster than
    main memory

27
Cache (contd)
  • Why not just make the main memory out of the same
    stuff as cache?
  • Expensive
  • Runs hot
  • This was actually done in Cray computers
  • Liquid cooling system

28
Cache (contd)
  • Cache hit
  • Required variable is in cache
  • Cache miss
  • Required variable not in cache
  • If cache is full, something else must be thrown
    out (sent back to main memory) to make room
  • Want to minimize number of cache misses

29
Cache example

mini cache holds 2 lines, 4 words each
for(i0 ilt10 i) xi i
x0
x8
x1
x9
x2
a
x3
b
Main memory
x4


x5
x6
x7
30
Cache example (contd)
  • We will ignore i for simplicity
  • need x0, not in cache cache miss
  • load line from memory into cache
  • next 3 loop indices result in cache hits

x0
x1
x2
x3
for(i0 ilt10 i) xi i
x0
x8
x1
x9
x2
a
x3
b
x4


x5
x6
x7
31
Cache example (contd)
  • need x4, not in cache cache miss
  • load line from memory into cache
  • next 3 loop indices result in cache hits

x0
x4
x1
x5
x6
x2
x3
x7
for(i0 ilt10 i) xi i
x0
x8
x1
x9
x2
a
x3
b
x4


x5
x6
x7
32
Cache example (contd)
  • need x8, not in cache cache miss
  • load line from memory into cache
  • no room in cache!
  • replace old line

x8
x4
x9
x5
x6
a
b
x7
for(i0 ilt10 i) xi i
x0
x8
x1
x9
x2
a
x3
b
x4


x5
x6
x7
33
Cache (contd)
  • Contiguous access is important
  • In C, multidimensional array is stored in memory
    as
  • a00
  • a01
  • a02


34
Cache (contd)
  • In Fortran and Matlab, multidimensional array is
    stored the opposite way
  • a(1,1)
  • a(2,1)
  • a(3,1)


35
Cache (contd)
  • Rule Always order your loops appropriately

for(i0 iltN i) for(j0 jltN j)
aij 1.0
do j 1, n do i 1, n a(i,j)
1.0 enddo enddo
C
Fortran
36
SCF Machines
37
p-series
  • Shared memory
  • IBM Power4 processors
  • 32 KB L1 cache per processor
  • 1.41 MB L2 cache per pair of processors
  • 128 MB L3 cache per 8 processors

38
p-series

machine Proc. spd procs memory
kite 1.3 GHz 32 32 GB
pogo 1.3 GHz 32 32 GB
frisbee 1.3 GHz 32 32 GB
domino 1.3 GHz 16 16 GB
twister 1.1 GHz 8 16 GB
scrabble 1.1 GHz 8 16 GB
marbles 1.1 GHz 8 16 GB
crayon 1.1 GHz 8 16 GB
litebrite 1.1 GHz 8 16 GB
hotwheels 1.1 GHz 8 16 GB
39
Blue Gene
  • Distributed memory
  • 2048 processors
  • 1024 2-processor nodes
  • IBM PowerPC 440 processors
  • 700 MHz
  • 512 MB memory per node (per 2 processors)
  • 32 KB L1 cache per node
  • 2 MB L2 cache per node
  • 4 MB L3 cache per node

40
BladeCenter
  • Hybrid memory
  • 56 processors
  • 14 4-processor nodes
  • AMD Opteron processors
  • 2.6 GHz
  • 8 GB memory per node (per 4 processors)
  • Each node has shared memory
  • 64 KB L1 cache per 2 processors
  • 1 MB L2 cache per 2 processors

41
Linux Cluster
  • Hybrid memory
  • 104 processors
  • 52 2-processor nodes
  • Intel Pentium III processors
  • 1.3 GHz
  • 1 GB memory per node (per 2 processors)
  • Each node has shared memory
  • 16 KB L1 cache per 2 processors
  • 512 KB L2 cache per 2 processors

42
For More Information
  • SCV web site
  • http//scv.bu.edu/
  • Todays presentations are available at
  • http//scv.bu.edu/documentation/presentations/
  • under the title Introduction to Scientific
    Computing and Visualization

43
Next Time
  • G T code
  • Time it
  • Look at effect of compiler flags
  • profile it
  • Where is time consumed?
  • Modify it to improve serial performance
Write a Comment
User Comments (0)
About PowerShow.com