The Synergy between Computer Science and CFD Modeling - PowerPoint PPT Presentation

1 / 37

About This Presentation

Title:

The Synergy between Computer Science and CFD Modeling

Description:

This will not speed up small models due to the overhead of message passing ... The result: More robust, reliable modeling, at much higher speed, with less effort ... – PowerPoint PPT presentation

Number of Views:58

Avg rating:3.0/5.0

Slides: 38

Provided by: dovkr

Category:

more less

Transcript and Presenter's Notes

Title: The Synergy between Computer Science and CFD Modeling

1
The Synergy between Computer Science and CFD
Modeling
Dov Kruger Anne Pence Alan Blumberg
2
Background of the Davidson Laboratory Team

Professor Alan Blumberg, co-author of POM and
author of ECOMSED
Dov Kruger, computer science, transitioning into
ocean research
Anne Pence, PhD candidate with a background in
naval architecture and Oceanography

3
CFD is a Computing War Against Big Problems

Overview of 5 key areas, and focus on our
research in single and multiple CPU performance
and accuracy
Understanding the battleground Architectural
Features of Computers
Having a tactical plan coding efficiency
Effective Use of resources
Parallelism on shared memory architectures
Auto-parallelizing compilers
OpenMP
Parallelism using distributed computers (MPI)
Hitting the correct target Accuracy

4
Ignorance is useful, knowledge is crucial

We questioned everything, including equation of
state
Surprising results a new equation of state, and
a new style of coding CFD models that coalesces
loops and is significantly faster on modern
computers
Nothing worked until Professor Blumberg
Conveyed the analytics of the core routines
Designed suitable test cases
Debugged problems

5
Doubling Performance of Barotropic Mode 10
months 2.5 hours

Blindly optimizing the code is very difficult
It took time to understand the model
Problems
Validating results
Finding the right test case
Incremental test procedure
Once the correct strategy was in place, it took
only 2.5 hours to double the performance of
barotropic mode
More optimizations are possible, but they get
exponentially more difficult
We propose not to do them by hand

6
A Sample Problem
7
Computer Architecture

Locality Speed!
Registers
Different Levels of cache
Main Memory
Swapping
Disk
Sequential access is much faster than
non-sequential
Locality is also accuracy
Multiple execution units
The limiting factor today is memory bandwidth
ECOMSED and POM are both memory limited

8
Efficiency

There is a cost structure to operations
Changes over time
Todays Highest Costs
Writing to memory
Cache must flush to memory
Reading from memory
Statistically, cached

exp log sqrt Division Multiplication
9
Heuristics for Writing Fast Code

Efficiency doesn't hurt
It may not help
Minimize cost of operations
Different on each machine
But similar enough that heuristics will produce
good results on most architectures
On average, performance will improve if
expressions are sufficiently large

10
Example Computing Bottom Stress original

do j2,jmm1
do i2,imm1
ubar(i,j,kbm1)0.5(ua(i,j)ua(i1,j))
vbar(i,j,kbm1)0.5(va(i,j)va(i,j1))
enddo
enddo
do j2,jmm1
do i2,imm1
qbar(i,j)sqrt(ubar(i,j,kbm1)ubar(i,j,kbm
1)
vbar(i,j,kbm1)vbar(i,j,kbm
1))
enddo
enddo
do j2,jmm1
do i2,imm1
if (fsm(i,j).gt.0.0) then
tau(i,j,kb)10000.cbc(i,j)/
varbf(i,j)qbar(i,j)qbar(i,j)
endif

11
Bottom Stress 21 Times Faster

do j2,jmm1
do i2,imm1
tempuBAR (UA(I,J)UA(I1,J))
tempVBAR (VA(I,J)VA(I,J1))
TAU(I,J,KB)CBC2(I,J)(tempUBAR2
tempVBAR2)
enddo
enddo

12
The Current State of Compiler Optimization

Compilers typically do not rearrange floating
point expressions
For fear of subtle roundoff effects
Instruction scheduling can still be done as long
as the two data streams do not interact
Example (x y) - (x / y) (x y) (x y)
Common subexpressions are well-handled provided
they are in the same form
Algebraic equivalences are not considered
Constant subexpressions are pre-evaluated at
compile time in C and FORTRAN

13
Examples Writing Fast Code

Only scalars are considered constant
Writing to an array is not optimized
awayvar1(i,j,k) c1 c2 c3var1(i,j,k)
var1(i,j,k) c4var1(i,j,k) c1 c2 c3
c4c1 var2(i,j,k) c3 var3(i,j,k)
c4c1c3c4 var2(i,j,k) var3(i,j,k)var2(i,j,k
) var3(i,j,k) c1c3c4var2(i,j,k)
var3(i,j,k) (c1c3c4)var1(i,j,k) / c1(i,j,k)

14
Skipping Land

Goals
Save CPU by only processing water boxes
Cost very little CPU for all-water problems
IF-tests slow down the CPU, so preferably avoid
them
Stripwise unstructured view of the grid avoids
additional tests
inefficient on a vector machine
k-order would be better to test an entire water
column at once, but this would require
re-engineering the entire model

15
Skipping land, continued
istart,jrow, iend
2,2,2 5,2,6 2,3,2 5,3,6 2,4,6 2,5,6 2,6,6
istart,jrow, iend
16
High Performance Advection

Scalar Fluxes in one preferred direction
No memory access for flux
Full machine precision

xfluxi
Xfluxi1
Var(i,j,k)xFluxiP1 xFluxixFluxi xFluxiP1
17
Advection, cont

yfluxJ(1) 0
yFluxJ(im) 0
diffusiveyfluxJ(1) 0
diffusiveyFluxJ(im) 0
do k2,kbm1
do i2,imm1
yfluxJ(i) 0 ! because of land
boundary condition
diffusiveYFluxJ(i) 0
enddo
do j2,jmm1
xfluxI 0 ! because of land
boundary condition
do i2,imm1
xfluxIP1 (f(i,j,k)f(i1,j,k))
xmfl3d(i1,j,k)
yFluxJP1 (f(i,j,k)f(i,j1,k))
ymfl3d(i,j1,k)
diffusiveXFluxIP1 -aamx(i,j,k)
addDTi(i,j,BACK)
(fb(i,j,k)-fb(i-1,j,k))uMaskedH2_2H1(i,
j)
diffusiveYFluxJP1 -aamy(i,j,k)
addDTj(i,j, BACK)
(fb(i,j,k)-fb(i,j-1,k))vMaskedH1_2H2(i,
j)

18
Advection, take 3

ff(i,j,k)
(
fB(i,j,k) volT(i,j,BACK) -
!advective part
dti2 0.5 invDZ(k)
(
(f(i,j,k-1)f(i,j,k))w(i,j,k
) -
(f(i,j,k)f(i,j,k1))w(i,j,k
1)ART(i,j)
xFluxIP1 - xFluxI yFluxJP1
- yFluxJ(i)
) ! diffusive part
diffusiveXfluxIP1 -
diffusiveXFluxI
diffusiveYFluxJP1 -
diffusiveYFluxJ(i)
)
invVolT(i,j,FUTURE)
xFluxI xFluxIp1
diffusiveXFluxI diffusiveXFluxIp1
yFluxJ(i) yFluxJP1
diffusiveYFluxJ(i)diffusiveYFluxJP1
enddo
enddo

19
K-order is Faster

POM and ECOMSED are currently stored with I
varying most frequently
var(i,j,k)
For most algorithms, traversal order is
irrelevant, except
Vertical models are naturally ordered top to
bottom
A model stored and accessed in k-order can
implement the vertical profile algorithms much
more efficiently

20
Performance of PROFT/PROFS

On a sample problem
3 times faster for profiling temperature
8 times faster for profiling salinity
This includes the penalty of traversing in the
wrong memory order for T,S
For a k-ordered model, code would be even faster
Techniques as discussed before, and also
Removal of exponentiation where unnecessary

21
Shared Memory Parallelism

Compiler automatically recognizes opportunities
for multiple CPUs to split up loops
Can be
Automatic (under compiler control)
Manual (OpenMP)
Problem is memory bandwidth

22
Memory Bandwidth and Shared Memory Computers

Different levels
Commodity PC dual processors
Mid-range shared-memory computers
Larger cache
Floating point performance of CPUs are critical
Special-purpose machines with exotic memory and
caches
Write-back cache synchronization

23
MPI

Message Passing Interface requires manual coding
intervention
The programmer can split the domain across
multiple computers
Messages are passed exchanging information
between models
This will not speed up small models due to the
overhead of message passing
The volume of the model must be large with
respect to the data exchange interface

24
Accuracy

Values in registers retain extra bits of
precision
This can substantially improve computational
results
Every time values are stored to single precision
arrays, this extra information is lost
Example Vectorization on the Pentium 4
Very fast (4 simultaneous 32-bit operations)
Values change substantially in the model
Loss of significant figures

25
Simulation is not Prediction

Roundoff error slowly burns away digits of
precision
No escape from information entropy
Roundoff is indistinguishable from small
differences in the initial conditions
Single precision is not enough
After 12 hours, salinity results can differ by
3ppt on different computers

26
Calibration of Model Run on Different Machines?

Calibration represents balancing all the
constituents of the equations as they are
computed by the model
Some parts of the model are more sensitive to
roundoff error than others
Some parts are more driven by forcing functions
and are therefore more stable
Turbulence closure is particularly twitchy

27
Density Computation

The Equation of State has been defined by Millero
and Poisson in UNESCO
Accuracy is high
On the order of 6 digits
Original data (gt500 samples) 5 digits
Computational cost is also high
Many terms including s1.5
How much accuracy is needed?

28
Comparison Fofonoff 1952 vs. UNESCO

ECOMSED used Fofonoff 1952 until recently
Approximately 0.6 difference at worst case at
approximate 10C, 10ppt
In computer model, the absolute density is
irrelevant
The density difference between adjacent cells
drives the forcing
In a 20ppt difference
No significant difference in X1
Small differences in distribution of salt
gradient between 0 and 20 ppt.

29
Difference between UNESCO and Fofonoff 1950
30
A Faster Algorithm densityDKAP

c1 6.38582008014815d-12
c2 -1.07670862741285d-09
c3 9.73156784811086d-08
c4 -9.0042779076007d-06
c5 6.60339902342078d-05
c6 -0.000132403263360079d0
c7 4.91156586724696d-12
c8 -7.81209323957205d-10
c9 6.72419650942208d-08
c10 -3.5945086374423d-06
c11 0.000809532551425d0
c12 -1.59139276805119d-07
c13 -2.00957653399204d-12
c14 1.28333259714417d-10
c15 2.34587013019438d-09

do k1,kb-1
do j2,jm-1
do i2,im-1
rho(i,j,k)
((((c13 t(i,j,k) c14) t(i,j,k)
c15)
s(i,j,k) c12) s(i,j,k)
((((c7 t(i,j,k) c8) t(i,j,k)) c9)
t(i,j,k) c10) t(i,j,k) c11)
s(i,j,k)
((((c1 t(i,j,k) c2) t(i,j,k) c3)
t(i,j,k)c4) t(i,j,k) c5) t(i,j,k)
c6
enddo
enddo
enddo

31
UNESCO, DKAP and MS

Dr. Martin Senator of Davidson Laboratory
contributed a different fit, MSSuperior to
UNESCO in fit to the original data

32
Consistent Attention to Accuracy

In POM and ECOMSED, density computation is
performed
Rho is computed for each cell
In POM, density is stored as a ratio from rhoref
to save digits
Differences are subtracted between adjacent cells
Salinity difference of 0.001 ppt yields a result
in the 7th decimal place
Temperature difference of 0.01 C yields a
result in the 7th decimal place
Result 0 digits accuracy
Conclusion Compute density in double precision.
Anything else may be inaccurate

33
Double precision Density vs. Single Precision

Double precision is more stable
Differential Calculation of density will be
presumably much more accurate, and therefore even
more stable

34
Calculate Density Difference Analytically

For greatest accuracy, calculate differential
density directly
Slightly slower, but full accuracy
Density(s,t) Pn(s,t)
DiffDensity(s,t,ds,dt)

35
Future Directions in Hardware

New processors with more registers
Bigger expressions become even more advantageous
Improved shared memory performance
Improved writeback cache design, larger caches,
faster interconnects
Intelligent memory subsystems preloading values
Sequential access becomes even more important

36
Improved Software

We propose to write a Model Construction Toolkit
(MCT) that will do all the optimizations
mentioned here, and more
Reimplement POM/ECOMSED using the new toolkit
Allow calling FORTRAN routines
Enter algorithms either in a FORTRAN-like syntax
or in a higher-level syntax based on common
notation for difference equations
Automatically optimize expressions
Compute array subexpressions only when changed
Compute results using an optimal function given
the situation
The result More robust, reliable modeling, at
much higher speed, with less effort

37
Acknowledgements

Professor Roger Pinkham, for superb numerical and
statistical knowledge and insight
Dr. Martin Senator for his densityMS fit and help
getting started with R
The makers of R, an incredible statistics package
John Gilson, who reviewed the presentation and
helped us get the right computing focus, and
whose discussions first got us started thinking
about a model construction toolkit

Write a Comment

User Comments (0)