Title: The Synergy between Computer Science and CFD Modeling
1The Synergy between Computer Science and CFD
Modeling
Dov Kruger Anne Pence Alan Blumberg
2Background of the Davidson Laboratory Team
- Professor Alan Blumberg, co-author of POM and
author of ECOMSED - Dov Kruger, computer science, transitioning into
ocean research - Anne Pence, PhD candidate with a background in
naval architecture and Oceanography
3CFD is a Computing War Against Big Problems
- Overview of 5 key areas, and focus on our
research in single and multiple CPU performance
and accuracy - Understanding the battleground Architectural
Features of Computers - Having a tactical plan coding efficiency
- Effective Use of resources
- Parallelism on shared memory architectures
- Auto-parallelizing compilers
- OpenMP
- Parallelism using distributed computers (MPI)
- Hitting the correct target Accuracy
4Ignorance is useful, knowledge is crucial
- We questioned everything, including equation of
state - Surprising results a new equation of state, and
a new style of coding CFD models that coalesces
loops and is significantly faster on modern
computers - Nothing worked until Professor Blumberg
- Conveyed the analytics of the core routines
- Designed suitable test cases
- Debugged problems
5Doubling Performance of Barotropic Mode 10
months 2.5 hours
- Blindly optimizing the code is very difficult
- It took time to understand the model
- Problems
- Validating results
- Finding the right test case
- Incremental test procedure
- Once the correct strategy was in place, it took
only 2.5 hours to double the performance of
barotropic mode - More optimizations are possible, but they get
exponentially more difficult - We propose not to do them by hand
6A Sample Problem
7Computer Architecture
- Locality Speed!
- Registers
- Different Levels of cache
- Main Memory
- Swapping
- Disk
- Sequential access is much faster than
non-sequential - Locality is also accuracy
- Multiple execution units
- The limiting factor today is memory bandwidth
- ECOMSED and POM are both memory limited
8Efficiency
- There is a cost structure to operations
- Changes over time
- Todays Highest Costs
- Writing to memory
- Cache must flush to memory
- Reading from memory
- Statistically, cached
exp log sqrt Division Multiplication
9Heuristics for Writing Fast Code
- Efficiency doesn't hurt
- It may not help
- Minimize cost of operations
- Different on each machine
- But similar enough that heuristics will produce
good results on most architectures - On average, performance will improve if
expressions are sufficiently large
10Example Computing Bottom Stress original
- do j2,jmm1
- do i2,imm1
- ubar(i,j,kbm1)0.5(ua(i,j)ua(i1,j))
- vbar(i,j,kbm1)0.5(va(i,j)va(i,j1))
- enddo
- enddo
- do j2,jmm1
- do i2,imm1
- qbar(i,j)sqrt(ubar(i,j,kbm1)ubar(i,j,kbm
1) - vbar(i,j,kbm1)vbar(i,j,kbm
1)) - enddo
- enddo
- do j2,jmm1
- do i2,imm1
- if (fsm(i,j).gt.0.0) then
- tau(i,j,kb)10000.cbc(i,j)/
- varbf(i,j)qbar(i,j)qbar(i,j)
- endif
11Bottom Stress 21 Times Faster
- do j2,jmm1
- do i2,imm1
- tempuBAR (UA(I,J)UA(I1,J))
- tempVBAR (VA(I,J)VA(I,J1))
- TAU(I,J,KB)CBC2(I,J)(tempUBAR2
tempVBAR2) - enddo
- enddo
12The Current State of Compiler Optimization
- Compilers typically do not rearrange floating
point expressions - For fear of subtle roundoff effects
- Instruction scheduling can still be done as long
as the two data streams do not interact - Example (x y) - (x / y) (x y) (x y)
- Common subexpressions are well-handled provided
they are in the same form - Algebraic equivalences are not considered
- Constant subexpressions are pre-evaluated at
compile time in C and FORTRAN
13Examples Writing Fast Code
- Only scalars are considered constant
- Writing to an array is not optimized
awayvar1(i,j,k) c1 c2 c3var1(i,j,k)
var1(i,j,k) c4var1(i,j,k) c1 c2 c3
c4c1 var2(i,j,k) c3 var3(i,j,k)
c4c1c3c4 var2(i,j,k) var3(i,j,k)var2(i,j,k
) var3(i,j,k) c1c3c4var2(i,j,k)
var3(i,j,k) (c1c3c4)var1(i,j,k) / c1(i,j,k)
14Skipping Land
- Goals
- Save CPU by only processing water boxes
- Cost very little CPU for all-water problems
- IF-tests slow down the CPU, so preferably avoid
them - Stripwise unstructured view of the grid avoids
additional tests - inefficient on a vector machine
- k-order would be better to test an entire water
column at once, but this would require
re-engineering the entire model
15Skipping land, continued
istart,jrow, iend
2,2,2 5,2,6 2,3,2 5,3,6 2,4,6 2,5,6 2,6,6
istart,jrow, iend
16High Performance Advection
- Scalar Fluxes in one preferred direction
- No memory access for flux
- Full machine precision
xfluxi
Xfluxi1
Var(i,j,k)xFluxiP1 xFluxixFluxi xFluxiP1
17Advection, cont
- yfluxJ(1) 0
- yFluxJ(im) 0
- diffusiveyfluxJ(1) 0
- diffusiveyFluxJ(im) 0
- do k2,kbm1
- do i2,imm1
- yfluxJ(i) 0 ! because of land
boundary condition - diffusiveYFluxJ(i) 0
- enddo
- do j2,jmm1
- xfluxI 0 ! because of land
boundary condition - do i2,imm1
- xfluxIP1 (f(i,j,k)f(i1,j,k))
xmfl3d(i1,j,k) - yFluxJP1 (f(i,j,k)f(i,j1,k))
ymfl3d(i,j1,k) - diffusiveXFluxIP1 -aamx(i,j,k)
addDTi(i,j,BACK) - (fb(i,j,k)-fb(i-1,j,k))uMaskedH2_2H1(i,
j) - diffusiveYFluxJP1 -aamy(i,j,k)
addDTj(i,j, BACK) - (fb(i,j,k)-fb(i,j-1,k))vMaskedH1_2H2(i,
j)
18Advection, take 3
- ff(i,j,k)
- (
- fB(i,j,k) volT(i,j,BACK) -
!advective part - dti2 0.5 invDZ(k)
- (
- (f(i,j,k-1)f(i,j,k))w(i,j,k
) - - (f(i,j,k)f(i,j,k1))w(i,j,k
1)ART(i,j) - xFluxIP1 - xFluxI yFluxJP1
- yFluxJ(i) - ) ! diffusive part
- diffusiveXfluxIP1 -
diffusiveXFluxI - diffusiveYFluxJP1 -
diffusiveYFluxJ(i) - )
- invVolT(i,j,FUTURE)
- xFluxI xFluxIp1
- diffusiveXFluxI diffusiveXFluxIp1
- yFluxJ(i) yFluxJP1
- diffusiveYFluxJ(i)diffusiveYFluxJP1
- enddo
- enddo
19K-order is Faster
- POM and ECOMSED are currently stored with I
varying most frequently - var(i,j,k)
- For most algorithms, traversal order is
irrelevant, except - Vertical models are naturally ordered top to
bottom - A model stored and accessed in k-order can
implement the vertical profile algorithms much
more efficiently
20Performance of PROFT/PROFS
- On a sample problem
- 3 times faster for profiling temperature
- 8 times faster for profiling salinity
- This includes the penalty of traversing in the
wrong memory order for T,S - For a k-ordered model, code would be even faster
- Techniques as discussed before, and also
- Removal of exponentiation where unnecessary
21Shared Memory Parallelism
- Compiler automatically recognizes opportunities
for multiple CPUs to split up loops - Can be
- Automatic (under compiler control)
- Manual (OpenMP)
- Problem is memory bandwidth
22Memory Bandwidth and Shared Memory Computers
- Different levels
- Commodity PC dual processors
- Mid-range shared-memory computers
- Larger cache
- Floating point performance of CPUs are critical
- Special-purpose machines with exotic memory and
caches - Write-back cache synchronization
23MPI
- Message Passing Interface requires manual coding
intervention - The programmer can split the domain across
multiple computers - Messages are passed exchanging information
between models - This will not speed up small models due to the
overhead of message passing - The volume of the model must be large with
respect to the data exchange interface
24Accuracy
- Values in registers retain extra bits of
precision - This can substantially improve computational
results - Every time values are stored to single precision
arrays, this extra information is lost - Example Vectorization on the Pentium 4
- Very fast (4 simultaneous 32-bit operations)
- Values change substantially in the model
- Loss of significant figures
25Simulation is not Prediction
- Roundoff error slowly burns away digits of
precision - No escape from information entropy
- Roundoff is indistinguishable from small
differences in the initial conditions - Single precision is not enough
- After 12 hours, salinity results can differ by
3ppt on different computers
26Calibration of Model Run on Different Machines?
- Calibration represents balancing all the
constituents of the equations as they are
computed by the model - Some parts of the model are more sensitive to
roundoff error than others - Some parts are more driven by forcing functions
and are therefore more stable - Turbulence closure is particularly twitchy
27Density Computation
- The Equation of State has been defined by Millero
and Poisson in UNESCO - Accuracy is high
- On the order of 6 digits
- Original data (gt500 samples) 5 digits
- Computational cost is also high
- Many terms including s1.5
- How much accuracy is needed?
28Comparison Fofonoff 1952 vs. UNESCO
- ECOMSED used Fofonoff 1952 until recently
- Approximately 0.6 difference at worst case at
approximate 10C, 10ppt - In computer model, the absolute density is
irrelevant - The density difference between adjacent cells
drives the forcing - In a 20ppt difference
- No significant difference in X1
- Small differences in distribution of salt
gradient between 0 and 20 ppt.
29Difference between UNESCO and Fofonoff 1950
30A Faster Algorithm densityDKAP
- c1 6.38582008014815d-12
- c2 -1.07670862741285d-09
- c3 9.73156784811086d-08
- c4 -9.0042779076007d-06
- c5 6.60339902342078d-05
- c6 -0.000132403263360079d0
- c7 4.91156586724696d-12
- c8 -7.81209323957205d-10
- c9 6.72419650942208d-08
- c10 -3.5945086374423d-06
- c11 0.000809532551425d0
- c12 -1.59139276805119d-07
- c13 -2.00957653399204d-12
- c14 1.28333259714417d-10
- c15 2.34587013019438d-09
-
- do k1,kb-1
- do j2,jm-1
- do i2,im-1
- rho(i,j,k)
- ((((c13 t(i,j,k) c14) t(i,j,k)
c15) - s(i,j,k) c12) s(i,j,k)
- ((((c7 t(i,j,k) c8) t(i,j,k)) c9)
- t(i,j,k) c10) t(i,j,k) c11)
s(i,j,k) - ((((c1 t(i,j,k) c2) t(i,j,k) c3)
- t(i,j,k)c4) t(i,j,k) c5) t(i,j,k)
c6 - enddo
- enddo
- enddo
31UNESCO, DKAP and MS
- Dr. Martin Senator of Davidson Laboratory
contributed a different fit, MSSuperior to
UNESCO in fit to the original data
32Consistent Attention to Accuracy
- In POM and ECOMSED, density computation is
performed - Rho is computed for each cell
- In POM, density is stored as a ratio from rhoref
to save digits - Differences are subtracted between adjacent cells
- Salinity difference of 0.001 ppt yields a result
in the 7th decimal place - Temperature difference of 0.01 C yields a
result in the 7th decimal place - Result 0 digits accuracy
- Conclusion Compute density in double precision.
Anything else may be inaccurate
33Double precision Density vs. Single Precision
- Double precision is more stable
- Differential Calculation of density will be
presumably much more accurate, and therefore even
more stable
34Calculate Density Difference Analytically
- For greatest accuracy, calculate differential
density directly - Slightly slower, but full accuracy
- Density(s,t) Pn(s,t)
- DiffDensity(s,t,ds,dt)
35Future Directions in Hardware
- New processors with more registers
- Bigger expressions become even more advantageous
- Improved shared memory performance
- Improved writeback cache design, larger caches,
faster interconnects - Intelligent memory subsystems preloading values
- Sequential access becomes even more important
36Improved Software
- We propose to write a Model Construction Toolkit
(MCT) that will do all the optimizations
mentioned here, and more - Reimplement POM/ECOMSED using the new toolkit
- Allow calling FORTRAN routines
- Enter algorithms either in a FORTRAN-like syntax
or in a higher-level syntax based on common
notation for difference equations - Automatically optimize expressions
- Compute array subexpressions only when changed
- Compute results using an optimal function given
the situation - The result More robust, reliable modeling, at
much higher speed, with less effort
37Acknowledgements
- Professor Roger Pinkham, for superb numerical and
statistical knowledge and insight - Dr. Martin Senator for his densityMS fit and help
getting started with R - The makers of R, an incredible statistics package
- John Gilson, who reviewed the presentation and
helped us get the right computing focus, and
whose discussions first got us started thinking
about a model construction toolkit