Title: Hernquist SAC:PARTREE Stuart Johnson
1Hernquist SACPARTREEStuart Johnson
- Single-processor optimization of a particle tree
code - Punch line
- tuned subroutines 2.75 X speedup
- whole code 2 X speedup
2Outline
- Scientific Context and Goals
- Initial Computational Characteristics
- Optimization efforts
- Performance improvements
- Conclusions
3Scientific Context and Goals
- Simulate gravitational evolution of Galaxies
- early disturbed development
- origin of elliptical galaxies
- background light and tidal debris
- evolution of "cosmological" model
- initial mass density
- initial spectrum of heterogeneities
4PARTREE
- Approximately solves the N-body problem
- Nearby Particles particle-particle interactions
- "Distant" Particles particle-multipole
interactions - Converts O(N2) -gt O(N log N) algorithm
- More general than SCF (symmetry)
5What the code does
- At each timestep
- 1) ORB domain decomposition (Parallel)
- 2) Construct local BH trees
- 3) Construct locally essential trees (Parallel)
- 4) Walk through trees to calculate forces (tuned)
- 5) Move particles
6Tree data structures
- ORB Orthogonal Recursive Bisection
- Successive Halving for parallel domain
decomposition on 2n processors - BH Barnes-Hut
- Nested cubing for on-processor decomposition and
localization of particles
7Opening criterion
- Open a cell if it is too close
- Approximate a cell using a multipole expansion if
it is distant
8 Force Calculation
- Particle-particle interaction
- Note the square root for calculating the
particle-to-particle unit vector - Note the divide(s)
9 Particle grouping
- Forces are calculated for nearby particles using
the same tree decomposition, since nearby
particles see almost the same gravity field - implemented by using the distance(d) to the
nearest edge of the particle group in the opening
criterion (calculation more accurate than for
single particle) - has very significant implications for data reuse
(generates in-cache force calculation loop(s)) - has implications for independent operations
- particle group size set to 32 by experimentation
- grouping is a performance tradeoff
10Initial Computational Characteristics
- Test problem 2 million particle disk halo
simulation - Not huge 640MB (4 nodes or more on SP)
- 1000 interactions per particle
- Profile of computation time
- Scalability
11Profile of computation time (4 node run)
- cumulative self self
total - time seconds seconds calls ms/call
ms/call name - 71.6 5332.26 5332.26 29226604 0.18
0.18 .cforce 6 - 11.3 6176.81 844.55 1264378 0.67
5.31 .GroupForceWalk 5 - 7.3 6717.45 540.64 3540223 0.15
0.15 .pforce 7 - 2.5 6902.73 185.28 12000000 0.02
0.02 .AddParticleToTree 9 - 1.9 7040.62 137.89
.__mcount 10 - 0.7 7090.43 49.81
.readsocket 12 - 0.6 7131.67 41.24 638 64.64
64.64 .WeightFrac 14 - 0.5 7172.11 40.44
.kickpipes 16 - 0.5 7212.00 39.89 139771115 0.00
0.00 .whichChild 17 - 0.5 7247.02 35.02 128894017 0.00
0.00 .subCenter 18
12 Scalability of test problem
13Optimization Efforts
- cforce/pforce tuning for the SP and T3E
- general comments
- program for architecture
- program for cache
- program for pipelines
- avoid slow things (like divide)
14Optimization Efforts
- general tuning comments
- techniques
- predict performance compare to absolutes
- understand limiting factors on performance
- understand effects of code modifications and
modify towards predicted performance - look at the assembly code
- use compiler flags
15 cforce/pforce tuning for SP/T3E
- elimination of divides
- "Vectorization" of inverse square root
- the tunable loops and their properties
- cache behavior
- computational intensity/performance predictions
- elimination of statically declared temporaries
- elimination of single precision calculations
- absolute performance of modified code
- tuning pipelineing and splitting loops (T3E)
- increase independent operations, prevent register
spill
16 Elimination of divides
- Original code (1 86 253 162 CP T3E)
- sr sqrt(sr2)
- phii (c-gtmass) / sr
- mor3 phii / sr2
- temp 5.0phiquad/sr2
-
- Best (since 1/sqrt is as fast as sqrt) (486
90 CP T3E) - sr1/sqrt(sr2)
- phii (c-gtmass)sr
- rsr2srsr
- mor3 phii rsr2
- temp 5.0phiquad rsr2
17 "Vectorization" of inverse square root(T3E)
- Intrinsic (libm) function timings (from T3E
Benchmarker's Guide) - Routine CP Scalar CP Vector
- ------- --------- ---------
- SQRT 86 25
- 1/SQRT 86 25
-
- T3E can automatically stripmine and call vector
routines - BUT the C compiler is broken! (assembly reveals
this)
18 "Vectorization" of inverse square root(SP)
- Routine CP SQRT CP 1/SQRT
- ------------- ------------- ---------------
- FSQRT/FD(HW) 13.8 22.4
- libm(scalar) 43.7 58.0
- libmass(scalar) 28.4 28.4
- libmassvp2(vector) 7.1 7.1
- loop timings are for 10,000,000 ops in vector
lengths of 5000, reused to provide in-cache
timings - SP (w/out preprocessor) requires explicit vector
call to get lmassvp2 form (vrsqrt)
19 Tunable loop(s)
- Loop 1
- for(i0 iltng i)
- for(j0, cclist cltclistccount j, c)
- xj (c-gtr).x - pos0.x
- yj (c-gtr).y - pos0.y
- zj (c-gtr).z - pos0.z
- x2j xjxj y2j yjyj
- z2j zjzj
- tmpsjx2jy2jz2j(c-gtepssq)
20 Tunable loop(s)
- Loop 2
- for(j0, cclist cltclistccount j, c)
- sr2tmpsjtmpsj
- phii (c-gtmass) tmpsj
- mor3 phii sr2
- phisum - phii
- acc0.x mor3xj
- acc0.y mor3yj
- acc0.z mor3zj
- or5 sr2 sr2 tmpsj
- phiquad ( 0.5 (qxx0 x2j qyy0 y2j
qzz0 z2j - - (c-gtepssq)momnode) xj(qxy0yj
qxz0zj) - qyz0 yjzj)or5
- phisum - phiquad
- temp 5.0phiquadsr2
- acc0.x (tempxj - (qxx0xj qxy0yj
qxz0zj)or5) - acc0.y (tempyj - (qxy0xj qyy0yj
qyz0zj)or5) - acc0.z (tempzj - (qxz0xj qyz0yj
qzz0zj)or5) -
21Loop properties
- 1) Cache behavior
- Particle grouping -gt data reuse-gt almost all loop
data can be considered in-cache if - Size of working arrays adjusted to fit in cache
22Loop properties
- 2) computational intensity/performance
predictions -
loop 1 loop 2 total - Floating Point
- /-/ 9 59
- FMA 0 18
- cycles 9/2 (59-218)1841/2
- Load/Store
- L/S 9 14
- cycles(no quads) 9/2 14/2
- CPU bound loops, so
- predicted cycles 9/2 41/2
25
23Elimination of statically declared variables
- Original code
- static float x, y, z
- static float x2, y2, z2, epssq
- static float dr2, sr, sr2, phii, mor3, phisum
- static float or5, temp, phiquad
- static coordStruct acc0, pos0
- All of these should be in registers, but are
forced to store back! - Change to local variables, and some statically
declared in-cache workspace...
24Elimination of single precision
- Particles stored in single or double precision
- Calculations performed in double precision
- Example of macro modification
- define COPYPARTICLE() \
- \
- plistpcount.type c-gttype \
- plistpcount.mass (double)(c-gtmass) \
- plistpcount.r.x (double)(c-gtr.x) \
- plistpcount.r.y (double)(c-gtr.y) \
- plistpcount.r.z (double)(c-gtr.z)
\ - plistpcount.epssq (double)(c-gtepssq)
\
25Performance Improvements
- in-cache test of original cforce simulator loop
7.59 seconds - in-cache test of optimized cforce simulator loop
2.07 seconds - 3.67 times faster! (10 M iterations of loop)
26Performance of optimized tuned loops (no vrsqrt)
- CPU seconds 1.5913 CP
executing 254610220 - Elapsed seconds 1.6457
- FPU0 results/sec 159.58M F.P. in
Math0 253934523 - FPU1 results/sec 138.09M F.P. in
Math1 219752424 - F.P. add ops/sec 25.44M F.P. add
40480950 - F.P. mul ops/sec 113.92M F.P. mul
181278375 - F.P. div ops/sec 0.00M F.P. div
1776 - F.P. ma ops/sec 158.16M F.P. ma
251677116 - MFLOPS ratio 455.67M F.P. math
ops 725115333 - Fixed instr/sec E0 64.67M Fixed
instr E0 102907722 - Fixed instr/sec E1 43.67M Fixed
instr E1 69495885 - ICU instr/sec 0.00M ICU instr.
0 - Integer MIPS 108.34 Total
instr. 172403607 - I Cache misses/sec 2.73k
- D Cache reloads/sec 43.75k
- D Cache storebacks/sec 22.45k
- D Cache misses/sec 33.81k
- Total TLB misses/sec 0.00k
- cycles/FLOP 0.3511
27Performance of optimized tuned loops (no vrsqrt)
- Optimized loops need 25.5 cycles per iteration, a
bit slower than my prediction. - Based on the displayed ops, this should be
(25118140)/2 23.6 cycles - missing 1.9 cycles/loop and am only running at
93 peak. - Tuned loops are running at 456 Mflops
28Performance improvements
- Rs2hpm results from 2M particle run(optimized
code) - cumulative self self
total - time seconds seconds calls ms/call
ms/call name - 42.4 1598.79 1598.79 3009774 0.53
0.53 .cforce 6 - 25.0 2539.95 941.16 1264378 0.74
2.08 .GroupForceWalk - 11.7 2981.47 441.52
vrsqrt 7 - 4.9 3165.09 183.62 12000000 0.02
0.02 .AddParticleToTree - 3.6 3301.92 136.83
.__mcount 10 - 2.2 3385.88 83.96 1264407 0.07
0.07 .pforce 11 - 1.1 3427.49 41.61
.readsocket 13 - 1.1 3468.57 41.08 139771115 0.00
0.00 .whichChild 14 - 1.1 3509.58 41.01 638 64.28
64.28 .WeightFrac 17 - 0.9 3543.70 34.12
.kickpipes 18
29Performance Improvements
- Rs2hpm results from 2M particle run (original
code) - cumulative self self
total - time seconds seconds calls ms/call
ms/call name - 71.6 5332.26 5332.26 29226604 0.18
0.18 .cforce 6 - 11.3 6176.81 844.55 1264378 0.67
5.31 .GroupForceWalk5 - 7.3 6717.45 540.64 3540223 0.15
0.15 .pforce 7 - 2.5 6902.73 185.28 12000000 0.02
0.02 .AddParticleToTree - 1.9 7040.62 137.89
.__mcount 10 - 0.7 7090.43 49.81
.readsocket 12 - 0.6 7131.67 41.24 638 64.64
64.64 .WeightFrac 14 - 0.5 7172.11 40.44
.kickpipes 16 - 0.5 7212.00 39.89 139771115 0.00
0.00 .whichChild 17 - 0.5 7247.02 35.02 128894017 0.00
0.00 .subCenter 18 - cforce and pforce are 2.76 X faster...
30Performance improvements
- Full run (2M particles)performance
- Processors Opt code P/Proc/S Orig code
P/Proc/S Speedup - -------------- -----------------------
------------------------ ----------- - 4 3289 1650
1.99 - 8 3788 1880
2.01 - 16 3289 1645
2.00 - 32 2717 1358
2.00 - 64 2300 1200
1.92
31Conclusions 1
- with 2X the speed can do 2X the particles in the
same time as before - Code may scale well enough to 128 nodes for
100,000,000 particle simulations (50,000
processor seconds/timestep, 14000 SUs for 1000
timesteps), but scaling problems may be inherent
to algorithm (ORB aspect ratios?) - T3E code also optimized similarly with more
attention to loop pipelining problems
32Conclusions 2
- Profiling and performance monitoring tools
essential for optimization work - easy-to-use interactive nodes REALLY nice for
optimization work - Same code does not run the best on the T3E and
the SP - SP easier to tune for and faster than the T3E
from a single-PE standpoint - Funky DEC Alpha on-chip bandwidths and latencies