Title: NPACI Strategic Application Collaboration
1NPACI Strategic Application Collaboration
- Implementation, Tuning, Visualization Aspects
Related to the Immersed Boundary Method
2Talk Outline
- Collaborators
- Brief Description of Method
- Areas of focus of effort
- - Single PE Vector/Parallel
- - Single PE MPP
- - Multiple PE MPP
- - Visualization
3Collaborators
- NYU
- Charles Peskin, David McQueen, Nat Cohen
- SDSC
- Richard Charles, Dongju Choi, Giri Chukkapalli
4Basics of the Method
- General Fluid-Flexible-structure Interactions
- Simple Fiber tension calculations
- Fluid modeled via FFTs and periodic BC.s
- Delta function projection of fiber force onto
fluid - Velocity field interpolation to update fiber
points
5Equations of motion - I
A) At timestep (n1), Xn, and un are known B)
Next the tensions in the fibers are found
from C) Next project the fiber forces onto
the fluid lattice
6Equations of motion - II
D) Solve the N-S equations for un1, and
pn1 E) Update the fiber positions
7Fiber Fluid Geometry
8Fiber definitions - Initial State
9Fiber definitions - Deformed State
10Execution Flowchart
1) Initial Work - Data readin, fiber generation,
source/sinks ... 2) Source fiber
activation 3) Fiber force calculation 4)
Pushup - Projection of fiber forces onto fluid
lattice 5) Fluidup - Fluid lattice velocity
pressure calculation 6) Move - Movement of
fibers based upon new velocities
11T90 Performance Characteristics
Move - 27 CPU - 143 Mflops Pushup - 22
CPU - 245 Mflops Fluidup - 18 CPU - 801
Mflops (CFFT3D) Fibgen - 17 CPU - 1
Mflops (Called once initially - I/O one-time
calculations) Movemk - 9 CPU - 10
Mflops (Values for 10 timestep run on
T90) Fluidup Move - Also had poor cache
characteristics on MPP machines
12T90 Optimization Results
Subroutine Original Optimized Comparison Move 1
36 Mflops 245 Mflops 1.8 Pushup 226 Mflops 321
Mflops 1.4 Movemk 10 Mflops 85
Mflops 8.5 Total Time 1550 sec 1155 sec 25
Speedup Total Speed 286 Mflops 400 Mflops
140 Increase (Values for 100 timestep run on
T90)
13T90 Optimization Details - I
1. Vectorizing the copy of the velocity data into
a linear arrays The original part was not
vectorized and taking much time to copy from 3-D
variables (U,V,W) to 1-D variables (ULIN, VLIN,
WLIN). A new variable, ID, is created. This
helps to store the address of the 1-D variables
corresponding to the 3-D variables.
(Subroutines Move and Movemk)
14T90 Optimization Details - II
DO 4 K-1,NG1 K3D MODNG(K) DO 4 J-1,2
J3D MODNG(JJJ-1)1 I0 K16(J1)4 2
DO 4 I-1,2 I3D MODNG(III-1)1
ID(I0I,1) I3D ID(I0I,2) J3D
ID(I0I,3) K3D 4 CONTINUE do i-15,16NGP2
i3did(i,1) j3did(i,2) k3did(i,3) ULIN(I)
U(I3D,J3D,K3D) VLIN(I) V(I3D,J3D,K3D)
WLIN(I) W(I3D,J3D,K3D) enddo
DO 4 K-1,NG1 K3D MODNG(K) DO 4 J-1,2
J3D MODNG(JJJ-1)1 I0 K16(J1)4 2
DO 4 I-1,2 I3D MODNG(III-1)1
ULIN(I0I) U(I3D,J3D,K3D) VLIN(I0I)
V(I3D,J3D,K3D) WLIN(I0I)
W(I3D,J3D,K3D) 4 CONTINUE
15T90 Optimization Details - III
2. A compute line was in the outer loop and was
not vectorized The line was put in
the other part of the routines to be
efficiently vectorized and stored in an array,
mzeroa. (Subroutines Move and Movemk and
Pushup)
16T90 Optimization Details - IV
DO 122 NPT1,NPOINTS MZERO 16(INT(XFN3OLD(NPT
) - 1. FLNG) - NG) DO 121 M1,64
UINT(M,NPT) ULIN(MZEROM) DELTA(M,NPT)
VINT(M,NPT) ULIN(MZEROM) DELTA(M,NPT)
WINT(M,NPT) ULIN(MZEROM) DELTA(M,NPT) 121
CONTINUE 122 CONTINUE
DO 122 NPT1,NPOINTS MZERO mzeroa(npt) DO
121 M1,64 UINT(M,NPT) ULIN(MZEROM)
DELTA(M,NPT) VINT(M,NPT) ULIN(MZEROM)
DELTA(M,NPT) WINT(M,NPT) ULIN(MZEROM)
DELTA(M,NPT) 121 CONTINUE 122 CONTINUE
17T90 Optimization Details - V
3. An inner loop was unrolled. (Subroutine
Pushup) 4. Frequently used compute parts were
set in the beginning of the
routines. (Subroutines Move and Movemk and
Pushup) 5. Inner-loop-redundant compute parts
were separated into outer loops. (Subroutine
Pushup)
18Cache-based MPP Optimization
- Code was ported to the IBM SP.
- Detailed single PE performance analysis was
conducted. - Hot spots were tuned.
- Initial investigation of the parallel load
partitioning and the communication algorithm
design has been done.
19IBM SP Porting Details - I
- compiler flags for ibm spCFLAGS -g -pg
-qarchpwr2 -qrealsize8 -qintsize8 -qfixed -c
linker flags for ibm spLFLAGS -g -pg
-qarchpwr2 -qrealsize8 -qintsize8 -qfixed sp
libsLIBS -lessl - Create -DSP section in the makefile.
- Create ibg_cfft3d_interface_sp.F with fft
interface. - fflush, timers
20IBM SP Porting Details - II
- Original Code Performance (including I/O)
- CPU seconds 269.5600 CP executing
40888159996Elapsed seconds 479.5405FPU0
results/sec 5.11M F.P. in Math0
1376740098FPU1 results/sec 0.60M F.P. in
Math1 161472402F.P. add ops/sec 1.34M F.P.
add 359893857F.P. mul ops/sec 0.80M F.P.
mul 214718100F.P. div ops/sec 0.03M F.P.
div 7120845F.P. ma ops/sec 1.60M F.P. ma
432094233MFLOPS ratio 5.36M F.P. math ops
1445921268Fixed instr/sec E0 69.57M Fixed
instr E0 18752272602Fixed instr/sec E1
42.32M Fixed instr E1 11407203306ICU
instr/sec 0.00M ICU instr. 0Integer MIPS
111.88 Total instr. 30159475908I Cache
reloads/sec 0.00kD Cache reloads/sec
99.18kD Cache storebacks/sec 37.63kD Cache
misses/sec 83.00kTotal TLB misses/sec 0.00k
21IBM SP Porting Details - III
- Xprofiler output (flat profile)
- cumulative self self total
- time seconds seconds calls ms/call
ms/call name - 44.7 111.29 111.29 12 9274.17
9359.17 .interactviadelta 3 - 11.4 139.70 28.41
.__divi64 4 - 10.3 165.30 25.60
.IOGetByte 5 - 8.8 187.29 21.99
.LDScan 6 - 7.7 206.55 19.26
.__f64toi64rz 7 - 4.5 217.69 11.14
.__f64toi64rz.GL 8 - 2.0 222.67 4.98
.atof 10 - 1.4 226.06 3.39
.__divi64.GL 11 - 1.2 229.07 3.01 4 752.50
752.50 .ibg_calccfdsumknowns 13 - 0.8 230.94 1.87 4 467.50
1576.50 .ibg_solvefluiddynamics 9
22IBM SP Porting Details - IV
- Xprofiler ouput (function call summary)
- total calls function
- 49.98 502961 calls from
.calc_ptfuturedomain 22 to .calcperbinxyz 23 - 49.98 502961 calls from .calcperbinxyz
23 to .calcbinxyz 45 - 0.01 89 calls from
.mib_main_iterate 1 to .second 78 - 0.00 48 calls from
.exchangefluiddata 29 to .exchfluidoneside 27 - 0.00 48 calls from
.exchfluidoneside 27 to .handleownfluidghosts
28 - 0.00 24 calls from
.handleownfluidghosts 28 to .copytoownfluidghost
s 37 - 0.00 24 calls from
.ibg_solvefluiddynamics 9 to .second 78 - 0.00 24 calls from
.handleownfluidghosts 28 to .addfromownfluidghos
ts 38 - 0.00 20 calls from .splitlistkey
24 to .setemptylist 79 - 0.00 16 calls from
.ibg_solvefluiddynamics 9 to .ibg_csfft3d 81
23IBM SP Porting Details - V
- Xprofiler ouput (library call summary)
-
- total total total total calls
calls calls load - seconds time calls calls out
of into within unit - 142.57 57.21 1006383 100.00
0.00 0.00 100.00 heart_sp_test - 54.50 21.87 NA --
0.00 -- -- /usr/lib/libc.a
shr.o - 52.13 20.92 NA -- 0.00
-- -- /lib/libxlf90.a io.o - 0.00 0.00 NA --
0.00 -- -- /lib/libessl.a
essl.o - 0.00 0.00 NA --
0.00 -- -- /lib/libxlf90.a
xlfsys.o
24IBM SP Porting Details - VI
- Largest time spent in the subroutine
interactviadelta up to 60 . - The subroutine interactviadelta was isolated
into a kernel with appropriate input data.
25Original Kernel Performance
ORIGINAL KERNEL 16.50M MFLOPS ratio do
iPt1,502961 I0 IJK(1,iPt) J0
IJK(2,iPt) K0 IJK(3,iPt) do 3 iDim
1, 3 do 3 K1,4 do
3 J1,4 do 3 I1,4
data_pt(iDim,iPt)
data_pt(iDim,iPt) data_fluid(I0I,J0J,K0K,
iDim)wt_delta(I,J,K) c
data_fluid(I0I,J0J,K0K, iDim)
data_fluid(I0I,J0J,K0K, iDim) c .
data_pt(iDim,iPt)
wt_delta(I,J,K) 3 continue end do
26Modified Kernel Performance
MODIFIED KERNEL 37.16M MFLOPS ratio (Outer
loop unrolled to increase the number of
independent operations in the loop.) do
iPt1,502961 I0 IJK(1,iPt) J0
IJK(2,iPt) K0 IJK(3,iPt) do 3
K1,4 do 3 J1,4 do 3 I1,4
data_fluid(1,I0I,J0J,K0K)
data_fluid(1,I0I,J0J,K0K) data_pt(1,iPt)
wt_delta(I,J,K) data_fluid(2,I0I,J0
J,K0K) data_fluid(2,I0I,J0J,K0K)
data_pt(2,iPt) wt_delta(I,J,K)
data_fluid(3,I0I,J0J,K0K) data_fluid(3,I0I,J0
J,K0K) data_pt(3,iPt) wt_delta(I,J,K) 3
continue end do
27Cache-blocking of Fiber Data
MODIFIED KERNEL 123.87M MFLOPS ratio (Outer
loop unrolled with Cache-blocking of fiber point
data.) do 3 kbl0,9 do 3 jbl0,9 do 3
ibl0,9 II0 iblblock_side JJ0
jblblock_side KK0 kblblock_side
iblock 1iblock do 3 iPt1,502 I0
II0IJK(1,iPt,iblock) J0 JJ0IJK(2,iPt,iblock
) K0 KK0IJK(3,iPt,iblock) do
3 K0,3 do 3 J0,3
do 3 I0,3
data_fluid(1,I0I,J0J,K0K) data_fluid(1,I0I,J
0J,K0K) data_pt(1,iPt,iblock)wt_delta(I1,J1
,K1) data_fluid(2,I0I,J0J,K0
K) data_fluid(2,I0I,J0J,K0K)
data_pt(2,iPt,iblock)wt_delta(I1,J1,K1)
data_fluid(3,I0I,J0J,K0K)
data_fluid(3,I0I,J0J,K0K) data_pt(3,iPt,ibloc
k)wt_delta(I1,J1,K1) 3 continue
28Cache-blocking of Fiber Fluid Data
MODIFIED KERNEL 262.46M MFLOPS ratio (Outer
loop unrolled with Cache-blocking of both fiber
and fluid point data.) do iblock1,no_blocks
do iPt1,502 I0 IJK(1,iPt,iblock) J0
IJK(2,iPt,iblock) K0 IJK(3,iPt,iblock)
do 3 K0,3 do 3 J0,3 do
3 I0,3 data_fluid(1,I0I,J0J,K0K,i
block) data_fluid(1,I0I,J0J,K0K,iblock)
data_pt(1,iPt,iblock) wt_delta(I1,J1,K1)
data_fluid(2,I0I,J0J,K0K,iblock)
data_fluid(2,I0I,J0J,K0K,iblock)
data_pt(2,iPt,iblock) wt_delta(I1,J1,K1)
data_fluid(3,I0I,J0J,K0K,iblock)
data_fluid(3,I0I,J0J,K0K,iblock)
data_pt(3,iPt,iblock) wt_delta(I1,J1,K1) 3
continue end do end do
29Summary of Single PE Optimization
- Core of the parallel code was extracted and
optimized - Original 16.50 MFLOPS ratio
- Loop unrolling 37.16 MFLOPS ratio
- Cache blocking 123.87 MFLOPS ratio
- Cache reuse 262.46 MFLOPS ratio
30Parallel Load Partitioning Communication
Strategies
- Fiber points fall into the corresponding fluid
sub-domain will be updated by the processors
owning the sub-domain - Simple to implement
- May cause sever loading balancing
- Uniform distribution of fiber points among all
processors. - Good load balancing and scalable
- Complex to implement and may increase the
communication overhead - Uniform distribution of fibers among all
processors. - This falls between the above two.
31Original SGI-Based Visualization
32First Attempt
33Fiber VRML Extrusion Node Fibers
34Bounding Boxes
35Alpha-Shapes
36Resultant Surfaces
37Cutting Planes
38Multiple Parts - I
39Multiple Parts - II
40New Geometry Representation
- Previous use of alpha-shapes resulted in less
than optimal surfaces and/or excessive hand
manipulation of the data - Process unsatisfactory for visualization
animation of time-dependent data. - Alpha-shape process applied to initial (clean)
configuration and defining surfaces identified
for each part (layer). - Surface definition (triangle connectivity) used
for each subsequent time-step
41New Multiple Parts
42Animation
43Summary
- General basis of Immersed Boundary Method
- Discussed Implementation aspects.
- Discussed Tuning aspects on VP and MPP machines
- Showed progression of visualization tool
development