NPACI Strategic Application Collaboration - PowerPoint PPT Presentation

1 / 43

About This Presentation

Title:

NPACI Strategic Application Collaboration

Description:

(Values for 10 timestep run on T90) ... do 3 K=1,4. do 3 J=1,4. do 3 I=1,4 ... do 3 K=0,3. do 3 J=0,3. do 3 I=0,3 ... – PowerPoint PPT presentation

Number of Views:46

Avg rating:3.0/5.0

Slides: 44

Provided by: npa5

Category:

more less

Transcript and Presenter's Notes

Title: NPACI Strategic Application Collaboration

1
NPACI Strategic Application Collaboration

Implementation, Tuning, Visualization Aspects
Related to the Immersed Boundary Method

2
Talk Outline

Collaborators
Brief Description of Method
Areas of focus of effort
- Single PE Vector/Parallel
- Single PE MPP
- Multiple PE MPP
- Visualization

3
Collaborators

NYU
Charles Peskin, David McQueen, Nat Cohen
SDSC
Richard Charles, Dongju Choi, Giri Chukkapalli

4
Basics of the Method

General Fluid-Flexible-structure Interactions
Simple Fiber tension calculations
Fluid modeled via FFTs and periodic BC.s
Delta function projection of fiber force onto
fluid
Velocity field interpolation to update fiber
points

5
Equations of motion - I
A) At timestep (n1), Xn, and un are known B)
Next the tensions in the fibers are found
from C) Next project the fiber forces onto
the fluid lattice
6
Equations of motion - II
D) Solve the N-S equations for un1, and
pn1 E) Update the fiber positions
7
Fiber Fluid Geometry
8
Fiber definitions - Initial State
9
Fiber definitions - Deformed State
10
Execution Flowchart
1) Initial Work - Data readin, fiber generation,
source/sinks ... 2) Source fiber
activation 3) Fiber force calculation 4)
Pushup - Projection of fiber forces onto fluid
lattice 5) Fluidup - Fluid lattice velocity
pressure calculation 6) Move - Movement of
fibers based upon new velocities
11
T90 Performance Characteristics
Move - 27 CPU - 143 Mflops Pushup - 22
CPU - 245 Mflops Fluidup - 18 CPU - 801
Mflops (CFFT3D) Fibgen - 17 CPU - 1
Mflops (Called once initially - I/O one-time
calculations) Movemk - 9 CPU - 10
Mflops (Values for 10 timestep run on
T90) Fluidup Move - Also had poor cache
characteristics on MPP machines
12
T90 Optimization Results
Subroutine Original Optimized Comparison Move 1
36 Mflops 245 Mflops 1.8 Pushup 226 Mflops 321
Mflops 1.4 Movemk 10 Mflops 85
Mflops 8.5 Total Time 1550 sec 1155 sec 25
Speedup Total Speed 286 Mflops 400 Mflops
140 Increase (Values for 100 timestep run on
T90)
13
T90 Optimization Details - I
1. Vectorizing the copy of the velocity data into
a linear arrays The original part was not
vectorized and taking much time to copy from 3-D
variables (U,V,W) to 1-D variables (ULIN, VLIN,
WLIN). A new variable, ID, is created. This
helps to store the address of the 1-D variables
corresponding to the 3-D variables.
(Subroutines Move and Movemk)
14
T90 Optimization Details - II
DO 4 K-1,NG1 K3D MODNG(K) DO 4 J-1,2
J3D MODNG(JJJ-1)1 I0 K16(J1)4 2
DO 4 I-1,2 I3D MODNG(III-1)1
ID(I0I,1) I3D ID(I0I,2) J3D
ID(I0I,3) K3D 4 CONTINUE do i-15,16NGP2
i3did(i,1) j3did(i,2) k3did(i,3) ULIN(I)
U(I3D,J3D,K3D) VLIN(I) V(I3D,J3D,K3D)
WLIN(I) W(I3D,J3D,K3D) enddo
DO 4 K-1,NG1 K3D MODNG(K) DO 4 J-1,2
J3D MODNG(JJJ-1)1 I0 K16(J1)4 2
DO 4 I-1,2 I3D MODNG(III-1)1
ULIN(I0I) U(I3D,J3D,K3D) VLIN(I0I)
V(I3D,J3D,K3D) WLIN(I0I)
W(I3D,J3D,K3D) 4 CONTINUE
15
T90 Optimization Details - III
2. A compute line was in the outer loop and was
not vectorized The line was put in
the other part of the routines to be
efficiently vectorized and stored in an array,
mzeroa. (Subroutines Move and Movemk and
Pushup)
16
T90 Optimization Details - IV
DO 122 NPT1,NPOINTS MZERO 16(INT(XFN3OLD(NPT
) - 1. FLNG) - NG) DO 121 M1,64
UINT(M,NPT) ULIN(MZEROM) DELTA(M,NPT)
VINT(M,NPT) ULIN(MZEROM) DELTA(M,NPT)
WINT(M,NPT) ULIN(MZEROM) DELTA(M,NPT) 121
CONTINUE 122 CONTINUE
DO 122 NPT1,NPOINTS MZERO mzeroa(npt) DO
121 M1,64 UINT(M,NPT) ULIN(MZEROM)
DELTA(M,NPT) VINT(M,NPT) ULIN(MZEROM)
DELTA(M,NPT) WINT(M,NPT) ULIN(MZEROM)
DELTA(M,NPT) 121 CONTINUE 122 CONTINUE
17
T90 Optimization Details - V
3. An inner loop was unrolled. (Subroutine
Pushup) 4. Frequently used compute parts were
set in the beginning of the
routines. (Subroutines Move and Movemk and
Pushup) 5. Inner-loop-redundant compute parts
were separated into outer loops. (Subroutine
Pushup)
18
Cache-based MPP Optimization

Code was ported to the IBM SP.
Detailed single PE performance analysis was
conducted.
Hot spots were tuned.
Initial investigation of the parallel load
partitioning and the communication algorithm
design has been done.

19
IBM SP Porting Details - I

compiler flags for ibm spCFLAGS -g -pg
-qarchpwr2 -qrealsize8 -qintsize8 -qfixed -c
linker flags for ibm spLFLAGS -g -pg
-qarchpwr2 -qrealsize8 -qintsize8 -qfixed sp
libsLIBS -lessl
Create -DSP section in the makefile.
Create ibg_cfft3d_interface_sp.F with fft
interface.
fflush, timers

20
IBM SP Porting Details - II

Original Code Performance (including I/O)
CPU seconds 269.5600 CP executing
40888159996Elapsed seconds 479.5405FPU0
results/sec 5.11M F.P. in Math0
1376740098FPU1 results/sec 0.60M F.P. in
Math1 161472402F.P. add ops/sec 1.34M F.P.
add 359893857F.P. mul ops/sec 0.80M F.P.
mul 214718100F.P. div ops/sec 0.03M F.P.
div 7120845F.P. ma ops/sec 1.60M F.P. ma
432094233MFLOPS ratio 5.36M F.P. math ops
1445921268Fixed instr/sec E0 69.57M Fixed
instr E0 18752272602Fixed instr/sec E1
42.32M Fixed instr E1 11407203306ICU
instr/sec 0.00M ICU instr. 0Integer MIPS
111.88 Total instr. 30159475908I Cache
reloads/sec 0.00kD Cache reloads/sec
99.18kD Cache storebacks/sec 37.63kD Cache
misses/sec 83.00kTotal TLB misses/sec 0.00k

21
IBM SP Porting Details - III

Xprofiler output (flat profile)
cumulative self self total
time seconds seconds calls ms/call
ms/call name
44.7 111.29 111.29 12 9274.17
9359.17 .interactviadelta 3
11.4 139.70 28.41
.__divi64 4
10.3 165.30 25.60
.IOGetByte 5
8.8 187.29 21.99
.LDScan 6
7.7 206.55 19.26
.__f64toi64rz 7
4.5 217.69 11.14
.__f64toi64rz.GL 8
2.0 222.67 4.98
.atof 10
1.4 226.06 3.39
.__divi64.GL 11
1.2 229.07 3.01 4 752.50
752.50 .ibg_calccfdsumknowns 13
0.8 230.94 1.87 4 467.50
1576.50 .ibg_solvefluiddynamics 9

22
IBM SP Porting Details - IV

Xprofiler ouput (function call summary)
total calls function
49.98 502961 calls from
.calc_ptfuturedomain 22 to .calcperbinxyz 23
49.98 502961 calls from .calcperbinxyz
23 to .calcbinxyz 45
0.01 89 calls from
.mib_main_iterate 1 to .second 78
0.00 48 calls from
.exchangefluiddata 29 to .exchfluidoneside 27
0.00 48 calls from
.exchfluidoneside 27 to .handleownfluidghosts
28
0.00 24 calls from
.handleownfluidghosts 28 to .copytoownfluidghost
s 37
0.00 24 calls from
.ibg_solvefluiddynamics 9 to .second 78
0.00 24 calls from
.handleownfluidghosts 28 to .addfromownfluidghos
ts 38
0.00 20 calls from .splitlistkey
24 to .setemptylist 79
0.00 16 calls from
.ibg_solvefluiddynamics 9 to .ibg_csfft3d 81

23
IBM SP Porting Details - V

Xprofiler ouput (library call summary)
total total total total calls
calls calls load
seconds time calls calls out
of into within unit
142.57 57.21 1006383 100.00
0.00 0.00 100.00 heart_sp_test
54.50 21.87 NA --
0.00 -- -- /usr/lib/libc.a
shr.o
52.13 20.92 NA -- 0.00
-- -- /lib/libxlf90.a io.o
0.00 0.00 NA --
0.00 -- -- /lib/libessl.a
essl.o
0.00 0.00 NA --
0.00 -- -- /lib/libxlf90.a
xlfsys.o

24
IBM SP Porting Details - VI

Largest time spent in the subroutine
interactviadelta up to 60 .
The subroutine interactviadelta was isolated
into a kernel with appropriate input data.

25
Original Kernel Performance
ORIGINAL KERNEL 16.50M MFLOPS ratio do
iPt1,502961 I0 IJK(1,iPt) J0
IJK(2,iPt) K0 IJK(3,iPt) do 3 iDim
1, 3 do 3 K1,4 do
3 J1,4 do 3 I1,4
data_pt(iDim,iPt)
data_pt(iDim,iPt) data_fluid(I0I,J0J,K0K,
iDim)wt_delta(I,J,K) c
data_fluid(I0I,J0J,K0K, iDim)
data_fluid(I0I,J0J,K0K, iDim) c .

data_pt(iDim,iPt)
wt_delta(I,J,K) 3 continue end do
26
Modified Kernel Performance
MODIFIED KERNEL 37.16M MFLOPS ratio (Outer
loop unrolled to increase the number of
independent operations in the loop.) do
iPt1,502961 I0 IJK(1,iPt) J0
IJK(2,iPt) K0 IJK(3,iPt) do 3
K1,4 do 3 J1,4 do 3 I1,4
data_fluid(1,I0I,J0J,K0K)
data_fluid(1,I0I,J0J,K0K) data_pt(1,iPt)
wt_delta(I,J,K) data_fluid(2,I0I,J0
J,K0K) data_fluid(2,I0I,J0J,K0K)
data_pt(2,iPt) wt_delta(I,J,K)
data_fluid(3,I0I,J0J,K0K) data_fluid(3,I0I,J0
J,K0K) data_pt(3,iPt) wt_delta(I,J,K) 3
continue end do
27
Cache-blocking of Fiber Data
MODIFIED KERNEL 123.87M MFLOPS ratio (Outer
loop unrolled with Cache-blocking of fiber point
data.) do 3 kbl0,9 do 3 jbl0,9 do 3
ibl0,9 II0 iblblock_side JJ0
jblblock_side KK0 kblblock_side
iblock 1iblock do 3 iPt1,502 I0
II0IJK(1,iPt,iblock) J0 JJ0IJK(2,iPt,iblock
) K0 KK0IJK(3,iPt,iblock) do
3 K0,3 do 3 J0,3
do 3 I0,3
data_fluid(1,I0I,J0J,K0K) data_fluid(1,I0I,J
0J,K0K) data_pt(1,iPt,iblock)wt_delta(I1,J1
,K1) data_fluid(2,I0I,J0J,K0
K) data_fluid(2,I0I,J0J,K0K)
data_pt(2,iPt,iblock)wt_delta(I1,J1,K1)
data_fluid(3,I0I,J0J,K0K)
data_fluid(3,I0I,J0J,K0K) data_pt(3,iPt,ibloc
k)wt_delta(I1,J1,K1) 3 continue
28
Cache-blocking of Fiber Fluid Data
MODIFIED KERNEL 262.46M MFLOPS ratio (Outer
loop unrolled with Cache-blocking of both fiber
and fluid point data.) do iblock1,no_blocks
do iPt1,502 I0 IJK(1,iPt,iblock) J0
IJK(2,iPt,iblock) K0 IJK(3,iPt,iblock)
do 3 K0,3 do 3 J0,3 do
3 I0,3 data_fluid(1,I0I,J0J,K0K,i
block) data_fluid(1,I0I,J0J,K0K,iblock)
data_pt(1,iPt,iblock) wt_delta(I1,J1,K1)
data_fluid(2,I0I,J0J,K0K,iblock)
data_fluid(2,I0I,J0J,K0K,iblock)
data_pt(2,iPt,iblock) wt_delta(I1,J1,K1)
data_fluid(3,I0I,J0J,K0K,iblock)
data_fluid(3,I0I,J0J,K0K,iblock)
data_pt(3,iPt,iblock) wt_delta(I1,J1,K1) 3
continue end do end do
29
Summary of Single PE Optimization

Core of the parallel code was extracted and
optimized
Original 16.50 MFLOPS ratio
Loop unrolling 37.16 MFLOPS ratio
Cache blocking 123.87 MFLOPS ratio
Cache reuse 262.46 MFLOPS ratio

30
Parallel Load Partitioning Communication
Strategies

Fiber points fall into the corresponding fluid
sub-domain will be updated by the processors
owning the sub-domain
Simple to implement
May cause sever loading balancing
Uniform distribution of fiber points among all
processors.
Good load balancing and scalable
Complex to implement and may increase the
communication overhead
Uniform distribution of fibers among all
processors.
This falls between the above two.

31
Original SGI-Based Visualization
32
First Attempt
33
Fiber VRML Extrusion Node Fibers
34
Bounding Boxes
35
Alpha-Shapes
36
Resultant Surfaces
37
Cutting Planes
38
Multiple Parts - I
39
Multiple Parts - II
40
New Geometry Representation

Previous use of alpha-shapes resulted in less
than optimal surfaces and/or excessive hand
manipulation of the data
Process unsatisfactory for visualization
animation of time-dependent data.
Alpha-shape process applied to initial (clean)
configuration and defining surfaces identified
for each part (layer).
Surface definition (triangle connectivity) used
for each subsequent time-step