NPACI Strategic Application Collaboration - PowerPoint PPT Presentation

1 / 43
About This Presentation
Title:

NPACI Strategic Application Collaboration

Description:

(Values for 10 timestep run on T90) ... do 3 K=1,4. do 3 J=1,4. do 3 I=1,4 ... do 3 K=0,3. do 3 J=0,3. do 3 I=0,3 ... – PowerPoint PPT presentation

Number of Views:46
Avg rating:3.0/5.0
Slides: 44
Provided by: npa5
Category:

less

Transcript and Presenter's Notes

Title: NPACI Strategic Application Collaboration


1
NPACI Strategic Application Collaboration
  • Implementation, Tuning, Visualization Aspects
    Related to the Immersed Boundary Method

2
Talk Outline
  • Collaborators
  • Brief Description of Method
  • Areas of focus of effort
  • - Single PE Vector/Parallel
  • - Single PE MPP
  • - Multiple PE MPP
  • - Visualization

3
Collaborators
  • NYU
  • Charles Peskin, David McQueen, Nat Cohen
  • SDSC
  • Richard Charles, Dongju Choi, Giri Chukkapalli

4
Basics of the Method
  • General Fluid-Flexible-structure Interactions
  • Simple Fiber tension calculations
  • Fluid modeled via FFTs and periodic BC.s
  • Delta function projection of fiber force onto
    fluid
  • Velocity field interpolation to update fiber
    points

5
Equations of motion - I
A) At timestep (n1), Xn, and un are known B)
Next the tensions in the fibers are found
from C) Next project the fiber forces onto
the fluid lattice
6
Equations of motion - II
D) Solve the N-S equations for un1, and
pn1 E) Update the fiber positions
7
Fiber Fluid Geometry
8
Fiber definitions - Initial State
9
Fiber definitions - Deformed State
10
Execution Flowchart
1) Initial Work - Data readin, fiber generation,
source/sinks ... 2) Source fiber
activation 3) Fiber force calculation 4)
Pushup - Projection of fiber forces onto fluid
lattice 5) Fluidup - Fluid lattice velocity
pressure calculation 6) Move - Movement of
fibers based upon new velocities
11
T90 Performance Characteristics
Move - 27 CPU - 143 Mflops Pushup - 22
CPU - 245 Mflops Fluidup - 18 CPU - 801
Mflops (CFFT3D) Fibgen - 17 CPU - 1
Mflops (Called once initially - I/O one-time
calculations) Movemk - 9 CPU - 10
Mflops (Values for 10 timestep run on
T90) Fluidup Move - Also had poor cache
characteristics on MPP machines
12
T90 Optimization Results
Subroutine Original Optimized Comparison Move 1
36 Mflops 245 Mflops 1.8 Pushup 226 Mflops 321
Mflops 1.4 Movemk 10 Mflops 85
Mflops 8.5 Total Time 1550 sec 1155 sec 25
Speedup Total Speed 286 Mflops 400 Mflops
140 Increase (Values for 100 timestep run on
T90)
13
T90 Optimization Details - I
1. Vectorizing the copy of the velocity data into
a linear arrays The original part was not
vectorized and taking much time to copy from 3-D
variables (U,V,W) to 1-D variables (ULIN, VLIN,
WLIN). A new variable, ID, is created. This
helps to store the address of the 1-D variables
corresponding to the 3-D variables.
(Subroutines Move and Movemk)
14
T90 Optimization Details - II
DO 4 K-1,NG1 K3D MODNG(K) DO 4 J-1,2
J3D MODNG(JJJ-1)1 I0 K16(J1)4 2
DO 4 I-1,2 I3D MODNG(III-1)1
ID(I0I,1) I3D ID(I0I,2) J3D
ID(I0I,3) K3D 4 CONTINUE do i-15,16NGP2
i3did(i,1) j3did(i,2) k3did(i,3) ULIN(I)
U(I3D,J3D,K3D) VLIN(I) V(I3D,J3D,K3D)
WLIN(I) W(I3D,J3D,K3D) enddo
DO 4 K-1,NG1 K3D MODNG(K) DO 4 J-1,2
J3D MODNG(JJJ-1)1 I0 K16(J1)4 2
DO 4 I-1,2 I3D MODNG(III-1)1
ULIN(I0I) U(I3D,J3D,K3D) VLIN(I0I)
V(I3D,J3D,K3D) WLIN(I0I)
W(I3D,J3D,K3D) 4 CONTINUE
15
T90 Optimization Details - III
2. A compute line was in the outer loop and was
not vectorized The line was put in
the other part of the routines to be
efficiently vectorized and stored in an array,
mzeroa. (Subroutines Move and Movemk and
Pushup)
16
T90 Optimization Details - IV
DO 122 NPT1,NPOINTS MZERO 16(INT(XFN3OLD(NPT
) - 1. FLNG) - NG) DO 121 M1,64
UINT(M,NPT) ULIN(MZEROM) DELTA(M,NPT)
VINT(M,NPT) ULIN(MZEROM) DELTA(M,NPT)
WINT(M,NPT) ULIN(MZEROM) DELTA(M,NPT) 121
CONTINUE 122 CONTINUE
DO 122 NPT1,NPOINTS MZERO mzeroa(npt) DO
121 M1,64 UINT(M,NPT) ULIN(MZEROM)
DELTA(M,NPT) VINT(M,NPT) ULIN(MZEROM)
DELTA(M,NPT) WINT(M,NPT) ULIN(MZEROM)
DELTA(M,NPT) 121 CONTINUE 122 CONTINUE
17
T90 Optimization Details - V
3. An inner loop was unrolled. (Subroutine
Pushup) 4. Frequently used compute parts were
set in the beginning of the
routines. (Subroutines Move and Movemk and
Pushup) 5. Inner-loop-redundant compute parts
were separated into outer loops. (Subroutine
Pushup)
18
Cache-based MPP Optimization
  • Code was ported to the IBM SP.
  • Detailed single PE performance analysis was
    conducted.
  • Hot spots were tuned.
  • Initial investigation of the parallel load
    partitioning and the communication algorithm
    design has been done.

19
IBM SP Porting Details - I
  • compiler flags for ibm spCFLAGS -g -pg
    -qarchpwr2 -qrealsize8 -qintsize8 -qfixed -c
    linker flags for ibm spLFLAGS -g -pg
    -qarchpwr2 -qrealsize8 -qintsize8 -qfixed sp
    libsLIBS -lessl
  • Create -DSP section in the makefile.
  • Create ibg_cfft3d_interface_sp.F with fft
    interface.
  • fflush, timers

20
IBM SP Porting Details - II
  • Original Code Performance (including I/O)
  • CPU seconds 269.5600 CP executing
    40888159996Elapsed seconds 479.5405FPU0
    results/sec 5.11M F.P. in Math0
    1376740098FPU1 results/sec 0.60M F.P. in
    Math1 161472402F.P. add ops/sec 1.34M F.P.
    add 359893857F.P. mul ops/sec 0.80M F.P.
    mul 214718100F.P. div ops/sec 0.03M F.P.
    div 7120845F.P. ma ops/sec 1.60M F.P. ma
    432094233MFLOPS ratio 5.36M F.P. math ops
    1445921268Fixed instr/sec E0 69.57M Fixed
    instr E0 18752272602Fixed instr/sec E1
    42.32M Fixed instr E1 11407203306ICU
    instr/sec 0.00M ICU instr. 0Integer MIPS
    111.88 Total instr. 30159475908I Cache
    reloads/sec 0.00kD Cache reloads/sec
    99.18kD Cache storebacks/sec 37.63kD Cache
    misses/sec 83.00kTotal TLB misses/sec 0.00k

21
IBM SP Porting Details - III
  • Xprofiler output (flat profile)
  • cumulative self self total
  • time seconds seconds calls ms/call
    ms/call name
  • 44.7 111.29 111.29 12 9274.17
    9359.17 .interactviadelta 3
  • 11.4 139.70 28.41
    .__divi64 4
  • 10.3 165.30 25.60
    .IOGetByte 5
  • 8.8 187.29 21.99
    .LDScan 6
  • 7.7 206.55 19.26
    .__f64toi64rz 7
  • 4.5 217.69 11.14
    .__f64toi64rz.GL 8
  • 2.0 222.67 4.98
    .atof 10
  • 1.4 226.06 3.39
    .__divi64.GL 11
  • 1.2 229.07 3.01 4 752.50
    752.50 .ibg_calccfdsumknowns 13
  • 0.8 230.94 1.87 4 467.50
    1576.50 .ibg_solvefluiddynamics 9

22
IBM SP Porting Details - IV
  • Xprofiler ouput (function call summary)
  • total calls function
  • 49.98 502961 calls from
    .calc_ptfuturedomain 22 to .calcperbinxyz 23
  • 49.98 502961 calls from .calcperbinxyz
    23 to .calcbinxyz 45
  • 0.01 89 calls from
    .mib_main_iterate 1 to .second 78
  • 0.00 48 calls from
    .exchangefluiddata 29 to .exchfluidoneside 27
  • 0.00 48 calls from
    .exchfluidoneside 27 to .handleownfluidghosts
    28
  • 0.00 24 calls from
    .handleownfluidghosts 28 to .copytoownfluidghost
    s 37
  • 0.00 24 calls from
    .ibg_solvefluiddynamics 9 to .second 78
  • 0.00 24 calls from
    .handleownfluidghosts 28 to .addfromownfluidghos
    ts 38
  • 0.00 20 calls from .splitlistkey
    24 to .setemptylist 79
  • 0.00 16 calls from
    .ibg_solvefluiddynamics 9 to .ibg_csfft3d 81

23
IBM SP Porting Details - V
  • Xprofiler ouput (library call summary)
  • total total total total calls
    calls calls load
  • seconds time calls calls out
    of into within unit
  • 142.57 57.21 1006383 100.00
    0.00 0.00 100.00 heart_sp_test
  • 54.50 21.87 NA --
    0.00 -- -- /usr/lib/libc.a
    shr.o
  • 52.13 20.92 NA -- 0.00
    -- -- /lib/libxlf90.a io.o
  • 0.00 0.00 NA --
    0.00 -- -- /lib/libessl.a
    essl.o
  • 0.00 0.00 NA --
    0.00 -- -- /lib/libxlf90.a
    xlfsys.o

24
IBM SP Porting Details - VI
  • Largest time spent in the subroutine
    interactviadelta up to 60 .
  • The subroutine interactviadelta was isolated
    into a kernel with appropriate input data.

25
Original Kernel Performance
ORIGINAL KERNEL 16.50M MFLOPS ratio do
iPt1,502961 I0 IJK(1,iPt) J0
IJK(2,iPt) K0 IJK(3,iPt) do 3 iDim
1, 3 do 3 K1,4 do
3 J1,4 do 3 I1,4
data_pt(iDim,iPt)
data_pt(iDim,iPt) data_fluid(I0I,J0J,K0K,
iDim)wt_delta(I,J,K) c
data_fluid(I0I,J0J,K0K, iDim)
data_fluid(I0I,J0J,K0K, iDim) c .

data_pt(iDim,iPt)
wt_delta(I,J,K) 3 continue end do
26
Modified Kernel Performance
MODIFIED KERNEL 37.16M MFLOPS ratio (Outer
loop unrolled to increase the number of
independent operations in the loop.) do
iPt1,502961 I0 IJK(1,iPt) J0
IJK(2,iPt) K0 IJK(3,iPt) do 3
K1,4 do 3 J1,4 do 3 I1,4
data_fluid(1,I0I,J0J,K0K)
data_fluid(1,I0I,J0J,K0K) data_pt(1,iPt)
wt_delta(I,J,K) data_fluid(2,I0I,J0
J,K0K) data_fluid(2,I0I,J0J,K0K)
data_pt(2,iPt) wt_delta(I,J,K)
data_fluid(3,I0I,J0J,K0K) data_fluid(3,I0I,J0
J,K0K) data_pt(3,iPt) wt_delta(I,J,K) 3
continue end do
27
Cache-blocking of Fiber Data
MODIFIED KERNEL 123.87M MFLOPS ratio (Outer
loop unrolled with Cache-blocking of fiber point
data.) do 3 kbl0,9 do 3 jbl0,9 do 3
ibl0,9 II0 iblblock_side JJ0
jblblock_side KK0 kblblock_side
iblock 1iblock do 3 iPt1,502 I0
II0IJK(1,iPt,iblock) J0 JJ0IJK(2,iPt,iblock
) K0 KK0IJK(3,iPt,iblock) do
3 K0,3 do 3 J0,3
do 3 I0,3
data_fluid(1,I0I,J0J,K0K) data_fluid(1,I0I,J
0J,K0K) data_pt(1,iPt,iblock)wt_delta(I1,J1
,K1) data_fluid(2,I0I,J0J,K0
K) data_fluid(2,I0I,J0J,K0K)
data_pt(2,iPt,iblock)wt_delta(I1,J1,K1)
data_fluid(3,I0I,J0J,K0K)
data_fluid(3,I0I,J0J,K0K) data_pt(3,iPt,ibloc
k)wt_delta(I1,J1,K1) 3 continue
28
Cache-blocking of Fiber Fluid Data
MODIFIED KERNEL 262.46M MFLOPS ratio (Outer
loop unrolled with Cache-blocking of both fiber
and fluid point data.) do iblock1,no_blocks
do iPt1,502 I0 IJK(1,iPt,iblock) J0
IJK(2,iPt,iblock) K0 IJK(3,iPt,iblock)
do 3 K0,3 do 3 J0,3 do
3 I0,3 data_fluid(1,I0I,J0J,K0K,i
block) data_fluid(1,I0I,J0J,K0K,iblock)
data_pt(1,iPt,iblock) wt_delta(I1,J1,K1)
data_fluid(2,I0I,J0J,K0K,iblock)
data_fluid(2,I0I,J0J,K0K,iblock)
data_pt(2,iPt,iblock) wt_delta(I1,J1,K1)
data_fluid(3,I0I,J0J,K0K,iblock)
data_fluid(3,I0I,J0J,K0K,iblock)
data_pt(3,iPt,iblock) wt_delta(I1,J1,K1) 3
continue end do end do
29
Summary of Single PE Optimization
  • Core of the parallel code was extracted and
    optimized
  • Original 16.50 MFLOPS ratio
  • Loop unrolling 37.16 MFLOPS ratio
  • Cache blocking 123.87 MFLOPS ratio
  • Cache reuse 262.46 MFLOPS ratio

30
Parallel Load Partitioning Communication
Strategies
  • Fiber points fall into the corresponding fluid
    sub-domain will be updated by the processors
    owning the sub-domain
  • Simple to implement
  • May cause sever loading balancing
  • Uniform distribution of fiber points among all
    processors.
  • Good load balancing and scalable
  • Complex to implement and may increase the
    communication overhead
  • Uniform distribution of fibers among all
    processors.
  • This falls between the above two.

31
Original SGI-Based Visualization
32
First Attempt
33
Fiber VRML Extrusion Node Fibers
34
Bounding Boxes
35
Alpha-Shapes
36
Resultant Surfaces
37
Cutting Planes
38
Multiple Parts - I
39
Multiple Parts - II
40
New Geometry Representation
  • Previous use of alpha-shapes resulted in less
    than optimal surfaces and/or excessive hand
    manipulation of the data
  • Process unsatisfactory for visualization
    animation of time-dependent data.
  • Alpha-shape process applied to initial (clean)
    configuration and defining surfaces identified
    for each part (layer).
  • Surface definition (triangle connectivity) used
    for each subsequent time-step

41
New Multiple Parts
42
Animation
43
Summary
  • General basis of Immersed Boundary Method
  • Discussed Implementation aspects.
  • Discussed Tuning aspects on VP and MPP machines
  • Showed progression of visualization tool
    development
Write a Comment
User Comments (0)
About PowerShow.com