Title: Optimization of the Parallel Performance
1Optimization of the Parallel Performance of a
Global Gyrokinetic Particle Code on NERSCs CRAY
T3E and IBM SP
J. N. Leboeuf and V. K. Decyk, UCLA R. Sydora, U.
Alberta, Canada
NERSC User Training Lectures at June NUG Meeting
June 6-7, 2000
ORNL
2 Motivation
- Very large scale calculations which are
extremely demanding computionally are required
when we attempt to model experimental
situations
- Optimization of performance is necessary to
improve efficiency and make parameter scans
feasible
Outline - Brief model description
- Optimization strategy and implementation
- Results - Summary
3UCLA-Canada (UCAN) Global Gyrokinetic Particle
Code
Designed to study ion microinstabilities and
ensuing transport in toroidal magnetic fusion
devices of the tokamak type
Physics Attributes
- Three dimensional - Toroidal - Nonlinear -
Cartesian geometry - Covers the whole plasma
cross-sectiongt Global - Adiabatic electrons
Refs - R. D. Sydora, V. K. Decyk, and J. M.
Dawson, Plasma Phys. Control. Fusion 38 (1996)
A281-A294 - J. N. Leboeuf et al. , Phys. Plasmas
7 (2000) 1795-1801
4UCLA-Canada (UCAN) Global Gyrokinetic Particle
Code
Numerical Attributes
- Uses low-noise nonlinear delta-f methods to
integrate ions orbits - Massively parallel
implementation using PLIB library of particle
and field management routines
- MPI for message passing
- One dimensional domain decomposition (toroidal
or z direction)
Ref - V. K. Decyk, Computer Physics
Communications 87 (1995) 87-94
5Global Gyrokinetic Particle Simulations of
Turbulence in UCLAs Electric Tokamak (ET)
Massively parallel computers enable modeling of
plasma conditions (plasma weather) inside a
large scale fusion device like ET
Nonlinear Steady State
Linear Stage
6Need for Optimization and Optimization Strategy
The CRAY T3E is notoriously cache starved with
an 8K level 1 cache compared to a 64K level 1
cache on the IBM SP gt minimizing cache usage
is a crucial issue on the T3E
Follow strategy outlined in Optimization of
particle-in-cell codes for RISC processors,
V. K. Decyk, S. R. Karmesin, A. de Boer, and
P. C. Liewer, Computers in Physics 10 (1996)
290-298
Main elements - Loop
reordering - Data restructuring
- Particle sorting
Should help even on the IBM SP
7OPTIMIZATION Loop Reordering in Particle
Push and Accumulation Routines
After
do 33 l 1, nblok c nleng
nleng0 ntot npp(l) - nps(l) 1
if( ntot .lt. nleng ) nleng ntot ndiv
float( ntot - 1 ) / float( nleng ) 1.
nlast ntot - ( ndiv - 1 ) nleng c do
10 ncon 1,ndiv nlengl nleng if(
ncon .eq. ndiv ) nlengl nlast jfst
nps(l) ( ncon - 1 ) nleng - 1 c c c
do 20 k 1,4 c k2 k k c
k1 k2 - 1 c do 20 jj
1,nlengl c i jfst
jj c btot cb1rmajor / (rmajor
(part(1,i,l) - xxo) ) rhoi
sqrt( 2. part(11,i,l) btot ) wcirv c
do 30 k1,4 k2kk k1k2-1 c
Before
do 33 l 1, nblok c nleng
nleng0 ntot npp(l) - nps(l) 1
if( ntot .lt. nleng ) nleng ntot ndiv
float( ntot - 1 ) / float( nleng ) 1.
nlast ntot - ( ndiv - 1 ) nleng c do
10 ncon 1,ndiv nlengl nleng if(
ncon .eq. ndiv ) nlengl nlast jfst
nps(l) ( ncon - 1 ) nleng - 1 c c
do 20 k 1,4 k2 k k
k1 k2 - 1 c do 30 jj
1,nlengl c i jfst
jj c btot cb1rmajor / (rmajor
(part(1,i,l) - xxo) ) rhoi
sqrt( 2. part(11,i,l) btot ) wcirv
8OPTIMIZATION Loop Reordering in FFT Routines
Before
After
c first transform in z nrz nxyz/nz
do 580 l 1, indz ns 2(l - 1)
ns2 ns ns km nzh/ns kmr
kmnrz do 570 m 1, jblok do 560 k
1, km k1 ns2(k - 1) k2 k1
ns do 550 j 1, ns j1 j k1
j2 j k2 s conjg(sct(1kmr(j-1)))
do 540 i 1, kxp do 530 n 1, ny
t sg(j2,n,i,m) g(j2,n,i,m)
g(j1,n,i,m) - t g(j1,n,i,m) g(j1,n,i,m)
t 530 continue 540 continue 550 continue 560
continue 570 continue 580 continue
c first transform in z nrz nxyz/nz
do 580 l 1, indz ns 2(l - 1)
ns2 ns ns km nzh/ns kmr
kmnrz do 570 m 1, jblok do 560 i
1, kxp do 550 n 1, ny do 540 k
1, km k1 ns2(k - 1) k2 k1
ns do 530 j 1, ns j1 j k1
j2 j k2 s conjg(sct(1kmr(j-1)))
t sg(j2,n,i,m) g(j2,n,i,m)
g(j1,n,i,m) - t g(j1,n,i,m) g(j1,n,i,m)
t 530 continue 540 continue 550 continue
560 continue 570 continue 580 continue
9OPTIMIZATION Data Restructuring
Separate electric field components elx, ely,
elz regrouped into single array
el(1,...)elx(...), el(2,...)ely(...)
el(3,...)elz(...) so that they are now
contiguous in memory
After
Before
call pfft3r(g1x,g1xt,bs,br,1,ntpose,mixup,sc
t, cdddd1 incx1, incy1, incz, kstrt,
1 incx , incy , incz, kstrt,
2 ncxv, ncyppv, nczv, 3
nncxppp, nnczp, nnczp1, nblok, nblok,
ncypp, 4 ncypp, ncy
) c do 810 l 1,nblok do
810 k 1,nczp do 810 j 1,nncy1
do 810 i 1,nncx1
elx(i,j,k,l) g1x(i,j,k,l) 810 continue c
call pfft3r(g1x,g1xt,bs,br,1,ntpose,mixup,sc
t, cdddd1 incx1, incy1, incz, kstrt,
1 incx , incy , incz, kstrt,
2 ncxv, ncyppv, nczv, 3
nncxppp, nnczp, nnczp1, nblok, nblok,
ncypp, 4 ncypp, ncy
) c do 810 l 1,nblok do
810 k 1,nczp do 810 j 1,nncy1
do 810 i 1,nncx1
elx(i,j,k,l) g1x(i,j,k,l)
el(1,i,j,k,l)g1x(i,j,k,l) 810 continue c
10OPTIMIZATION Memory Access Control
Particle pusher modified to take advantage of
data restructuring, so as to minimize memory
excursions and improve data locality
After
c dx 1 el(1,lx,ly ,lz,l)alf1el(1,lxr
,ly ,lz,l)alf2 1 el(1,lx,lyr,lz,l)alf3el(1
,lxr,lyr,lz,l)alf4 dy 1 el(2,lx,ly
,lz,l)alf1el(2,lxr,ly ,lz,l)alf2 1
el(2,lx,lyr,lz,l)alf3el(2,lxr,lyr,lz,l)alf4
dz 1 el(3,lx,ly ,lz,l)alf1el(3,lxr,ly
,lz,l)alf2 1 el(3,lx,lyr,lz,l)alf3el(3,lxr
,lyr,lz,l)alf4 c c..... e.s. field c
extt(k,jj,l) dx 2 el(1,lx ,ly
,lzr,l)alf5 el(1,lxr,ly ,lzr,l)alf6 3
el(1,lx ,lyr,lzr,l)alf7
el(1,lxr,lyr,lzr,l)alf8 c eytt(k,jj,l)
dy 2 el(2,lx ,ly ,lzr,l)alf5
el(2,lxr,ly ,lzr,l)alf6 3 el(2,lx
,lyr,lzr,l)alf7 el(2,lxr,lyr,lzr,l)alf8 c
eztt(k,jj,l) dz 2 el(3,lx ,ly
,lzr,l)alf5 el(3,lxr,ly ,lzr,l)alf6 3
el(3,lx ,lyr,lzr,l)alf7
el(3,lxr,lyr,lzr,l)alf8 c
Before
c extt(k,jj,l) 1 elx(lx ,ly
,lz ,l)alf1 elx(lxr,ly ,lz ,l)alf2 1
elx(lx ,lyr,lz ,l)alf3 elx(lxr,lyr,lz
,l)alf4 2 elx(lx ,ly ,lzr,l)alf5
elx(lxr,ly ,lzr,l)alf6 3 elx(lx
,lyr,lzr,l)alf7 elx(lxr,lyr,lzr,l)alf8 c
eytt(k,jj,l) 1 ely(lx ,ly ,lz
,l)alf1 ely(lxr,ly ,lz ,l)alf2 1
ely(lx ,lyr,lz ,l)alf3 ely(lxr,lyr,lz
,l)alf4 2 ely(lx ,ly ,lzr,l)alf5
ely(lxr,ly ,lzr,l)alf6 3 ely(lx
,lyr,lzr,l)alf7 ely(lxr,lyr,lzr,l)alf8 c
eztt(k,jj,l) 1 elz(lx ,ly ,lz
,l)alf1 elz(lxr,ly ,lz ,l)alf2 1
elz(lx ,lyr,lz ,l)alf3 elz(lxr,lyr,lz
,l)alf4 2 elz(lx ,ly ,lzr,l)alf5
elz(lxr,ly ,lzr,l)alf6 3 elz(lx
,lyr,lzr,l)alf7 elz(lxr,lyr,lzr,l)alf8 c
11OPTIMIZATION Particle Sorting
Sorting Every 5 Time Steps
No Sorting
cc cc initial sort cc call
psortp3yz(part,pt,ip,npic,npp,noff,idimp, 1
nnopmax,nblok,ny1,nyzpm1) c c.....................
.................................................
c...... main loop ................................
.....................
nsortt5 c........................................
.............................. call
timera(-1,'total ') c 100 continue c nt
nt 1 nb ( nt / nbb ) nbb
nsort(nt/nsortt)nsortt call pmove3(
part, 1 edges, npp, sbufr, sbufl,
rbufr, rbufl, ihole, 2 jsr, jsl,
jss, 3 ncz, kstrt, nvp, idimp,
nnopmax, nblok, idps, 4 nbmax,
ntmax, ierr ) c
if(nsort.eq.nt) then call
psortp3yz(part,pt,ip,npic,npp,noff,idimp, 1
nnopmax,nblok,ny1,nyzpm1) end if c
c c...............................................
....................... c...... main loop
..................................................
... c.............................................
......................... call
timera(-1,'total ') c 100 continue c nt
nt 1 nb ( nt / nbb ) nbb c
call pmove3( part, 1 edges,
npp, sbufr, sbufl, rbufr, rbufl, ihole, 2
jsr, jsl, jss, 3 ncz,
kstrt, nvp, idimp, nnopmax, nblok, idps, 4
nbmax, ntmax, ierr
) c
12Performance Improvements on NERSCs CRAY T3E
Resulting from Optimizations
Improvements by better than a factor of 2
achieved in terms of overall time per particle
per step and flops as measured by the pat
utility on NERSCs CRAY T3E
13Performance Improvements on NERSCs CRAY T3E and
IBM SP Resulting from Optimizations
Optimization for cache starved T3E still
results in 30 performance improvement on SP
14Performance with Increasing Problem Size and
Number of Processors on NERSCs IBM SP
fit
Linear scaling obtained when number of
particles increases with number of processors
(factor of 2 each at a time)
15Comparison on Performance on Appleseed Mcintosh
G4 cluster and IBM SP
Equivalent performance up to 8 processors
16Summary
Optimization strategy successfully implemented
Optimizations have improved performance on both
the T3E and SP
With up to 8 processors, performance is
comparable on UCLAs Appleseed cluster of
Macintosh G4s and on NERSCs IBM SP