Title: Preliminary Report of the PERI Architecture Tiger Team
1Preliminary Report of the PERI Architecture Tiger
Team
- Performance Engineering Research Institute
- Tiger Team Lead Bronis R. de Supinski
- Sadaf Alam, David H. Bailey, Laura Carrington,
Jacqueline Chame, Chris Daley, Anshu Dubey, Todd
Gamblin, Dan Gunter, Paul Hovland, Heike Jagode,
Karen Karavanic , Gabriel Marin, John
Mellor-Crummey, Shirley Moore, Boyana Norris,
Lenny Oliker, Philip C. Roth, Martin Schulz,
Sameer Shende, Jeff Vetter, Pat Worley, Nick
Wright - January XX, 2009
- questions to bronis_at_llnl.gov
2PERI Architecture Tiger Team Overview
- Assist OASCR in formulating its ten year plan
- What machine mix is needed to fulfill OASCR
mission? - Answer determined through application focus
- Consider a range of OASCR applications
- Evaluate suitability of current and future
architectures - Focus on system evaluation
- Still consider optimization opportunities but not
the focus - Consist of almost all PERI researchers
- Three major components
- Measuring pioneering applications on todays
systems - Predict their performance on future systems
- Report analysis of results, consult DOE on
implications - This report focuses on initial measurement
activities andpreliminary modeling work
associated with that activity
3Headquarters Requested the Architecture Tiger
Team to Focus on Early Science Applications
- Oak Ridge Early Science Applications
- Chimera
- DCA
- GTC
- Argonne Early Science Applications
- FLASH
- GTC
- Lattice QCD
- NAMD
- Initial three chosen by headquarters FLASH, GTC
and S3D - Tasks require flexible, evolving strategy for
code base used - Predictions require freezing application code at
some point - Anticipate revisiting models (and related code
base) over time - Measurement can be iterative so some flexibility
- Use measurements to guide selection of modeled
code base
4Widely Varied Initial Progress for First
ThreeApplications S3D was Smoothest
- Relationship established w/FY07 Tiger Team
facilitated effort - Clear application team structure simplified
determining initial code base and input problem
sets - Significant measurement results
- Performance similar across Jaguar and Intrepid
- Identified inherent load imbalance issue
- Minor performance limitation at current scales
- Potential issue at larger scales
- Focus of initial performance assertions modeling
effort - Initial data gathered for PMaC modeling
5Widely Varied Initial Progress for First
ThreeApplications External Factors Limited
FLASH
- Little existing relationship
- Some familiarity between ANL PERI representatives
and University of Chicago (UC) application team - Clear application team structure helped
relationship develop - Administrative issues complicated getting started
- Licensing issues related to code
- Restricted distribution mechanism related to
export control - Delay in staffing plan (Chris Daley, UC) for
measurement effort - Required funds slow going through LBNL
procurement process - Initial measurement results
- Scaling studies on (Jaguar? and) Intrepid
- Currently, gathering extensive TAU and other tool
data - Beginning to gather data for PMaC modeling
6Widely Varied Initial Progress for First
ThreeApplications GTC
- Application is undergoing significant
redevelopment - Several possibilities for code base for
explorations - Existing versions expected to change
significantly - Will be very different over long term
- Even short term stability not clear
- I/O mechanism being completely redone currently
- Probably requires at least two code bases for
study - Initial measurement results
- Initial measurement studies on Jaguar
- Identified configuration error that implied
scaling limitations - Currently, gathering extensive TAU and other tool
data - Will begin gathering modeling data soon
7S3D Direct numerical simulation (DNS) of
turbulent combustion
- State-of-the-art code developed at CRF/Sandia
- 2007 INCITE award - 6M hours on XT3/4 at NCCS
- Tier 1 pioneering application for 250TF system
- Why DNS?
- Study micro-physics of turbulent reacting flows
- Full access to time resolved fields
- Physical insight into chemistry turbulence
interactions - Develop validate reduced model descriptions
usedin macro-scale simulations of
engineering-level systems
Text and figures courtesy of S3D PI, Jacqueline
H. Chen, SNL
8S3D - DNS Solver
- Solves compressible reacting Navier-Stokes
equations - High fidelity numerical methods
- 8th order finite-difference
- 4th order explicit RK integrator
- Hierarchy of molecular transport models
- Detailed chemistry
- Multiphysics (sprays, radiation soot)
- From SciDAC-TSTC(Terascale Simulation of
Combustion)
Text and figures courtesy of S3D PI, Jacqueline
H. Chen, SNL
9S3D Parallelization
- Fortran90 MPI
- 3D domain decomposition
- each MPI process manages part of the domain
- All processes have same number of grid points
same computational load - Inter-processor communication only between
nearest neighbors in 3D mesh - large messages non-blocking sends receives
- All-to-all communication only required for
monitoring synchronization ahead of I/O
S3D logical topology
Text courtesy of S3D PI, Jacqueline H. Chen, SNL
10Total Execution Time of S3D on Intrepid
11Relative Efficiency For S3Dunder Weak Scaling on
Intrepid
12Relative Efficiency on Intrepid by Event
13Relative Speedup on Intrepid by Event
14Event Correlation to Total Time on Intrepid
r 1 implies direct correlation
15Fraction of time in MPI on Intrepid
16Total Runtime Breakdown by Events
17Mean Time by Function BreakdownAcross All Nodes
on Intrepid
Total Runtime 1 Hour, 31 minutes, 25 seconds
18S3D Wall Clock Times Measured on Jaguar with
Optimized TAU Instrumentation on 64 Cores
- Exclusive times distributed across routines
called within S3Ds solve_driver
- ratt_i
- rhsf
- ratx_i
- transport_mcomputecoefficients
- transport_mcomputespeciesdiffflux
- MPI_Wait
- integrate
- thermchen_mcalc_temp
- transport_mcomputeheatflux
- derivative_xcalc
- derivative_ycalc
- derivative_zcalc
- derivative_xcomm
19S3D Wall Clock Times Measured on Jaguar with
Optimized TAU Instrumentation on 64 Cores
- Exclusive times for MPI_Wait (6) exhibit
potential load balance issue
20Gathered, IPC Floating Point Data and 8 Memory
Measurements for S3D on Jaguar with TAU
- IPC (Instructions per Cycle) efficiency metric
- Proportion of floating point operations
- Hardware counter-based memory measurements
- L1 data cache misses
- L1 instruction cache misses
- L1 data TLB misses
- L1 instruction TLB misses
- L2 (unified but not shared between cores) cache
misses - L2 data TLB misses
- L2 instruction TLB misses
- Memory accesses on quad-core (L3) for different
core cases
21Event-Based Measurement of IPC, FPO rate and L1
Data Cache Miss Rates
22L3 Cache Behavior for Different Core Cases4
Cores/Node (VNM) Versus 1 Core/Node (SMP)
- Runtime on Jaguar VNM 813s SMP 613.4s
- Runtime on Intrepid VNM 1728.7s SMP 1731.7s
23Initial Measurements of S3DI/O Performance on
Jaguar
- S3D uses FORTRAN I/O to read control and input
files and to write restart dumps - Each rank writes own restart file
- Writes staggered across logical process topology
to avoid contention at file system metadata
server - Restart dumps dominate I/O cost as each rank
writes - Four REAL(8) scalars (time, tstep, time_save,
pout) - Four REAL(8) 3D 4D arrays (yspecies, temp,
pressure, u) - Fortran I/O record markers
- Per-process write volume
- Vproc 8 (4 nx ny nz (n_species 5)) 64
- For 303030 grid points per process resolution
- 5.8 MB per process, per checkpoint
- With 20K processes 116.6 GB per checkpoint
24Write Restart Performance Measuredon Jaguars
Lustre File System
- 303030 grid points per process
- 10 Iterations, 1 checkpoint
25Our Initial I/O Model for S3D Projects Restart
File Performance over 12 Hours on Jaguar
- How many time steps can be done in 12 hour
allocation? - How much of that 12 hours will be spent doing
I/O? - Use average observed time step latency
- Bars show projections with minimum, average,
andmaximum observed checkpoint latency
26Vampir Provides a Deeper Look at One Iteration of
S3D on 512 Cores on Jaguar
- Rank 0 Timeline
- Core of each iteration
- Subcall tree Acalculation only
- Subcall trees B and Ccalculation
communication
27Vampir Provides a Deeper Look at One Iteration of
S3D on 512 Cores on Jaguar
- Rank 0 Timeline
- Process profile
- Process count inactivity at that time
- Load Imbalance comes most likely arises in rhsf
and derivative_xyz - 70 of entire MPI time spent in MPI_Wait
28Communication in Subcall Tree B Distributes Ghost
Zones of 3D Volume
- Message statistics from 512 tasks, 8X8X8 virtual
processor grid - Message size 28K same for different iterations
and processor counts periodic boundary conditions
for communication
29Bimodal Nature of MPI_Wait Histogram on Intrepid
Indicates Load Balance Issue
30Wide Range of MPI_Barrier Times on Intrepid Also
Indicates Load Balance Issue
31Some Insight into Load Balance of S3D on Jaguar
with a 8X8X8 Processor Grid from Scalasca
32Modeling S3D with Performance Assertions
- Communication modeling goals
- To understand load imbalance
- To project for future system and problem
configurations - To generate synthetic traces for network
simulator - Workload distribution
- Along x, y and z axes
- Along the 3 planes
- Neighbor list
- For computation
- For communication
32
September 25, 2008
33Workload Imbalance Findings
- MPI wait in derivative calculations
- Symbolic models for messages in derivate and
filter subroutines - Infrequent Allreduce
- No sub-communicator in the PERI problem
configurations
September 25, 2008
33
34Performance Assertions Modelof S3D Communication
- Symbolic models
- Isend7/8 my mz iorder/2 sizeof(MPI_REAL8)
- Isend9/10 mx mz iorder/2 sizeof(MPI_REAL8)
- Isend11/12 mx my iorder/2
sizeof(MPI_REAL8) - Isend1/2 my mz (1iforder/2)
sizeof(MPI_REAL8) - Isend3/4 mx mz (1iforder/2)
sizeof(MPI_REAL8) - Isend5/6 mx my (1iforder/2)
sizeof(MPI_REAL8) - Allreduce constant 3 MPI_REAL8
- Model validation
- Neighbor IDs validated for different problem
sizes - Message sizes validated for different axes values
- Confirmed values with mpiP profiles on Jaguar
35S3D Computation Modelingunder Performance
Assertions
- if (neighbor(1).lt.0) then
- do k 1, mz
- do j 1, my
- df(1,j,k) ae ( f(2,j,k) - neg_f(4,j,k))
- be ( f(3,j,k) - neg_f(3,j,k))
- ce ( f(4,j,k) - neg_f(2,j,k))
- de ( f(5,j,k) - neg_f(1,j,k))
-
- df(2,j,k) ae ( f(3,j,k) - f(1,j,k))
- be ( f(4,j,k) - neg_f(4,j,k))
- ce ( f(5,j,k) - neg_f(3,j,k))
- de ( f(6,j,k) - neg_f(2,j,k))
-
- df(3,j,k) ae ( f(4,j,k) - f(2,j,k))
- be ( f(5,j,k) - f(1,j,k))
- ce ( f(6,j,k) - neg_f(4,j,k))
- de ( f(7,j,k) - neg_f(3,j,k))
-
- df(4,j,k) ae ( f(5,j,k) - f(3,j,k))
September 25, 2008
35
36Need Initial PMaC Modeling Discussion
37Need slides describing FLASH Science
38FLASH Scaling on Intrepid
Weak Scaling
Strong Scaling
Turbulence-DrivenNuclear Burning
White Dwarf Deflagration
38
39The Gyrokinetic Toroidal Code
- 3D particle-in-cell code to study
microturbulencein magnetically confined fusion
plasmas - Solves the gyro-averaged Vlasov equation
- Gyrokinetic Poisson equation solved in real space
- Low noise df method
- Global code (full torus as opposed to only a flux
tube) - Massively parallel typical runs use 1024
processors
- Electrostatic (for now)
- Nonlinear and fully self-consistent
- Written in Fortran 90/95
- Originally optimized for superscalar processors
40Particle-in-Cell (PIC) Method
- Particles sample distribution function.
- The particles interact via a grid, on which the
potential is calculated from deposited charges.
- The PIC Steps
- SCATTER, or deposit, charges on the grid
(nearest neighbors) - Solve Poisson equation
- GATHER forces on each particle from potential
- Move particles (PUSH)
- Repeat
41Charge Deposition for Charged Rings4-Point
Average Method
Point-charge particles replaced by charged rings
due to gyro-averaging
42Application Teams Flagship Code The Gyrokinetic
Toroidal Code (GTC)
- Fully global 3D particle-in-cell code (PIC) in
toroidal geometry - Developed by Prof. Zhihong Lin (now at UC Irvine)
- Used for non-linear gyrokinetic simulations of
plasma microturbulence - Fully self-consistent
- Uses magnetic field line following coordinates
(y,q,z) Boozer, 1981 - Guiding center Hamiltonian White and Chance,
1984 - Non-spectral Poisson solver Lin and Lee, 1995
- Low numerical noise algorithm (df method)
- Full torus (global) simulation
- Scales to a very large number of processors
- Excellent theoretical tool!
43Measurement Effort for GTC On-going
- Initial measurement effort focused on Jaguar
- Verified that load balance and runtime improved
with correction to particle initialization (see
next slide) - Version that uses ADIOS builds and runs with both
MPI only and hybrid MPI/OpenMP - Currently gathering measurements of MPI only with
TAU - Hybrid MPI/OpenMP version instrumented with most
recent TAU builds but crashes with segmentation
fault - Some very preliminary measurements on Intrepid
- Application team requested that evaluation wait
until additional optimization effort completed - PERI optimization of OpenMP loops improved
performance 15-20 - Application team working to provide alternate
code base that is expected to improve scaling
significantly
44TAU Time Profiles of GTC with Different Particle
Initializations Showing Load Imbalances
Corrected particle initialization results in less
severe load imbalance
Load imbalance due to incorrect particle
initialization
128 process runs on Jaguar
Profiling helps ensure that a valid version is
used for modeling
44
45Cover miscellaneous issues
46Going Forward
- Need to discuss what remains to be done for first
three codes - Need to discuss what codes should be next
47Need initial conclusion