Preliminary Report of the PERI Architecture Tiger Team - PowerPoint PPT Presentation

1 / 47

About This Presentation

Title:

Preliminary Report of the PERI Architecture Tiger Team

Description:

Preliminary Report of the PERI Architecture Tiger Team – PowerPoint PPT presentation

Number of Views:62

Avg rating:3.0/5.0

Slides: 48

Provided by: alic147

Category:

more less

Transcript and Presenter's Notes

Title: Preliminary Report of the PERI Architecture Tiger Team

1
Preliminary Report of the PERI Architecture Tiger
Team

Performance Engineering Research Institute
Tiger Team Lead Bronis R. de Supinski
Sadaf Alam, David H. Bailey, Laura Carrington,
Jacqueline Chame, Chris Daley, Anshu Dubey, Todd
Gamblin, Dan Gunter, Paul Hovland, Heike Jagode,
Karen Karavanic , Gabriel Marin, John
Mellor-Crummey, Shirley Moore, Boyana Norris,
Lenny Oliker, Philip C. Roth, Martin Schulz,
Sameer Shende, Jeff Vetter, Pat Worley, Nick
Wright
January XX, 2009
questions to bronis_at_llnl.gov

2
PERI Architecture Tiger Team Overview

Assist OASCR in formulating its ten year plan
What machine mix is needed to fulfill OASCR
mission?
Answer determined through application focus
Consider a range of OASCR applications
Evaluate suitability of current and future
architectures
Focus on system evaluation
Still consider optimization opportunities but not
the focus
Consist of almost all PERI researchers
Three major components
Measuring pioneering applications on todays
systems
Predict their performance on future systems
Report analysis of results, consult DOE on
implications
This report focuses on initial measurement
activities andpreliminary modeling work
associated with that activity

3
Headquarters Requested the Architecture Tiger
Team to Focus on Early Science Applications

Oak Ridge Early Science Applications
Chimera
DCA
GTC
Argonne Early Science Applications
FLASH
GTC
Lattice QCD
NAMD
Initial three chosen by headquarters FLASH, GTC
and S3D
Tasks require flexible, evolving strategy for
code base used
Predictions require freezing application code at
some point
Anticipate revisiting models (and related code
base) over time
Measurement can be iterative so some flexibility
Use measurements to guide selection of modeled
code base

MADNESS
POP
S3D

NEK5
Qbox
WPP

4
Widely Varied Initial Progress for First
ThreeApplications S3D was Smoothest

Relationship established w/FY07 Tiger Team
facilitated effort
Clear application team structure simplified
determining initial code base and input problem
sets
Significant measurement results
Performance similar across Jaguar and Intrepid
Identified inherent load imbalance issue
Minor performance limitation at current scales
Potential issue at larger scales
Focus of initial performance assertions modeling
effort
Initial data gathered for PMaC modeling

5
Widely Varied Initial Progress for First
ThreeApplications External Factors Limited
FLASH

Little existing relationship
Some familiarity between ANL PERI representatives
and University of Chicago (UC) application team
Clear application team structure helped
relationship develop
Administrative issues complicated getting started
Licensing issues related to code
Restricted distribution mechanism related to
export control
Delay in staffing plan (Chris Daley, UC) for
measurement effort
Required funds slow going through LBNL
procurement process
Initial measurement results
Scaling studies on (Jaguar? and) Intrepid
Currently, gathering extensive TAU and other tool
data
Beginning to gather data for PMaC modeling

6
Widely Varied Initial Progress for First
ThreeApplications GTC

Application is undergoing significant
redevelopment
Several possibilities for code base for
explorations
Existing versions expected to change
significantly
Will be very different over long term
Even short term stability not clear
I/O mechanism being completely redone currently
Probably requires at least two code bases for
study
Initial measurement results
Initial measurement studies on Jaguar
Identified configuration error that implied
scaling limitations
Currently, gathering extensive TAU and other tool
data
Will begin gathering modeling data soon

7
S3D Direct numerical simulation (DNS) of
turbulent combustion

State-of-the-art code developed at CRF/Sandia
2007 INCITE award - 6M hours on XT3/4 at NCCS
Tier 1 pioneering application for 250TF system
Why DNS?
Study micro-physics of turbulent reacting flows
Full access to time resolved fields
Physical insight into chemistry turbulence
interactions
Develop validate reduced model descriptions
usedin macro-scale simulations of
engineering-level systems

Text and figures courtesy of S3D PI, Jacqueline
H. Chen, SNL
8
S3D - DNS Solver

Solves compressible reacting Navier-Stokes
equations
High fidelity numerical methods
8th order finite-difference
4th order explicit RK integrator
Hierarchy of molecular transport models
Detailed chemistry

Multiphysics (sprays, radiation soot)
From SciDAC-TSTC(Terascale Simulation of
Combustion)

Text and figures courtesy of S3D PI, Jacqueline
H. Chen, SNL
9
S3D Parallelization

Fortran90 MPI
3D domain decomposition
each MPI process manages part of the domain
All processes have same number of grid points
same computational load
Inter-processor communication only between
nearest neighbors in 3D mesh
large messages non-blocking sends receives
All-to-all communication only required for
monitoring synchronization ahead of I/O

S3D logical topology
Text courtesy of S3D PI, Jacqueline H. Chen, SNL
10
Total Execution Time of S3D on Intrepid
11
Relative Efficiency For S3Dunder Weak Scaling on
Intrepid
12
Relative Efficiency on Intrepid by Event
13
Relative Speedup on Intrepid by Event
14
Event Correlation to Total Time on Intrepid
r 1 implies direct correlation
15
Fraction of time in MPI on Intrepid
16
Total Runtime Breakdown by Events
17
Mean Time by Function BreakdownAcross All Nodes
on Intrepid
Total Runtime 1 Hour, 31 minutes, 25 seconds
18
S3D Wall Clock Times Measured on Jaguar with
Optimized TAU Instrumentation on 64 Cores

Exclusive times distributed across routines
called within S3Ds solve_driver

ratt_i
rhsf
ratx_i
transport_mcomputecoefficients
transport_mcomputespeciesdiffflux
MPI_Wait
integrate
thermchen_mcalc_temp
transport_mcomputeheatflux
derivative_xcalc
derivative_ycalc
derivative_zcalc
derivative_xcomm

19
S3D Wall Clock Times Measured on Jaguar with
Optimized TAU Instrumentation on 64 Cores

Exclusive times for MPI_Wait (6) exhibit
potential load balance issue

20
Gathered, IPC Floating Point Data and 8 Memory
Measurements for S3D on Jaguar with TAU

IPC (Instructions per Cycle) efficiency metric
Proportion of floating point operations
Hardware counter-based memory measurements
L1 data cache misses
L1 instruction cache misses
L1 data TLB misses
L1 instruction TLB misses
L2 (unified but not shared between cores) cache
misses
L2 data TLB misses
L2 instruction TLB misses
Memory accesses on quad-core (L3) for different
core cases

21
Event-Based Measurement of IPC, FPO rate and L1
Data Cache Miss Rates
22
L3 Cache Behavior for Different Core Cases4
Cores/Node (VNM) Versus 1 Core/Node (SMP)

Runtime on Jaguar VNM 813s SMP 613.4s
Runtime on Intrepid VNM 1728.7s SMP 1731.7s

23
Initial Measurements of S3DI/O Performance on
Jaguar

S3D uses FORTRAN I/O to read control and input
files and to write restart dumps
Each rank writes own restart file
Writes staggered across logical process topology
to avoid contention at file system metadata
server
Restart dumps dominate I/O cost as each rank
writes
Four REAL(8) scalars (time, tstep, time_save,
pout)
Four REAL(8) 3D 4D arrays (yspecies, temp,
pressure, u)
Fortran I/O record markers
Per-process write volume
Vproc 8 (4 nx ny nz (n_species 5)) 64
For 303030 grid points per process resolution
5.8 MB per process, per checkpoint
With 20K processes 116.6 GB per checkpoint

24
Write Restart Performance Measuredon Jaguars
Lustre File System

303030 grid points per process
10 Iterations, 1 checkpoint

25
Our Initial I/O Model for S3D Projects Restart
File Performance over 12 Hours on Jaguar

How many time steps can be done in 12 hour
allocation?
How much of that 12 hours will be spent doing
I/O?
Use average observed time step latency
Bars show projections with minimum, average,
andmaximum observed checkpoint latency

26
Vampir Provides a Deeper Look at One Iteration of
S3D on 512 Cores on Jaguar

Rank 0 Timeline
Core of each iteration
Subcall tree Acalculation only
Subcall trees B and Ccalculation
communication

27
Vampir Provides a Deeper Look at One Iteration of
S3D on 512 Cores on Jaguar

Rank 0 Timeline
Process profile
Process count inactivity at that time
Load Imbalance comes most likely arises in rhsf
and derivative_xyz
70 of entire MPI time spent in MPI_Wait

28
Communication in Subcall Tree B Distributes Ghost
Zones of 3D Volume

Message statistics from 512 tasks, 8X8X8 virtual
processor grid
Message size 28K same for different iterations
and processor counts periodic boundary conditions
for communication

29
Bimodal Nature of MPI_Wait Histogram on Intrepid
Indicates Load Balance Issue
30
Wide Range of MPI_Barrier Times on Intrepid Also
Indicates Load Balance Issue
31
Some Insight into Load Balance of S3D on Jaguar
with a 8X8X8 Processor Grid from Scalasca
32
Modeling S3D with Performance Assertions

Communication modeling goals
To understand load imbalance
To project for future system and problem
configurations
To generate synthetic traces for network
simulator
Workload distribution
Along x, y and z axes
Along the 3 planes
Neighbor list
For computation
For communication

32
September 25, 2008
33
Workload Imbalance Findings

MPI wait in derivative calculations
Symbolic models for messages in derivate and
filter subroutines
Infrequent Allreduce
No sub-communicator in the PERI problem
configurations

September 25, 2008
33
34
Performance Assertions Modelof S3D Communication

Symbolic models
Isend7/8 my mz iorder/2 sizeof(MPI_REAL8)
Isend9/10 mx mz iorder/2 sizeof(MPI_REAL8)
Isend11/12 mx my iorder/2
sizeof(MPI_REAL8)
Isend1/2 my mz (1iforder/2)
sizeof(MPI_REAL8)
Isend3/4 mx mz (1iforder/2)
sizeof(MPI_REAL8)
Isend5/6 mx my (1iforder/2)
sizeof(MPI_REAL8)
Allreduce constant 3 MPI_REAL8
Model validation
Neighbor IDs validated for different problem
sizes
Message sizes validated for different axes values
Confirmed values with mpiP profiles on Jaguar

35
S3D Computation Modelingunder Performance
Assertions

if (neighbor(1).lt.0) then
do k 1, mz
do j 1, my
df(1,j,k) ae ( f(2,j,k) - neg_f(4,j,k))
be ( f(3,j,k) - neg_f(3,j,k))
ce ( f(4,j,k) - neg_f(2,j,k))
de ( f(5,j,k) - neg_f(1,j,k))
df(2,j,k) ae ( f(3,j,k) - f(1,j,k))
be ( f(4,j,k) - neg_f(4,j,k))
ce ( f(5,j,k) - neg_f(3,j,k))
de ( f(6,j,k) - neg_f(2,j,k))
df(3,j,k) ae ( f(4,j,k) - f(2,j,k))
be ( f(5,j,k) - f(1,j,k))
ce ( f(6,j,k) - neg_f(4,j,k))
de ( f(7,j,k) - neg_f(3,j,k))
df(4,j,k) ae ( f(5,j,k) - f(3,j,k))

September 25, 2008
35
36
Need Initial PMaC Modeling Discussion
37
Need slides describing FLASH Science
38
FLASH Scaling on Intrepid
Weak Scaling
Strong Scaling
Turbulence-DrivenNuclear Burning
White Dwarf Deflagration
38
39
The Gyrokinetic Toroidal Code

3D particle-in-cell code to study
microturbulencein magnetically confined fusion
plasmas
Solves the gyro-averaged Vlasov equation
Gyrokinetic Poisson equation solved in real space
Low noise df method
Global code (full torus as opposed to only a flux
tube)
Massively parallel typical runs use 1024
processors

Electrostatic (for now)
Nonlinear and fully self-consistent
Written in Fortran 90/95
Originally optimized for superscalar processors

40
Particle-in-Cell (PIC) Method

Particles sample distribution function.
The particles interact via a grid, on which the
potential is calculated from deposited charges.

The PIC Steps
SCATTER, or deposit, charges on the grid
(nearest neighbors)
Solve Poisson equation
GATHER forces on each particle from potential
Move particles (PUSH)
Repeat

41
Charge Deposition for Charged Rings4-Point
Average Method
Point-charge particles replaced by charged rings
due to gyro-averaging
42
Application Teams Flagship Code The Gyrokinetic
Toroidal Code (GTC)

Fully global 3D particle-in-cell code (PIC) in
toroidal geometry
Developed by Prof. Zhihong Lin (now at UC Irvine)
Used for non-linear gyrokinetic simulations of
plasma microturbulence
Fully self-consistent
Uses magnetic field line following coordinates
(y,q,z) Boozer, 1981
Guiding center Hamiltonian White and Chance,
1984
Non-spectral Poisson solver Lin and Lee, 1995
Low numerical noise algorithm (df method)
Full torus (global) simulation
Scales to a very large number of processors
Excellent theoretical tool!

43
Measurement Effort for GTC On-going

Initial measurement effort focused on Jaguar
Verified that load balance and runtime improved
with correction to particle initialization (see
next slide)
Version that uses ADIOS builds and runs with both
MPI only and hybrid MPI/OpenMP
Currently gathering measurements of MPI only with
TAU
Hybrid MPI/OpenMP version instrumented with most
recent TAU builds but crashes with segmentation
fault
Some very preliminary measurements on Intrepid
Application team requested that evaluation wait
until additional optimization effort completed
PERI optimization of OpenMP loops improved
performance 15-20
Application team working to provide alternate
code base that is expected to improve scaling
significantly

44
TAU Time Profiles of GTC with Different Particle
Initializations Showing Load Imbalances
Corrected particle initialization results in less
severe load imbalance
Load imbalance due to incorrect particle
initialization
128 process runs on Jaguar
Profiling helps ensure that a valid version is
used for modeling
44
45
Cover miscellaneous issues