Preliminary Report of the PERI Architecture Tiger Team - PowerPoint PPT Presentation

1 / 47
About This Presentation
Title:

Preliminary Report of the PERI Architecture Tiger Team

Description:

Preliminary Report of the PERI Architecture Tiger Team – PowerPoint PPT presentation

Number of Views:62
Avg rating:3.0/5.0
Slides: 48
Provided by: alic147
Category:

less

Transcript and Presenter's Notes

Title: Preliminary Report of the PERI Architecture Tiger Team


1
Preliminary Report of the PERI Architecture Tiger
Team
  • Performance Engineering Research Institute
  • Tiger Team Lead Bronis R. de Supinski
  • Sadaf Alam, David H. Bailey, Laura Carrington,
    Jacqueline Chame, Chris Daley, Anshu Dubey, Todd
    Gamblin, Dan Gunter, Paul Hovland, Heike Jagode,
    Karen Karavanic , Gabriel Marin, John
    Mellor-Crummey, Shirley Moore, Boyana Norris,
    Lenny Oliker, Philip C. Roth, Martin Schulz,
    Sameer Shende, Jeff Vetter, Pat Worley, Nick
    Wright
  • January XX, 2009
  • questions to bronis_at_llnl.gov

2
PERI Architecture Tiger Team Overview
  • Assist OASCR in formulating its ten year plan
  • What machine mix is needed to fulfill OASCR
    mission?
  • Answer determined through application focus
  • Consider a range of OASCR applications
  • Evaluate suitability of current and future
    architectures
  • Focus on system evaluation
  • Still consider optimization opportunities but not
    the focus
  • Consist of almost all PERI researchers
  • Three major components
  • Measuring pioneering applications on todays
    systems
  • Predict their performance on future systems
  • Report analysis of results, consult DOE on
    implications
  • This report focuses on initial measurement
    activities andpreliminary modeling work
    associated with that activity

3
Headquarters Requested the Architecture Tiger
Team to Focus on Early Science Applications
  • Oak Ridge Early Science Applications
  • Chimera
  • DCA
  • GTC
  • Argonne Early Science Applications
  • FLASH
  • GTC
  • Lattice QCD
  • NAMD
  • Initial three chosen by headquarters FLASH, GTC
    and S3D
  • Tasks require flexible, evolving strategy for
    code base used
  • Predictions require freezing application code at
    some point
  • Anticipate revisiting models (and related code
    base) over time
  • Measurement can be iterative so some flexibility
  • Use measurements to guide selection of modeled
    code base
  • MADNESS
  • POP
  • S3D
  • NEK5
  • Qbox
  • WPP

4
Widely Varied Initial Progress for First
ThreeApplications S3D was Smoothest
  • Relationship established w/FY07 Tiger Team
    facilitated effort
  • Clear application team structure simplified
    determining initial code base and input problem
    sets
  • Significant measurement results
  • Performance similar across Jaguar and Intrepid
  • Identified inherent load imbalance issue
  • Minor performance limitation at current scales
  • Potential issue at larger scales
  • Focus of initial performance assertions modeling
    effort
  • Initial data gathered for PMaC modeling

5
Widely Varied Initial Progress for First
ThreeApplications External Factors Limited
FLASH
  • Little existing relationship
  • Some familiarity between ANL PERI representatives
    and University of Chicago (UC) application team
  • Clear application team structure helped
    relationship develop
  • Administrative issues complicated getting started
  • Licensing issues related to code
  • Restricted distribution mechanism related to
    export control
  • Delay in staffing plan (Chris Daley, UC) for
    measurement effort
  • Required funds slow going through LBNL
    procurement process
  • Initial measurement results
  • Scaling studies on (Jaguar? and) Intrepid
  • Currently, gathering extensive TAU and other tool
    data
  • Beginning to gather data for PMaC modeling

6
Widely Varied Initial Progress for First
ThreeApplications GTC
  • Application is undergoing significant
    redevelopment
  • Several possibilities for code base for
    explorations
  • Existing versions expected to change
    significantly
  • Will be very different over long term
  • Even short term stability not clear
  • I/O mechanism being completely redone currently
  • Probably requires at least two code bases for
    study
  • Initial measurement results
  • Initial measurement studies on Jaguar
  • Identified configuration error that implied
    scaling limitations
  • Currently, gathering extensive TAU and other tool
    data
  • Will begin gathering modeling data soon

7
S3D Direct numerical simulation (DNS) of
turbulent combustion
  • State-of-the-art code developed at CRF/Sandia
  • 2007 INCITE award - 6M hours on XT3/4 at NCCS
  • Tier 1 pioneering application for 250TF system
  • Why DNS?
  • Study micro-physics of turbulent reacting flows
  • Full access to time resolved fields
  • Physical insight into chemistry turbulence
    interactions
  • Develop validate reduced model descriptions
    usedin macro-scale simulations of
    engineering-level systems

Text and figures courtesy of S3D PI, Jacqueline
H. Chen, SNL
8
S3D - DNS Solver
  • Solves compressible reacting Navier-Stokes
    equations
  • High fidelity numerical methods
  • 8th order finite-difference
  • 4th order explicit RK integrator
  • Hierarchy of molecular transport models
  • Detailed chemistry
  • Multiphysics (sprays, radiation soot)
  • From SciDAC-TSTC(Terascale Simulation of
    Combustion)

Text and figures courtesy of S3D PI, Jacqueline
H. Chen, SNL
9
S3D Parallelization
  • Fortran90 MPI
  • 3D domain decomposition
  • each MPI process manages part of the domain
  • All processes have same number of grid points
    same computational load
  • Inter-processor communication only between
    nearest neighbors in 3D mesh
  • large messages non-blocking sends receives
  • All-to-all communication only required for
    monitoring synchronization ahead of I/O

S3D logical topology
Text courtesy of S3D PI, Jacqueline H. Chen, SNL
10
Total Execution Time of S3D on Intrepid
11
Relative Efficiency For S3Dunder Weak Scaling on
Intrepid
12
Relative Efficiency on Intrepid by Event
13
Relative Speedup on Intrepid by Event
14
Event Correlation to Total Time on Intrepid
r 1 implies direct correlation
15
Fraction of time in MPI on Intrepid
16
Total Runtime Breakdown by Events
17
Mean Time by Function BreakdownAcross All Nodes
on Intrepid
Total Runtime 1 Hour, 31 minutes, 25 seconds
18
S3D Wall Clock Times Measured on Jaguar with
Optimized TAU Instrumentation on 64 Cores
  • Exclusive times distributed across routines
    called within S3Ds solve_driver
  • ratt_i
  • rhsf
  • ratx_i
  • transport_mcomputecoefficients
  • transport_mcomputespeciesdiffflux
  • MPI_Wait
  • integrate
  • thermchen_mcalc_temp
  • transport_mcomputeheatflux
  • derivative_xcalc
  • derivative_ycalc
  • derivative_zcalc
  • derivative_xcomm

19
S3D Wall Clock Times Measured on Jaguar with
Optimized TAU Instrumentation on 64 Cores
  • Exclusive times for MPI_Wait (6) exhibit
    potential load balance issue

20
Gathered, IPC Floating Point Data and 8 Memory
Measurements for S3D on Jaguar with TAU
  • IPC (Instructions per Cycle) efficiency metric
  • Proportion of floating point operations
  • Hardware counter-based memory measurements
  • L1 data cache misses
  • L1 instruction cache misses
  • L1 data TLB misses
  • L1 instruction TLB misses
  • L2 (unified but not shared between cores) cache
    misses
  • L2 data TLB misses
  • L2 instruction TLB misses
  • Memory accesses on quad-core (L3) for different
    core cases

21
Event-Based Measurement of IPC, FPO rate and L1
Data Cache Miss Rates
22
L3 Cache Behavior for Different Core Cases4
Cores/Node (VNM) Versus 1 Core/Node (SMP)
  • Runtime on Jaguar VNM 813s SMP 613.4s
  • Runtime on Intrepid VNM 1728.7s SMP 1731.7s

23
Initial Measurements of S3DI/O Performance on
Jaguar
  • S3D uses FORTRAN I/O to read control and input
    files and to write restart dumps
  • Each rank writes own restart file
  • Writes staggered across logical process topology
    to avoid contention at file system metadata
    server
  • Restart dumps dominate I/O cost as each rank
    writes
  • Four REAL(8) scalars (time, tstep, time_save,
    pout)
  • Four REAL(8) 3D 4D arrays (yspecies, temp,
    pressure, u)
  • Fortran I/O record markers
  • Per-process write volume
  • Vproc 8 (4 nx ny nz (n_species 5)) 64
  • For 303030 grid points per process resolution
  • 5.8 MB per process, per checkpoint
  • With 20K processes 116.6 GB per checkpoint

24
Write Restart Performance Measuredon Jaguars
Lustre File System
  • 303030 grid points per process
  • 10 Iterations, 1 checkpoint

25
Our Initial I/O Model for S3D Projects Restart
File Performance over 12 Hours on Jaguar
  • How many time steps can be done in 12 hour
    allocation?
  • How much of that 12 hours will be spent doing
    I/O?
  • Use average observed time step latency
  • Bars show projections with minimum, average,
    andmaximum observed checkpoint latency

26
Vampir Provides a Deeper Look at One Iteration of
S3D on 512 Cores on Jaguar
  • Rank 0 Timeline
  • Core of each iteration
  • Subcall tree Acalculation only
  • Subcall trees B and Ccalculation
    communication

27
Vampir Provides a Deeper Look at One Iteration of
S3D on 512 Cores on Jaguar
  • Rank 0 Timeline
  • Process profile
  • Process count inactivity at that time
  • Load Imbalance comes most likely arises in rhsf
    and derivative_xyz
  • 70 of entire MPI time spent in MPI_Wait

28
Communication in Subcall Tree B Distributes Ghost
Zones of 3D Volume
  • Message statistics from 512 tasks, 8X8X8 virtual
    processor grid
  • Message size 28K same for different iterations
    and processor counts periodic boundary conditions
    for communication

29
Bimodal Nature of MPI_Wait Histogram on Intrepid
Indicates Load Balance Issue
30
Wide Range of MPI_Barrier Times on Intrepid Also
Indicates Load Balance Issue
31
Some Insight into Load Balance of S3D on Jaguar
with a 8X8X8 Processor Grid from Scalasca
32
Modeling S3D with Performance Assertions
  • Communication modeling goals
  • To understand load imbalance
  • To project for future system and problem
    configurations
  • To generate synthetic traces for network
    simulator
  • Workload distribution
  • Along x, y and z axes
  • Along the 3 planes
  • Neighbor list
  • For computation
  • For communication

32
September 25, 2008
33
Workload Imbalance Findings
  • MPI wait in derivative calculations
  • Symbolic models for messages in derivate and
    filter subroutines
  • Infrequent Allreduce
  • No sub-communicator in the PERI problem
    configurations

September 25, 2008
33
34
Performance Assertions Modelof S3D Communication
  • Symbolic models
  • Isend7/8 my mz iorder/2 sizeof(MPI_REAL8)
  • Isend9/10 mx mz iorder/2 sizeof(MPI_REAL8)
  • Isend11/12 mx my iorder/2
    sizeof(MPI_REAL8)
  • Isend1/2 my mz (1iforder/2)
    sizeof(MPI_REAL8)
  • Isend3/4 mx mz (1iforder/2)
    sizeof(MPI_REAL8)
  • Isend5/6 mx my (1iforder/2)
    sizeof(MPI_REAL8)
  • Allreduce constant 3 MPI_REAL8
  • Model validation
  • Neighbor IDs validated for different problem
    sizes
  • Message sizes validated for different axes values
  • Confirmed values with mpiP profiles on Jaguar

35
S3D Computation Modelingunder Performance
Assertions
  • if (neighbor(1).lt.0) then
  • do k 1, mz
  • do j 1, my
  • df(1,j,k) ae ( f(2,j,k) - neg_f(4,j,k))
  • be ( f(3,j,k) - neg_f(3,j,k))
  • ce ( f(4,j,k) - neg_f(2,j,k))
  • de ( f(5,j,k) - neg_f(1,j,k))
  •  
  • df(2,j,k) ae ( f(3,j,k) - f(1,j,k))
  • be ( f(4,j,k) - neg_f(4,j,k))
  • ce ( f(5,j,k) - neg_f(3,j,k))
  • de ( f(6,j,k) - neg_f(2,j,k))
  •  
  • df(3,j,k) ae ( f(4,j,k) - f(2,j,k))
  • be ( f(5,j,k) - f(1,j,k))
  • ce ( f(6,j,k) - neg_f(4,j,k))
  • de ( f(7,j,k) - neg_f(3,j,k))
  •  
  • df(4,j,k) ae ( f(5,j,k) - f(3,j,k))

September 25, 2008
35
36
Need Initial PMaC Modeling Discussion
37
Need slides describing FLASH Science
38
FLASH Scaling on Intrepid
Weak Scaling
Strong Scaling
Turbulence-DrivenNuclear Burning
White Dwarf Deflagration
38
39
The Gyrokinetic Toroidal Code
  • 3D particle-in-cell code to study
    microturbulencein magnetically confined fusion
    plasmas
  • Solves the gyro-averaged Vlasov equation
  • Gyrokinetic Poisson equation solved in real space
  • Low noise df method
  • Global code (full torus as opposed to only a flux
    tube)
  • Massively parallel typical runs use 1024
    processors
  • Electrostatic (for now)
  • Nonlinear and fully self-consistent
  • Written in Fortran 90/95
  • Originally optimized for superscalar processors

40
Particle-in-Cell (PIC) Method
  • Particles sample distribution function.
  • The particles interact via a grid, on which the
    potential is calculated from deposited charges.
  • The PIC Steps
  • SCATTER, or deposit, charges on the grid
    (nearest neighbors)
  • Solve Poisson equation
  • GATHER forces on each particle from potential
  • Move particles (PUSH)
  • Repeat

41
Charge Deposition for Charged Rings4-Point
Average Method
Point-charge particles replaced by charged rings
due to gyro-averaging
42
Application Teams Flagship Code The Gyrokinetic
Toroidal Code (GTC)
  • Fully global 3D particle-in-cell code (PIC) in
    toroidal geometry
  • Developed by Prof. Zhihong Lin (now at UC Irvine)
  • Used for non-linear gyrokinetic simulations of
    plasma microturbulence
  • Fully self-consistent
  • Uses magnetic field line following coordinates
    (y,q,z) Boozer, 1981
  • Guiding center Hamiltonian White and Chance,
    1984
  • Non-spectral Poisson solver Lin and Lee, 1995
  • Low numerical noise algorithm (df method)
  • Full torus (global) simulation
  • Scales to a very large number of processors
  • Excellent theoretical tool!

43
Measurement Effort for GTC On-going
  • Initial measurement effort focused on Jaguar
  • Verified that load balance and runtime improved
    with correction to particle initialization (see
    next slide)
  • Version that uses ADIOS builds and runs with both
    MPI only and hybrid MPI/OpenMP
  • Currently gathering measurements of MPI only with
    TAU
  • Hybrid MPI/OpenMP version instrumented with most
    recent TAU builds but crashes with segmentation
    fault
  • Some very preliminary measurements on Intrepid
  • Application team requested that evaluation wait
    until additional optimization effort completed
  • PERI optimization of OpenMP loops improved
    performance 15-20
  • Application team working to provide alternate
    code base that is expected to improve scaling
    significantly

44
TAU Time Profiles of GTC with Different Particle
Initializations Showing Load Imbalances
Corrected particle initialization results in less
severe load imbalance
Load imbalance due to incorrect particle
initialization
128 process runs on Jaguar
Profiling helps ensure that a valid version is
used for modeling
44
45
Cover miscellaneous issues
  • PDSI Collaboration

46
Going Forward
  • Need to discuss what remains to be done for first
    three codes
  • Need to discuss what codes should be next

47
Need initial conclusion
Write a Comment
User Comments (0)
About PowerShow.com