Title: VTF Applications Performance and Scalability
1VTF Applications Performance and Scalability
- Sharon Brunett
- CACR/Caltech
- ASCI Site Review
- October 28, 29 2003
2ASCI Platform Specifics
- LLNLs IBM SP3 (frost)
- 65 node SMP , 375 MHz Power3 Nighthawk-2 (16
CPUs/node) - 16 GB memory/node
- 20 TB global parallel file system
- SP switch2, colony switch
- 2 GB/sec node-to-node bandwidth
- bi-directional
- LANLs HP/Compaq Alphaserver ES45 (QSC)
- 256 node SMP, 1.25 GHz Alpha EV6 ( 4 CPUs/node)
- 16 GB memory/node
- 12 TB global file system
- Quadrics network interconnect (QsNet)
- 2 mus latency
- 300 MB/sec bandwidth
3Multiscale Polycrystal Studies
- Quantitative assessment of microstructural
effects in macroscopic material response through
the computation of full-field solutions of
polycrystals
- Inhomogeneous plastic deformation fields
- Grain-boundary effects
- Stress concentration
- Dislocation pile-up
- Constraint-induced multislip
- Size dependence (inverse) Hall-Petch effect
- Resolve (as opposed to model) mesoscale behavior
exploiting the power of high-performance
computing - Enable full-scale simulation of engineering
systems incorporating micromechanical effects.
4Mesh Generation
- Ingrain subdivision behavior can be simulated in
both single crystals and polycrystals. - texture simulation results agree well with
experimental results - Mesh generation method keeps the topology of
individual grain shapes - Enables effective interactions between grains
- Increasing of the grain count in polycrystals
gives a more stable mechanical response.
Single grain corresponding to a single cell in a
crystal
51.5 Million Element, 1241 Grain Multiscale
Polycrystal Simulation
Simulation carried out on 1024 processors of
LLNLs IBM SP3, frost
6Multiscale Polycrystal Performance
- Aggregate parallel performance
- LANLs QSC
- Floating point operations 10.67 of peak
- Integer operations 15.39 of peak
- Memory operations 22.08 of peak
- DCPI hardware counters used to collect data
- Qopcounter tool used to analyze DCPI database
- LLNLs Frost
- L1 cache hit rate 98
- Load/store instructions executed w/o main memory
access - Load Store Unit idle 36
- Floating point operations 4.47 of peak
- Hpmcount tool used to count hardware events
during program execution
7Multiscale Polycrystal Performance II
- MPI routines can consume 30 of runtime for
large runs on Frost - Workload imbalance as grains are distributed
across nodes - MPI_Waitall every step dominating communications
time - Nearest neighbor sends take longer from nodes
with computationally heavy grains - Routines taking the most CPU time on QSC
- resolved_fcc_cuitino 18.85
- upslip_fcc_cuitino_explicit 11.74
- setafcc 9.16
- matvec 8.5
- 50 of execution time in 4 routines
- Room for performance improvement with better load
balancing and routine level optimization
8Multiscale Polycrystal Scaling on LLNLs IBM
SP3, Frost
elements
9Multiscale Polycrystal Scaling on LANLs
HP/Compaq, QSC
elements
10Scaling for Polycrystalline Copper in a Shear
Compression Specimen Configuration
elements
LANLs HP/Compaq QSC system
113D Converging Shock Simulations in a Wedge
- 1024 processor ASCI Frost run of a converging
shock. The interface is nominally a 2D ellipse
perturbed with a prescribed spectrum and
randomized phases. - The 2D elliptical interface is computed using
local shock polar analysis to yield a perfectly
circular transmitted shock - Resolution 2000x400x400 with over 1T Byte of
data generated.
Density Pressure
12Density Field in a 3D Wedge
Density field in the Wedge. The transmitted shock
front appears to be stable while the gas
interface is Richtmyer-Meshkov unstable. The
simulation took place on 1024 processors of
LLNLs IBM SP3, frost, 2000x400x400 initial
grid.
13Wedge3D Performance on LLNLs IBM SP3, Frost
- Aggregate parallel performance for 1400x280x280
grid - LLNLs Frost
- Floating point operations 5.8 to 10 of peak,
depending on node - Hpmcount tool used to count hardware events
during program execution - Most time consuming communication calls
- MPI_Wait() and MPI_Allreduce
- Accounting for 3 to 30 of runtime on 128 way run
- 175x70x70 grid per processor
- Occasional high MPI time on a few nodes seem to
be caused by system daemons competing for
resources
14Wedge3D Scaling on LLNLs IBM SP3, Frost
grid size XxYxZ
15Fragmentation 2D Scaling on LANLs HP/Compaq,
QSC
450K elements
61K -gt 915K elements
85K -gt 1.1M elements
Levels of subdivision 450K to 1.1M elements
16Crack Patterns in the Configuration Occurring
During Scalability Studies on QSC
17Fragmentation 2D Performance on LANLs
HP/Compaq, QSC
- Procedures with highest CPU cycle consumption
- element_driver 14.9
- assemble 13.9
- NewNeohookean 8.12
- 16 processor run with 2 levels of subdivision
(60K elements) - Dcpiprof too used to profile run
- Problems processing dcpi database FLOP rates for
large runs - reported to LANL support
- small runs yield 3 FLOP peak
- Only 10 spend in fragmentation routines!
- Much room for improvement on our I/O performance
dumping to the parallel file system
(/scratch1,2)