VTF Applications Performance and Scalability - PowerPoint PPT Presentation

1 / 17

About This Presentation

Title:

VTF Applications Performance and Scalability

Description:

65 node SMP , 375 MHz Power3 Nighthawk-2 (16 CPUs/node) 16 GB memory/node ~ 20 TB global parallel file system. SP switch2, colony switch. 2 GB/sec node-to-node ... – PowerPoint PPT presentation

Number of Views:45

Avg rating:3.0/5.0

Slides: 18

Provided by: sharonb6

Category:

more less

Transcript and Presenter's Notes

Title: VTF Applications Performance and Scalability

1
VTF Applications Performance and Scalability

Sharon Brunett
CACR/Caltech
ASCI Site Review
October 28, 29 2003

2
ASCI Platform Specifics

LLNLs IBM SP3 (frost)
65 node SMP , 375 MHz Power3 Nighthawk-2 (16
CPUs/node)
16 GB memory/node
20 TB global parallel file system
SP switch2, colony switch
2 GB/sec node-to-node bandwidth
bi-directional

LANLs HP/Compaq Alphaserver ES45 (QSC)
256 node SMP, 1.25 GHz Alpha EV6 ( 4 CPUs/node)
16 GB memory/node
12 TB global file system
Quadrics network interconnect (QsNet)
2 mus latency
300 MB/sec bandwidth

3
Multiscale Polycrystal Studies

Quantitative assessment of microstructural
effects in macroscopic material response through
the computation of full-field solutions of
polycrystals

Inhomogeneous plastic deformation fields
Grain-boundary effects
Stress concentration
Dislocation pile-up
Constraint-induced multislip
Size dependence (inverse) Hall-Petch effect

Resolve (as opposed to model) mesoscale behavior
exploiting the power of high-performance
computing
Enable full-scale simulation of engineering
systems incorporating micromechanical effects.

4
Mesh Generation

Ingrain subdivision behavior can be simulated in
both single crystals and polycrystals.
texture simulation results agree well with
experimental results
Mesh generation method keeps the topology of
individual grain shapes
Enables effective interactions between grains
Increasing of the grain count in polycrystals
gives a more stable mechanical response.

Single grain corresponding to a single cell in a
crystal
5
1.5 Million Element, 1241 Grain Multiscale
Polycrystal Simulation
Simulation carried out on 1024 processors of
LLNLs IBM SP3, frost
6
Multiscale Polycrystal Performance

Aggregate parallel performance
LANLs QSC
Floating point operations 10.67 of peak
Integer operations 15.39 of peak
Memory operations 22.08 of peak
DCPI hardware counters used to collect data
Qopcounter tool used to analyze DCPI database
LLNLs Frost
L1 cache hit rate 98
Load/store instructions executed w/o main memory
access
Load Store Unit idle 36
Floating point operations 4.47 of peak
Hpmcount tool used to count hardware events
during program execution

7
Multiscale Polycrystal Performance II

MPI routines can consume 30 of runtime for
large runs on Frost
Workload imbalance as grains are distributed
across nodes
MPI_Waitall every step dominating communications
time
Nearest neighbor sends take longer from nodes
with computationally heavy grains
Routines taking the most CPU time on QSC
resolved_fcc_cuitino 18.85
upslip_fcc_cuitino_explicit 11.74
setafcc 9.16
matvec 8.5
50 of execution time in 4 routines
Room for performance improvement with better load
balancing and routine level optimization

8
Multiscale Polycrystal Scaling on LLNLs IBM
SP3, Frost
elements
9
Multiscale Polycrystal Scaling on LANLs
HP/Compaq, QSC
elements
10
Scaling for Polycrystalline Copper in a Shear
Compression Specimen Configuration
elements
LANLs HP/Compaq QSC system
11
3D Converging Shock Simulations in a Wedge

1024 processor ASCI Frost run of a converging
shock. The interface is nominally a 2D ellipse
perturbed with a prescribed spectrum and
randomized phases.
The 2D elliptical interface is computed using
local shock polar analysis to yield a perfectly
circular transmitted shock
Resolution 2000x400x400 with over 1T Byte of
data generated.

Density Pressure
12
Density Field in a 3D Wedge
Density field in the Wedge. The transmitted shock
front appears to be stable while the gas
interface is Richtmyer-Meshkov unstable. The
simulation took place on 1024 processors of
LLNLs IBM SP3, frost, 2000x400x400 initial
grid.
13
Wedge3D Performance on LLNLs IBM SP3, Frost

Aggregate parallel performance for 1400x280x280
grid
LLNLs Frost
Floating point operations 5.8 to 10 of peak,
depending on node
Hpmcount tool used to count hardware events
during program execution
Most time consuming communication calls
MPI_Wait() and MPI_Allreduce
Accounting for 3 to 30 of runtime on 128 way run
175x70x70 grid per processor
Occasional high MPI time on a few nodes seem to
be caused by system daemons competing for
resources

14
Wedge3D Scaling on LLNLs IBM SP3, Frost
grid size XxYxZ
15
Fragmentation 2D Scaling on LANLs HP/Compaq,
QSC
450K elements
61K -gt 915K elements
85K -gt 1.1M elements
Levels of subdivision 450K to 1.1M elements
16
Crack Patterns in the Configuration Occurring
During Scalability Studies on QSC
17
Fragmentation 2D Performance on LANLs
HP/Compaq, QSC

Procedures with highest CPU cycle consumption
element_driver 14.9
assemble 13.9
NewNeohookean 8.12
16 processor run with 2 levels of subdivision
(60K elements)
Dcpiprof too used to profile run
Problems processing dcpi database FLOP rates for
large runs
reported to LANL support
small runs yield 3 FLOP peak
Only 10 spend in fragmentation routines!
Much room for improvement on our I/O performance
dumping to the parallel file system
(/scratch1,2)