VTF Applications Performance and Scalability - PowerPoint PPT Presentation

1 / 17
About This Presentation
Title:

VTF Applications Performance and Scalability

Description:

65 node SMP , 375 MHz Power3 Nighthawk-2 (16 CPUs/node) 16 GB memory/node ~ 20 TB global parallel file system. SP switch2, colony switch. 2 GB/sec node-to-node ... – PowerPoint PPT presentation

Number of Views:45
Avg rating:3.0/5.0
Slides: 18
Provided by: sharonb6
Category:

less

Transcript and Presenter's Notes

Title: VTF Applications Performance and Scalability


1
VTF Applications Performance and Scalability
  • Sharon Brunett
  • CACR/Caltech
  • ASCI Site Review
  • October 28, 29 2003

2
ASCI Platform Specifics
  • LLNLs IBM SP3 (frost)
  • 65 node SMP , 375 MHz Power3 Nighthawk-2 (16
    CPUs/node)
  • 16 GB memory/node
  • 20 TB global parallel file system
  • SP switch2, colony switch
  • 2 GB/sec node-to-node bandwidth
  • bi-directional
  • LANLs HP/Compaq Alphaserver ES45 (QSC)
  • 256 node SMP, 1.25 GHz Alpha EV6 ( 4 CPUs/node)
  • 16 GB memory/node
  • 12 TB global file system
  • Quadrics network interconnect (QsNet)
  • 2 mus latency
  • 300 MB/sec bandwidth

3
Multiscale Polycrystal Studies
  • Quantitative assessment of microstructural
    effects in macroscopic material response through
    the computation of full-field solutions of
    polycrystals
  • Inhomogeneous plastic deformation fields
  • Grain-boundary effects
  • Stress concentration
  • Dislocation pile-up
  • Constraint-induced multislip
  • Size dependence (inverse) Hall-Petch effect
  • Resolve (as opposed to model) mesoscale behavior
    exploiting the power of high-performance
    computing
  • Enable full-scale simulation of engineering
    systems incorporating micromechanical effects.

4
Mesh Generation
  • Ingrain subdivision behavior can be simulated in
    both single crystals and polycrystals.
  • texture simulation results agree well with
    experimental results
  • Mesh generation method keeps the topology of
    individual grain shapes
  • Enables effective interactions between grains
  • Increasing of the grain count in polycrystals
    gives a more stable mechanical response.

Single grain corresponding to a single cell in a
crystal
5
1.5 Million Element, 1241 Grain Multiscale
Polycrystal Simulation
Simulation carried out on 1024 processors of
LLNLs IBM SP3, frost
6
Multiscale Polycrystal Performance
  • Aggregate parallel performance
  • LANLs QSC
  • Floating point operations 10.67 of peak
  • Integer operations 15.39 of peak
  • Memory operations 22.08 of peak
  • DCPI hardware counters used to collect data
  • Qopcounter tool used to analyze DCPI database
  • LLNLs Frost
  • L1 cache hit rate 98
  • Load/store instructions executed w/o main memory
    access
  • Load Store Unit idle 36
  • Floating point operations 4.47 of peak
  • Hpmcount tool used to count hardware events
    during program execution

7
Multiscale Polycrystal Performance II
  • MPI routines can consume 30 of runtime for
    large runs on Frost
  • Workload imbalance as grains are distributed
    across nodes
  • MPI_Waitall every step dominating communications
    time
  • Nearest neighbor sends take longer from nodes
    with computationally heavy grains
  • Routines taking the most CPU time on QSC
  • resolved_fcc_cuitino 18.85
  • upslip_fcc_cuitino_explicit 11.74
  • setafcc 9.16
  • matvec 8.5
  • 50 of execution time in 4 routines
  • Room for performance improvement with better load
    balancing and routine level optimization

8
Multiscale Polycrystal Scaling on LLNLs IBM
SP3, Frost
elements
9
Multiscale Polycrystal Scaling on LANLs
HP/Compaq, QSC
elements
10
Scaling for Polycrystalline Copper in a Shear
Compression Specimen Configuration
elements
LANLs HP/Compaq QSC system
11
3D Converging Shock Simulations in a Wedge
  • 1024 processor ASCI Frost run of a converging
    shock. The interface is nominally a 2D ellipse
    perturbed with a prescribed spectrum and
    randomized phases.
  • The 2D elliptical interface is computed using
    local shock polar analysis to yield a perfectly
    circular transmitted shock
  • Resolution 2000x400x400 with over 1T Byte of
    data generated.

Density Pressure
12
Density Field in a 3D Wedge
Density field in the Wedge. The transmitted shock
front appears to be stable while the gas
interface is Richtmyer-Meshkov unstable. The
simulation took place on 1024 processors of
LLNLs IBM SP3, frost, 2000x400x400 initial
grid.
13
Wedge3D Performance on LLNLs IBM SP3, Frost
  • Aggregate parallel performance for 1400x280x280
    grid
  • LLNLs Frost
  • Floating point operations 5.8 to 10 of peak,
    depending on node
  • Hpmcount tool used to count hardware events
    during program execution
  • Most time consuming communication calls
  • MPI_Wait() and MPI_Allreduce
  • Accounting for 3 to 30 of runtime on 128 way run
  • 175x70x70 grid per processor
  • Occasional high MPI time on a few nodes seem to
    be caused by system daemons competing for
    resources

14
Wedge3D Scaling on LLNLs IBM SP3, Frost
grid size XxYxZ
15
Fragmentation 2D Scaling on LANLs HP/Compaq,
QSC
450K elements
61K -gt 915K elements
85K -gt 1.1M elements
Levels of subdivision 450K to 1.1M elements
16
Crack Patterns in the Configuration Occurring
During Scalability Studies on QSC
17
Fragmentation 2D Performance on LANLs
HP/Compaq, QSC
  • Procedures with highest CPU cycle consumption
  • element_driver 14.9
  • assemble 13.9
  • NewNeohookean 8.12
  • 16 processor run with 2 levels of subdivision
    (60K elements)
  • Dcpiprof too used to profile run
  • Problems processing dcpi database FLOP rates for
    large runs
  • reported to LANL support
  • small runs yield 3 FLOP peak
  • Only 10 spend in fragmentation routines!
  • Much room for improvement on our I/O performance
    dumping to the parallel file system
    (/scratch1,2)
Write a Comment
User Comments (0)
About PowerShow.com