Title: Parallel Multi-Reference Configuration Interaction on JAZZ
1Parallel Multi-Reference Configuration
Interaction on JAZZ
- Ron Shepard (CHM)Mike Minkoff (MCS)Mike Dvorak
(MCS)
2The COLUMBUS Program System
- Molecular Electronic Structure
- Collection of individual programs that
communicate through external files - 1 Atomic-Orbital Integral Generation2 Orbital
Optimization (MCSCF, SCF)3 Integral
Transformation4 MR-SDCI5 CI Density6
Properties (energy gradient, geometry
optimization)
3Real Symmetric Eigenvalue Problem
- Use the iterative Davidson Method for the lowest
(or lowest few) eigenpairs - Direct CI H is not explicitly constructed, wHv
are constructed in operator form - Matrix dimensions are 104 to 109
- All floating point calculations are 64-bit
4Davidson Method
Generate an initial vector x1 MAINLOOP DO n1,
NITER Compute and save wn H xn Compute
the nth row and column of G XTHX WTX
Compute the subspace Ritz pair (G ?1) c 0
Compute the residual vector r W c ? X c
Check for convergence using r, c, ?, etc.
IF (converged) THEN EXIT MAINLOOP
ELSE Generate a new expansion vector xn1
from r, ?, vXc, etc. ENDIF ENDDO MAINLOOP
5Matrix Elements
- Hmn ltm Hop ngt
- ngt ?(r1) ?1 ?(r2)?2 ?(rn)?n with
?j?, ? -
-
6Matrix Elements
-
- hpq and gpqrs are computed and stored as arrays
(with index symmetry) - ltmEpqngt and ltmepqrsngt are coupling
coefficients these are sparse and are recomputed
as needed -
7Matrix-Vector Products
w H x
- The challenge is to bring together the different
factors in order to compute w efficiently
8Coupling Coefficient Evaluation
- Graphical Unitary Group Approach (GUGA)
- Define a directed graph with nodes and arcs
Shavitt Graph - Nodes correspond to spin-coupled states
consisting of a subset of the total number of
orbitals - Arcs correspond to the (up to) four allowed spin
couplings when an orbital is added to the graph
9Coupling Coefficient Evaluation
? graph head
Internal orbitals
?w,x,y,z
External orbitals
?graph tail
10Coupling Coefficient Evaluation
11Integral Types
- 0 gpqrs
- 1 gpqra
- 2 gpqab, gpa,qb
- 3 gpabc
- 4 gabcd
12Original Program (1980)
- Need to optimize wave functions for Ncsf105 to
106 - Available memory was typically 105 words
- Must segment the vectors, v and w, and partition
the matrix H into subblocks, then work with one
subblock at a time.
13First Parallel Program (1990)
- Networked workstations using TCGMSG
- Each matrix subblock corresponds to a compute
task - Different tasks require different resources (pay
attention to load balancing) - Same vector segmentation for all gpqrs types
- gpqrs, ltm epqrs ngt, w, and v were stored on
external shared files (file contention
bottlenecks)
14Current Parallel Program
- Eliminate shared file I/O by distributing data
across the nodes with the GA Library - Parallel efficiency depends on the vector
segmentation and corresponding H subblocking - Apply different vector segmentation for different
gpqrs types - Tasks are timed each Davidson iteration, then
sorted into decreasing order and reassigned for
the next iteration in order to optimize load
balancing - Manual tuning of the segmentation is required for
optimal performance - Capable of optimizing expansions up to Ncsf109
15COLUMBUS-PetaflopsApplication
- Mike Dvorak, Mike Minkoff
- MCS Division
- Ron Shepard
- Chemistry Division
- Argonne National Lab
16Notes on software engineering
- PCIUDG parallel code
- Fortran 77/90
- Compiled with Intel/Myrinet on Jazz
- 70k lines in PCIUDG
- 14 files containing 205 subroutines
- Versioning system
- Currently distributed in a tar file
- Created a LCRC CVS repository for personal code
mods
17Notes on Software Engineering (cont)
- Homegrown preprocessing system
- Uses mdcif parallel statements to
comment/uncomment parts of the code - Could/should be replaced with CPP directives
- Global Arrays library
- Provides global address space for matrix
computation - Used mainly for chemistry codes but applicable
for other applications - Ran with most current version --gt no perf gain
- Installed on Softenv on Jazz (version 3.2.6)
18Gprof Output
- 270 subroutines called
- loopcalc subroutine using 20 of simulation time
- Added user defined MPE states to 50 loopcalc
calls - Challenge due to large number of subroutines in
file - 2 GB file size severe limiter on number of procs
- Broken logging
- Show actual output
19Jumpshot/MPE Instrumentation
- Live Demo of a 20 proc run
20Using FPMPI
- Relinked code with FPMPI
- Tell you total number of MPE calls made
- Output file size smalled (compared to other tools
i.e. Jumpshot) - Produces a histogram of message sizes
- Not installed in Softenv on Jazz yet
- riley/fpmpi-2.0
- Problem for runs
- Double Zeta C2H4 without optimizing the load
balance
21Total Number of MPI calls
22Max/Avg MPI Complete Time
23Avg/Max Time MPI Barrier
24COLUMBUS Performance Results
25COLUMBUS Performance Data
- R. Shepard, M. Dvorak, M. Minkoff
26Timing of Steps (Sec.)
Time Basis Set Integral Time Orbital Opt. Time CI Time
QZ 388 11806 382,221
TZ 26 104 31,415
DZ 1 34 3,281
27Walks Vs. Basis Set (Millions)
Walk Type Basis Set Z Y X W Matrix Dim.
cc-pVQZ .08 15 536 305 858
cc-pVTZ .08 7 120 69 198
cc-pVDZ .08 2 13 8 24
28Timing of CI Iteration
29Basic Model of PerformanceTime C1C2NC3/N
30Constrained Linear TermC2 gt 0