Title: Parallel Methods for Nano/Materials Science Applications
1Parallel Methods for Nano/Materials Science
Applications
(Electronic Structure Calculations)
- Andrew Canning
- Computational Research Division LBNL
UC Davis, Applied Science Dept.
2Outline
- Introduction to Nano/Materials science
- Electronic Structure Calculations (DFT)
- Parallelism for Plane Wave Approach
- Code performance on High Performance Parallel
Computers - New Methods and Applications
3Milestones in Parallel Calculations
1991 Silicon surface reconstruction (7x7),
Phys. Rev. (Stich, Payne, King-Smith, Lin,
Clarke) Meiko I860, 64 processor Computing
Surface (Brommer, Needels, Larson,
Joannopoulos)
Thinking Machines CM2, 16,384 bit processors
1998 FeMn alloys (exchange bias), Gordon Bell
prize SC98 (Ujfalussy, Stocks, Canning, Y. Wang,
Shelton et al.)
Cray T3E, 1500 procs.
first gt 1 Tflop Simulation
2006 1000 atom Molybdenum simulation with Qbox
SC05 (Gordon Bell prize for peak performance).
(F. Gygi et al. )
BlueGene/L, 207 Tflops on IBM BG/L (LLNL)
2008 New Algorithm to Enable 400 TFlop/s
Sustained Performance in Simulations of Disorder
Effects in High-Tc SC08 Gordon Bell prize
(Thomas C. Schulthess et al.)
Cray XT5 (ORNL), first gt 1 Pflop Simulation
Linear
Scaling Divide-and-Conquer Electronic Structure
Calculations (L-W Wang et. al.) SC08 Gordon
Bell prize (algorithmic innovation )
4Electronic Structure Calculations
- Accurate Quantum Mechanical treatment for the
electrons - Each electron represented on grid or with some
basis functions (eg. Fourier components) - Compute Intensive Each electron requires 1
million points/basis (need 100s of electrons) - 70-80 NERSC Materials Science Computer Time
(first-principles electronic
structure)
BaYClCe excited state (new scintillator gamma
radiation detector)
5Motivation for Electronic Structure Calculations
- Most Materials Properties Only Understood at a
fundamental level from accurate electronic
structure calculations (Strength, Cohesion etc) - Many Properties Purely Electronic eg.
Optical Properties (Lasers) - Complements Experiments
- Computer Design Materials at the nanoscale
6Materials Science Methods
- Many Body Quantum Mechanical Approach (Quantum
Monte Carlo) 20-30 atoms - Single Particle QM (Density Functional Theory)
No free parameters. 100-1000 atoms - Empirical QM Models eg. Tight Binding
1000-5000 atoms - Empirical Classical Potential Methods
thousand-million
atoms - Continuum Methods
Increasing atoms
Single particle DFT methods largest user of
supercomputer cycles of any scientific method in
any discipline
7Quantum Mechanics for molecules and crystals
Many Body Schrodinger Equation for electrons
(position ri) in a molecule or crystals
with nuclei charge Z and position RI (natural
units e h 1 etc.)
kinetic energy
electron-nuclei interaction
electron-electron interaction
- Solution of the above eigenfunction equation
HYEY (H is Hamiltonian) for the wavefunction Y
gives complete information for the system. - Observables take the form of operators eg.
momentum p (classical form mv) pop - i h
then p observed is given by the solution of
popYpY - Lowest eigenvalue pair (E0,Y0) is the
groundstate energy. Higher eigenpairs correspond
to excited states of the system.
8Ab initio Method Density Functional Theory
(Kohn 98 Nobel Prize)
Many Body Schrodinger Equation (exact but
exponential scaling )
Kohn Sham Equation (65) The many body ground
state problem can be mapped onto a single
particle non-linear problem with the same
electron density and a different effective
potential (cubic scaling).
Use Local Density Approximation (LDA) for
(good for Si,C)
r charge density
9Selfconsistent Solution
fix V(r,r) a linear problem
N electrons N wave functions lowest N
eigenfunctions
Selfconsistency until Vout Vin
10Choice of Basis for DFT(LDA)
Increasing basis size M
Gaussian
FLAPW
Fourier
grid
Percentage of eigenpairs M/N
30
2
Eigensolvers
Direct (scalapack)
Iterative
11Plane-wave Pseudopotential Method in DFT
Solve Kohn-Sham Equations self-consistently for
electron wavefunctions within the Local Density
Approximation
1. Plane-wave (Fourier) expansion for to energy
cutoff g lt rcut (sphere )
2. Replace frozen atom core (core electrons
and nuclei) by a pseudopotential
(pseudo-wavefunctions in core region allows rcut
to be reduced without loss of accuracy)
Different parts of the Hamiltonian calculated in
different spaces (Fourier and real) via 3d FFT
12Computational Considerations
- H never computed explicitly (available through
mat-vec product) - Matrix-vector product in NlogN steps (not N2)
- Orthogonalization is expensive, Mat-vec is
cheap - Matrix is dense (in Fourier or Real space)
- Real space grid approximately double size of
Fourier sphere - Each Selfconsistent step we have good guess for
eigenvectors (from previous step) - Typically use stable CG based iterative methods
or Davidson -
FFT
13Parallelization (load balance, minimize
communications)
Dataset spheres of Fourier coefficients for each
wavefunction (electron) 100s-1000s N
spheres of M points per sphere (M 100N)
.
- Calculations (most costly)
- Calculation of overlap matrices for
orthogonalization, scales as MN2
(written in BLAS3) - Non-local part of pseudopotential calculation
(ion-electron interaction) scales as MN2
(written in BLAS3) can also be done in real
space - 3d FFTs to construct charge density in real
space, scales as NMlogM
Load balancing calculations 1., 2. Give the
same number of points in sphere to each
processor 3. Give the same number of columns of
data to each processor
Conflicting conditions, np complete problem bin
(bag) packing problem
14Parallelization (load balance, minimize
communications)
Load Balancing Scheme
- Consider sphere as one dimensional list of
columns ordered according to length. - Give out columns to processors in descending
order such that each column is given to the
processor with the least number of points in the
sphere. -
3 processor example
Close to the same number of points and columns go
to each processor
3d FFT
Communications
- With this data distribution the BLAS3 operations
in Fourier space have limited communications - Main communications occur in 3d FFTs
Write specialized 3d FFTs for this data layout
15 Standard Parallel 3d FFT on NxNxN (x,y,z)
grid
1. Perform 1d FFTs on N2 x direction columns
2. Transpose (x,y,z) -gt (y,z,x)
3. Perform 1d FFTs on N2 y direction columns
4. Transpose (y,z,x) -gt (z,y,x)
5. Perform 1d FFTs on N2 z direction columns
6. Transpose (z,y,x) -gt (x,y,z) optional
Scaling Issues (bandwidth and latency)
computations/communications N2NlogN/N3 logN
O(10)
message size (num. of processors)-2 for 1d
layout
16Specialized 3d FFT for Electronic Structure Codes
(Plane Waves/Fourier)
- Works for any grid size on any number of
processors - Only non-zero elements communicated/calculated
- Most communication in global transpose (b) to (c)
little communication (d) to (e) - Many 3d FFTs done at the same time to avoid
latency issues - Much faster than vendor supplied 3d-FFT (no grid
size limitations) - Used in many codes (PARATEC, PETot, ESCAN, GIT
code etc. )
(a)
(b)
FFT
Transp.
(c)
(d)
FFT
Transp.
(e)
(f)
FFT
17Results 3d-FFT 5123 grid on Cray XT4 (Franklin,
NERSC, LBNL)
- Cray XT4 (NERSC)
- Quad core proc. Opteron 2.3 GHz
- 38,640 proc. cores, 9660 nodes
Procs. isen/rec isen/rec (1 core) alltoallv isenrec-block 40 alltoallv-block 40 Weak scaling Nprocs x 5122
128 0.6175 s 0.3350 0.5560
256 0.3752 s 0.1858 0.3033 0.3292 0.3013 0.1604
512 0.3417 s 0.1326 0.2007 0.1770 0.1609 0.1609
1024 6.3045 s 0.1221 0.1992 0.1913 0.0772 0.1711
2048 7.6232 s 6.2814 0.2767 0.5552 0.0477 0.1857
4096 9.4510 s 6.8414 0.4272 0.0439 0.2019
8192 0.3746 0.6888 0.0369 0.2415
Time for forwardreverse 3d FFT, 5123 grid
uses FFTW 1d FFT libs
18PARATEC (PARAllel Total Energy Code)
- PARATEC performs first-principles quantum
mechanical total energy calculation using
pseudopotentials plane wave basis set - Written in F90 and MPI
- Designed to run on large parallel machines IBM
SP etc. but also runs on PCs
- PARATEC uses all-band CG approach to obtain
wavefunctions of electrons (blocks comms.
Specialized 3dffts) - Generally obtains high percentage of peak on
different platforms (uses BLAS3 and 1d FFT libs) - Developed with Louie and Cohens groups (UCB,
LBNL)
Overall ratio calcs./communications N (not
logN)
19PARATEC Code Details
- Code written in F90 and MPI (50,000 lines)
- 33 3D FFT, 33 BLAS3, 33 Hand coded F90
- Global Communications in 3D FFT (Transpose)
- Parallel 3D FFT handwritten, minimize comms.
reduce latency (written on top of vendor
supplied 1D complex FFT )
20PARATEC Performance
Problem Proc Bassi NERSC (IBM Power5) Bassi NERSC (IBM Power5) Jaquard NERSC (Opteron) Jaquard NERSC (Opteron) Thunder (Itanium2) Thunder (Itanium2) Franklin NERSC (Cray XT4) Franklin NERSC (Cray XT4) NEC ES (SX6) NEC ES (SX6) IBM BG/L IBM BG/L
Problem Proc Gflops/Proc peak Gflops/Proc peak Gflops/Proc peak Gflops/Proc peak Gflops/Proc peak Gflops/Proc peak
488 AtomCdSeQuantumDot 128 5.49 72 2.8 51 5.1 64
488 AtomCdSeQuantumDot 256 5.52 73 1.98 45 2.6 47 3.36 65 5.0 62 1.21 43
488 AtomCdSeQuantumDot 512 5.13 67 0.95 21 2.4 44 3.15 61 4.4 55 1.00 35
488 AtomCdSeQuantumDot 1024 3.74 49 1.8 32 2.93 56 3.6 46
488 AtomCdSeQuantumDot 2048 2.37 46 2.7 35
- Grid size 2503
- All architectures generally achieve high
performance due to computational intensity of
code (BLAS3, FFT) - ES achieves highest overall performance
5.5Tflop/s on 2048 procs (5.3
Tflops on XT4 on 2048 procs in single proc. node
mode) - FFT used for benchmark for NERSC procurements
(run on
up to 18K procs on Cray XT4, weak scaling ) - Vectorisation directives and multiple 1d FFTs
required for NEC SX6
Developed with Louie and Cohens groups (UCB,
LBNL), also work with L. Oliker, J Carter
21Application Gamma Ray Detection ( Cerium
activated Scintillators )
- Applications in cosmology, particle physics,
medical, homeland security - Scintillators produce light in the optical range
when exposed to gamma rays - Brightest Scintillators are Cerium doped
materials (lanthanum halides) - Theoretical Screening of Candidate Materials
Ce states
Increasing energy
Project funded by the DHS, in collaboration with
S. Derenzo, M. Weber, A. Chaudhry in Life
Sciences (LBNL)
22Scintillation process model (Ce activ.)
?-ray excitation produces a hole and an e in the
host material
The hole is captured by the dopant Ce site (Ce3
-gt Ce4 )
Electron diffuses down through conduction band
into 5d Ce
5d -gt 4f transition on Ce site emits optical
photon
Key assumption 4f and 5d states of Ce atom must
be in the energy gap of the host crystal
conduction
5 d
??F
4 f
valence
23Density of States Plot for Ce in LaBr3
5 d
4 f
Single Ce atom in large unit cell
Ce density of states
Energy
24Criteria to determine bright Ce activated
Scintillators
- Size of host material bandgap smaller the
bandgap the more electron hole pairs but must
accommodate Ce 4f,5d - Energy difference between the VBM and Ce 4f
level larger this value the less probable will
be hole capture at the Ce3 site (requires
multiple or high-energy phonons) - Energy difference between the CBM and Ce 5d
level if 5d is too close thermal excitations may
excite electron into CB - Level of localisation of the (Ce3) state on the
Ce atom
Ce states
25Predictions for some new materials
Compound Bandgap EG (eV) Ce 4f - VBM (eV) (Ce3) localisation
Ba2YCl7Ce 4.05 0.87 64
BaY2Br8Ce 2.94 0.45 39
BaY2I8Ce 2.50 0.0 42
LaFSCe 3.27 0.71 45
La2Ca2O5Ce 2.38 0.26 28
La2Mg2O5Ce 2.80 0.1 12
Ba2YCl7Ce made by experimentalists and found to
be a very bright scintillator. Initial Patent
Filing taken out for Material Ba2YCl7Ce
26The Quantization Condition of Quantum-well States
in Cu/Co(100)
- Theoretical investigation of Quantum Well
states in Cu films using our codes (PARATEC,
PEtot) to compare with experiments at the ALS (E.
Rotenberg, Y.Z. Wu, Z.Q. Qiu) - New computational methods for metallic systems
used in the calculations. - Lead to an understanding of surface effects on
the Quantum Well States. Improves on simple Phase
Accumulation Model used previously
QW states in Copper Wedge
Difference between theory and experiment improved
by taking surface effects into account
27Summary
- Plane wave electronic structure methods can
scale to the 1K-10K
processor regime - Require different parallelization methods for
higher proc. count (state parallelism, k-point
parallelism eg. Qbox code)
- This research was funded by the DOE as part of
the Modeling and Simulation in Nanoscience
Initiative (MICS and BES) contract No
DE-AC03-76SF00098 - Gamma ray detection funded by DHS
28Future Directions
- O(N) based methods (exploit locality) can give
sparse matrix problem - Excited state calculations
- Transport Calculations (conductivity)
(dynamical properties)