Title: XT3 Optimization Scientific Libraries
1XT3 Optimization Scientific Libraries
- Adrian Tate,
- Scientific Libraries Engineer
- Cray Inc.
-
2Contents
- Libraries features and plans
- XT3 Performance tips
- FDPM (Fast Double Precision Math) library
- Suggested exercises
3Scientific Libraries
- Purpose of a scientific library is to provide
highly tuned versions of common mathematical
operations to make users life easier.
4XT3 Math Libraries
FFTW libGoto
XT3 LibSci
ScaLAPACK SuperLU Cray FFTs Sparse BLAS FDPM
ACML
BLAS LAPACK FFTs Random Number Generators
5Cray XT3 Libraries
- Cray XT LibSci
- ScaLAPACK
- SuperLU
- Cray FFTs
- PETSc
- Sparse BLAS
- AMD Core Math Library (ACML)
- BLAS
- LAPACK
- ACML FFTs
- Random number generators
- vector math libraries (link with acml_mv)
- Goto BLAS
- FFTW (3.1.1, 2.1.5)
- OpenMP support with Compute Node Linux
6Using Libraries
- Libsci loaded with PrgEnv module
- module load PrgEnv-pgi/1.3.30
- CrayFFTs
- module load libscifft-pgi/1.0.0
- FFTW
- module load FFTW/2.1.5 or FFTW/3.1.1
- LibGoto download from UT
- You dont need to link explicitly with libsci
library
7FFTs on Cray XT3
- FFTs in ACML
- Contain OpenMP version for 2D, 3D FFTs
- Cray FFT interfaces to ACML FFTs
- Pre-built plans for FFTW 3.1.1 by end of 2006
- Additional FFT optimizations in 2007 FFTW 3.1.1
- Initial release by fall of 2006
- Pre-built plans (Wisdom) available end of 2006
- FFTW 2.1.5
- Included only for MPI FFTW
- Tuned by demand
- Further optimizations of FFTW 3.x in 2007
8ScaLAPACK on XT3
- ScaLAPACK, PBLAS and BLACS in XT-libsci
- Improved eigensolver support - provide Beta
release of MRRR for XT3 late 2006 - New communications layer for ScaLAPACK in Baker
timeframe
9Sparse Solvers
- SuperLU and SuperLU_dist in libsci
- Sparse BLAS (single processor) to support
iterative solvers in PETSc - Sparse matrix-vector multiply
- Sparse triangular solve
- OSKI custom routines from sparse BLAS standard
- Investigating Epetra
- Provides parallel sparse BLAS using
- Single processor sparse BLAS
10Library Goal
- Have best scientific libraries in industry within
next 2 years
11XT3 Library Performance Tips
- Libraries should be well performing by definition
- Requires 2 things from customers
- Correct usage
- Communication with libraries group
12BLAS library
- Where there is a choice, experiment with options
- BLAS exist in both ACML and LibGoto
- Completely different design philosophy, and
relative performance differs with problem
size,shape and data type. - Experiment with each
13ACML vs GOTO zgemm MN1000
14ACML vs GOTO zgemm MN3000
15Utilize SSE2 instructions
- 32-bit arithmetic on Opteron can be 40 quicker
- sse instructions of length 4
- Where you do not need precision, use single
precision math routine - E.g. preconditioner for iterative solver
- Within libraries we utilize this
- FDPM
- Short precision transcendental functions
16Introduce yourself to your data
- Solver performance can depend on characteristics
of your matrix - Eigensolvers will be wildly different for various
data types. - Generate a graphic of your eigenvalue spectrum
17e.g. clustered eigenvalues
18Decision tree for eigensolvers (XT3)
Eigenvalues eigenvectors
Eigenvalues only
QR
Full spectrum
Part spectrum
clustered
Not clustered
1.MRRR 2.Divide and Conquer
MRRR / Bisection inv. it.
MRRR
19e.g. Application A - 10 of spectrum
20Parallel Libraries
- For parallel - increase blocksize
- Many codes use same on different machines
- XT3 MPI implementation works best for fewer,
larger messages - bigger distribution block size best (up to a
point). - You need this sweet spot.
211 stage of LU - limit of block size growth
BLAS2
BLAS2
BLAS3
BLAS3
BLAS3
BLAS3
BLAS3
BLAS3
22Blocksize variance for Cholesky
23PDSYEVD blocksize
24Redistribute
- Redistribute into optimal blocksize between code
portions - Redistribute is cheap compared to advantages
25Redistribution is cheap M3000 64 compute nodes
26Use single precision (again)
- Apart from the SSE2 utilization, using single
precision means that you can increase blocksize
and still maintain same MB/s across network. - ScaLAPACK, PBLAS, SuperLU dist
27Use sparse solvers
- Nearly all real data is sparse
- Dense linear algebra is a dying art.
- Sparse solvers are O(n2)
- Cray XT-libsci will include iterative solvers in
future release - ACML will support direct solvers in future release
28FDPM Library
- Fast Double Precision Math Library
- Uses iterative refinement to generate double
precision accuracy with single precision
arithmetic - Requires 1.5x memory.
- Can only be used for well conditioned problems.
- Release 1.4
- Serial linear solvers (LU, Cholesky, QR)
- Parallel Linear solvers
- Tools for condition number estimation
- Future release - eigenvalues
29e.g. LU with refinement
- Axb
- LUxb with single precision but keep a copy of A
in double precision - Generate r b Ax in double precision (O(n2))
- Solve A x1 r to find new x1 using single
precision solver - x x x1
- Iterate, with escape conditions
- If matrix has a high condition number, refinement
will take too long
30FDPM - contents
- FDPM_sLUr
- FDPM_pLUr
- FDPM_sLUc
- FDPM_pLUc
- FDPM_sLTr
- FDPM_pLTr
- FDPM_sLTc
- FDPM_pLTc
- FDPM_sQRr
- FDPM_pQRr
- FDPM_sQRc
- FDPM_pQRc
- FDPM_sDCr
- FDPM_pDCr
- FDPM_sDCc
- FDPM_sCONr
- dgesv
- pdgesv
- zgesv
31FDPM - interface
- Lapack and ScaLAPACK friendly
- Parameters are in lapack-like ordering
- Routine names (e.g.) FDPM_dLUz
- Replaces both factor and solve
- E.g. for serial dp LU replace calls to dgetrf and
dgetrs with call to FDPM_sLUd
32FDPM
- If your code uses a higher level driver routine
to solve a linear system, e.g. dgesv, then you
can use FPDM without making explicit calls to
FDPM routines. - Set environment variable FDPM_USE_DRIVER
- Dgsev will be called from FDPM, and the LAPACK
version used if it does not converge within 30
iterations.
33Helping us achieve our goal
- Please speak to the libraries group
- Request for features/functionality
- Performance information (esp ACML)
- Send any performance info to adrian_at_cray.com and
ldr_at_cray.com
34Suggested exercises
- Generate a double precision random matrix A and
right hand side vector b - Find the condition number of the matrix using
FDPM_sCONd - Factor and solve the system using LU driver dgess
- Solve the system using FDPM_sLUd and compare
performance - Library exists on bigben at atate/libfdpm.a