XT3 Optimization Scientific Libraries - PowerPoint PPT Presentation

1 / 34

About This Presentation

Title:

XT3 Optimization Scientific Libraries

Description:

Purpose of a scientific library is to provide highly tuned versions of common ... LUx=b with single precision but keep a copy of A in double precision ... – PowerPoint PPT presentation

Number of Views:44

Avg rating:3.0/5.0

Slides: 35

Provided by: ell170

Category:

more less

Transcript and Presenter's Notes

Title: XT3 Optimization Scientific Libraries

1
XT3 Optimization Scientific Libraries

Adrian Tate,
Scientific Libraries Engineer
Cray Inc.

2
Contents

Libraries features and plans
XT3 Performance tips
FDPM (Fast Double Precision Math) library
Suggested exercises

3
Scientific Libraries

Purpose of a scientific library is to provide
highly tuned versions of common mathematical
operations to make users life easier.

4
XT3 Math Libraries
FFTW libGoto
XT3 LibSci
ScaLAPACK SuperLU Cray FFTs Sparse BLAS FDPM
ACML
BLAS LAPACK FFTs Random Number Generators
5
Cray XT3 Libraries

Cray XT LibSci
ScaLAPACK
SuperLU
Cray FFTs
PETSc
Sparse BLAS
AMD Core Math Library (ACML)
BLAS
LAPACK
ACML FFTs
Random number generators
vector math libraries (link with acml_mv)
Goto BLAS
FFTW (3.1.1, 2.1.5)
OpenMP support with Compute Node Linux

6
Using Libraries

Libsci loaded with PrgEnv module
module load PrgEnv-pgi/1.3.30
CrayFFTs
module load libscifft-pgi/1.0.0
FFTW
module load FFTW/2.1.5 or FFTW/3.1.1
LibGoto download from UT
You dont need to link explicitly with libsci
library

7
FFTs on Cray XT3

FFTs in ACML
Contain OpenMP version for 2D, 3D FFTs
Cray FFT interfaces to ACML FFTs
Pre-built plans for FFTW 3.1.1 by end of 2006
Additional FFT optimizations in 2007 FFTW 3.1.1
Initial release by fall of 2006
Pre-built plans (Wisdom) available end of 2006
FFTW 2.1.5
Included only for MPI FFTW
Tuned by demand
Further optimizations of FFTW 3.x in 2007

8
ScaLAPACK on XT3

ScaLAPACK, PBLAS and BLACS in XT-libsci
Improved eigensolver support - provide Beta
release of MRRR for XT3 late 2006
New communications layer for ScaLAPACK in Baker
timeframe

9
Sparse Solvers

SuperLU and SuperLU_dist in libsci
Sparse BLAS (single processor) to support
iterative solvers in PETSc
Sparse matrix-vector multiply
Sparse triangular solve
OSKI custom routines from sparse BLAS standard
Investigating Epetra
Provides parallel sparse BLAS using
Single processor sparse BLAS

10
Library Goal

Have best scientific libraries in industry within
next 2 years

11
XT3 Library Performance Tips

Libraries should be well performing by definition
Requires 2 things from customers
Correct usage
Communication with libraries group

12
BLAS library

Where there is a choice, experiment with options
BLAS exist in both ACML and LibGoto
Completely different design philosophy, and
relative performance differs with problem
size,shape and data type.
Experiment with each

13
ACML vs GOTO zgemm MN1000

14
ACML vs GOTO zgemm MN3000

15
Utilize SSE2 instructions

32-bit arithmetic on Opteron can be 40 quicker
sse instructions of length 4
Where you do not need precision, use single
precision math routine
E.g. preconditioner for iterative solver
Within libraries we utilize this
FDPM
Short precision transcendental functions

16
Introduce yourself to your data

Solver performance can depend on characteristics
of your matrix
Eigensolvers will be wildly different for various
data types.
Generate a graphic of your eigenvalue spectrum

17
e.g. clustered eigenvalues
18
Decision tree for eigensolvers (XT3)

Eigenvalues eigenvectors
Eigenvalues only
QR
Full spectrum
Part spectrum
clustered
Not clustered
1.MRRR 2.Divide and Conquer
MRRR / Bisection inv. it.
MRRR
19
e.g. Application A - 10 of spectrum
20
Parallel Libraries

For parallel - increase blocksize
Many codes use same on different machines
XT3 MPI implementation works best for fewer,
larger messages
bigger distribution block size best (up to a
point).
You need this sweet spot.

21
1 stage of LU - limit of block size growth
BLAS2
BLAS2
BLAS3
BLAS3
BLAS3
BLAS3
BLAS3
BLAS3
22
Blocksize variance for Cholesky

23
PDSYEVD blocksize

24
Redistribute

Redistribute into optimal blocksize between code
portions
Redistribute is cheap compared to advantages

25
Redistribution is cheap M3000 64 compute nodes
26
Use single precision (again)

Apart from the SSE2 utilization, using single
precision means that you can increase blocksize
and still maintain same MB/s across network.
ScaLAPACK, PBLAS, SuperLU dist

27
Use sparse solvers

Nearly all real data is sparse
Dense linear algebra is a dying art.
Sparse solvers are O(n2)
Cray XT-libsci will include iterative solvers in
future release
ACML will support direct solvers in future release

28
FDPM Library

Fast Double Precision Math Library
Uses iterative refinement to generate double
precision accuracy with single precision
arithmetic
Requires 1.5x memory.
Can only be used for well conditioned problems.
Release 1.4
Serial linear solvers (LU, Cholesky, QR)
Parallel Linear solvers
Tools for condition number estimation
Future release - eigenvalues

29
e.g. LU with refinement

Axb
LUxb with single precision but keep a copy of A
in double precision
Generate r b Ax in double precision (O(n2))
Solve A x1 r to find new x1 using single
precision solver
x x x1
Iterate, with escape conditions
If matrix has a high condition number, refinement
will take too long

30
FDPM - contents

FDPM_sLUr
FDPM_pLUr
FDPM_sLUc
FDPM_pLUc
FDPM_sLTr
FDPM_pLTr
FDPM_sLTc
FDPM_pLTc
FDPM_sQRr
FDPM_pQRr
FDPM_sQRc
FDPM_pQRc
FDPM_sDCr
FDPM_pDCr
FDPM_sDCc
FDPM_sCONr
dgesv
pdgesv
zgesv

31
FDPM - interface

Lapack and ScaLAPACK friendly
Parameters are in lapack-like ordering
Routine names (e.g.) FDPM_dLUz
Replaces both factor and solve
E.g. for serial dp LU replace calls to dgetrf and
dgetrs with call to FDPM_sLUd

32
FDPM

If your code uses a higher level driver routine
to solve a linear system, e.g. dgesv, then you
can use FPDM without making explicit calls to
FDPM routines.
Set environment variable FDPM_USE_DRIVER
Dgsev will be called from FDPM, and the LAPACK
version used if it does not converge within 30
iterations.

33
Helping us achieve our goal