XT3 Optimization Scientific Libraries - PowerPoint PPT Presentation

1 / 34
About This Presentation
Title:

XT3 Optimization Scientific Libraries

Description:

Purpose of a scientific library is to provide highly tuned versions of common ... LUx=b with single precision but keep a copy of A in double precision ... – PowerPoint PPT presentation

Number of Views:44
Avg rating:3.0/5.0
Slides: 35
Provided by: ell170
Category:

less

Transcript and Presenter's Notes

Title: XT3 Optimization Scientific Libraries


1
XT3 Optimization Scientific Libraries
  • Adrian Tate,
  • Scientific Libraries Engineer
  • Cray Inc.

2
Contents
  • Libraries features and plans
  • XT3 Performance tips
  • FDPM (Fast Double Precision Math) library
  • Suggested exercises

3
Scientific Libraries
  • Purpose of a scientific library is to provide
    highly tuned versions of common mathematical
    operations to make users life easier.

4
XT3 Math Libraries
FFTW libGoto
XT3 LibSci
ScaLAPACK SuperLU Cray FFTs Sparse BLAS FDPM
ACML
BLAS LAPACK FFTs Random Number Generators
5
Cray XT3 Libraries
  • Cray XT LibSci
  • ScaLAPACK
  • SuperLU
  • Cray FFTs
  • PETSc
  • Sparse BLAS
  • AMD Core Math Library (ACML)
  • BLAS
  • LAPACK
  • ACML FFTs
  • Random number generators
  • vector math libraries (link with acml_mv)
  • Goto BLAS
  • FFTW (3.1.1, 2.1.5)
  • OpenMP support with Compute Node Linux

6
Using Libraries
  • Libsci loaded with PrgEnv module
  • module load PrgEnv-pgi/1.3.30
  • CrayFFTs
  • module load libscifft-pgi/1.0.0
  • FFTW
  • module load FFTW/2.1.5 or FFTW/3.1.1
  • LibGoto download from UT
  • You dont need to link explicitly with libsci
    library

7
FFTs on Cray XT3
  • FFTs in ACML
  • Contain OpenMP version for 2D, 3D FFTs
  • Cray FFT interfaces to ACML FFTs
  • Pre-built plans for FFTW 3.1.1 by end of 2006
  • Additional FFT optimizations in 2007 FFTW 3.1.1
  • Initial release by fall of 2006
  • Pre-built plans (Wisdom) available end of 2006
  • FFTW 2.1.5
  • Included only for MPI FFTW
  • Tuned by demand
  • Further optimizations of FFTW 3.x in 2007

8
ScaLAPACK on XT3
  • ScaLAPACK, PBLAS and BLACS in XT-libsci
  • Improved eigensolver support - provide Beta
    release of MRRR for XT3 late 2006
  • New communications layer for ScaLAPACK in Baker
    timeframe

9
Sparse Solvers
  • SuperLU and SuperLU_dist in libsci
  • Sparse BLAS (single processor) to support
    iterative solvers in PETSc
  • Sparse matrix-vector multiply
  • Sparse triangular solve
  • OSKI custom routines from sparse BLAS standard
  • Investigating Epetra
  • Provides parallel sparse BLAS using
  • Single processor sparse BLAS

10
Library Goal
  • Have best scientific libraries in industry within
    next 2 years

11
XT3 Library Performance Tips
  • Libraries should be well performing by definition
  • Requires 2 things from customers
  • Correct usage
  • Communication with libraries group

12
BLAS library
  • Where there is a choice, experiment with options
  • BLAS exist in both ACML and LibGoto
  • Completely different design philosophy, and
    relative performance differs with problem
    size,shape and data type.
  • Experiment with each

13
ACML vs GOTO zgemm MN1000

14
ACML vs GOTO zgemm MN3000

15
Utilize SSE2 instructions
  • 32-bit arithmetic on Opteron can be 40 quicker
  • sse instructions of length 4
  • Where you do not need precision, use single
    precision math routine
  • E.g. preconditioner for iterative solver
  • Within libraries we utilize this
  • FDPM
  • Short precision transcendental functions

16
Introduce yourself to your data
  • Solver performance can depend on characteristics
    of your matrix
  • Eigensolvers will be wildly different for various
    data types.
  • Generate a graphic of your eigenvalue spectrum

17
e.g. clustered eigenvalues
18
Decision tree for eigensolvers (XT3)

Eigenvalues eigenvectors
Eigenvalues only
QR
Full spectrum
Part spectrum
clustered
Not clustered
1.MRRR 2.Divide and Conquer
MRRR / Bisection inv. it.
MRRR
19
e.g. Application A - 10 of spectrum
20
Parallel Libraries
  • For parallel - increase blocksize
  • Many codes use same on different machines
  • XT3 MPI implementation works best for fewer,
    larger messages
  • bigger distribution block size best (up to a
    point).
  • You need this sweet spot.

21
1 stage of LU - limit of block size growth
BLAS2
BLAS2
BLAS3
BLAS3
BLAS3
BLAS3
BLAS3
BLAS3
22
Blocksize variance for Cholesky

23
PDSYEVD blocksize

24
Redistribute
  • Redistribute into optimal blocksize between code
    portions
  • Redistribute is cheap compared to advantages

25
Redistribution is cheap M3000 64 compute nodes
26
Use single precision (again)
  • Apart from the SSE2 utilization, using single
    precision means that you can increase blocksize
    and still maintain same MB/s across network.
  • ScaLAPACK, PBLAS, SuperLU dist

27
Use sparse solvers
  • Nearly all real data is sparse
  • Dense linear algebra is a dying art.
  • Sparse solvers are O(n2)
  • Cray XT-libsci will include iterative solvers in
    future release
  • ACML will support direct solvers in future release

28
FDPM Library
  • Fast Double Precision Math Library
  • Uses iterative refinement to generate double
    precision accuracy with single precision
    arithmetic
  • Requires 1.5x memory.
  • Can only be used for well conditioned problems.
  • Release 1.4
  • Serial linear solvers (LU, Cholesky, QR)
  • Parallel Linear solvers
  • Tools for condition number estimation
  • Future release - eigenvalues

29
e.g. LU with refinement
  • Axb
  • LUxb with single precision but keep a copy of A
    in double precision
  • Generate r b Ax in double precision (O(n2))
  • Solve A x1 r to find new x1 using single
    precision solver
  • x x x1
  • Iterate, with escape conditions
  • If matrix has a high condition number, refinement
    will take too long

30
FDPM - contents
  • FDPM_sLUr
  • FDPM_pLUr
  • FDPM_sLUc
  • FDPM_pLUc
  • FDPM_sLTr
  • FDPM_pLTr
  • FDPM_sLTc
  • FDPM_pLTc
  • FDPM_sQRr
  • FDPM_pQRr
  • FDPM_sQRc
  • FDPM_pQRc
  • FDPM_sDCr
  • FDPM_pDCr
  • FDPM_sDCc
  • FDPM_sCONr
  • dgesv
  • pdgesv
  • zgesv

31
FDPM - interface
  • Lapack and ScaLAPACK friendly
  • Parameters are in lapack-like ordering
  • Routine names (e.g.) FDPM_dLUz
  • Replaces both factor and solve
  • E.g. for serial dp LU replace calls to dgetrf and
    dgetrs with call to FDPM_sLUd

32
FDPM
  • If your code uses a higher level driver routine
    to solve a linear system, e.g. dgesv, then you
    can use FPDM without making explicit calls to
    FDPM routines.
  • Set environment variable FDPM_USE_DRIVER
  • Dgsev will be called from FDPM, and the LAPACK
    version used if it does not converge within 30
    iterations.

33
Helping us achieve our goal
  • Please speak to the libraries group
  • Request for features/functionality
  • Performance information (esp ACML)
  • Send any performance info to adrian_at_cray.com and
    ldr_at_cray.com

34
Suggested exercises
  • Generate a double precision random matrix A and
    right hand side vector b
  • Find the condition number of the matrix using
    FDPM_sCONd
  • Factor and solve the system using LU driver dgess
  • Solve the system using FDPM_sLUd and compare
    performance
  • Library exists on bigben at atate/libfdpm.a
Write a Comment
User Comments (0)
About PowerShow.com