Title: Linear Algebra Libraries
1Linear Algebra Libraries
- James Demmel,
- Mathematics and EECS
- UC Berkeley
2A few more participants (not all!)
- Jack Dongarra, U. Tennessee
- Kathy Yelic, UCB
- Xiaoye Li, NERSC/LBNL
- Tony Drummond, NERSC/LBNL
- Osni Marques, NERSC/LBNL
- Inderjit Dhillon, UT Austin
- Beresford Parlett, UC Berkeley
- Mark Adams, SNL
3Success Stories (with NERSC, LBNL)
Cosmic Microwave Background Analysis, BOOMERanG
collaboration, MADCAP code (Apr. 27, 2000).
- Scattering in a quantum system of three charged
particles (Rescigno, Baertschy, Isaacs and
McCurdy, Dec. 24, 1999).
SuperLU
ScaLAPACK
4SuperLU - Scalable Sparse Solvers
Contact Xiaoye Li, NERSC, LBNL,
www.nersc.gov/xiaoye
- SuperLU
- Solve sparse linear system A x b using Gaussian
elimination. - Efficient and portable
- Sequential SuperLU
- Achieved up to 40 peak Mflop rate
- SuperLU_MT shared-memory
- Achieved up to 10 fold speedup
- SuperLU_DIST distributed-memory
- Achieved up to 100 fold speedup
- Included in HYPRE, PETSC, ML
- To appear in Matlab, SUN Perf Lib, BCSLIB-EXT
- Enabled Scientific Discovery
- Solved complex unsymmetric systems of order 2
million, on IBM SP.
Recigno, Baertschy, Isaacs McCurdy, Science,
24 Dec 1999
5The Holy Grail
Eigensolver for Symmetric matices
To be propagated throughout LAPACK and ScaLAPACK
6Parallel Multigrid for Irregular FE Problems
- Sample problem
- Compute crushing of a stiff sphere w/ 17 steel
and rubber layers, in rubber cube - 80K 56M dof
- Up to 640 Cray T3E processors
- 50 scaled parallel efficiency
www.cs.berkeley.edu/madams
- 76M dof solved in 70 seconds on 1920 processor
ASCI Red (SC 01) - Prize for Best Industrial Appl in Mannheim
SuParCup 99
7The ACTS Toolkit acts.nersc.gov
- Advanced Computations Testing and Simulation
- Collection of (mostly) DOE developed tools
- Documented and improved
- Paid for by DOE and NSF/NPACI
- Freely available to all users
- Education, training and consulting
- Tutorial 10/01 and 3/02
- Application Level Libraries
- ScaLAPACK, SuperLU, Aztec, PETSc, Hypre, PVODE,
TAO, ATLAS, PHiPAC, - System Level Libraries
- Global Arrays, Overture, POET, POOMA, Globus,
Nexus, - Software Development Tools
- CUMULVS, PAWS, SILOON, TAU, Tulip, PADRE, PETE
8Templates Eigensolver Tutorial
www.siam.org
www.netlib.org
Train users to match algorithms and software to
problems
9Future Work
- Continue tech transfer
- Leverage DOE investment in algorithms, software,
training - Propagate Holy Grail Eigensolver throughout
libraries - Will change eigenvalue solvers, singular value
solvers, least squares - Better Multigrid solvers
- New coarseners, smoothers
- Listen to the users
- 22M hits on netlib to (Sca)LAPACK, hundreds of
e-mails per year - Example thread-safe versions for use on SMP
nodes, like BH - Automatically tuned kernels
- Harden research results from PHiPAC, Sparsity,
BeBOP
10Motivation for Automatic Tuning
- Why replace conventional hand tuning by user or
vendor? - Time consuming and tedious
- Hard to predict performance from source code
- Growing list of kernels to tune
- Must be redone for every architecture and
compiler - Cant depend on compiler
- Compiler technology often lags architecture
- Best algorithm may depend on input, so some
tuning at run-time. - Not all algorithms semantically or mathematically
equivalent
11Automatic Performance Tuning
- Two steps
- Identify and generate a space of algorithms, with
various - Instruction mixes and orders
- Memory Access Patterns
- Data structures
- Mathematical Formulations
- 2. Search for the fastest one, by running them
- When do we search?
- Once per kernel and architecture
- At compile time
- At run time, e.g., once the sparsity structure is
known - All of the above
- Many examples
- PHiPAC, ATLAS, Sparsity, FFTW, Spiral,
12Register Blocking Optimization
- Identify a small dense blocks of nonzeros.
- Fill in extra zeros to complete blocks
- Use an optimized multiplication code for the
particular block size.
2x2 register blocked matrix
2
1
2
3
0
2
4
1
2
5
0
1
0
0
1
3
0
2
1
3
0
5
7
3
0
4
1
1
- Improves register reuse, lowers indexing
overhead. - Challenge adds storage (potentially) and
computation
13Sparsity Sparse MatVec Multiply
Speedups on Itanium
14Speedup of ATAx
- Speedups on Pentium 4
- Access A only once
15Future Work on Tuning
- Further research
- More kernels
- For numerics, communication, other problems
- Higher level algorithm choices (templates)
- NPACI problem
- How to make tuning usable by non-experts
- Interface design for sparse problems
16Titanium
- Susan Graham and Katherine Yelick
- Computer Science Division, EECS
- U.C. Berkeley
- http//titanium.cs.berkeley.edu/
17Titanium Group
- Susan Graham
- Katherine Yelick
- Paul Hilfinger
- Phillip Colella (LBNL)
- Alex Aiken
- Greg Balls (SDSC)
- Peter McQuorquodale (LBNL)
- Andrew Begel
- Dan Bonachea
- Szu-Huey Chang
- Tyson Condie
- Carrie Fei
- David Gay
- Ben Liblit
- Chang Sun Lin
- Geoff Pike
- Jimmy Su
- Ellen Tsai
- Mike Welcome (LBNL)
- Siu Man Yau
18Titanium Overview
- Object-oriented language based on Java with
- Scalable parallelism
- SPMD model with global address space
- Multidimensional arrays
- points and index sets as first-class values
- Immutable classes
- user-definable non-reference types for
performance - Operator overloading
- by demand from our user community
- Semi-automated memory management
- uses memory regions for high performance
19SciMark Benchmark
- Numerical benchmark for Java, C/C
- Five kernels
- FFT (complex, 1D)
- Successive Over-Relaxation (SOR)
- Monte Carlo integration (MC)
- Sparse matrix multiply
- dense LU factorization
- Results are reported in Mflops
- Download and run on your machine from
- http//math.nist.gov/scimark2
- C and Java sources also provided
Roldan Pozo, NIST, http//math.nist.gov/Rpozo
20SciMark Java vs. C(Sun UltraSPARC 60)
Sun JDK 1.3 (HotSpot) , javac -0 Sun cc -0
SunOS 5.7
Roldan Pozo, NIST, http//math.nist.gov/Rpozo
21SciMark Java vs. C(Intel PIII 500MHz, Win98)
Sun JDK 1.2, javac -0 Microsoft VC 5.0, cl
-0 Win98
Roldan Pozo, NIST, http//math.nist.gov/Rpozo
22Java Compiled by Titanium Compiler
23Titanium Compiler on the Power 3 (SP)
24Java Compiled by Titanium Compiler
25SOR red-black loop (small data)
- Using a red-black algorithm, titanium arrays (191
Mflops) are faster than Java arrays
26Parallel Applications
- Genome Application
- Heart simulation
- AMR elliptic and hyperbolic solvers
- Scalable Poisson for infinite domains
- Genome application
- Several smaller benchmarks EM3D, MatMul, LU,
FFT, Join,
27MOOSE Application
- Problem Microarray construction
- Used for genome experiments
- Possible medical applications long-term
- Microarray Optimal Oligo Selection Engine (MOOSE)
- A parallel engine for selecting the best
oligonucleotide sequences for genetic microarray
testing - Uses dynamic load balancing within Titanium
28Heart Simulation
- Problem compute blood flow in the heart
- Modeled as an elastic structure in an
incompressible fluid. - The immersed boundary method Peskin and
McQueen. - 20 years of development in model
- Many other applications blood clotting, inner
ear, paper making, embryo growth, and more - Can be used for design
of prosthetics - Artificial heart valves
- Cochlear implants
29Scalable Poisson Solver
- MLC for Finite-Differences by Balls and Colella
- Poisson equation with infinite boundaries
- arise in astrophysics, some biological systems,
etc. - Method is scalable
- Low communication
- Performance on
- SP2 (shown) and t3e
- scaled speedups
- nearly ideal (flat)
- Currently 2D and non-adaptive
30Error on High-Wavenumber Problem
- Charge is
- 1 charge of concentric waves
- 2 star-shaped charges.
- Largest error is where the charge is changing
rapidly. Note - discretization error
- faint decomposition error
- Run on 16 procs
31AMR Gas Dynamics
- Developed by McCorquodale and Colella
- 2D Example (3D supported)
- Mach-10 shock on solid surface
at
oblique angle - Future Self-gravitating gas dynamics package
32Unstructured Mesh Kernel
- EM3D Relaxation on a 3D unstructured mesh
- Speedup on Ultrasparc SMP
- Simple kernel mesh not partitioned.
33Recent Developments
- Interfaces to libraries
- KeLP and (older) PETSc and Metis
- New IBM SP implementation
- Uses LAPI rather than MPI, about 2x performance
gain - New release IBM, SGI, Cray, Linux cluster,
Threads, - Uniprocessor optimizations
- Method inlining, both automated and manual
- Cache optimizations
- Shared pointer analysis
- Support for unstructured computation
- General sub-array copy now with arbitrary points
34Future Plans
- Merge communication layer with UPC
- Unified Parallel C has broad vendor support.
- Uses some execution model as Titanium
- Push vendors to expose low-overhead communication
- Automated communication overlap
- Analysis and refinement of cache optimizations
- Additional support for unstructured grids
- Conjugate gradient and particle methods are
motivations - Better uniprocessor optimizations, possibly new
arrays
35End of Slides
36Target Problems
- Many modeling problems in astrophysics, biology,
material science, and other areas require - Enormous range of spatial and temporal scales
- Requires
- Adaptive methods
- Large scale parallel machines
- Titanium supports
- Stuctured grids
- Locally-structured grids (AMR)
- Unstructured grids (in progress)
37Local Pointer Analysis
- Compiler can infer many uses of local
- Data structures must be well partitioned