Linear Algebra Libraries - PowerPoint PPT Presentation

1 / 37

About This Presentation

Title:

Linear Algebra Libraries

Description:

... more participants (not all!) Jack Dongarra, U. Tennessee. Kathy Yelic, ... Using a red-black algorithm, titanium arrays (191 Mflops) are faster than Java arrays ... – PowerPoint PPT presentation

Number of Views:75

Avg rating:3.0/5.0

Slides: 38

Provided by: kath218

Category:

more less

Transcript and Presenter's Notes

Title: Linear Algebra Libraries

1
Linear Algebra Libraries

James Demmel,
Mathematics and EECS
UC Berkeley

2
A few more participants (not all!)

Jack Dongarra, U. Tennessee
Kathy Yelic, UCB
Xiaoye Li, NERSC/LBNL
Tony Drummond, NERSC/LBNL
Osni Marques, NERSC/LBNL
Inderjit Dhillon, UT Austin
Beresford Parlett, UC Berkeley
Mark Adams, SNL

3
Success Stories (with NERSC, LBNL)
Cosmic Microwave Background Analysis, BOOMERanG
collaboration, MADCAP code (Apr. 27, 2000).

Scattering in a quantum system of three charged
particles (Rescigno, Baertschy, Isaacs and
McCurdy, Dec. 24, 1999).

SuperLU
ScaLAPACK
4
SuperLU - Scalable Sparse Solvers
Contact Xiaoye Li, NERSC, LBNL,
www.nersc.gov/xiaoye

SuperLU
Solve sparse linear system A x b using Gaussian
elimination.
Efficient and portable
Sequential SuperLU
Achieved up to 40 peak Mflop rate
SuperLU_MT shared-memory
Achieved up to 10 fold speedup
SuperLU_DIST distributed-memory
Achieved up to 100 fold speedup
Included in HYPRE, PETSC, ML
To appear in Matlab, SUN Perf Lib, BCSLIB-EXT
Enabled Scientific Discovery
Solved complex unsymmetric systems of order 2
million, on IBM SP.

Recigno, Baertschy, Isaacs McCurdy, Science,
24 Dec 1999
5
The Holy Grail
Eigensolver for Symmetric matices
To be propagated throughout LAPACK and ScaLAPACK
6
Parallel Multigrid for Irregular FE Problems

Sample problem
Compute crushing of a stiff sphere w/ 17 steel
and rubber layers, in rubber cube
80K 56M dof
Up to 640 Cray T3E processors
50 scaled parallel efficiency

www.cs.berkeley.edu/madams

76M dof solved in 70 seconds on 1920 processor
ASCI Red (SC 01)
Prize for Best Industrial Appl in Mannheim
SuParCup 99

7
The ACTS Toolkit acts.nersc.gov

Advanced Computations Testing and Simulation
Collection of (mostly) DOE developed tools
Documented and improved
Paid for by DOE and NSF/NPACI
Freely available to all users
Education, training and consulting
Tutorial 10/01 and 3/02
Application Level Libraries
ScaLAPACK, SuperLU, Aztec, PETSc, Hypre, PVODE,
TAO, ATLAS, PHiPAC,
System Level Libraries
Global Arrays, Overture, POET, POOMA, Globus,
Nexus,
Software Development Tools
CUMULVS, PAWS, SILOON, TAU, Tulip, PADRE, PETE

8
Templates Eigensolver Tutorial
www.siam.org
www.netlib.org
Train users to match algorithms and software to
problems
9
Future Work

Continue tech transfer
Leverage DOE investment in algorithms, software,
training
Propagate Holy Grail Eigensolver throughout
libraries
Will change eigenvalue solvers, singular value
solvers, least squares
Better Multigrid solvers
New coarseners, smoothers
Listen to the users
22M hits on netlib to (Sca)LAPACK, hundreds of
e-mails per year
Example thread-safe versions for use on SMP
nodes, like BH
Automatically tuned kernels
Harden research results from PHiPAC, Sparsity,
BeBOP

10
Motivation for Automatic Tuning

Why replace conventional hand tuning by user or
vendor?
Time consuming and tedious
Hard to predict performance from source code
Growing list of kernels to tune
Must be redone for every architecture and
compiler
Cant depend on compiler
Compiler technology often lags architecture
Best algorithm may depend on input, so some
tuning at run-time.
Not all algorithms semantically or mathematically
equivalent

11
Automatic Performance Tuning

Two steps
Identify and generate a space of algorithms, with
various
Instruction mixes and orders
Memory Access Patterns
Data structures
Mathematical Formulations
2. Search for the fastest one, by running them
When do we search?
Once per kernel and architecture
At compile time
At run time, e.g., once the sparsity structure is
known
All of the above
Many examples
PHiPAC, ATLAS, Sparsity, FFTW, Spiral,

12
Register Blocking Optimization

Identify a small dense blocks of nonzeros.
Fill in extra zeros to complete blocks
Use an optimized multiplication code for the
particular block size.

2x2 register blocked matrix
2
1
2
3
0
2
4
1
2
5
0
1
0
0
1
3
0
2
1
3
0
5
7
3
0
4
1
1

Improves register reuse, lowers indexing
overhead.
Challenge adds storage (potentially) and
computation

13
Sparsity Sparse MatVec Multiply
Speedups on Itanium
14
Speedup of ATAx

Speedups on Pentium 4
Access A only once

15
Future Work on Tuning

Further research
More kernels
For numerics, communication, other problems
Higher level algorithm choices (templates)
NPACI problem
How to make tuning usable by non-experts
Interface design for sparse problems

16
Titanium

Susan Graham and Katherine Yelick
Computer Science Division, EECS
U.C. Berkeley
http//titanium.cs.berkeley.edu/

17
Titanium Group

Susan Graham
Katherine Yelick
Paul Hilfinger
Phillip Colella (LBNL)
Alex Aiken
Greg Balls (SDSC)
Peter McQuorquodale (LBNL)
Andrew Begel
Dan Bonachea

Szu-Huey Chang
Tyson Condie
Carrie Fei
David Gay
Ben Liblit
Chang Sun Lin
Geoff Pike
Jimmy Su
Ellen Tsai
Mike Welcome (LBNL)
Siu Man Yau

18
Titanium Overview

Object-oriented language based on Java with
Scalable parallelism
SPMD model with global address space
Multidimensional arrays
points and index sets as first-class values
Immutable classes
user-definable non-reference types for
performance
Operator overloading
by demand from our user community
Semi-automated memory management
uses memory regions for high performance

19
SciMark Benchmark

Numerical benchmark for Java, C/C
Five kernels
FFT (complex, 1D)
Successive Over-Relaxation (SOR)
Monte Carlo integration (MC)
Sparse matrix multiply
dense LU factorization
Results are reported in Mflops
Download and run on your machine from
http//math.nist.gov/scimark2
C and Java sources also provided

Roldan Pozo, NIST, http//math.nist.gov/Rpozo
20
SciMark Java vs. C(Sun UltraSPARC 60)
Sun JDK 1.3 (HotSpot) , javac -0 Sun cc -0
SunOS 5.7
Roldan Pozo, NIST, http//math.nist.gov/Rpozo
21
SciMark Java vs. C(Intel PIII 500MHz, Win98)
Sun JDK 1.2, javac -0 Microsoft VC 5.0, cl
-0 Win98
Roldan Pozo, NIST, http//math.nist.gov/Rpozo
22
Java Compiled by Titanium Compiler
23
Titanium Compiler on the Power 3 (SP)
24
Java Compiled by Titanium Compiler
25
SOR red-black loop (small data)

Using a red-black algorithm, titanium arrays (191
Mflops) are faster than Java arrays

26
Parallel Applications

Genome Application
Heart simulation
AMR elliptic and hyperbolic solvers
Scalable Poisson for infinite domains
Genome application
Several smaller benchmarks EM3D, MatMul, LU,
FFT, Join,

27
MOOSE Application

Problem Microarray construction
Used for genome experiments
Possible medical applications long-term
Microarray Optimal Oligo Selection Engine (MOOSE)
A parallel engine for selecting the best
oligonucleotide sequences for genetic microarray
testing
Uses dynamic load balancing within Titanium

28
Heart Simulation

Problem compute blood flow in the heart
Modeled as an elastic structure in an
incompressible fluid.
The immersed boundary method Peskin and
McQueen.
20 years of development in model
Many other applications blood clotting, inner
ear, paper making, embryo growth, and more
Can be used for design
of prosthetics
Artificial heart valves
Cochlear implants

29
Scalable Poisson Solver

MLC for Finite-Differences by Balls and Colella
Poisson equation with infinite boundaries
arise in astrophysics, some biological systems,
etc.
Method is scalable
Low communication
Performance on
SP2 (shown) and t3e
scaled speedups
nearly ideal (flat)
Currently 2D and non-adaptive

30
Error on High-Wavenumber Problem

Charge is
1 charge of concentric waves
2 star-shaped charges.
Largest error is where the charge is changing
rapidly. Note
discretization error
faint decomposition error
Run on 16 procs

31
AMR Gas Dynamics

Developed by McCorquodale and Colella
2D Example (3D supported)
Mach-10 shock on solid surface
at
oblique angle
Future Self-gravitating gas dynamics package

32
Unstructured Mesh Kernel

EM3D Relaxation on a 3D unstructured mesh
Speedup on Ultrasparc SMP
Simple kernel mesh not partitioned.

33
Recent Developments

Interfaces to libraries
KeLP and (older) PETSc and Metis
New IBM SP implementation
Uses LAPI rather than MPI, about 2x performance
gain
New release IBM, SGI, Cray, Linux cluster,
Threads,
Uniprocessor optimizations
Method inlining, both automated and manual
Cache optimizations
Shared pointer analysis
Support for unstructured computation
General sub-array copy now with arbitrary points

34
Future Plans

Merge communication layer with UPC
Unified Parallel C has broad vendor support.
Uses some execution model as Titanium
Push vendors to expose low-overhead communication
Automated communication overlap
Analysis and refinement of cache optimizations
Additional support for unstructured grids
Conjugate gradient and particle methods are
motivations
Better uniprocessor optimizations, possibly new
arrays

35
End of Slides
36
Target Problems

Many modeling problems in astrophysics, biology,
material science, and other areas require
Enormous range of spatial and temporal scales
Requires
Adaptive methods
Large scale parallel machines
Titanium supports
Stuctured grids
Locally-structured grids (AMR)
Unstructured grids (in progress)

37
Local Pointer Analysis