Title: Optimizing the Performance of Sparse MatrixVector Multiplication
1Optimizing the Performance of Sparse
Matrix-Vector Multiplication
2Overview
- Motivation
- Optimization techniques
- Register Blocking
- Cache Blocking
- Multiple Vectors
- Sparsity system
- Related Work
- Contribution
- Conclusion
3Motivation Usage
- Sparse Matrix-Vector Multiplication
- Usage of this operation
- Iterative Solvers
- Explicit Methods
- Eigenvalue and Singular Value Problems
- Applications in structure modeling, fluid
dynamics, document retrieval(Latent Semantic
Indexing) and many other simulation areas
4Motivation Performance (1)
- Matrix-vector multiplication (BLAS2) is slower
than matrix-matrix multiplication (BLAS3) - For example, on 167 MHz UltraSPARC I,
- Vendor optimized matrix-vector multiplication
57Mflops - Vendor optimized matrix-matrix multiplication
185Mflops - The reason lower ratio of the number of floating
point operations to the number of memory operation
5Motivation Performance (2)
- Sparse matrix operation is slower than dense
matrix operation. - For example, on 167 MHz UltraSPARC I,
- Dense matrix-vector multiplication
- naïve implementation
38Mflops - vendor optimized implementation
57Mflops - Sparse matrix-vector multiplication (Naïve
implementation) -
5.7 - 25Mflops - The reason indirect data structure, thus
inefficient memory accesses
6Motivation Optimized libraries
- Old approach Hand-Optimized Libraries
- Vendor-supplied BLAS, LAPACK
- New approach Automatic generation of libraries
- PHiPAC (dense linear algebra)
- ATLAS (dense linear algebra)
- FFTW (fast fourier transform)
- Our approach Automatic generation of libraries
for sparse matrices - Additional dimension nonzero structure of
sparse matrices
7Sparse Matrix Formats
- There are large number of sparse matrix formats.
- Point-entry
- Coordinate format (COO), Compressed Sparse Row
(CSR), - Compressed Sparse Column (CSC), Sparse Diagonal
(DIA), - Block-entry
- Block Coordinate (BCO), Block Sparse Row (BSR),
- Block Sparse Column (BSC), Block Diagonal (BDI),
- Variable Block Compressed Sparse Row (VBR),
8Compressed Sparse Row Format
- We internally use CSR format, because it is
relatively efficient format
9Optimization Techniques
- Register Blocking
- Cache Blocking
- Multiple vector
10Register Blocking
- Blocked Compressed Sparse Row Format
- Advantages of the format
- Better temporal locality in registers
- The multiplication loop can be unrolled for
better performance
11Register Blocking Fill Overhead
- We use uniform block size, adding fill overhead.
- fill overhead 12/7 1.71
- This increases both space and the number of
floating point operations.
12Register Blocking
- Dense Matrix profile on an UltraSPARC I (input to
the performance model)
13Register Blocking Selecting the block size
- The hard part of the problem is picking the block
size so that - It minimizes the fill overhead
- It maximizes the raw performance
- Two approaches
- Exhaustive search
- Using a model
14Register Blocking Performance model
- Two components to the performance model
- Multiplication performance of dense matrix
represented in sparse format - Estimated fill overhead
- Predicted performance for block size r x c
- dense r x c blocked performance
-
- fill overhead
15Benchmark matrices
- Matrix 1 Dense matrix (1000 x 1000)
- Matrices 2-17 Finite Element Method matrices
- Matrices 18-39 matrices from Structural
Engineering, Device Simulation - Matrices 40-44 Linear Programming matrices
- Matrix 45 document retrieval matrix
- used for Latent Semantic
Indexing - Matrix 46 random matrix (10000 x 10000, 0.15)
16Register Blocking Performance
- The optimization is effective most on FEM
matrices and dense matrix (lower-numbered).
17Register Blocking Performance
- Speedup is generally best on MIPS R10000, which
is competitive with the dense BLAS performance.
(DGEMV/DGEMM 0.38)
18Register Blocking Validation of Performance
Model
- Comparison to the performance of exhaustive
search (yellow bars, block sizes in lower row) on
a subset of the benchmark matrices - The exhaustive search does not produce much
better result.
19Register Blocking Overhead
- Pre-computation overhead
- Estimating fill overhead (red bars)
- Reorganizing the matrix (yellow bars)
- The ratio means the number of repetitions for
which the optimization is beneficial.
20Cache Blocking
- Temporal locality of access to source vector
Source vector x
Destination Vector y
In memory
21Cache Blocking Performance
- MIPS speedup is generally better.
- larger cache, larger miss penalty (26/589 ns for
MIPS, 36/268 ns for Ultra.) - Except document retrieval and random matrix.
22Cache Blocking Performance on document
retrieval matrix
- Document retrieval matrix 10K x 256K, 37M
nonzeros, SVD is applied for LSI(Latent Semantic
Indexing) - The nonzero elements are spread across the
matrix, with no dense cluster. - Peak at 16K x 16K cache block with speedup 3.1
23Cache Blocking When and how to use cache
blocking
- From the experiment, the matrices for which cache
blocking is most effective are large and
random. - We developed a measurement of randomness of
matrix. - We perform search in coarse grain, to decide
cache block size.
24Combination of Register and Cache blocking
UltraSPARC
- The combination is rarely beneficial, often
slower than either of the two optimization.
25Combination of Register and Cache blocking MIPS
26Multiple Vector Multiplication
- Better chance of optimization BLAS2 vs. BLAS3
Repetition of single-vector case
Multiple-vector case
27Multiple Vector Multiplication Performances
- Register blocking performance
- Cache blocking performance
28Multiple Vector Multiplication Register
Blocking Performance
- The speedup is larger than single vector register
blocking. - Even the performance of the matrices that did not
speedup improved. (middle group in UltraSPARC)
29Multiple Vector Multiplication Cache Blocking
Performance
MIPS
UltraSPARC
- Noticeable speedup for the matrices that did not
speedup (UltraSPARC) - Block sizes are much smaller than that of single
vector cache blocking.
30Sparsity System Purpose
- Guide a choice of optimization
- Automatic selection of optimization parameters
such as block size, number of vectors - http//comix.cs.berkeley.edu/ejim/sparsity
31Sparsity System Organization
Example matrix
Sparsity Machine Profiler
Machine Performance Profile
Sparsity Optimizer
Optimized code, drivers
Maximum Number of vectors
32Summary Speedup of Sparsity on UltraSPARC
- On UltraSPARC, up to 3x for single vector, 4.7x
for multiple vector
Single Vector
Multiple Vector
33Summary Speedup of Sparsity on MIPS
- On MIPS, up to 3x single vector, 6x for multiple
vector
Single Vector
Multiple Vector
34Summary Overhead of Sparsity Optimization
- The number of iteration
- Overhead time
- Time saved
- The BLAS Technical Forum include a parameter in
the matrix creation routine to indicate how many
times the operation is performed.
35Related Work (1)
- Dense Matrix Optimization
- Loop transformation by compilers M. Wolf, etc.
- Hand-optimized libraries BLAS, LAPACK
- Automatic Generation of Libraries
- PHiPAC, ATLAS and FFTW
- Sparse Matrix Standardization and Libraries
- BLAS Technical Forum
- NIST Sparse BLAS, MV, SparseLib, TNT
- Hand Optimization of Sparse Matrix-Vector Multi.
- S. Toledo, Oliker et. al.
36Related Work (2)
- Sparse Matrix Packages
- SPARSKIT, PSPARSELIB, Aztec, BlockSolve95,
Spark98 - Compiling Sparse Matrix Code
- Sparse compiler (Bik), Bernoulli compiler
(Kotlyar) - On-demand Code Generation
- NIST SparseBLAS, Sparse compiler
37Contribution
- Thorough investigation of memory hierarchy
optimization for sparse matrix-vector
multiplication - Performance study on benchmark matrices
- Development of performance model to choose
optimization parameter - Sparsity system for automatic tuning and code
generation of sparse matrix-vector multiplication
38Conclusion
- Memory hierarchy optimization for sparse
matrix-vector multiplication - Register Blocking matrices with dense local
structure benefit - Cache Blocking large matrices with random
structure benefit - Multiple vector multiplication improves the
performance further because of reuse of matrix
elements - The choice of optimization depends on both matrix
structure and machine architecture. - The automated system helps this complicated and
time-consuming process.