Title: Automatic Performance Tuning of SpMV on GPGPU
1Automatic Performance Tuning of SpMV on GPGPU
- Xianyi Zhang
- Lab of Parallel Computing
- Institute of Software Chinese Academy of Sciences
- zxy_at_mail.rdcps.ac.cn
2Outline
- Motivation
- SpMV Introduction
- AMD Stream Computing
- GOSpMV Overview
- GOSpMV Performance Evaluation
- Conclusion Future Work
3Motivation
- Sparse Matrix-Vector Multiplication (SpMV) yyAx
- The important kernel in scientific applications
- PDE solver, simulation, etc.
- Low performance
- Irregular memory access pattern
4Motivation
- GPU
- Huge computation power
- Jason Yang, James Goodman. Symmetric Key
Cryptography on Modern Graphics Hardware.
http//ati.amd.com/technology/streamcomputing/asia
crypt2007.pdf
5SpMV Introduction
- CSR (Compressed Sparse Row)
A_val1,2,4,1 A_col0,2,1,2 A_ptr0,2,3,4
for(i 0 i lt n i) value 0
for(j A_ptri j lt A_ptri1 j)
value value A_valjxA_colj yi
value
x is accessed irregularly
x is accessed indirectly
6SpMV Introduction
- BCSR (Block Compressed Sparse Row)
- BCSR 2 3
7AMD Stream Computing
AMD Stream Computing User Guide
8AMD Stream Computing
AMD Stream Computing User Guide
9GOSpMV Overview
- GOSpMV Software Architecture
10GOSpMV Overview
- BCSR SpMV implementation on GPGPU
11GOSpMV Overview
- Automatic Performance Tuning
12GOSpMV Overview
- Off-line GPGPU Benchmark
- Dense matrix (different size)
- Every BCSR block size
13GOSpMV Overview
- Run-Time Evaluation(search optimal BCSR block
size) - Input Sparse Matrix A, GPGPU Benchmark data
Pdense(block-format, nzd) - Output the maximum P (A, block-format, s),
optimal BCSR block size - For each BCSR r c block,
- do
- calculate fill ratio fErc(A, s) with sample rate
s - Psp(block-format, nzEBCSR) Pdense(block-format,
nzd), nzd is nearest to nzEBCSR - P (A, block-format, s) P (block-format,
nzEBCSR)/ fErc(A, s) - done
14GOSpMV Performance Evaluation
- Test box
- Intel Pentium Dual Core E2160/1.8GHz, 2.0GB
memory - GPU
- AMD Radeon HD 3690 (RV670), theoretical
peak428.8 GigaFlOPS (single precision) - AMD Stream SDK v1.1-beta
- Ubuntu 8.04, Linux 2.6.24, gcc 4.2.3
- Test matrices
- 8 sparse matrices, different size (small, medium,
large) - Small (nonzeros lt 100,000)
- Medium (100,000 lt nonzeros lt 1,000,000)
- Large (nonzeros gt 1,000,000)
- Matrix Market and UF Sparse Matrix Collection .
15GOSpMV Performance Evaluation
16 GOSpMV Performance Evaluation
- AMD Radeon HD 3690 Result
- SpMV BCSR on GPGPU (1500 iterations)
17 GOSpMV Performance Evaluation
- Different iterations (100,300,500,1000,1500)
18 GOSpMV Performance Evaluation
- The automatic performance tuning (1500
iterations) - The average speedup 3.11
19Conclusion
- GOSpMV Performance Speedup
- AMD Radeon HD 3690
- average 3.11, max 5.96, 1500 iterations
- GOSpMV is suited for
- Medium matrices, Large matrices
- Iteration numbergt 300
- Regular matrices (low fill ratio)
- In general, GOSpMV selects the better BCSR block
size by automatic performance tuning technology.
20Future Work
- Double precision
- Support other BCSR block size (e.g. 8x8)
- New HW (AMD RV770)
- Automatic performance tuning strategy
- Re-ordering matrix
21Thank you!QA