Title: Optimization of Sparse Matrix Kernels for Data Mining
1Optimization of Sparse Matrix Kernels for Data
Mining
- Eun-Jin Im and Katherine Yelick
- U.C.Berkeley
2Outline
- SPARSITY Performance optimization of Sparse
Matrix Vector Operations - Sparse Matrices in Data Mining Applications
- Performance Improvements by SPARSITY for Data
Mining Matrices
3The Need for optimized sparse matrix codes
- Sparse matrix is represented with indirect data
structures. - Sparse matrix routines are slower than dense
matrix counterparts. - The performance is dependent on the distribution
of nonzero element of the sparse matrix.
4The Solution SPARSITY System
- System that provides optimized C codes for sparse
matrix vector operations - http//www.cs.berkeley.edu/ejim/sparsity
- Related Work ATLAS, PHiPAC for dense matrix
routines and FFTW
5SPARSITY optimizations (1) Register Blocking
2x2 register blocked matrix
- Identify a small dense blocks of nonzeros.
- Use an optimized multiplication code for the
particular block size.
2 1 4 3
0 2 1 2
1 2
1 0
0 3 3 1
2 5 1 4
3 0 3 2
0 4 1 2
2
1
2
3
0
2
4
1
2
5
0
1
0
0
1
3
0
2
1
3
0
5
7
3
0
4
1
1
- Improves register reuse, lowers indexing
overhead. - Challenge choosing a block size
6SPARSITY optimizations (2) Cache Blocking
- Keeping part of source vector in cache
Source vector (x)
Destination Vector (y)
Sparse matrix(A)
- Improves cache reuse of source vector.
- Challenge choosing a block size
7SPARSITY optimizations (3) Multiple Vectors
- Better potential for reuse
- Loop unrolled codes multiplying across vectors
are generated by a code generator.
x
j1
y
a
i2
y
ij
i1
- Allows reuse of matrix elements.
- Choosing the number of vectors for loop unrolling.
8SPARSITY automatic performance tuning
- SPARSITY is a system for automatic performance
engineering. - Parameterized code generation
- Search combined with performance modeling selects
- Register block size
- Cache block size
- Number of vectors for loop unrolling
9Sparse Matrices from Data Mining App.
Collection Algorithm Dimension Non-zeros Density Avg. of NZs / row
Web Documents LSI 10000 x 255943 3.7M 0.15 371
NSF abstracts CD 94481 x 6366 7.0M 1.16 74
Face Images EA 36000 x 2640 5.6M 5.86 155
10Data Mining Algorithms
- For Text Retrieval
- Term-by-document matrix
- Latent Semantic Indexing Berry et. Al.
- Computation of Singular Value Decomposition
- Blocked SVD uses multiple vectors
- Concept Decomposition Dhillon and Modha
- Matrix approximation solving least-squares
problem - Also uses multiple vectors
11Data Mining Algorithms
- For Image Retrieval
- Eigenface Approximation Li
- Used for face recognition
- Pixel-by-image matrix
- Each image has multi-resolution hierarchy and is
compressed with wavelet transformation.
12Platforms Used in Performance Measurements
Clock (MHz) L2 cache DGEMV (MFLOPS) DGEMM (MFLOPS)
MIPS R10000 200 2 MB 67 322
Ultra- SPARC II 250 1 MB 100 401
Penitium III 450 512 KB 87 328
Alpha 21164 533 96 KB 83 550
13Performance on Web Document Data
14Performance on NSF Abstract Data
15Performance on Face Image Data
16Speedup
MIPS R10K Ultra- SPARC II Pentium III Alpha 21164
Web Documents 3.8 5.9 2.0 2.7
NSF Abstracts 2.9 1.3 1.6 1.3
Face Images 4.7 5.1 2.6 4.5
17Performance Summary
- Performance is better when a matrix is denser.
(Face Images) - Cache blocking is effective for a matrix with a
large number of columns. (Web Documents) - Optimization of the multiplication with multiple
vectors is effective.
18Cache Block Size for Web Document Matrix
L2 Cache Size Block Size for Single Vector Block Size for 10 vectors
MIPS R10000 2MB 10000x 64K 10000x 8K
Ultra-SPARC II 1MB 10000x 32K 10000x 4K
Pentium III 512KB 10000x 16K 10000x 2K
Alpha 21164 96KB 10000x 4K 10000x 2K
- Width of cache block is limited by the size of
cache. - For multiple vectors, the loop unrolling factor
is 10 except for Alpha 21164 where the factor is
3.
19Conclusion
- Most of the matrices used in data mining is
sparse matrix. - The sparse matrix operation is memory-inefficient
and needs optimization. - The optimization is dependent on the nonzero
structure of the matrix. - SPARSITY system effectively speeds up this
operation.
20For Contribution
- Contact ejim_at_cs.berkeley.edu to donate your
matrix ! Thank you.