Optimization of Sparse Matrix Kernels for Data Mining - PowerPoint PPT Presentation

1 / 20

About This Presentation

Title:

Optimization of Sparse Matrix Kernels for Data Mining

Description:

Im and Yelick. 2. Outline. SPARSITY : Performance optimization of Sparse Matrix Vector Operations ... Im and Yelick. 3. The Need for optimized sparse matrix codes ... – PowerPoint PPT presentation

Number of Views:82

Avg rating:3.0/5.0

Slides: 21

Provided by: eunj

Learn more at: https://people.eecs.berkeley.edu

Category:

more less

Transcript and Presenter's Notes

Title: Optimization of Sparse Matrix Kernels for Data Mining

1
Optimization of Sparse Matrix Kernels for Data
Mining

Eun-Jin Im and Katherine Yelick
U.C.Berkeley

2
Outline

SPARSITY Performance optimization of Sparse
Matrix Vector Operations
Sparse Matrices in Data Mining Applications
Performance Improvements by SPARSITY for Data
Mining Matrices

3
The Need for optimized sparse matrix codes

Sparse matrix is represented with indirect data
structures.
Sparse matrix routines are slower than dense
matrix counterparts.
The performance is dependent on the distribution
of nonzero element of the sparse matrix.

4
The Solution SPARSITY System

System that provides optimized C codes for sparse
matrix vector operations
http//www.cs.berkeley.edu/ejim/sparsity
Related Work ATLAS, PHiPAC for dense matrix
routines and FFTW

5
SPARSITY optimizations (1) Register Blocking
2x2 register blocked matrix

Identify a small dense blocks of nonzeros.
Use an optimized multiplication code for the
particular block size.

2 1 4 3
0 2 1 2
1 2
1 0
0 3 3 1
2 5 1 4
3 0 3 2
0 4 1 2
2
1
2
3
0
2
4
1
2
5
0
1
0
0
1
3
0
2
1
3
0
5
7
3
0
4
1
1

Improves register reuse, lowers indexing
overhead.
Challenge choosing a block size

6
SPARSITY optimizations (2) Cache Blocking

Keeping part of source vector in cache

Source vector (x)

Destination Vector (y)
Sparse matrix(A)

Improves cache reuse of source vector.
Challenge choosing a block size

7
SPARSITY optimizations (3) Multiple Vectors

Better potential for reuse
Loop unrolled codes multiplying across vectors
are generated by a code generator.

x
j1
y
a
i2
y
ij
i1

Allows reuse of matrix elements.
Choosing the number of vectors for loop unrolling.

8
SPARSITY automatic performance tuning

SPARSITY is a system for automatic performance
engineering.
Parameterized code generation
Search combined with performance modeling selects
Register block size
Cache block size
Number of vectors for loop unrolling

9
Sparse Matrices from Data Mining App.
Collection Algorithm Dimension Non-zeros Density Avg. of NZs / row
Web Documents LSI 10000 x 255943 3.7M 0.15 371
NSF abstracts CD 94481 x 6366 7.0M 1.16 74
Face Images EA 36000 x 2640 5.6M 5.86 155
10
Data Mining Algorithms

For Text Retrieval
Term-by-document matrix
Latent Semantic Indexing Berry et. Al.
Computation of Singular Value Decomposition
Blocked SVD uses multiple vectors
Concept Decomposition Dhillon and Modha
Matrix approximation solving least-squares
problem
Also uses multiple vectors

11
Data Mining Algorithms

For Image Retrieval
Eigenface Approximation Li
Used for face recognition
Pixel-by-image matrix
Each image has multi-resolution hierarchy and is
compressed with wavelet transformation.

12
Platforms Used in Performance Measurements
Clock (MHz) L2 cache DGEMV (MFLOPS) DGEMM (MFLOPS)
MIPS R10000 200 2 MB 67 322
Ultra- SPARC II 250 1 MB 100 401
Penitium III 450 512 KB 87 328
Alpha 21164 533 96 KB 83 550
13
Performance on Web Document Data
14
Performance on NSF Abstract Data
15
Performance on Face Image Data
16
Speedup
MIPS R10K Ultra- SPARC II Pentium III Alpha 21164
Web Documents 3.8 5.9 2.0 2.7
NSF Abstracts 2.9 1.3 1.6 1.3
Face Images 4.7 5.1 2.6 4.5
17
Performance Summary

Performance is better when a matrix is denser.
(Face Images)
Cache blocking is effective for a matrix with a
large number of columns. (Web Documents)
Optimization of the multiplication with multiple
vectors is effective.

18
Cache Block Size for Web Document Matrix
L2 Cache Size Block Size for Single Vector Block Size for 10 vectors
MIPS R10000 2MB 10000x 64K 10000x 8K
Ultra-SPARC II 1MB 10000x 32K 10000x 4K
Pentium III 512KB 10000x 16K 10000x 2K
Alpha 21164 96KB 10000x 4K 10000x 2K

Width of cache block is limited by the size of
cache.
For multiple vectors, the loop unrolling factor
is 10 except for Alpha 21164 where the factor is
3.

19
Conclusion