The Study of Cache Oblivious Algorithms - PowerPoint PPT Presentation

1 / 32

About This Presentation

Title:

The Study of Cache Oblivious Algorithms

Description:

The Study of Cache Oblivious Algorithms Prepared by Jia Guo Cache-Oblivious Algorithms by Matteo Frigo, Charles E. Leiserson, Harald Prokop, and Sridhar Ramachandran. – PowerPoint PPT presentation

Number of Views:126

Avg rating:3.0/5.0

Slides: 33

Provided by: jia2

Learn more at: http://polaris.cs.uiuc.edu

Category:

more less

Transcript and Presenter's Notes

Title: The Study of Cache Oblivious Algorithms

1
The Study of Cache Oblivious Algorithms

Prepared by Jia Guo

Cache-Oblivious Algorithmsby Matteo Frigo,
Charles E. Leiserson, Harald Prokop, and Sridhar
Ramachandran. In the 40th Annual Symposium on
Foundations of Computer Science, FOCS '99, 17-18
October, 1999, New York, NY, USA.

3
Outline

Cache complexity
Cache aware algorithms
Cache oblivious algorithms
Matrix multiplication
Matrix transposition
FFT
Conclusion

4
Assumption

Only two levels of memory hierarchies
An ideal cache
Fully associative
Optimal replacement strategy
Tall cache
A very large memory

5
An Ideal Cache Model
An ideal cache model (Z,L) Z Total words in the
cache L Words in one cache line
6
Cache Complexity

An algorithm with input size n is measured by
Work complexity W(n)
Cache complexity the number of cache misses it
incurs. Q(n Z, L)

7
Outline

Cache complexity
Cache aware algorithms
Cache oblivious algorithms
Matrix multiplication
Matrix transposition
FFT
Conclusion

8
Cache Aware Algorithms

Contain parameters to minimize the cache
complexity for a particular cache size (Z) and
line length (L).
Need to adjust parameters when running on
different platforms.

9
Example

A blocked matrix multiplication algorithm
s is a tuning parameter to make the algorithm run
fast

s
s
A11
A
n
10
Example (2)

Cache complexity
The three s x s sub matrices should fit into the
cache so they occupy
cache lines
Optimal performance is obtained when
Z/L cache misses needed to bring 3 sub matrices
into cache
n2/L cache misses needed to read n2 elements
It is

11
Outline

Cache complexity
Cache aware algorithms
Cache oblivious algorithms
Matrix multiplication
Matrix transposition and FFT
Conclusion

12
Cache Oblivious Algorithms

Have no parameters about hardware, such as cache
size (Z), cache-line length (L).
No tuning needed, platform independent.
The following algorithms introduced are proved to
have the optimal cache complexity.

13
Matrix Multiplication

Partition matrix A and B by half in the largest
dimension. A n x m, B m x p
Proceed recursively until reach the base case -
one element.

n max (m, p)
m max (n, p)
p max (n, m)
14
Matrix Multiplication (2)
Assume Sizes of A, B are nx4n, 4nxn
AB

A1B1
A2B2

A11B11
A12B12
A21B21
A22B22
15
Matrix Multiplication (3)

Intuitively, once a sub problem fits into the
cache, its smaller sub problems can be solved in
cache with no further misses.

16
Matrix Multiplication (4)

Cache complexity
Can achieve the same as the cache complexity of
Block-MULT algorithm (cache aware)
For a square matrix, the optimal cache complexity
is achieved.

17
Outline

Cache complexity
Cache aware algorithms
Cache oblivious algorithms
Matrix multiplication
Matrix transposition
FFT
Conclusion

18
Matrix Transposition

If n is very large, the access of B in column
will cause cache miss every time!
(No spatial locality in B)

A
AT
for i 1 to m for j 1 to
n B( j, i ) A( i, j )
m x n
B
n x m
19
Matrix Transposition (2)

Partition array A along the longer dimension and
recursively execute the transpose function.

A21
A11
A11T
A12T
A12
A22
A21T
A22T
20
Matrix Transposition (3)

Cache complexity
It has the optimal cache complexity
Q(m, n) T(1mn/L)

21
Fast Fourier Transform

Use Cooley-Tukey algorithm
Cooley-Tukey algorithms recursively re-express a
DFT of a composite size n n1n2 as
Perform n2 DFTs of size n1.
Multiply by complex roots of unity called twiddle
factors.
Perform n1 DFTs of size n2.

22
n1
n2
23

Assume X is a row-major n1 n2 matrix
Steps
Transpose X in place.
Compute n2 DFTs
Multiply by twiddle factors
Transpose X in place
Compute n1 DFTs
Transpose X in-place

24
Fast Fourier Transform
n14, n22
Transpose to select n2 DFT of size n1
Call FFT recursively with n12, n22
Reach the base case, return
twiddle factor
Transpose to select n1 DFT of size n2
Transpose and return

25
Fast Fourier Transform

Cache complexity
Optimal for a Cooley-Tukey algorithm, when n is
an exact power of 2
Q(n) O(1(n/L)(1logzn)

26
Other Cache Oblivious Algorithms

Funnelsort
Distribution sort
LU decomposition without pivots

27
Outline

Cache complexity
Cache aware algorithms
Cache oblivious algorithms
Matrix multiplication
Matrix transposition
FFT
Conclusion

28
Questions

How large is the range of practicality of
cache-oblivious algorithms?
What are the relative strengths of
cache-oblivious and cache-aware algorithms?

29
Practicality of Cache-oblivious Algorithms
Average time to transpose an NxN matrix, divided
by N2
30
Practicality of Cache-oblivious Algorithms (2)
Average time taken to multiply two NxN matrices,
divided by N3
31
Question 2

Do cache-oblivious algorithms perform as well as
cache-aware algorithms?
FFTW library
No answer yet.

32
References

Cache-Oblivious Algorithmsby Matteo Frigo,
Charles E. Leiserson, Harald Prokop, and Sridhar
Ramachandran. In the 40th Annual Symposium on
Foundations of Computer Science, FOCS '99, 17-18
October, 1999, New York, NY, USA.
Cache-Oblivious Algorithmsby Harald Prokop.
Master's Thesis, MIT Department of Electrical
Engineering and Computer Science. June 1999.
Optimizing Matrix Multiplication with a
Classifier Learning System by Xiaoming Li and
María Jesus Garzarán. LCPC 2005.