Title: The Study of Cache Oblivious Algorithms
1The Study of Cache Oblivious Algorithms
2- Cache-Oblivious Algorithmsby Matteo Frigo,
Charles E. Leiserson, Harald Prokop, and Sridhar
Ramachandran. In the 40th Annual Symposium on
Foundations of Computer Science, FOCS '99, 17-18
October, 1999, New York, NY, USA.
3Outline
- Cache complexity
- Cache aware algorithms
- Cache oblivious algorithms
- Matrix multiplication
- Matrix transposition
- FFT
- Conclusion
4Assumption
- Only two levels of memory hierarchies
- An ideal cache
- Fully associative
- Optimal replacement strategy
- Tall cache
- A very large memory
5An Ideal Cache Model
An ideal cache model (Z,L) Z Total words in the
cache L Words in one cache line
6Cache Complexity
- An algorithm with input size n is measured by
- Work complexity W(n)
- Cache complexity the number of cache misses it
incurs. Q(n Z, L)
7Outline
- Cache complexity
- Cache aware algorithms
- Cache oblivious algorithms
- Matrix multiplication
- Matrix transposition
- FFT
- Conclusion
8Cache Aware Algorithms
- Contain parameters to minimize the cache
complexity for a particular cache size (Z) and
line length (L). - Need to adjust parameters when running on
different platforms.
9Example
- A blocked matrix multiplication algorithm
- s is a tuning parameter to make the algorithm run
fast
s
s
A11
A
n
10Example (2)
- Cache complexity
- The three s x s sub matrices should fit into the
cache so they occupy
cache lines - Optimal performance is obtained when
- Z/L cache misses needed to bring 3 sub matrices
into cache - n2/L cache misses needed to read n2 elements
- It is
11Outline
- Cache complexity
- Cache aware algorithms
- Cache oblivious algorithms
- Matrix multiplication
- Matrix transposition and FFT
- Conclusion
12Cache Oblivious Algorithms
- Have no parameters about hardware, such as cache
size (Z), cache-line length (L). - No tuning needed, platform independent.
- The following algorithms introduced are proved to
have the optimal cache complexity.
13Matrix Multiplication
- Partition matrix A and B by half in the largest
dimension. A n x m, B m x p - Proceed recursively until reach the base case -
one element.
n max (m, p)
m max (n, p)
p max (n, m)
14Matrix Multiplication (2)
Assume Sizes of A, B are nx4n, 4nxn
AB
A1B1
A2B2
A11B11
A12B12
A21B21
A22B22
15Matrix Multiplication (3)
- Intuitively, once a sub problem fits into the
cache, its smaller sub problems can be solved in
cache with no further misses.
16Matrix Multiplication (4)
- Cache complexity
- Can achieve the same as the cache complexity of
Block-MULT algorithm (cache aware) - For a square matrix, the optimal cache complexity
is achieved.
17Outline
- Cache complexity
- Cache aware algorithms
- Cache oblivious algorithms
- Matrix multiplication
- Matrix transposition
- FFT
- Conclusion
18Matrix Transposition
- If n is very large, the access of B in column
will cause cache miss every time! - (No spatial locality in B)
A
AT
for i 1 to m for j 1 to
n B( j, i ) A( i, j )
m x n
B
n x m
19Matrix Transposition (2)
- Partition array A along the longer dimension and
recursively execute the transpose function.
A21
A11
A11T
A12T
A12
A22
A21T
A22T
20Matrix Transposition (3)
- Cache complexity
- It has the optimal cache complexity
- Q(m, n) T(1mn/L)
21Fast Fourier Transform
-
- Use Cooley-Tukey algorithm
- Cooley-Tukey algorithms recursively re-express a
DFT of a composite size n n1n2 as - Perform n2 DFTs of size n1.
- Multiply by complex roots of unity called twiddle
factors. - Perform n1 DFTs of size n2.
22n1
n2
23- Assume X is a row-major n1 n2 matrix
- Steps
- Transpose X in place.
- Compute n2 DFTs
- Multiply by twiddle factors
- Transpose X in place
- Compute n1 DFTs
- Transpose X in-place
24Fast Fourier Transform
n14, n22
Transpose to select n2 DFT of size n1
Call FFT recursively with n12, n22
Reach the base case, return
twiddle factor
Transpose to select n1 DFT of size n2
Transpose and return
25Fast Fourier Transform
- Cache complexity
- Optimal for a Cooley-Tukey algorithm, when n is
an exact power of 2 - Q(n) O(1(n/L)(1logzn)
26Other Cache Oblivious Algorithms
- Funnelsort
- Distribution sort
- LU decomposition without pivots
27Outline
- Cache complexity
- Cache aware algorithms
- Cache oblivious algorithms
- Matrix multiplication
- Matrix transposition
- FFT
- Conclusion
28Questions
- How large is the range of practicality of
cache-oblivious algorithms? - What are the relative strengths of
cache-oblivious and cache-aware algorithms?
29Practicality of Cache-oblivious Algorithms
Average time to transpose an NxN matrix, divided
by N2
30Practicality of Cache-oblivious Algorithms (2)
Average time taken to multiply two NxN matrices,
divided by N3
31Question 2
- Do cache-oblivious algorithms perform as well as
cache-aware algorithms? - FFTW library
- No answer yet.
32References
- Cache-Oblivious Algorithmsby Matteo Frigo,
Charles E. Leiserson, Harald Prokop, and Sridhar
Ramachandran. In the 40th Annual Symposium on
Foundations of Computer Science, FOCS '99, 17-18
October, 1999, New York, NY, USA. - Cache-Oblivious Algorithmsby Harald Prokop.
Master's Thesis, MIT Department of Electrical
Engineering and Computer Science. June 1999. - Optimizing Matrix Multiplication with a
Classifier Learning System by Xiaoming Li and
MarÃa Jesus Garzarán. LCPC 2005.