The Study of Cache Oblivious Algorithms - PowerPoint PPT Presentation

About This Presentation
Title:

The Study of Cache Oblivious Algorithms

Description:

The Study of Cache Oblivious Algorithms Prepared by Jia Guo Cache-Oblivious Algorithms by Matteo Frigo, Charles E. Leiserson, Harald Prokop, and Sridhar Ramachandran. – PowerPoint PPT presentation

Number of Views:120
Avg rating:3.0/5.0
Slides: 33
Provided by: jia2
Category:

less

Transcript and Presenter's Notes

Title: The Study of Cache Oblivious Algorithms


1
The Study of Cache Oblivious Algorithms
  • Prepared by Jia Guo

2
  • Cache-Oblivious Algorithmsby Matteo Frigo,
    Charles E. Leiserson, Harald Prokop, and Sridhar
    Ramachandran. In the 40th Annual Symposium on
    Foundations of Computer Science, FOCS '99, 17-18
    October, 1999, New York, NY, USA.

3
Outline
  • Cache complexity
  • Cache aware algorithms
  • Cache oblivious algorithms
  • Matrix multiplication
  • Matrix transposition
  • FFT
  • Conclusion

4
Assumption
  • Only two levels of memory hierarchies
  • An ideal cache
  • Fully associative
  • Optimal replacement strategy
  • Tall cache
  • A very large memory

5
An Ideal Cache Model
An ideal cache model (Z,L) Z Total words in the
cache L Words in one cache line
6
Cache Complexity
  • An algorithm with input size n is measured by
  • Work complexity W(n)
  • Cache complexity the number of cache misses it
    incurs. Q(n Z, L)

7
Outline
  • Cache complexity
  • Cache aware algorithms
  • Cache oblivious algorithms
  • Matrix multiplication
  • Matrix transposition
  • FFT
  • Conclusion

8
Cache Aware Algorithms
  • Contain parameters to minimize the cache
    complexity for a particular cache size (Z) and
    line length (L).
  • Need to adjust parameters when running on
    different platforms.

9
Example
  • A blocked matrix multiplication algorithm
  • s is a tuning parameter to make the algorithm run
    fast

s
s
A11
A
n
10
Example (2)
  • Cache complexity
  • The three s x s sub matrices should fit into the
    cache so they occupy
    cache lines
  • Optimal performance is obtained when
  • Z/L cache misses needed to bring 3 sub matrices
    into cache
  • n2/L cache misses needed to read n2 elements
  • It is

11
Outline
  • Cache complexity
  • Cache aware algorithms
  • Cache oblivious algorithms
  • Matrix multiplication
  • Matrix transposition and FFT
  • Conclusion

12
Cache Oblivious Algorithms
  • Have no parameters about hardware, such as cache
    size (Z), cache-line length (L).
  • No tuning needed, platform independent.
  • The following algorithms introduced are proved to
    have the optimal cache complexity.

13
Matrix Multiplication
  • Partition matrix A and B by half in the largest
    dimension. A n x m, B m x p
  • Proceed recursively until reach the base case -
    one element.

n max (m, p)
m max (n, p)
p max (n, m)
14
Matrix Multiplication (2)
Assume Sizes of A, B are nx4n, 4nxn
AB

A1B1
A2B2


A11B11
A12B12
A21B21
A22B22
15
Matrix Multiplication (3)
  • Intuitively, once a sub problem fits into the
    cache, its smaller sub problems can be solved in
    cache with no further misses.

16
Matrix Multiplication (4)
  • Cache complexity
  • Can achieve the same as the cache complexity of
    Block-MULT algorithm (cache aware)
  • For a square matrix, the optimal cache complexity
    is achieved.

17
Outline
  • Cache complexity
  • Cache aware algorithms
  • Cache oblivious algorithms
  • Matrix multiplication
  • Matrix transposition
  • FFT
  • Conclusion

18
Matrix Transposition
  • If n is very large, the access of B in column
    will cause cache miss every time!
  • (No spatial locality in B)

A
AT
for i 1 to m for j 1 to
n B( j, i ) A( i, j )
m x n
B
n x m
19
Matrix Transposition (2)
  • Partition array A along the longer dimension and
    recursively execute the transpose function.

A21
A11
A11T
A12T
A12
A22
A21T
A22T
20
Matrix Transposition (3)
  • Cache complexity
  • It has the optimal cache complexity
  • Q(m, n) T(1mn/L)

21
Fast Fourier Transform
  • Use Cooley-Tukey algorithm
  • Cooley-Tukey algorithms recursively re-express a
    DFT of a composite size n n1n2 as
  • Perform n2 DFTs of size n1.
  • Multiply by complex roots of unity called twiddle
    factors.
  • Perform n1 DFTs of size n2.

22
n1
n2
23
  • Assume X is a row-major n1 n2 matrix
  • Steps
  • Transpose X in place.
  • Compute n2 DFTs
  • Multiply by twiddle factors
  • Transpose X in place
  • Compute n1 DFTs
  • Transpose X in-place

24
Fast Fourier Transform
n14, n22
Transpose to select n2 DFT of size n1
Call FFT recursively with n12, n22
Reach the base case, return
twiddle factor
Transpose to select n1 DFT of size n2
Transpose and return

25
Fast Fourier Transform
  • Cache complexity
  • Optimal for a Cooley-Tukey algorithm, when n is
    an exact power of 2
  • Q(n) O(1(n/L)(1logzn)

26
Other Cache Oblivious Algorithms
  • Funnelsort
  • Distribution sort
  • LU decomposition without pivots

27
Outline
  • Cache complexity
  • Cache aware algorithms
  • Cache oblivious algorithms
  • Matrix multiplication
  • Matrix transposition
  • FFT
  • Conclusion

28
Questions
  • How large is the range of practicality of
    cache-oblivious algorithms?
  • What are the relative strengths of
    cache-oblivious and cache-aware algorithms?

29
Practicality of Cache-oblivious Algorithms
Average time to transpose an NxN matrix, divided
by N2
30
Practicality of Cache-oblivious Algorithms (2)
Average time taken to multiply two NxN matrices,
divided by N3
31
Question 2
  • Do cache-oblivious algorithms perform as well as
    cache-aware algorithms?
  • FFTW library
  • No answer yet.

32
References
  • Cache-Oblivious Algorithmsby Matteo Frigo,
    Charles E. Leiserson, Harald Prokop, and Sridhar
    Ramachandran. In the 40th Annual Symposium on
    Foundations of Computer Science, FOCS '99, 17-18
    October, 1999, New York, NY, USA.
  • Cache-Oblivious Algorithmsby Harald Prokop.
    Master's Thesis, MIT Department of Electrical
    Engineering and Computer Science. June 1999.
  • Optimizing Matrix Multiplication with a
    Classifier Learning System by Xiaoming Li and
    María Jesus Garzarán. LCPC 2005.
Write a Comment
User Comments (0)
About PowerShow.com