Title: CS 240A : January 23, 2006 Matrix multiplication
1CS 240A January 23, 2006Matrix multiplication
- Linear algebra problems
- Matrix multiplication I cache issues
- Matrix multiplication II parallel issues
2References
- Kathy Yelicks slides on matmul and cache
issueshttp//www.cs.berkeley.edu/yelick/cs267/l
ectures/03/lect03-matmul.ppt - Kathy Yelicks slides on parallel matrix
multiplicationhttp//www.cs.berkeley.edu/yelick
/cs267/lectures/13/lect13-pmatmul.ppt - Jim Demmels slides on parallel dense linear
algebrahttp//www.cs.berkeley.edu/demmel/cs267_
Spr99/Lectures/Lect_19_2000.ppt
3Problems in linear algebra
- Solving linear equations
- Find x such that Ax b
- Workhorse of computational modeling
- Dense electromagnetics kernel in sparse
- Sparse differential equations, etc.
- Eigenvalue / eigenvector
- Find ? and x such that Ax ?x
- Dynamics of systems vibration information
retrieval - Today Matrix matrix multiplication
- Compute C A B
- Kernel in linear equation solvers, etc.
4Avoiding data movement Reuse and locality
Conventional Storage Hierarchy
Proc
Proc
Proc
Cache
Cache
Cache
L2 Cache
L2 Cache
L2 Cache
L3 Cache
L3 Cache
L3 Cache
potential interconnects
Memory
Memory
Memory
- Large memories are slow, fast memories are small
- Parallel processors, collectively, have large,
fast cache - the slow accesses to remote data we call
communication - Algorithm should do most work on local data
5 Simplified model of hierarchical memory
- Assume just 2 levels in the hierarchy, fast and
slow - All data initially in slow memory
- m number of memory elements (words) moved
between fast and slow memory - tm time per slow memory operation
- f number of arithmetic operations
- tf time per arithmetic operation ltlt tm
- q f / m average number of flops per slow
element access - Minimum possible time f tf when all data in
fast memory - Actual time
- f tf m tm f tf (1 tm/tf 1/q)
- Larger q means time closer to minimum f tf
6Warm up Matrix-vector multiplication
- implements y y Ax
- for i 1n
- for j 1n
- y(i) y(i) A(i,j)x(j)
A(i,)
y(i)
y(i)
x()
7Warm up Matrix-vector multiplication
- read x(1n) into fast memory
- read y(1n) into fast memory
- for i 1n
- read row i of A into fast memory
- for j 1n
- y(i) y(i) A(i,j)x(j)
- write y(1n) back to slow memory
- m number of slow memory refs 3n n2
- f number of arithmetic operations 2n2
- q f / m 2
- Matrix-vector multiplication limited by slow
memory speed
8 Modeling Matrix-Vector Multiplication
- Compute time for nxn 1000x1000 matrix
- Time
- f tf m tm f tf (1 tm/tf 1/q)
- 2n2 (1 0.5 tm/tf)
- For tf and tm, using data from R. Vuducs PhD (pp
352-3) - http//bebop.cs.berkeley.edu/pubs/vuduc2003-disser
tation.pdf - For tm use words-per-cache-line /
minimum-memory-latency
machine balance (q must be at least this for
peak speed)
9Naïve Matrix Multiply
- implements C C AB
- for i 1 to n
- for j 1 to n
- for k 1 to n
- C(i,j) C(i,j) A(i,k) B(k,j)
Algorithm has 2n3 O(n3) Flops and operates on
3n2 words of memory
A(i,)
C(i,j)
C(i,j)
B(,j)
10Matrix Multiply on RS/6000
12000 would take 1095 years
T N4.7
Size 2000 took 5 days
O(N3) performance would have constant
cycles/flop Performance looks much closer to
O(N5)
Slide source Larry Carter, UCSD
11Matrix Multiply on RS/6000
Page miss every iteration
TLB miss every iteration
Cache miss every 16 iterations
Page miss every 512 iterations
Slide source Larry Carter, UCSD
12Naïve Matrix Multiply
- implements C C AB
- for i 1 to n
- read row i of A into fast memory
- for j 1 to n
- read C(i,j) into fast memory
- read column j of B into fast memory
- for k 1 to n
- C(i,j) C(i,j) A(i,k) B(k,j)
- write C(i,j) back to slow memory
A(i,)
C(i,j)
C(i,j)
B(,j)
13Naïve Matrix Multiply
- How many references to slow memory?
- m n3 read each column of B n times
- n2 read each row of A once
- 2n2 read and write each element of C
once - n3 3n2
- So q f / m 2n3 / (n3 3n2)
- 2 for large n, no improvement over
matrix-vector multiply
A(i,)
C(i,j)
C(i,j)
B(,j)
14Blocked Matrix Multiply
- Consider A,B,C to be N by N matrices of b by b
subblocks where bn / N is called the block size - for i 1 to N
- for j 1 to N
- read block C(i,j) into fast memory
- for k 1 to N
- read block A(i,k) into fast
memory - read block B(k,j) into fast
memory - C(i,j) C(i,j) A(i,k)
B(k,j) do a matrix multiply on blocks - write block C(i,j) back to slow memory
A(i,k)
C(i,j)
C(i,j)
B(k,j)
15Blocked Matrix Multiply
- m is amount memory traffic between slow and
fast memory - matrix has nxn elements, and NxN blocks each
of size bxb - f is number of floating point operations, 2n3
for this problem - q f / m measures algorithm efficiency in the
memory system
m Nn2 read a block of B N3 times (N3
n/N n/N) Nn2 read a block of A
N3 times 2n2 read and write each
block of C once (2N 2) n2 So
computational intensity q f / m 2n3 / ((2N
2) n2)
n / N b for large n We can improve
performance by increasing the blocksize b Can be
much faster than matrix-vector multiply (q2)
16Using Analysis to Understand Machines
- The blocked algorithm has computational intensity
q b - The larger the block size, the more efficient our
algorithm will be - Limit All three blocks from A,B,C must fit in
fast memory (cache), so we cannot make these
blocks arbitrarily large - Assume your fast memory has size Mfast
- 3b2 lt Mfast, so q b lt
sqrt(Mfast/3)
17Limits to Optimizing Matrix Multiply
- The blocked algorithm changes the order in which
values are accumulated into each Ci,j, using
associativity of addition - The previous analysis showed that the blocked
algorithm has computational intensity - q b lt sqrt(Mfast/3)
- Lower bound bound theorem (Hong Kung, 1981)
- Any reorganization of this algorithm (that
uses only associativity) is limited to q
O(sqrt(Mfast))
18BLAS Basic Linear Algebra Subroutines
- Industry standard interface
- Vendors, others supply optimized implementations
- History
- BLAS1 (1970s)
- vector operations dot product, saxpy (yaxy),
etc - m2n, f2n, q 1 or less
- BLAS2 (mid 1980s)
- matrix-vector operations matrix vector multiply,
etc - mn2, f2n2, q2, less overhead
- somewhat faster than BLAS1
- BLAS3 (late 1980s)
- matrix-matrix operations matrix matrix multiply,
etc - m gt n2, fO(n3), so q can possibly be as large
as n - BLAS3 is potentially much faster than BLAS2
- Good algorithms use BLAS3 when possible (LAPACK)
- See www.netlib.org/blas, www.netlib.org/lapack
19BLAS speeds on an IBM RS6000/590
Peak speed 266 Mflops
Peak
BLAS 3
BLAS 2
BLAS 1
BLAS 3 (n-by-n matrix matrix multiply) vs BLAS 2
(n-by-n matrix vector multiply) vs BLAS 1 (saxpy
of n vectors)
20ScaLAPACK Parallel Library