CS 240A : January 23, 2006 Matrix multiplication - PowerPoint PPT Presentation

1 / 20

About This Presentation

Title:

CS 240A : January 23, 2006 Matrix multiplication

Description:

m = number of memory elements (words) moved between fast and slow memory ... q = f / m average number of flops per slow element access ... – PowerPoint PPT presentation

Number of Views:26

Avg rating:3.0/5.0

Slides: 21

Provided by: johnrg5

Category:

more less

Transcript and Presenter's Notes

Title: CS 240A : January 23, 2006 Matrix multiplication

1
CS 240A January 23, 2006Matrix multiplication

Linear algebra problems
Matrix multiplication I cache issues
Matrix multiplication II parallel issues

2
References

Kathy Yelicks slides on matmul and cache
issueshttp//www.cs.berkeley.edu/yelick/cs267/l
ectures/03/lect03-matmul.ppt
Kathy Yelicks slides on parallel matrix
multiplicationhttp//www.cs.berkeley.edu/yelick
/cs267/lectures/13/lect13-pmatmul.ppt
Jim Demmels slides on parallel dense linear
algebrahttp//www.cs.berkeley.edu/demmel/cs267_
Spr99/Lectures/Lect_19_2000.ppt

3
Problems in linear algebra

Solving linear equations
Find x such that Ax b
Workhorse of computational modeling
Dense electromagnetics kernel in sparse
Sparse differential equations, etc.
Eigenvalue / eigenvector
Find ? and x such that Ax ?x
Dynamics of systems vibration information
retrieval
Today Matrix matrix multiplication
Compute C A B
Kernel in linear equation solvers, etc.

4
Avoiding data movement Reuse and locality
Conventional Storage Hierarchy
Proc
Proc
Proc
Cache
Cache
Cache
L2 Cache
L2 Cache
L2 Cache
L3 Cache
L3 Cache
L3 Cache
potential interconnects
Memory
Memory
Memory

Large memories are slow, fast memories are small
Parallel processors, collectively, have large,
fast cache
the slow accesses to remote data we call
communication
Algorithm should do most work on local data

5
Simplified model of hierarchical memory

Assume just 2 levels in the hierarchy, fast and
slow
All data initially in slow memory
m number of memory elements (words) moved
between fast and slow memory
tm time per slow memory operation
f number of arithmetic operations
tf time per arithmetic operation ltlt tm
q f / m average number of flops per slow
element access
Minimum possible time f tf when all data in
fast memory
Actual time
f tf m tm f tf (1 tm/tf 1/q)
Larger q means time closer to minimum f tf

6
Warm up Matrix-vector multiplication

implements y y Ax
for i 1n
for j 1n
y(i) y(i) A(i,j)x(j)

A(i,)

y(i)
y(i)
x()
7
Warm up Matrix-vector multiplication

read x(1n) into fast memory
read y(1n) into fast memory
for i 1n
read row i of A into fast memory
for j 1n
y(i) y(i) A(i,j)x(j)
write y(1n) back to slow memory

m number of slow memory refs 3n n2
f number of arithmetic operations 2n2
q f / m 2
Matrix-vector multiplication limited by slow
memory speed

8
Modeling Matrix-Vector Multiplication

Compute time for nxn 1000x1000 matrix
Time
f tf m tm f tf (1 tm/tf 1/q)
2n2 (1 0.5 tm/tf)
For tf and tm, using data from R. Vuducs PhD (pp
352-3)
http//bebop.cs.berkeley.edu/pubs/vuduc2003-disser
tation.pdf
For tm use words-per-cache-line /
minimum-memory-latency

machine balance (q must be at least this for
peak speed)
9
Naïve Matrix Multiply

implements C C AB
for i 1 to n
for j 1 to n
for k 1 to n
C(i,j) C(i,j) A(i,k) B(k,j)

Algorithm has 2n3 O(n3) Flops and operates on
3n2 words of memory
A(i,)
C(i,j)
C(i,j)
B(,j)

10
Matrix Multiply on RS/6000
12000 would take 1095 years
T N4.7
Size 2000 took 5 days
O(N3) performance would have constant
cycles/flop Performance looks much closer to
O(N5)
Slide source Larry Carter, UCSD
11
Matrix Multiply on RS/6000
Page miss every iteration
TLB miss every iteration
Cache miss every 16 iterations
Page miss every 512 iterations
Slide source Larry Carter, UCSD
12
Naïve Matrix Multiply

implements C C AB
for i 1 to n
read row i of A into fast memory
for j 1 to n
read C(i,j) into fast memory
read column j of B into fast memory
for k 1 to n
C(i,j) C(i,j) A(i,k) B(k,j)
write C(i,j) back to slow memory

A(i,)
C(i,j)
C(i,j)
B(,j)

13
Naïve Matrix Multiply

How many references to slow memory?
m n3 read each column of B n times
n2 read each row of A once
2n2 read and write each element of C
once
n3 3n2
So q f / m 2n3 / (n3 3n2)
2 for large n, no improvement over
matrix-vector multiply

A(i,)
C(i,j)
C(i,j)
B(,j)

14
Blocked Matrix Multiply

Consider A,B,C to be N by N matrices of b by b
subblocks where bn / N is called the block size
for i 1 to N
for j 1 to N
read block C(i,j) into fast memory
for k 1 to N
read block A(i,k) into fast
memory
read block B(k,j) into fast
memory
C(i,j) C(i,j) A(i,k)
B(k,j) do a matrix multiply on blocks
write block C(i,j) back to slow memory

A(i,k)
C(i,j)
C(i,j)

B(k,j)
15
Blocked Matrix Multiply

m is amount memory traffic between slow and
fast memory
matrix has nxn elements, and NxN blocks each
of size bxb
f is number of floating point operations, 2n3
for this problem
q f / m measures algorithm efficiency in the
memory system

m Nn2 read a block of B N3 times (N3
n/N n/N) Nn2 read a block of A
N3 times 2n2 read and write each
block of C once (2N 2) n2 So
computational intensity q f / m 2n3 / ((2N
2) n2)
n / N b for large n We can improve
performance by increasing the blocksize b Can be
much faster than matrix-vector multiply (q2)
16
Using Analysis to Understand Machines

The blocked algorithm has computational intensity
q b
The larger the block size, the more efficient our
algorithm will be
Limit All three blocks from A,B,C must fit in
fast memory (cache), so we cannot make these
blocks arbitrarily large
Assume your fast memory has size Mfast
3b2 lt Mfast, so q b lt
sqrt(Mfast/3)

17
Limits to Optimizing Matrix Multiply

The blocked algorithm changes the order in which
values are accumulated into each Ci,j, using
associativity of addition
The previous analysis showed that the blocked
algorithm has computational intensity
q b lt sqrt(Mfast/3)
Lower bound bound theorem (Hong Kung, 1981)
Any reorganization of this algorithm (that
uses only associativity) is limited to q
O(sqrt(Mfast))

18
BLAS Basic Linear Algebra Subroutines

Industry standard interface
Vendors, others supply optimized implementations
History
BLAS1 (1970s)
vector operations dot product, saxpy (yaxy),
etc
m2n, f2n, q 1 or less
BLAS2 (mid 1980s)
matrix-vector operations matrix vector multiply,
etc
mn2, f2n2, q2, less overhead
somewhat faster than BLAS1
BLAS3 (late 1980s)
matrix-matrix operations matrix matrix multiply,
etc
m gt n2, fO(n3), so q can possibly be as large
as n
BLAS3 is potentially much faster than BLAS2
Good algorithms use BLAS3 when possible (LAPACK)
See www.netlib.org/blas, www.netlib.org/lapack

19
BLAS speeds on an IBM RS6000/590
Peak speed 266 Mflops
Peak
BLAS 3
BLAS 2
BLAS 1
BLAS 3 (n-by-n matrix matrix multiply) vs BLAS 2
(n-by-n matrix vector multiply) vs BLAS 1 (saxpy
of n vectors)
20
ScaLAPACK Parallel Library

Write a Comment

User Comments (0)