Jim Demmel

About This Presentation

Title:

Jim Demmel

Description:

Title: Optimizing Matrix Multiply Author: Kathy Yelick Description: Based on s by Jim Demmel and others Last modified by: Nicola Mastronardi Created Date – PowerPoint PPT presentation

Number of Views:42

Avg rating:3.0/5.0

Slides: 41

Provided by: Kathy304

Category:

more less

Transcript and Presenter's Notes

Title: Jim Demmel

1
Minimizing Communication in Numerical Linear
Algebrawww.cs.berkeley.edu/demmelCommunicatio
n Lower Bounds for Direct Linear Algebra

Jim Demmel
EECS Math Departments, UC Berkeley
demmel_at_cs.berkeley.edu

2
Outline

Recall Motivation
Lower bound proof for Matmul, by
Irony/Toledo/Tiskin
How to generalize it to linear algebra without
orthogonal xforms
How to generalize it to linear algebra with
orthogonal xforms
Summary of linear algebra problems for which the
lower bound is attainable
Summary of extensions to Strassen

3
Collaborators

Grey Ballard, UCB EECS
Ioana Dumitriu, U. Washington
Laura Grigori, INRIA
Ming Gu, UCB Math
Mark Hoemmen, UCB EECS
Olga Holtz, UCB Math TU Berlin
Julien Langou, U. Colorado Denver
Marghoob Mohiyuddin, UCB EECS
Oded Schwartz , TU Berlin
Hua Xiang, INRIA
Kathy Yelick, UCB EECS NERSC
BeBOP group, bebop.cs.berkeley.edu

4
Motivation (1/2)

Two kinds of costs
Arithmetic (FLOPS)
Communication moving data between
levels of a memory hierarchy (sequential case)
over a network connecting processors (parallel
case).

5
Motivation (2/2)

Running time of an algorithm is sum of 3 terms
flops time_per_flop
words moved / bandwidth
messages latency

Exponentially growing gaps between
Time_per_flop ltlt 1/Network BW ltlt Network Latency
Improving 59/year vs 26/year vs 15/year
Time_per_flop ltlt 1/Memory BW ltlt Memory Latency
Improving 59/year vs 23/year vs 5.5/year

Goal reorganize linear algebra to avoid
communication
Between all memory hierarchy levels
L1 L2 DRAM network, etc
Not just hiding communication (speedup ? 2x )
Arbitrary speedups possible

6
Direct linear algebra Prior Work on Matmul

Assume n3 algorithm (i.e. not Strassen-like)
Sequential case, with fast memory of size M
Lower bound on words moved to/from slow memory
? (n3 / M1/2 ) Hong Kung (1981)
Attained using blocked or recursive
(cache-oblivious) algorithm

Parallel case on P processors
Let NNZ be total memory needed assume load
balanced
Lower bound on words communicated
? (n3 /(P NNZ )1/2 )
Irony, Tiskin Toledo (2004)
When NNZ 3n2 , (2D alg), get ? (n2 / P1/2 )
Attained by Cannons Algorithm
For 3D alg (NNZ O(n2 P1/3 )), get ? (n2 /
P2/3 )
Attainable too (details later)

7
Direct linear algebra Generalizing Prior Work

Sequential case words moved
? (n3 / M1/2 ) ? (flops / (fast_memory_size)1/
2 )
Parallel case words moved
? (n3 /(P NNZ )1/2 ) ? ((n3 /P) / (NNZ/P)1/2
)
? (flops_per_proc / (memory_per_processor)1/2
)
(true for at least one processor, assuming
balance in either flops or in memory)
In both cases, we let M memory size, and write

words_moved by at least one processor ?
(flops / M1/2 )
8
Lower bound for all direct linear algebra

words_moved by at least one processor
(flops / M1/2 )
messages_sent by at least one processor
(flops / M3/2 )

Holds for
Matmul, BLAS, LU, QR, eig, SVD
Need to explain model of computation
Some whole programs (sequences of these
operations, no matter how they are interleaved)
Dense and sparse matrices (where flops ltlt n3 )
Sequential and parallel algorithms
Some graph-theoretic algorithms

9
Proof of Communication Lower Bound on C AB
(1/5)

Proof from Irony/Toledo/Tiskin (2004)
Original proof, then generalization
Think of instruction stream being executed
Looks like add, load, multiply, store,
load, add,
Each load/store moves a word between fast and
slow memory, or between local memory and remote
memory (another processor)
We want to count the number of loads and stores,
given that we are multiplying n-by-n matrices C
AB using the usual 2n3 flops, possibly
reordered assuming addition is commutative/associa
tive
Assuming that at most M words can be stored in
fast memory
Outline
Break instruction stream into segments, each with
M loads and stores
Somehow bound the maximum number of flops that
can be done in each segment, call it F
So F segments ? T total flops 2n3 ,
so segments ? T / F
So loads stores M segments ? M T /
F

10
Illustrating Segments, for M3
...
11
Proof of Communication Lower Bound on C AB
(1/5)

Proof from Irony/Toledo/Tiskin (2004)
Original proof, then generalization
Think of instruction stream being executed
Looks like add, load, multiply, store,
load, add,
Each load/store moves a word between fast and
slow memory
We want to count the number of loads and stores,
given that we are multiplying n-by-n matrices C
AB using the usual 2n3 flops, possibly reordered
assuming addition is commutative/associative
Assuming that at most M words can be stored in
fast memory
Outline
Break instruction stream into segments, each with
M loads and stores
Somehow bound the maximum number of flops that
can be done in each segment, call it F
So F segments ? T total flops 2n3 ,
so segments ? T / F
So loads stores M segments ? M T /
F

12
Proof of Communication Lower Bound on C AB
(2/5)

Given segment of instruction stream with M loads
stores, how many adds multiplies (F) can we
do?
At most 2M entries of C, 2M entries of A and/or
2M entries of B can be accessed
Use geometry
Represent n3 multiplications by n x n x n cube
One n x n face represents A
each 1 x 1 subsquare represents one A(i,k)
One n x n face represents B
each 1 x 1 subsquare represents one B(k,j)
One n x n face represents C
each 1 x 1 subsquare represents one C(i,j)
Each 1 x 1 x 1 subcube represents one C(i,j)
A(i,k) B(k,j)
May be added directly to C(i,j), or to temporary
accumulator

13
Proof of Communication Lower Bound on C AB
(3/5)
k
C face
B face

If we have at most 2M A squares, 2M B
squares, and 2M C squares on faces, how many
cubes can we have?

A face
14
Proof of Communication Lower Bound on C AB
(4/5)
x
y
z
y
z
x
(i,k) is in A shadow if (i,j,k) in 3D set
(j,k) is in B shadow if (i,j,k) in 3D set
(i,j) is in C shadow if (i,j,k) in 3D
set Thm (Loomis Whitney, 1949) cubes in
3D set Volume of 3D set (area(A shadow)
area(B shadow) area(C shadow)) 1/2
cubes in black box with side lengths x, y
and z Volume of black box xyz ( xz zy
yx)1/2 (A?s B?s C?s )1/2
15
Proof of Communication Lower Bound on C AB
(5/5)

Consider one segment of instructions with M
loads, stores
Can be at most 2M entries of A, B, C available in
one segment
Volume of set of cubes representing possible
multiply/adds in one segment is (2M 2M
2M)1/2 (2M) 3/2 F
Segments ? ?2n3 / F?
Loads Stores M Segments ? M ?2n3 / F?
? n3 / (2M)1/2 M
?(n3 / M1/2 )

Parallel Case apply reasoning to one processor
out of P
Adds and Muls ? 2n3 / P (at least one proc
does this )
M n2 / P (each processor gets equal fraction of
matrix)
Load Stores words moved from or to
other procs ? M (2n3 /P) / F M (2n3 /P) /
(2M)3/2 n2 / (2P)1/2

16
Proof of Loomis-Whitney inequality

T 3D set of 1x1x1 cubes on lattice
N T cubes
Tx projection of T onto x0 plane
Nx Tx squares in Tx, same for Ty, Ny, etc
Goal N (Nx Ny Nz)1/2

T(xi) subset of T with xi
T(xi y ) projection of T(xi) onto y0
plane
N(xi) T(xi) etc
N ?i N(xi) ?i (N(xi))1/2
(N(xi))1/2
?i (Nx)1/2 (N(xi))1/2
(Nx)1/2 ?i (N(xi y ) N(xi z )
)1/2
(Nx)1/2 ?i (N(xi y ) )1/2 (N(xi
z ) )1/2
(Nx)1/2 (?i N(xi y ) )1/2 (?i
N(xi z ) )1/2
(Nx)1/2 (Ny)1/2 (Nz)1/2

17
Homework

Prove more general statement of Loomis-Whitney
Suppose T is d-dimensional
N T d-dimensional cubes in T
T(i) is projection of T onto hyperplane x(i)0
N(i) d-1 dimensional volume of T(i)
Show N ?i1 to d (N(i))1/(d-1)

18
How to generalize this lower bound (1/4)

It doesnt depend on C(i,j) being a matrix entry,
just a unique memory location (same for A(i,k)
and B(k,j) )

It doesnt depend on C and A not overlapping (or
C and B, or A and B)
Some reorderings may change answer still get a
lower bound for all reorderings

It doesnt depend on doing n3 multiply/adds

It doesnt depend on C(i,j) just being a sum of
products

C(i,j) ?k A(i,k)B(k,j)
For all (i,j) ? S Mem(c(i,j))
fij ( gijk ( Mem(a(i,k)) , Mem(b(k,j)) ) for k ?
Sij , some other arguments)
19
How to generalize this lower bound (2/4)

It does assume the presence of an operand
generating a load and/or store how could this
not happen?
Mem(b(k,j)) could be reused in many more gijk
than (P) allows
Ex Compute C(m) A (B.m) (Matlab notation)
for m1 to t
Can move many fewer words than ? (flops / M1/2 )
We might generate a result during a segment, use
it, and discard it, without generating any memory
traffic
Turns out QR, eig, SVD all may do this
Need a different analysis for them later

20
How to generalize this lower bound (3/4)

Need to distinguish Sources, Destinations of
each operand in fast memory during a segment
Possible Sources
S1 Already in fast memory at start of segment,
or read at most 2M
S2 Created during segment no bound without
more information
Possible Destinations
D1 Left in fast memory at end of segment, or
written at most 2M
D2 Discarded no bound without more information
Need to assume no S2/D2 arguments at most 4M of
others

21
How to generalize this lower bound (4/4)

Theorem To evaluate (P) with memory of size M,
where
fij and gijk are nontrivial functions of their
arguments
G is the total number of gijks,
No S2/D2 arguments
requires at least G/ (8M)1/2 M slow
memory references

Simpler words_moved ? (flops / M1/2 )

Corollary To evaluate (P) requires at least
G/ (81/2 M3/2 ) 1
messages (simpler ? (flops / M3/2 ) )
Proof maximum message size is M

22
Some Corollaries of this lower bound (1/4)
words_moved ? (flops / M1/2 )
?

Theorem applies to dense or sparse, parallel or
sequential
MatMul, including ATA or A2
Triangular solve C A-1X
C(i,j) (X(i,j) - ?klti A(i,k)C(k,j)) / A(i,i)
A lower triangular
C plays double role of b and c in Model (P)
LU factorization (any pivoting, LU or ILU)
L(i,j) (A(i,j) - ?kltj L(i,k)U(k,j)) / U(j,j),
U(i,j) A(i,j) - ?klti L(i,k)U(k,j)
L (and U) play double role as c and a (c and b)
in Model (P)
Cholesky (any diagonal pivoting, C or IC)
L(i,j) (A(i,j) - ?kltj L(i,k)LT(k,j)) / L(j,j),
L(j,j) (A(j,j) - ?kltj L(j,k)LT(k,j))1/2
L (and LT) play triple role as a, b and c

23
Some Corollaries of this lower bound (2/4)
words_moved ? (flops / M1/2 )
?

Applies to simplified operations
Ex 1 Compute ABF where A(i,k) f(i,k) and
B(k,j) g(k,j), so no inputs and 1 output
assume each f(i,k) and g(k,j) evaluated once
Ex 2 Compute determinant of A(i,j) f(i,j)
using LU
Apply lower bound by imposing reads and writes
Ex 1 Every time a final value of (AB)(i,j) is
computed, write it every time f(i,k),
g(k,j) evaluated, insert a read
Ex 2 Every time a final value of L(i,j), U(i,j)
computed, write it every time f(i,j)
evaluated, insert a read
Still get ? (flops / M1/2 3n2) words_moved,
by subtracting imposed reads/writes (sparse
case analogous)

24
Some Corollaries of this lower bound (3/4)
words_moved ? (flops / M1/2 )
?
25
Some Corollaries of this lower bound (4/4)
words_moved ? (flops / M1/2 )
?

Applies to some graph algorithms
Ex All-Pairs-Shortest-Path by Floyd-Warshall
Matches (P) with gijk and fij min
Get words_moved ? (n3 / M1/2) , for dense
graphs
Homework state result for sparse graphs how
does n3 change?

Initialize all path(i, j) length of edge from
node i to j (or ? if none) for k 1 to n
for i 1 to n for j 1 to n
path(i, j) min ( path(i, j),
path(i, k) path(k, j) )
26
3D parallel matrix multiplication C C A B

p procs arranged in p1/3 x p1/3 x p1/3
grid, with tree networks along fibers
Each A(i,k), B(k,j), C(i,j) is n/p1/3 x n/p1/3
Initially
Proc (i,j,0) owns C(i,j)
Proc (i,0,k) owns A(i,k)
Proc (0,j,k) owns B(k,j)
Algorithm
For all (i,k), broadcast A(i,k) to proc (i,j,k) ?
j
comm. cost log p1/3 ? ? log p1/3
? n2 / p2/3 ?
For all (j,k), broadcast B(k,j) to proc(i,j,k) ?
j same comm. cost
For all (i,j,k) Tmp(i,j,k) A(i,k) ? B(k,j)
cost (2(n/p1/3)3) 2n3/p flops
For all (i,j), reduce C(i,j) ?k Tmp(i,j,k)
same comm. Cost
Total comm. cost O(log(p) ? n2 / p2/3 ?
log(p) ? ?)
Lower bound ?((n3/p)/(n2/p2/3)1/2 ?
(n3/p)/(n2/p2/3)3/2 ?) ?(n2/p2/3 ? ?)
Lower bound also attainable for all n2/p2/3 ? M
n2/px ? n2/p Toledo et al

27
So when doesnt Model (P) apply?
words_moved ? (flops / M1/2 )
?

Ex for r 1 to t, C(r) A (B.(1/r))
Matlab notation
(P) does not apply
With A(i,k) and B(k,j) in memory, we can do t
operations, not just 1 gijk
Cant apply (P) with arguments b(k,j) indexed in
one-to-one fashion
Can still analyze using segments, Loomis-Whitney
flops/segment ? 8(tM3)1/2
segments ? (t n3 ) / 8(tM3)1/2
words_moved ? (t1/2 n3 / M1/2 ), not ? (t
n3 / M1/2 )
Attainable, using variation of usual blocked
matmul algorithm
Homework! (Hint need to recompute (B(j,k))1/r
when needed)

28
Lower bounds for Orthogonal Transformations (1/4)

Needed for QR, eig, SVD,
Analyze Blocked Householder Transformations
?j1 to b (I 2 uj uTj ) I U T UT where U
u1 , , ub
Treat Givens as 2x2 Householder
Model details and assumptions
Write (I U T UT )A A U(TUT A) A UZ
where Z T(UT A)
Only count operations in all A UZ operations
Generically a large fraction of the work
Assume Forward Progress, that each successive
Householder transformation leaves previously
created zeros zero Ok for
QR decomposition
Reductions to condensed forms (Hessenberg,
tri/bidiagonal)
Possible exception bulge chasing in banded case
One sweep of (block) Hessenberg QR iteration

29
Lower bounds for Orthogonal Transformations (2/4)

Perform many A UZ where Z T(UT A)
First challenge to applying theorem need to
collect all A-UZ into one big set to which model
(P) applies
Write them all as A(i,j) A(i,j) - ?k U(i,k)
Z(k,j) where k index of
k-th transformation,
k not necessarily index of column of A it comes
from
Second challenge All Z(k,j) may be S2/D2
Recall S2/D2 means computed on the fly and
discarded
Ex An x 2n Qn x n Rn x 2n where 2n2 M so A
fits in cache
Represent Q as n(n-1)/2 2x2 Householder (Givens)
transforms
There are n2(n-1)/2 ?(M3/2) nonzero Z(k,j), not
O(M)
Still only do ?(M3/2) flops during segment
But cant use Loomis-Whitney to prove it!
Need a new idea

30
Lower bounds for Orthogonal Transformations (3/4)

Dealing with Z(k,j) being S2/D2
How to bound flops in A(i,j) A(i,j) - ?k
U(i,k) Z(k,j) ?
Neither A nor U is S2/D2
A either turned into R or U, which are output
So at most 2M of each during segment
flops ( U(i,k) ) ( columns A(,j) present
)
( U(i,k) ) ( A(i,j) / min
nonzeros per column of A)
h O(M) / r
where h O(M)
How small can r be?
To store h U(i,k) Householder vector entries
in r rows, there can be at most r in the
first column, r-1 in the second, etc.,
(to maintain Forward Progress) so r(r-1)/2 ?
h so r ? h1/2
flops h O(M) / r O(M) h1/2 O(M3/2) as
desired

31
Lower bounds for Orthogonal Transformations (4/4)

Theorem words_moved by QR decomposition using
(blocked) Householder transformations ?( flops
/ M1/2 )
Theorem words_moved by reduction to Hessenberg
form, tridiagonal form, bidiagonal form, or one
sweep of QR iteration (or block versions of any
of these) ?( flops / M1/2 )
Assuming Forward Progress (created zeros remain
zero)
Model Merge left and right orthogonal
transformations
A(i,j) A(i,j) - ?kLUL(i,kL) ZL(kL,j) -
?kRZR(i,kR) UR(j,kR)

32
Robust PCA (Candes, 2009)
Decompose a matrix into a low rank component and
a sparse component
M L S
Homework Apply lower bound result
33
Can we attain these lower bounds?

Do conventional dense algorithms as implemented
in LAPACK and ScaLAPACK attain these bounds?
Mostly not
If not, are there other algorithms that do?
Yes
Several goals for algorithms
Minimize Bandwidth (? (flops/ M1/2 )) and
latency (? (flops/ M3/2 ))
Multiple memory hierarchy levels (attain bound
for each level?)
Explicit dependence on (multiple) M(s), vs cache
oblivious
Fewest flops when fits in smallest memory
What about sparse algorithms?

34
Recall Minimizing latency requires new data
structures

To minimize latency, need to load/store whole
rectangular subblock of matrix with one message
Incompatible with conventional columnwise
(rowwise) storage
Ex Rows (columns) not in contiguous memory
locations
Blocked storage store as matrix of bxb blocks,
each block stored contiguously
Ok for one level of memory hierarchy, what if
more?
Recursive blocked storage store each block
using subblocks

35
Recall Blocked vs Cache-Oblivious Algorithms

Blocked Matmul C AB explicitly refers to
subblocks of A, B and C of dimensions that depend
on cache size

Break Anxn, Bnxn, Cnxn into bxb blocks labeled
A(i,j), etc b chosen so 3 bxb blocks fit in
cache for i 1 to n/b, for j1 to n/b, for
k1 to n/b C(i,j) C(i,j) A(i,k)B(k,j)
b x b matmul another level of
memory would need 3 more loops

Cache-oblivious Matmul C AB is independent of
cache

Function C MM(A,B) If A and B are 1x1 C
A B else Break Anxn, Bnxn, Cnxn into
(n/2)x(n/2) blocks labeled A(i,j), etc for
i 1 to 2, for j 1 to 2, for k 1 to
2 C(i,j) C(i,j) MM( A(i,k), B(k,j)
) n/2 x n/2 matmul
36
Summary of dense sequential algorithms attaining
communication lower bounds

Algorithms shown minimizing Messages assume
(recursive) block layout
Many references (see reports), only some shown,
plus ours
Older references may or may not include
analysis
Cache-oblivious are underlined, Green are ours,
? is unknown/future work

Algorithm 2 Levels of Memory 2 Levels of Memory Multiple Levels of Memory Multiple Levels of Memory
Words Moved and Messages Words Moved and Messages
BLAS-3 Usual blocked or recursive algorithms Usual blocked or recursive algorithms Usual blocked algorithms (nested), or recursive Gustavson,97 Usual blocked algorithms (nested), or recursive Gustavson,97
Cholesky LAPACK (with b M1/2) Gustavson 97 BDHS09 Gustavson,97 Ahmed,Pingali,00 BDHS09 (?same) (?same)
LU with pivoting LAPACK (rarely) Toledo,97 , GDX 08 GDX 08 Toledo, 97 GDX 08? GDX 08?
QR LAPACK (rarely) Elmroth,Gustavson,98 DGHL08 Frens,Wise,03 DGHL08 Elmroth, Gustavson,98 DGHL08 ? Frens,Wise,03 DGHL08 ?
Eig, SVD Not LAPACK BDD10 Not LAPACK BDD10 BDD10 BDD10
37
Summary of dense 2D parallel algorithms
attaining communication lower bounds

Assume nxn matrices on P processors, memory per
processor O(n2 / P)
Many references (see reports), Green are ours
ScaLAPACK assumes best block size b chosen
Recall lower bounds
words_moved ?( n2 / P1/2 ) and
messages ?( P1/2 )

Algorithm Reference Factor exceeding lower bound for words_moved Factor exceeding lower bound for messages
Matrix multiply Cannon, 69 1 1
Cholesky ScaLAPACK log P log P
LU GDX08 ScaLAPACK log P log P log P log P N / P1/2
QR DGHL08 ScaLAPACK log P log P log3 P log P N / P1/2
Sym Eig, SVD BDD10 ScaLAPACK log P log P log3 P N / P1/2
Nonsym Eig BDD10 ScaLAPACK log P log P P1/2 log3 P log P N
38
Recent Communication Optimal Algorithms

QR with column pivoting
Cholesky with diagonal pivoting
LU with complete pivoting
LDL with complete pivoting
Sparse Cholesky
For matrices with good separators
For most sparse matrices (as hard as dense case)

39
Homework Extend lower bound

To algorithms using (block) Givens rotations
instead of (block) Householder transformations
To QR done with Gram-Schmidt orthogonalization
CGS and MGS
To heterogeneous collections of processors
Suppose processor k
Does ?(k) flops per second
Has reciprocal bandwidth ?(k)
Has latency ?(k)
What is a lower bound on solution time, for any
fractions of work assigned to each processor
(i.e. not all equal)
To a homogenous parallel shared memory machine
All data initially resides in large shared memory
Processors communicate by data written to/read
from slow memory
Each processor has local fast memory size M