Title: Jim Demmel
1Minimizing Communication in Numerical Linear
Algebrawww.cs.berkeley.edu/demmelCommunicatio
n Lower Bounds for Direct Linear Algebra
- Jim Demmel
- EECS Math Departments, UC Berkeley
- demmel_at_cs.berkeley.edu
2Outline
- Recall Motivation
- Lower bound proof for Matmul, by
Irony/Toledo/Tiskin - How to generalize it to linear algebra without
orthogonal xforms - How to generalize it to linear algebra with
orthogonal xforms - Summary of linear algebra problems for which the
lower bound is attainable - Summary of extensions to Strassen
3Collaborators
- Grey Ballard, UCB EECS
- Ioana Dumitriu, U. Washington
- Laura Grigori, INRIA
- Ming Gu, UCB Math
- Mark Hoemmen, UCB EECS
- Olga Holtz, UCB Math TU Berlin
- Julien Langou, U. Colorado Denver
- Marghoob Mohiyuddin, UCB EECS
- Oded Schwartz , TU Berlin
- Hua Xiang, INRIA
- Kathy Yelick, UCB EECS NERSC
- BeBOP group, bebop.cs.berkeley.edu
4Motivation (1/2)
- Two kinds of costs
- Arithmetic (FLOPS)
- Communication moving data between
- levels of a memory hierarchy (sequential case)
- over a network connecting processors (parallel
case).
5Motivation (2/2)
- Running time of an algorithm is sum of 3 terms
- flops time_per_flop
- words moved / bandwidth
- messages latency
- Exponentially growing gaps between
- Time_per_flop ltlt 1/Network BW ltlt Network Latency
- Improving 59/year vs 26/year vs 15/year
- Time_per_flop ltlt 1/Memory BW ltlt Memory Latency
- Improving 59/year vs 23/year vs 5.5/year
- Goal reorganize linear algebra to avoid
communication - Between all memory hierarchy levels
- L1 L2 DRAM network, etc
- Not just hiding communication (speedup ? 2x )
- Arbitrary speedups possible
6Direct linear algebra Prior Work on Matmul
- Assume n3 algorithm (i.e. not Strassen-like)
- Sequential case, with fast memory of size M
- Lower bound on words moved to/from slow memory
? (n3 / M1/2 ) Hong Kung (1981) - Attained using blocked or recursive
(cache-oblivious) algorithm
- Parallel case on P processors
- Let NNZ be total memory needed assume load
balanced - Lower bound on words communicated
? (n3 /(P NNZ )1/2 )
Irony, Tiskin Toledo (2004) - When NNZ 3n2 , (2D alg), get ? (n2 / P1/2 )
- Attained by Cannons Algorithm
- For 3D alg (NNZ O(n2 P1/3 )), get ? (n2 /
P2/3 ) - Attainable too (details later)
7Direct linear algebra Generalizing Prior Work
- Sequential case words moved
- ? (n3 / M1/2 ) ? (flops / (fast_memory_size)1/
2 ) - Parallel case words moved
- ? (n3 /(P NNZ )1/2 ) ? ((n3 /P) / (NNZ/P)1/2
) - ? (flops_per_proc / (memory_per_processor)1/2
) - (true for at least one processor, assuming
balance in either flops or in memory) - In both cases, we let M memory size, and write
words_moved by at least one processor ?
(flops / M1/2 )
8Lower bound for all direct linear algebra
- words_moved by at least one processor
- (flops / M1/2 )
- messages_sent by at least one processor
- (flops / M3/2 )
- Holds for
- Matmul, BLAS, LU, QR, eig, SVD
- Need to explain model of computation
- Some whole programs (sequences of these
operations, no matter how they are interleaved) - Dense and sparse matrices (where flops ltlt n3 )
- Sequential and parallel algorithms
- Some graph-theoretic algorithms
9Proof of Communication Lower Bound on C AB
(1/5)
- Proof from Irony/Toledo/Tiskin (2004)
- Original proof, then generalization
- Think of instruction stream being executed
- Looks like add, load, multiply, store,
load, add, - Each load/store moves a word between fast and
slow memory, or between local memory and remote
memory (another processor) - We want to count the number of loads and stores,
given that we are multiplying n-by-n matrices C
AB using the usual 2n3 flops, possibly
reordered assuming addition is commutative/associa
tive - Assuming that at most M words can be stored in
fast memory - Outline
- Break instruction stream into segments, each with
M loads and stores - Somehow bound the maximum number of flops that
can be done in each segment, call it F - So F segments ? T total flops 2n3 ,
so segments ? T / F - So loads stores M segments ? M T /
F
10Illustrating Segments, for M3
...
11Proof of Communication Lower Bound on C AB
(1/5)
- Proof from Irony/Toledo/Tiskin (2004)
- Original proof, then generalization
- Think of instruction stream being executed
- Looks like add, load, multiply, store,
load, add, - Each load/store moves a word between fast and
slow memory - We want to count the number of loads and stores,
given that we are multiplying n-by-n matrices C
AB using the usual 2n3 flops, possibly reordered
assuming addition is commutative/associative - Assuming that at most M words can be stored in
fast memory - Outline
- Break instruction stream into segments, each with
M loads and stores - Somehow bound the maximum number of flops that
can be done in each segment, call it F - So F segments ? T total flops 2n3 ,
so segments ? T / F - So loads stores M segments ? M T /
F
12Proof of Communication Lower Bound on C AB
(2/5)
- Given segment of instruction stream with M loads
stores, how many adds multiplies (F) can we
do? - At most 2M entries of C, 2M entries of A and/or
2M entries of B can be accessed - Use geometry
- Represent n3 multiplications by n x n x n cube
- One n x n face represents A
- each 1 x 1 subsquare represents one A(i,k)
- One n x n face represents B
- each 1 x 1 subsquare represents one B(k,j)
- One n x n face represents C
- each 1 x 1 subsquare represents one C(i,j)
- Each 1 x 1 x 1 subcube represents one C(i,j)
A(i,k) B(k,j) - May be added directly to C(i,j), or to temporary
accumulator
13Proof of Communication Lower Bound on C AB
(3/5)
k
C face
B face
- If we have at most 2M A squares, 2M B
squares, and 2M C squares on faces, how many
cubes can we have?
A face
14Proof of Communication Lower Bound on C AB
(4/5)
x
y
z
y
z
x
(i,k) is in A shadow if (i,j,k) in 3D set
(j,k) is in B shadow if (i,j,k) in 3D set
(i,j) is in C shadow if (i,j,k) in 3D
set Thm (Loomis Whitney, 1949) cubes in
3D set Volume of 3D set (area(A shadow)
area(B shadow) area(C shadow)) 1/2
cubes in black box with side lengths x, y
and z Volume of black box xyz ( xz zy
yx)1/2 (A?s B?s C?s )1/2
15Proof of Communication Lower Bound on C AB
(5/5)
- Consider one segment of instructions with M
loads, stores - Can be at most 2M entries of A, B, C available in
one segment - Volume of set of cubes representing possible
multiply/adds in one segment is (2M 2M
2M)1/2 (2M) 3/2 F - Segments ? ?2n3 / F?
- Loads Stores M Segments ? M ?2n3 / F?
- ? n3 / (2M)1/2 M
?(n3 / M1/2 )
- Parallel Case apply reasoning to one processor
out of P - Adds and Muls ? 2n3 / P (at least one proc
does this ) - M n2 / P (each processor gets equal fraction of
matrix) - Load Stores words moved from or to
other procs ? M (2n3 /P) / F M (2n3 /P) /
(2M)3/2 n2 / (2P)1/2
16Proof of Loomis-Whitney inequality
- T 3D set of 1x1x1 cubes on lattice
- N T cubes
- Tx projection of T onto x0 plane
- Nx Tx squares in Tx, same for Ty, Ny, etc
- Goal N (Nx Ny Nz)1/2
- T(xi) subset of T with xi
- T(xi y ) projection of T(xi) onto y0
plane - N(xi) T(xi) etc
- N ?i N(xi) ?i (N(xi))1/2
(N(xi))1/2 - ?i (Nx)1/2 (N(xi))1/2
- (Nx)1/2 ?i (N(xi y ) N(xi z )
)1/2 - (Nx)1/2 ?i (N(xi y ) )1/2 (N(xi
z ) )1/2 - (Nx)1/2 (?i N(xi y ) )1/2 (?i
N(xi z ) )1/2 - (Nx)1/2 (Ny)1/2 (Nz)1/2
17Homework
- Prove more general statement of Loomis-Whitney
- Suppose T is d-dimensional
- N T d-dimensional cubes in T
- T(i) is projection of T onto hyperplane x(i)0
- N(i) d-1 dimensional volume of T(i)
- Show N ?i1 to d (N(i))1/(d-1)
18How to generalize this lower bound (1/4)
- It doesnt depend on C(i,j) being a matrix entry,
just a unique memory location (same for A(i,k)
and B(k,j) )
- It doesnt depend on C and A not overlapping (or
C and B, or A and B) - Some reorderings may change answer still get a
lower bound for all reorderings
- It doesnt depend on doing n3 multiply/adds
- It doesnt depend on C(i,j) just being a sum of
products
C(i,j) ?k A(i,k)B(k,j)
For all (i,j) ? S Mem(c(i,j))
fij ( gijk ( Mem(a(i,k)) , Mem(b(k,j)) ) for k ?
Sij , some other arguments)
19How to generalize this lower bound (2/4)
- It does assume the presence of an operand
generating a load and/or store how could this
not happen? - Mem(b(k,j)) could be reused in many more gijk
than (P) allows - Ex Compute C(m) A (B.m) (Matlab notation)
for m1 to t - Can move many fewer words than ? (flops / M1/2 )
- We might generate a result during a segment, use
it, and discard it, without generating any memory
traffic - Turns out QR, eig, SVD all may do this
- Need a different analysis for them later
20How to generalize this lower bound (3/4)
- Need to distinguish Sources, Destinations of
each operand in fast memory during a segment - Possible Sources
- S1 Already in fast memory at start of segment,
or read at most 2M - S2 Created during segment no bound without
more information - Possible Destinations
- D1 Left in fast memory at end of segment, or
written at most 2M - D2 Discarded no bound without more information
- Need to assume no S2/D2 arguments at most 4M of
others
21How to generalize this lower bound (4/4)
- Theorem To evaluate (P) with memory of size M,
where - fij and gijk are nontrivial functions of their
arguments - G is the total number of gijks,
- No S2/D2 arguments
- requires at least G/ (8M)1/2 M slow
memory references
- Simpler words_moved ? (flops / M1/2 )
- Corollary To evaluate (P) requires at least
G/ (81/2 M3/2 ) 1
messages (simpler ? (flops / M3/2 ) ) - Proof maximum message size is M
22Some Corollaries of this lower bound (1/4)
words_moved ? (flops / M1/2 )
?
- Theorem applies to dense or sparse, parallel or
sequential - MatMul, including ATA or A2
- Triangular solve C A-1X
- C(i,j) (X(i,j) - ?klti A(i,k)C(k,j)) / A(i,i)
A lower triangular - C plays double role of b and c in Model (P)
- LU factorization (any pivoting, LU or ILU)
- L(i,j) (A(i,j) - ?kltj L(i,k)U(k,j)) / U(j,j),
U(i,j) A(i,j) - ?klti L(i,k)U(k,j) - L (and U) play double role as c and a (c and b)
in Model (P) - Cholesky (any diagonal pivoting, C or IC)
- L(i,j) (A(i,j) - ?kltj L(i,k)LT(k,j)) / L(j,j),
L(j,j) (A(j,j) - ?kltj L(j,k)LT(k,j))1/2 - L (and LT) play triple role as a, b and c
23Some Corollaries of this lower bound (2/4)
words_moved ? (flops / M1/2 )
?
- Applies to simplified operations
- Ex 1 Compute ABF where A(i,k) f(i,k) and
B(k,j) g(k,j), so no inputs and 1 output
assume each f(i,k) and g(k,j) evaluated once - Ex 2 Compute determinant of A(i,j) f(i,j)
using LU - Apply lower bound by imposing reads and writes
- Ex 1 Every time a final value of (AB)(i,j) is
computed, write it every time f(i,k),
g(k,j) evaluated, insert a read - Ex 2 Every time a final value of L(i,j), U(i,j)
computed, write it every time f(i,j)
evaluated, insert a read - Still get ? (flops / M1/2 3n2) words_moved,
by subtracting imposed reads/writes (sparse
case analogous)
24Some Corollaries of this lower bound (3/4)
words_moved ? (flops / M1/2 )
?
25Some Corollaries of this lower bound (4/4)
words_moved ? (flops / M1/2 )
?
- Applies to some graph algorithms
- Ex All-Pairs-Shortest-Path by Floyd-Warshall
- Matches (P) with gijk and fij min
- Get words_moved ? (n3 / M1/2) , for dense
graphs - Homework state result for sparse graphs how
does n3 change?
Initialize all path(i, j) length of edge from
node i to j (or ? if none) for k 1 to n
for i 1 to n for j 1 to n
path(i, j) min ( path(i, j),
path(i, k) path(k, j) )
263D parallel matrix multiplication C C A B
- p procs arranged in p1/3 x p1/3 x p1/3
- grid, with tree networks along fibers
- Each A(i,k), B(k,j), C(i,j) is n/p1/3 x n/p1/3
- Initially
- Proc (i,j,0) owns C(i,j)
- Proc (i,0,k) owns A(i,k)
- Proc (0,j,k) owns B(k,j)
- Algorithm
- For all (i,k), broadcast A(i,k) to proc (i,j,k) ?
j - comm. cost log p1/3 ? ? log p1/3
? n2 / p2/3 ? - For all (j,k), broadcast B(k,j) to proc(i,j,k) ?
j same comm. cost - For all (i,j,k) Tmp(i,j,k) A(i,k) ? B(k,j)
cost (2(n/p1/3)3) 2n3/p flops - For all (i,j), reduce C(i,j) ?k Tmp(i,j,k)
same comm. Cost - Total comm. cost O(log(p) ? n2 / p2/3 ?
log(p) ? ?) - Lower bound ?((n3/p)/(n2/p2/3)1/2 ?
(n3/p)/(n2/p2/3)3/2 ?) ?(n2/p2/3 ? ?) - Lower bound also attainable for all n2/p2/3 ? M
n2/px ? n2/p Toledo et al
27So when doesnt Model (P) apply?
words_moved ? (flops / M1/2 )
?
- Ex for r 1 to t, C(r) A (B.(1/r))
Matlab notation - (P) does not apply
- With A(i,k) and B(k,j) in memory, we can do t
operations, not just 1 gijk - Cant apply (P) with arguments b(k,j) indexed in
one-to-one fashion - Can still analyze using segments, Loomis-Whitney
- flops/segment ? 8(tM3)1/2
- segments ? (t n3 ) / 8(tM3)1/2
- words_moved ? (t1/2 n3 / M1/2 ), not ? (t
n3 / M1/2 ) - Attainable, using variation of usual blocked
matmul algorithm - Homework! (Hint need to recompute (B(j,k))1/r
when needed)
28Lower bounds for Orthogonal Transformations (1/4)
- Needed for QR, eig, SVD,
- Analyze Blocked Householder Transformations
- ?j1 to b (I 2 uj uTj ) I U T UT where U
u1 , , ub - Treat Givens as 2x2 Householder
- Model details and assumptions
- Write (I U T UT )A A U(TUT A) A UZ
where Z T(UT A) - Only count operations in all A UZ operations
- Generically a large fraction of the work
- Assume Forward Progress, that each successive
Householder transformation leaves previously
created zeros zero Ok for - QR decomposition
- Reductions to condensed forms (Hessenberg,
tri/bidiagonal) - Possible exception bulge chasing in banded case
- One sweep of (block) Hessenberg QR iteration
29Lower bounds for Orthogonal Transformations (2/4)
- Perform many A UZ where Z T(UT A)
- First challenge to applying theorem need to
collect all A-UZ into one big set to which model
(P) applies - Write them all as A(i,j) A(i,j) - ?k U(i,k)
Z(k,j) where k index of
k-th transformation, - k not necessarily index of column of A it comes
from - Second challenge All Z(k,j) may be S2/D2
- Recall S2/D2 means computed on the fly and
discarded - Ex An x 2n Qn x n Rn x 2n where 2n2 M so A
fits in cache - Represent Q as n(n-1)/2 2x2 Householder (Givens)
transforms - There are n2(n-1)/2 ?(M3/2) nonzero Z(k,j), not
O(M) - Still only do ?(M3/2) flops during segment
- But cant use Loomis-Whitney to prove it!
- Need a new idea
30Lower bounds for Orthogonal Transformations (3/4)
- Dealing with Z(k,j) being S2/D2
- How to bound flops in A(i,j) A(i,j) - ?k
U(i,k) Z(k,j) ? - Neither A nor U is S2/D2
- A either turned into R or U, which are output
- So at most 2M of each during segment
- flops ( U(i,k) ) ( columns A(,j) present
) - ( U(i,k) ) ( A(i,j) / min
nonzeros per column of A) - h O(M) / r
where h O(M) - How small can r be?
- To store h U(i,k) Householder vector entries
in r rows, there can be at most r in the
first column, r-1 in the second, etc., - (to maintain Forward Progress) so r(r-1)/2 ?
h so r ? h1/2 - flops h O(M) / r O(M) h1/2 O(M3/2) as
desired
31Lower bounds for Orthogonal Transformations (4/4)
- Theorem words_moved by QR decomposition using
(blocked) Householder transformations ?( flops
/ M1/2 ) - Theorem words_moved by reduction to Hessenberg
form, tridiagonal form, bidiagonal form, or one
sweep of QR iteration (or block versions of any
of these) ?( flops / M1/2 ) - Assuming Forward Progress (created zeros remain
zero) - Model Merge left and right orthogonal
transformations - A(i,j) A(i,j) - ?kLUL(i,kL) ZL(kL,j) -
?kRZR(i,kR) UR(j,kR)
32Robust PCA (Candes, 2009)
Decompose a matrix into a low rank component and
a sparse component
M L S
Homework Apply lower bound result
33Can we attain these lower bounds?
- Do conventional dense algorithms as implemented
in LAPACK and ScaLAPACK attain these bounds? - Mostly not
- If not, are there other algorithms that do?
- Yes
- Several goals for algorithms
- Minimize Bandwidth (? (flops/ M1/2 )) and
latency (? (flops/ M3/2 )) - Multiple memory hierarchy levels (attain bound
for each level?) - Explicit dependence on (multiple) M(s), vs cache
oblivious - Fewest flops when fits in smallest memory
- What about sparse algorithms?
34Recall Minimizing latency requires new data
structures
- To minimize latency, need to load/store whole
rectangular subblock of matrix with one message - Incompatible with conventional columnwise
(rowwise) storage - Ex Rows (columns) not in contiguous memory
locations - Blocked storage store as matrix of bxb blocks,
each block stored contiguously - Ok for one level of memory hierarchy, what if
more? - Recursive blocked storage store each block
using subblocks
35Recall Blocked vs Cache-Oblivious Algorithms
- Blocked Matmul C AB explicitly refers to
subblocks of A, B and C of dimensions that depend
on cache size
Break Anxn, Bnxn, Cnxn into bxb blocks labeled
A(i,j), etc b chosen so 3 bxb blocks fit in
cache for i 1 to n/b, for j1 to n/b, for
k1 to n/b C(i,j) C(i,j) A(i,k)B(k,j)
b x b matmul another level of
memory would need 3 more loops
- Cache-oblivious Matmul C AB is independent of
cache
Function C MM(A,B) If A and B are 1x1 C
A B else Break Anxn, Bnxn, Cnxn into
(n/2)x(n/2) blocks labeled A(i,j), etc for
i 1 to 2, for j 1 to 2, for k 1 to
2 C(i,j) C(i,j) MM( A(i,k), B(k,j)
) n/2 x n/2 matmul
36Summary of dense sequential algorithms attaining
communication lower bounds
- Algorithms shown minimizing Messages assume
(recursive) block layout - Many references (see reports), only some shown,
plus ours - Older references may or may not include
analysis - Cache-oblivious are underlined, Green are ours,
? is unknown/future work
Algorithm 2 Levels of Memory 2 Levels of Memory Multiple Levels of Memory Multiple Levels of Memory
Words Moved and Messages Words Moved and Messages
BLAS-3 Usual blocked or recursive algorithms Usual blocked or recursive algorithms Usual blocked algorithms (nested), or recursive Gustavson,97 Usual blocked algorithms (nested), or recursive Gustavson,97
Cholesky LAPACK (with b M1/2) Gustavson 97 BDHS09 Gustavson,97 Ahmed,Pingali,00 BDHS09 (?same) (?same)
LU with pivoting LAPACK (rarely) Toledo,97 , GDX 08 GDX 08 Toledo, 97 GDX 08? GDX 08?
QR LAPACK (rarely) Elmroth,Gustavson,98 DGHL08 Frens,Wise,03 DGHL08 Elmroth, Gustavson,98 DGHL08 ? Frens,Wise,03 DGHL08 ?
Eig, SVD Not LAPACK BDD10 Not LAPACK BDD10 BDD10 BDD10
37Summary of dense 2D parallel algorithms
attaining communication lower bounds
- Assume nxn matrices on P processors, memory per
processor O(n2 / P) - Many references (see reports), Green are ours
- ScaLAPACK assumes best block size b chosen
- Recall lower bounds
- words_moved ?( n2 / P1/2 ) and
messages ?( P1/2 )
Algorithm Reference Factor exceeding lower bound for words_moved Factor exceeding lower bound for messages
Matrix multiply Cannon, 69 1 1
Cholesky ScaLAPACK log P log P
LU GDX08 ScaLAPACK log P log P log P log P N / P1/2
QR DGHL08 ScaLAPACK log P log P log3 P log P N / P1/2
Sym Eig, SVD BDD10 ScaLAPACK log P log P log3 P N / P1/2
Nonsym Eig BDD10 ScaLAPACK log P log P P1/2 log3 P log P N
38Recent Communication Optimal Algorithms
- QR with column pivoting
- Cholesky with diagonal pivoting
- LU with complete pivoting
- LDL with complete pivoting
- Sparse Cholesky
- For matrices with good separators
- For most sparse matrices (as hard as dense case)
39Homework Extend lower bound
- To algorithms using (block) Givens rotations
instead of (block) Householder transformations - To QR done with Gram-Schmidt orthogonalization
- CGS and MGS
- To heterogeneous collections of processors
- Suppose processor k
- Does ?(k) flops per second
- Has reciprocal bandwidth ?(k)
- Has latency ?(k)
- What is a lower bound on solution time, for any
fractions of work assigned to each processor
(i.e. not all equal) - To a homogenous parallel shared memory machine
- All data initially resides in large shared memory
- Processors communicate by data written to/read
from slow memory - Each processor has local fast memory size M
40Extra slides