Minimizing Communication in Linear Algebra - PowerPoint PPT Presentation

1 / 87
About This Presentation
Title:

Minimizing Communication in Linear Algebra

Description:

Based on s by Jim Demmel and others – PowerPoint PPT presentation

Number of Views:121
Avg rating:3.0/5.0
Slides: 88
Provided by: Kathy343
Category:

less

Transcript and Presenter's Notes

Title: Minimizing Communication in Linear Algebra


1
Minimizing Communication in Linear Algebra
  • Grey Ballard
  • 29 Oct 2009

2
Outline
  • What is communication and why is it important
    to avoid?
  • Goal Algorithms that communicate as little as
    possible for
  • BLAS, Axb, QR, eig, SVD, whole programs,
  • Dense or sparse matrices
  • Sequential or parallel computers
  • Direct methods (BLAS, LU, QR, SVD, other
    decompositions)
  • Communication lower bounds for all these problems
  • Algorithms that attain them (all dense linear
    algebra, some sparse)
  • Mostly not in LAPACK or ScaLAPACK (yet)
  • Iterative methods Krylov subspace methods for
    Axb, Ax?x
  • Communication lower bounds, and algorithms that
    attain them (depending on sparsity structure)
  • Not in any libraries (yet)
  • Just highlights
  • Preliminary speed-up results
  • Up to 8x measured (or 8x), 81x modeled

3
Collaborators
  • Jim Demmel, UCB EECS
  • Ioana Dumitriu, U. Washington
  • Laura Grigori, INRIA
  • Ming Gu, UCB Math
  • Mark Hoemmen, UCB EECS
  • Olga Holtz, UCB Math TU Berlin
  • Julien Langou, U. Colorado Denver
  • Marghoob Mohiyuddin, UCB EECS
  • Oded Schwartz , TU Berlin
  • Hua Xiang, INRIA
  • Kathy Yelick, UCB EECS NERSC
  • BeBOP group, bebop.cs.berkeley.edu

4
Motivation (1/2)
  • Algorithms have two costs
  • Arithmetic (FLOPS)
  • Communication moving data between
  • levels of a memory hierarchy (sequential case)
  • processors over a network (parallel case).

5
Motivation (2/2)
  • Running time of an algorithm is sum of 3 terms
  • flops time_per_flop
  • words moved / bandwidth
  • messages latency
  • Time_per_flop ltlt 1/ bandwidth ltlt latency
  • Gaps growing exponentially with time

Annual improvements Annual improvements Annual improvements Annual improvements
Time_per_flop Bandwidth Latency
Network 26 15
DRAM 23 5
59
6
Direct linear algebra Prior Work on Matmul
  • Assume n3 algorithm (i.e. not Strassen-like)
  • Sequential case, with fast memory of size M
  • Lower bound on words moved to/from slow memory
    ? (n3 / M1/2 ) Hong, Kung, 81
  • Attained using blocked algorithms
  • Parallel case on P processors
  • Let NNZ be total memory needed assume load
    balanced
  • Lower bound on words communicated
    ? (n3 /(P NNZ )1/2 )
    Irony, Tiskin, Toledo, 04

NNZ (name of alg) Lower bound on words Attained by
3n2 (2D alg) ? (n2 / P1/2 ) Cannon, 69
3n2 P1/3 (3D alg) ? (n2 / P2/3 ) Johnson,93
7
Lower bound for all direct linear algebra
  • Let M fast memory size per processor
  • words_moved by at least one processor
  • (flops / M1/2 )
  • messages_sent by at least one processor
  • (flops / M3/2 )
  • Holds for
  • BLAS, LU, QR, eig, SVD,
  • Some whole programs (sequences of these
    operations, no matter how they are interleaved,
    eg computing Ak)
  • Dense and sparse matrices (where flops ltlt n3 )
  • Sequential and parallel algorithms
  • Some graph-theoretic algorithms (eg
    Floyd-Warshall)
  • See BDHS09

8
Can we attain these lower bounds?
  • Do conventional dense algorithms as implemented
    in LAPACK and ScaLAPACK attain these bounds?
  • Mostly not
  • If not, are there other algorithms that do?
  • Yes
  • Goals for algorithms
  • Minimize words ? (flops/ M1/2 )
  • Minimize messages ? (flops/ M3/2 )
  • Minimize for multiple memory hierarchy levels
  • Recursive, Cache-oblivious algorithms would be
    simplest
  • Fewest flops when matrix fits in fastest memory
  • Cache-oblivious algorithms dont always attain
    this
  • Only a few sparse algorithms so far

9
Summary of dense sequential algorithms attaining
communication lower bounds
  • Algorithms shown minimizing Messages use
    (recursive) block layout
  • Not possible with columnwise or rowwise layouts
  • Many references (see reports), only some shown,
    plus ours
  • Cache-oblivious are underlined, Green are ours,
    ? is unknown/future work

Algorithm 2 Levels of Memory 2 Levels of Memory Multiple Levels of Memory Multiple Levels of Memory
Words Moved and Messages Words Moved and Messages
BLAS-3 Usual blocked or recursive algorithms Usual blocked or recursive algorithms Usual blocked algorithms (nested), or recursive Gustavson,97 Usual blocked algorithms (nested), or recursive Gustavson,97
Cholesky LAPACK (with b M1/2) Gustavson 97 BDHS09 Gustavson,97 Ahmed,Pingali,00 BDHS09 (?same) (?same)
LU with pivoting LAPACK (rarely) Toledo,97 , GDX 08 GDX 08 not partial pivoting Toledo, 97 GDX 08? GDX 08?
QR LAPACK (rarely) Elmroth,Gustavson,98 DGHL08 Frens,Wise,03 but 3x flops DGHL08 Elmroth, Gustavson,98 DGHL08 ? Frens,Wise,03 DGHL08 ?
Eig, SVD Not LAPACK BDD09 randomized, but more flops Not LAPACK BDD09 randomized, but more flops BDD09 BDD09
10
Summary of dense 2D parallel algorithms
attaining communication lower bounds
  • Assume nxn matrices on P processors, memory
    per processor O(n2 / P)
  • ScaLAPACK assumes best block size b chosen
  • Many references (see reports), Green are ours
  • Recall lower bounds
  • words_moved ?( n2 / P1/2 ) and
    messages ?( P1/2 )

Algorithm Reference Factor exceeding lower bound for words_moved Factor exceeding lower bound for messages
Matrix multiply Cannon, 69 1 1
Cholesky ScaLAPACK log P log P
LU GDX08 ScaLAPACK log P log P log P ( N / P1/2 ) log P
QR DGHL08 ScaLAPACK log P log P log3 P ( N / P1/2 ) log P
Sym Eig, SVD BDD09 ScaLAPACK log P log P log3 P N / P1/2
Nonsym Eig BDD09 ScaLAPACK log P P1/2 log P log3 P N log P
11
QR of a Tall, Skinny matrix is bottleneckUse
TSQR instead
Q is represented implicitly as a product (tree of
factors) DGHL08
12
Minimizing Communication in TSQR
Multicore / Multisocket / Multirack / Multisite /
Out-of-core ?
Can Choose reduction tree dynamically
13
Performance of TSQR vs Sca/LAPACK
  • Parallel
  • Intel Clovertown
  • Up to 8x speedup (8 core, dual socket, 10M x 10)
  • Pentium III cluster, Dolphin Interconnect, MPICH
  • Up to 6.7x speedup (16 procs, 100K x 200)
  • BlueGene/L
  • Up to 4x speedup (32 procs, 1M x 50)
  • Both use Elmroth-Gustavson locally enabled by
    TSQR
  • Sequential
  • OOC on PowerPC laptop
  • As little as 2x slowdown vs (predicted) infinite
    DRAM
  • LAPACK with virtual memory never finished
  • See DGHL08
  • General CAQR Communication-Avoiding QR in
    progress

14
Modeled Speedups of CAQR vs ScaLAPACK
Petascale up to 22.9x IBM Power 5
up to 9.7x Grid up to 11x
Petascale machine with 8192 procs, each at 500
GFlops/s, a bandwidth of 4 GB/s.
15
Direct Linear Algebra summary and future work
  • Communication lower bounds on words_moved and
    messages
  • BLAS, LU, Cholesky, QR, eig, SVD,
  • Some whole programs (compositions of these
    operations, no matter how
    they are interleaved, eg computing Ak)
  • Dense and sparse matrices (where flops ltlt n3 )
  • Sequential and parallel algorithms
  • Some graph-theoretic algorithms (eg
    Floyd-Warshall)
  • Algorithms that attain these lower bounds
  • Nearly all dense problems (some open problems)
  • A few sparse problems
  • Speed ups in theory and practice
  • Future work
  • Lots to implement, autotune
  • Next generation of Sca/LAPACK on heterogeneous
    architectures (MAGMA)
  • Few algorithms in sparse case
  • Strassen-like algorithms
  • 3D Algorithms (only for Matmul so far), may be
    important for scaling

16
Avoiding Communication in Iterative Linear Algebra
  • k-steps of typical iterative solver for sparse
    Axb or Ax?x
  • Does k SpMVs with starting vector
  • Finds best solution among all linear
    combinations of these k1 vectors
  • Many such Krylov Subspace Methods
  • Conjugate Gradients, GMRES, Lanczos, Arnoldi,
  • Goal minimize communication in Krylov Subspace
    Methods
  • Assume matrix well-partitioned, with modest
    surface-to-volume ratio
  • Parallel implementation
  • Conventional O(k log p) messages, because k
    calls to SpMV
  • New O(log p) messages - optimal
  • Serial implementation
  • Conventional O(k) moves of data from slow to
    fast memory
  • New O(1) moves of data optimal
  • Lots of speed up possible (modeled and measured)
  • Price some redundant computation
  • Much prior work
  • See bebop.cs.berkeley.edu
  • CG van Rosendale, 83, Chronopoulos and Gear,
    89
  • GMRES Walker, 88, Joubert and Carey, 92,
    Bai et al., 94

17
Minimizing Communication of GMRES to solve Axb
  • GMRES find x in spanb,Ab,,Akb minimizing
    Ax-b 2
  • Cost of k steps of standard GMRES vs new GMRES

Standard GMRES for i1 to k w A
v(i-1) MGS(w, v(0),,v(i-1)) update
v(i), H endfor solve LSQ problem with
H Sequential words_moved O(knnz)
from SpMV O(k2n) from MGS Parallel
messages O(k) from SpMV
O(k2 log p) from MGS
Communication-avoiding GMRES W v, Av, A2v,
, Akv Q,R TSQR(W) Tall Skinny
QR Build H from R, solve LSQ
problem Sequential words_moved
O(nnz) from SpMV O(kn) from
TSQR Parallel messages O(1) from
computing W O(log p) from TSQR
  • Oops W from power method, precision lost!

18
Monomial basis Ax,,Akx fails to converge
A different polynomial basis does converge
19
Speed ups of GMRES on 8-core Intel Clovertown
MHDY09
20
Communication Avoiding Iterative Linear Algebra
Future Work
  • Lots of algorithms to implement, autotune
  • Make widely available via OSKI, Trilinos, PETSc,
    Python,
  • Extensions to other Krylov subspace methods
  • So far just Lanczos/CG and Arnoldi/GMRES
  • BiCGStab, CGS, QMR,
  • Add preconditioning
  • Block diagonal are easiest
  • Hierarchical, semiseparable,
  • Works if interactions between distant points can
    be compressed

21
Summary
  • Dont Communic

22
1/3
23
2/3
24
3/3
25
  • Extra Slides

26
Whats new since last time
  • Last spoke in this seminar Feb 2009
  • Direct case
  • Lower bound for QR decomp proven
  • Extensions to eigenproblems, SVD, compositions,
  • Invented or identified algorithms that attain
    lower bounds
  • Iterative case
  • Not presented last time
  • Mark Hoemmen has spoken about it
  • More good measured speedups (for GMRES TSQR)

27
Direct linear algebra Generalizing Prior Work
  • Sequential case words moved
  • ? (n3 / M1/2 ) ? (flops / (fast_memory_size)1/
    2 )
  • Parallel case words moved
  • ? (n3 /(P NNZ )1/2 ) ? ((n3 /P) / (NNZ/P)1/2
    )
  • ? (flops_per_proc / (memory_per_processor)1/2
    )
  • (true for at least one processor, assuming
    balance in either flops or in memory)
  • In both cases, we let M memory size, and write

words_moved by at least one processor ?
(flops / M1/2 )
28
Proof of Communication Lower Bound on C AB
(1/5)
  • Proof from Irony/Toledo/Tiskin (2004)
  • Original proof, then generalization
  • Think of instruction stream being executed
  • Looks like add, load, multiply, store,
    load, add,
  • Each load/store moves a word between fast and
    slow memory
  • We want to count the number of loads and stores,
    given that we are multiplying n-by-n matrices C
    AB using the usual 2n3 flops, possibly reordered
    assuming addition is commutative/associative
  • Assuming that at most M words can be stored in
    fast memory
  • Outline
  • Break instruction stream into segments, each with
    M loads and stores
  • Somehow bound the maximum number of flops that
    can be done in each segment, call it F
  • So F segments ? T total flops 2n3 ,
    so segments ? T / F
  • So loads stores M segments ? M T /
    F

29
Illustrating Segments, for M3
...
30
Proof of Communication Lower Bound on C AB
(1/5)
  • Proof from Irony/Toledo/Tiskin (2004)
  • Original proof, then generalization
  • Think of instruction stream being executed
  • Looks like add, load, multiply, store,
    load, add,
  • Each load/store moves a word between fast and
    slow memory
  • We want to count the number of loads and stores,
    given that we are multiplying n-by-n matrices C
    AB using the usual 2n3 flops, possibly reordered
    assuming addition is commutative/associative
  • Assuming that at most M words can be stored in
    fast memory
  • Outline
  • Break instruction stream into segments, each with
    M loads and stores
  • Somehow bound the maximum number of flops that
    can be done in each segment, call it F
  • So F segments ? T total flops 2n3 ,
    so segments ? T / F
  • So loads stores M segments ? M T /
    F

31
Proof of Communication Lower Bound on C AB
(2/5)
  • Given segment of instruction stream with M loads
    stores, how many adds multiplies (F) can we
    do?
  • At most 2M entries of C, 2M entries of A and/or
    2M entries of B can be accessed
  • Use geometry
  • Represent n3 multiplications by n x n x n cube
  • One n x n face represents A
  • each 1 x 1 subsquare represents one A(i,k)
  • One n x n face represents B
  • each 1 x 1 subsquare represents one B(k,j)
  • One n x n face represents C
  • each 1 x 1 subsquare represents one C(i,j)
  • Each 1 x 1 x 1 subcube represents one C(i,j)
    A(i,k) B(k,j)
  • May be added directly to C(i,j), or to temporary
    accumulator

32
Proof of Communication Lower Bound on C AB
(3/5)
k
C face
C(2,3)
C(1,1)
B(3,1)
A(1,3)
j
B(2,1)
A(1,2)
B(1,3)
B(1,1)
A(2,1)
A(1,1)
B face
i
  • If we have at most 2M A squares, 2M B
    squares, and 2M C squares on faces, how many
    cubes can we have?

A face
33
Proof of Communication Lower Bound on C AB
(4/5)
x
y
z
y
z
x
(i,k) is in A shadow if (i,j,k) in 3D set
(j,k) is in B shadow if (i,j,k) in 3D set
(i,j) is in C shadow if (i,j,k) in 3D
set Thm (Loomis Whitney, 1949) cubes in
3D set Volume of 3D set (area(A shadow)
area(B shadow) area(C shadow)) 1/2
cubes in black box with side lengths x, y
and z Volume of black box xyz ( xz zy
yx)1/2 (A?s B?s C?s )1/2
34
Proof of Communication Lower Bound on C AB
(5/5)
  • Consider one segment of instructions with M
    loads, stores
  • Can be at most 2M entries of A, B, C available in
    one segment
  • Volume of set of cubes representing possible
    multiply/adds in one segment is (2M 2M
    2M)1/2 (2M) 3/2 F
  • Segments ? 2n3 / F
  • Loads Stores M Segments ? M 2n3 / F
    n3 / (2M)1/2
  • Parallel Case apply reasoning to one processor
    out of P
  • Adds and Muls ? 2n3 / P (at least one proc
    does this )
  • M n2 / P (each processor gets equal fraction of
    matrix)
  • Load Stores words moved from or to
    other procs ? M (2n3 /P) / F M (2n3 /P) /
    (2M) 3/2 n2 / (2P)1/2

35
How to generalize this lower bound (1/4)
  • It doesnt depend on C(i,j) being a matrix entry,
    just a unique memory location (same for A(i,k)
    and B(k,j) )
  • It doesnt depend on C and A not overlapping (or
    C and B, or A and B)
  • Some reorderings may change answer still get a
    lower bound
  • It doesnt depend on doing n3 multiply/adds
  • It doesnt depend on C(i,j) just being a sum of
    products

C(i,j) ?k A(i,k)B(k,j)
For all (i,j) ? S Mem(c(i,j))
fij ( gijk ( Mem(a(i,k)) , Mem(b(k,j)) ) for k ?
Sij , some other arguments)
36
How to generalize this lower bound (2/4)
  • It does assume the presence of an operand
    generating a load and/or store how could this
    not happen?
  • Mem(b(k,j)) could be reused in many more gijk
    than (P) allows
  • Ex Compute C(m) A B.m (Matlab notation)
    for m1 to t
  • Can move t1/2 times fewer words than ? (flops /
    M1/2 )
  • We might generate a result during a segment, use
    it, and discard it, with generating any memory
    traffic
  • Turns out QR, eig, SVD all may do this
  • Need a different analysis for them later

37
How to generalize this lower bound (3/4)
  • Need to distinguish Sources, Destinations of
    each operand in fast memory during a segment
  • Possible Sources
  • S1 Already in fast memory at start of segment,
    or read at most 2M
  • S2 Created during segment no bound without
    more information
  • Possible Destinations
  • D1 Left in fast memory at end of segment, or
    written at most 2M
  • D2 Discarded no bound without more information
  • Need to assume no S2/D2 arguments at most 4M of
    others

38
How to generalize this lower bound (4/4)
  • Theorem To evaluate (P) with memory of size M,
    where
  • fij and gijk are nontrivial functions of their
    arguments
  • G is the total number of gijks,
  • No S2/D2 arguments
  • requires at least G/ (8M)1/2 M slow
    memory references
  • Simpler words_moved ? (flops / M1/2 )
  • Corollary To evaluate (P) requires at least
    G/ (81/2 M3/2 ) 1
    messages (simpler ? (flops / M3/2 ) )
  • Proof maximum message size is M

39
Some Corollaries of this lower bound (1/2)
words_moved ? (flops / M1/2 )
?
  • Theorem applies to dense or sparse, parallel or
    sequential
  • MatMul, including ATA or A2
  • Triangular solve C A-1X
  • C(i,j) (X(i,j) - ?klti A(i,k)C(k,j)) / A(i,i)
    A lower triangular
  • C plays double role of b and c in Model (P)
  • LU factorization (any pivoting, LU or ILU)
  • L(i,j) (A(i,j) - ?kltj L(i,k)U(k,j)) / U(j,j),
    U(i,j) A(i,j) - ?klti L(i,k)U(k,j)
  • L (and U) play double role as c and a (c and b)
    in Model (P)
  • Cholesky (any diagonal pivoting, C or IC)
  • L(i,j) (A(i,j) - ?kltj L(i,k)LT(k,j)) / L(j,j),
    L(j,j) (A(j,j) - ?kltj L(j,k)LT(k,j))1/2
  • L (and LT) plays triple role as a, b and c

40
Some Corollaries of this lower bound (2/2)
words_moved ? (flops / M1/2 )
?
  • Applies to simplified operations
  • Ex Compute determinant of A(i,j) func(i,j)
    using LU, where each func(i,j) called just
    once so no inputs and 1 output
  • Still get ? (flops / M1/2 2n2) words_moved,
    by imposing loads/stores
  • Applies to compositions of operations
  • Ex Compute Ak by repeated squaring, only input
    A, output Ak
  • Still get ? (flops / M1/2 n2 log k ) by
    imposing loads/stores,
  • Holds for any interleaving of operations
  • Applies to some graph algorithms
  • Ex All-Pairs-Shortest-Path by Floyd-Warshall
    gijk and fij min
  • Get ? (F / M1/2) words_moved where F? n4 , n3
    log n , n3

41
Lower bounds for Orthogonal Transformations (1/4)
  • Needed for QR, eig, SVD,
  • Analyze Blocked Householder Transformations
  • ?j1 to b (I 2 uj uTj ) I U T UT where U
    u1 , , ub
  • Treat Givens as 2x2 Householder
  • Model details and assumptions
  • Write (I U T UT )A A U(TUT A) A UZ
    where Z T(UT A)
  • Only count operations in all A UZ operations
  • Generically a large fraction of the work
  • Assume Forward Progress, that each successive
    Householder transformation leaves previously
    created zeros zero Ok for
  • QR decomposition
  • Reductions to condensed forms (Hessenberg,
    tri/bidiagonal)
  • Possible exception bulge chasing in banded case
  • One sweep of (block) Hessenberg QR iteration

42
Lower bounds for Orthogonal Transformations (2/4)
  • Perform many A UZ where Z T(UT A)
  • First challenge to applying theorem need to
    collect all A-UZ into one big set to which model
    (P) applies
  • Write them all as A(i,j) A(i,j) - ?k U(i,k)
    Z(k,j) where k index of
    k-th transformation,
  • k not necessarily index of column of A it comes
    from
  • Second challenge All Z(k,j) may be S2/D2
  • Recall S2/D2 means computed on the fly and
    discarded
  • Ex An x 2n Qn x n Rn x 2n where 2n2 M so A
    fits in cache
  • Represent Q as n(n-1)/2 2x2 Householder (Givens)
    transforms
  • There are n2(n-1)/2 ?(M3/2) nonzero Z(k,j), not
    O(M)
  • Still only do ?(M3/2) flops during segment
  • But cant use Loomis-Whitney to prove it!
  • Need a new idea

43
Lower bounds for Orthogonal Transformations (3/4)
  • Dealing with Z(k,j) being S2/D2
  • How to bound flops in A(i,j) A(i,j) - ?k
    U(i,k) Z(k,j) ?
  • Neither A nor U is S2/D2
  • A either turned into R or U, which are output
  • So at most 2M of each during segment
  • flops ( U(i,k) ) ( columns A(,j) present
    )
  • ( U(i,k) ) ( A(i,j) / min
    nonzeros per column of A)
  • h O(M) / r
    where h O(M)
  • How small can r be?
  • To store h U(i,k) Householder vector entries
    in r rows, there can be at most r in the
    first column, r-1 in the second, etc.,
  • (to maintain Forward Progress) so r(r-1)/2 ?
    h so r ? h1/2
  • flops h O(M) / r O(M) h1/2 O(M3/2) as
    desired

44
Lower bounds for Orthogonal Transformations (4/4)
  • Theorem words_moved by QR decomposition using
    (blocked) Householder transformations ?( flops
    / M1/2 )
  • Theorem words_moved by reduction to Hessenberg
    form, tridiagonal form, bidiagonal form, or one
    sweep of QR iteration (or block versions of any
    of these) ?( flops / M1/2 )
  • Assuming Forward Progress (created zeros remain
    zero)
  • Model Merge left and right orthogonal
    transformations
  • A(i,j) A(i,j) - ?kLUL(i,kL) ZL(kL,j) -
    ?kRZR(i,kR) UR(j,kR)

45
Minimizing latency requires new data structures
  • To minimize latency, need to load/store whole
    rectangular subblock of matrix with one message
  • Incompatible with conventional columnwise
    (rowwise) storage
  • Ex Rows (columns) not in contiguous memory
    locations
  • Blocked storage store as matrix of bxb blocks,
    each block stored contiguously
  • Ok for one level of memory hierarchy, what if
    more?
  • Recursive blocked storage store each block
    using subblocks

46
Blocked vs Cache-Oblivious Algorithms
  • Blocked Matmul C AB explicitly refers to
    subblocks of A, B and C of dimensions that depend
    on cache size

Break Anxn, Bnxn, Cnxn into bxb blocks labeled
A(i,j), etc b chosen so 3 bxb blocks fit in
cache for i 1 to n/b, for j1 to n/b, for
k1 to n/b C(i,j) C(i,j) A(i,k)B(k,j)
b x b matmul another level of
memory would need 3 more loops
  • Cache-oblivious Matmul C AB is independent of
    cache

Function C MM(A,B) If A and B are 1x1 C
A B else Break Anxn, Bnxn, Cnxn into
(n/2)x(n/2) blocks labeled A(i,j), etc for
i 1 to 2, for j 1 to 2, for k 1 to
2 C(i,j) C(i,j) MM( A(i,k), B(k,j)
) n/2 x n/2 matmul
47
Summary of dense sequential Cholesky algorithms
attaining communication lower bounds
  • Algorithms shown minimizing Messages assume
    (recursive) block layout
  • Many references (see reports), only some shown,
    plus ours
  • Oldest reference may or may not include
    analysis
  • Cache-oblivious are underlined, Green are ours,
    ? is unknown/future work
  • b block size, M fast memory size, n
    dimension

Algorithm 2 Levels of Memory 2 Levels of Memory Multiple Levels of Memory Multiple Levels of Memory
Words Moved and Messages Words Moved and Messages
Cholesky LAPACK (with b M1/2) Gustavson, 97 recursive, assumed good MatMul available BDHS09 BDHS09 like Gustavson,97 Ahmed,Pingali,00 Uses recursive Cholesky, TRSM and Matmul (?same) BDHS09 has analysis (?same) BDHS09 has analysis
48
Summary of dense sequential QR algorithms
attaining communication lower bounds
  • Algorithms shown minimizing Messages assume
    (recursive) block layout
  • Many references (see reports), only some shown,
    plus ours
  • Oldest reference may or may not include
    analysis
  • Cache-oblivious are underlined, Green are ours,
    ? is unknown/future work
  • b block size, M fast memory size, n
    dimension

Algorithm 2 Levels of Memory 2 Levels of Memory Multiple Levels of Memory Multiple Levels of Memory
Words Moved and Messages Words Moved and Messages
QR Elmroth, Gustavson,98 Recursive, may do more flops LAPACK (only for n?M and bM1/2, or n2?M) DGHL08 Frens,Wise,03 explicit Q only, 3x more flops DGHL08 implicit Q, but represented differently, same flop count Elmroth, Gustavson,98 DGHL08? Frens,Wise,03 DGHL08? Can we get the same flop count?
49
Summary of dense sequential LU algorithms
attaining communication lower bounds
  • Algorithms shown minimizing Messages assume
    (recursive) block layout
  • Many references (see reports), only some shown,
    plus ours
  • Oldest reference may or may not include
    analysis
  • Cache-oblivious are underlined, Green are ours,
    ? is unknown/future work
  • b block size, M fast memory size, n
    dimension

Algorithm 2 Levels of Memory 2 Levels of Memory Multiple Levels of Memory Multiple Levels of Memory
Words Moved and Messages Words Moved and Messages
LU with pivoting Toledo, 97 recursive, assumes good TRSM and Matmul available LAPACK (only for n?M and bM1/2, or n2?M) GDX08 Different than partial pivoting, still stable GDX08 Toledo, 97 GDX08? GDX08?
50
Same idea for LU, almost GDX08
  • First idea Use same reduction tree as in QR,
    with LU at each node
  • Numerically unstable!
  • Need a different pivoting scheme
  • At each node in tree, TSLU selects b pivot rows
    from 2b candidates from its 2 child nodes
  • At each node, do LU on 2b original rows selected
    by child nodes, not U factors from child
    nodes
  • When TSLU done, permute b selected rows to top of
    original matrix, redo b steps of LU without
    pivoting
  • Thm Same results as partial pivoting on
    different input matrix whose entries are the same
    magnitudes as original
  • CALU Communication Avoiding LU for general A
  • Use TSLU for panel factorizations
  • Apply to rest of matrix
  • Cost redundant panel factorizations
  • Benefit
  • Stable in practice, but not same pivot choice as
    GEPP
  • b times fewer messages overall - faster

51
Performance vs ScaLAPACK
  • TSLU (Tall-Skinny)
  • IBM Power 5
  • Up to 4.37x faster (16 procs, 1M x 150)
  • Cray XT4
  • Up to 5.52x faster (8 procs, 1M x 150)
  • CALU (Square)
  • IBM Power 5
  • Up to 2.29x faster (64 procs, 1000 x 1000)
  • Cray XT4
  • Up to 1.81x faster (64 procs, 1000 x 1000)
  • See INRIA Tech Report 6523 (2008), paper at SC08

52
CALU speedup prediction for a Petascale machine -
up to 81x faster
P 8192
Petascale machine with 8192 procs, each at 500
GFlops/s, a bandwidth of 4 GB/s.
53
Eigenvalue problems and SVD
  • Uses spectral divide and conquer
  • Idea goes back to Bulgakov, Godunov, Malyshev
  • Roughly If A2k QR, leading columns of Q
    converge to span invariant subspace for
    eigenvalues of A outside unit circle so QAQ
    becomes block upper triangular
  • Repeated squaring done implicitly
  • Bai, D., Gu, 97
  • Reduced just to Matmul, QR, and Generalized QR
    with pivoting
  • Eliminated need for exp(A), other improvements
  • DDH,07
  • Introduced randomized rank revealing QR
  • Q,Rqr(randn(n,n)), Q,Rqr(AQ)
  • Just need Matmul and QR
  • BDD,09
  • Shows that randomized GQR also rank revealing
  • Shows limits of divide-and-conquer, depending on
    pseudospectrum
  • Cache-oblivious, using Frens,Wise,03

54
Implicit Repeated Squaring
for each j,
55
Sparse Linear Algebra
  • Thm (George 73, Hoffman/Martin/Rose 73,
    Eisenstat/ Schultz/Sherman 76 ) Cholesky on an
    ns x ns sparse matrix whose graph is an
    s-dimensional nearest neighbor grid does ?(
    n3(s-1) ) flops attainable by nested dissection
    ordering
  • Proof most work in dense Cholesky on final
    submatrix of size ?(ns-1)
  • Corollary BDHS09,G09 Sequential Sparse
    Cholesky on such a matrix moves ?( n3(s-1) /
    M1/2 ) words, and Parallel Sparse Cholesky moves
    ?( n2(s-1) / (P log n )1/2 ) words on P procs
  • Assumes 2D parallel algorithm, i.e. minimal
    memory
  • Thm G09. Attainable in parallel case by nested
    dissection, eg with PSPASES.

56
Locally Dependent Entries for x,Ax, A
tridiagonal, 2 processors
Proc 1
Proc 2
Can be computed without communication
57
Locally Dependent Entries for x,Ax,A2x, A
tridiagonal, 2 processors
Proc 1
Proc 2
Can be computed without communication
58
Locally Dependent Entries for x,Ax,,A3x, A
tridiagonal, 2 processors
Proc 1
Proc 2
Can be computed without communication
59
Locally Dependent Entries for x,Ax,,A4x, A
tridiagonal, 2 processors
Proc 1
Proc 2
Can be computed without communication
60
Locally Dependent Entries for x,Ax,,A8x, A
tridiagonal, 2 processors
Proc 1
Proc 2
Can be computed without communication k8 fold
reuse of A
61
Remotely Dependent Entries for x,Ax,,A8x, A
tridiagonal, 2 processors
Proc 1
Proc 2
One message to get data needed to compute
remotely dependent entries, not k8 Minimizes
number of messages latency cost Price
redundant work ? surface/volume ratio
62
Remotely Dependent Entries for x,Ax,A2x,A3x, A
irregular, multiple processors
63
Sequential x,Ax,,A4x, with memory hierarchy
v
One read of matrix from slow memory, not
k4 Minimizes words moved bandwidth cost No
redundant work
64
Performance results on 8-Core Clovertown
65
Direct linear algebra Summary of lower bounds
  • Sequential case with fast memory of size M
  • Lower bound on words moved to/from slow memory
    ? ( flops / M1/2 )
  • Ex Dense O(n3) MatMul ? (n3 / M1/2 )
  • Lower bound on messages to/from slow memory
    ? ( flops / M3/2 )
  • Ex Dense O(n3) MatMul ? (n3 / M3/2 )
  • Parallel case on P processors
  • Let NNZ be total memory needed assume balanced
  • Lower bound on words communicated
    ? ( flops / (P NNZ )1/2 )
  • Ex Dense O(n3) MatMul (so NNZ 3n2) ? (n2
    / P1/2 )
  • Lower bound on messages sent
    ? ( flops P1/2 / NNZ3/2 )
  • Ex Dense O(n3) MatMul ? (P1/2 )

66
Naïve Matrix Multiply
  • implements C C AB
  • for i 1 to n
  • for j 1 to n
  • for k 1 to n
  • C(i,j) C(i,j) A(i,k) B(k,j)

A(i,)
C(i,j)
C(i,j)
B(,j)



67
Naïve Matrix Multiply
  • implements C C AB
  • for i 1 to n
  • read row i of A into fast memory
  • for j 1 to n
  • read C(i,j) into fast memory
  • read column j of B into fast memory
  • for k 1 to n
  • C(i,j) C(i,j) A(i,k) B(k,j)
  • write C(i,j) back to slow memory

Number of slow memory references m n3 to
read each column of B n times n2
to read each row of A once 2n2 to
read and write each element of C once
n3 3n2 Similar for any other order of 3 loops
68
Blocked (Tiled) Matrix Multiply
  • Consider A,B,C to be N-by-N matrices of b-by-b
    blocks where bn / N is called
    the block size
  • for i 1 to N
  • for j 1 to N
  • read block C(i,j) into fast memory
  • for k 1 to N
  • read block A(i,k) into fast
    memory
  • read block B(k,j) into fast
    memory
  • C(i,j) C(i,j) A(i,k)
    B(k,j) do a matrix multiply on blocks
  • write block C(i,j) back to slow memory

A(i,k)
C(i,j)
C(i,j)



B(k,j)
69
Blocked (Tiled) Matrix Multiply
  • Recall
  • m is amount memory traffic between slow and
    fast memory
  • matrix has nxn elements, and NxN blocks each
    of size bxb
  • m Nn2 read each block of B N3 times
    (N3 b2 N3 (n/N)2 Nn2)
  • Nn2 read each block of A N3
    times
  • 2n2 read and write each block
    of C once
  • (2N 2) n2
  • ? 2 n3 / b smaller than naïve by a
    factor of b
  • The bigger the b the better How big can we make
    b?
  • Limit need to fit 3 b x b submatrices into fast
    memory of size M, so 3b2 ? M
  • So cant reduce m below ? (n3 / M1/2) by this
    approach
  • Is there any way to make m smaller?

70
Limits to Optimizing Matrix Multiply
  • The blocked algorithm changes the order in which
    values are accumulated into each Ci,j by
    applying
    Commutativity and Associativity of Addition (CAA)
  • Get slightly different answers from naïve code,
    because of roundoff - OK
  • Theorem (Hong Kung, 1981) Any reorganization
    of this algorithm (that uses only CAA) on a
    machine with fast memory size M does at least
    memory references ? (n3 / M1/2)
  • Attained by blocked matmul
  • Theorem (Irony, Toledo Tiskin, 2004) Suppose
    we are doing parallel matrix multiplication on P
    processors, where each processor can store O(n2 /
    P) words. Then for any algorithm (using only
    CAA), at least one processor needs to communicate
    at least O(n2 / P1/2) words
  • Attained by Cannons algorithm
  • Extends to 3D matmul using P1/3 times as much
    memory as minimal

71
Some Corollaries (1/8)
  • Theorem To evaluate (P) with fast memory of size
    M requires at least gijk s / (8M)1/2 M
    slow memory refs
  • Corollary 1 Dense or sparse matrix-multiply C
    AB requires at least multiplies/ (8M)1/2 M
    slow memory references
  • Not always attainable (suppose A, B diagonal)
  • Tighter lower bound max of above and total
    input/output size

72
Some Corollaries (2/8)
  • Theorem To evaluate (P) with fast memory of size
    M requires at least gijk s / (8M)1/2 M
    slow memory refs
  • Corollary 2 Dense or sparse matrix-multiply C
    AB
  • on P parallel processors,
  • where each processor stores
  • NNZ/P ( nnz(A) nnz(B) nnz(C) ) / P words
  • requires at least (multiplies/P)/(8
    NNZ/P)1/2 NNZ/P (multiplies)/(8 NNZ P)1/2
    NNZ/P words communicated

73
Some Corollaries (3/8)
  • Theorem To evaluate (P) with fast memory of size
    M requires at least gijk s / (8M)1/2 M
    slow memory refs
  • Corollary 3 Corollaries 1 and 2 apply to ATA and
    A2.
  • Theorem permits a(i,k) and b(k,j) to overlap

74
Some Corollaries (4/8)
  • Theorem To evaluate (P) with fast memory of size
    M requires at least gijk s / (8M)1/2 M
    slow memory refs
  • Corollary 4 Corollaries 1 and 2 apply to TRSM C
    A-1B when A triangular (dense or sparse,
    sequential or parallel)
  • C(i,j) (B(i,j) - ?klti A(i,k)C(k,j)) / A(i,i)
    A lower tri.
  • Matches (P)
  • C plays double role as b and c
  • Subtraction from B(i,j), division by A(i,i)
    some other arguments
  • Lower bound on all reorderings, even invalid ones

75
Attaining Bandwidth Lower Bound for Dense TRSM
  • function X R_TRSM(A,B) . Compute X A-1B
  • if dimension 1
  • return X B/A
  • else
  • partition B B11 , B12 B21 , B22 etc
  • X11 R_TRSM( A11 , B11 )
  • X21 R_TRSM( A22 , B21 A21 X11 )
  • need to do matmul right
  • X12 R_TRSM( A11 , B12 )
  • X22 R_TRSM( A22 , B22 A21 X12 )
  • need to do matmul right
  • Cache-oblivious approach
  • Leiserson, Frigo, Prokop, Blelloch, Pingali,
  • Minimizes bandwith across multiple levels of
    memory hierarchy
  • What about latency?

76
Some Corollaries (5/8)
  • Theorem To evaluate (P) with fast memory of size
    M requires at least gijk s / (8M)1/2 M
    slow memory refs
  • Corollary 5 Corollaries 1 and 2 apply to LU
    factorization (dense or sparse, any pivoting, LU
    or ILU, seq. or par.)
  • L(i,j) (A(i,j) - ?kltj L(i,k)U(k,j)) / U(j,j)
  • U(i,j) A(i,j) - ?klti L(i,k)U(k,j)
  • Matches (P)
  • L (and U) plays double role as c and a (c and b)
  • Subtractions, division by U(j,j) some other
    arguments
  • Lower bound on all reorderings, even invalid ones

77
Some Corollaries (6/8)
  • Theorem To evaluate (P) with fast memory of size
    M requires at least gijk s / (8M)1/2 M
    slow memory refs
  • Corollary 6 Corollaries 1 and 2 apply to
    Cholesky (dense or sparse, any diag.
    pivoting, C or IC, seq. or par.)
  • L(i,j) (A(i,j) - ?kltj L(i,k)LT(k,j)) / L(j,j)
  • L(j,j) (A(j,j) - ?kltj L(j,k)LT(k,j))1/2
  • Matches (P)
  • L (and LT) plays triple role as a, b and c
  • Subtractions, division, square root some other
    arguments
  • Lower bound on all reorderings, even invalid ones

78
Some Corollaries (7/8)
  • Theorem To evaluate (P) with fast memory of size
    M requires at least gijk s / (8M)1/2 M
    slow memory refs
  • Corollary 7 Sparse Cholesky, where graph of
    matrix contains an n x n mesh, requires ?(n3
    /M1/2 ) slow mem refs
  • Eg 5 point Laplacian on n x n mesh
  • Combine Corollary 6 with lower bound on flops by
    George , Hoffman/Martin/Rose

79
Some Corollaries (8/8)
  • Theorem To evaluate (P) with fast memory of size
    M requires at least gijk s / (8M)1/2 M
    slow memory refs
  • Conjecture 8 Corollaries 1 and 2 apply to QR
    (dense or sparse, any pivoting, seq.
    or par.)
  • Proof of dense case only in EECS Tech Report
    2008-89

80
Do any LU/QR algs reach lower bounds?
  • Only dense case understood
  • LAPACK attains neither O((n2/M)1/2) x more
    bandwidth
  • Recursive seq. algs minimize bandwidth, not
    latency
  • Toledo for LU, Elmroth/Gustavson for QR
  • ScaLAPACK attains bandwidth lower bound
  • But sends O((n2/P)1/2) x more messages
  • New LU and QR algorithms do attain both lower
    bounds, both seq. and par.
  • LU need to abandon partial pivoting (but still
    stable)
  • QR need to represent Q as tree of Householder
    matrices
  • Neither new alg works for multiple mem hierarchy
    levels
  • See EECS TR 2008-89 for lots of related work
  • Oldest Golub/Plemmons/Sameh (1988)

81
Minimizing Comm. in Parallel QR
  • QR decomposition of m x n matrix W, m gtgt n
  • TSQR Tall Skinny QR
  • P processors, block row layout
  • Usual Parallel Algorithm
  • Compute Householder vector for each column
  • Number of messages ? n log P
  • Communication Avoiding Algorithm
  • Reduction operation, with QR as operator
  • Number of messages ? log P

82
TSLU LU factorization of a tall skinny matrix
First try the obvious generalization of TSQR
83
Growth factor for TSLU based factorization
  • Unstable for large P and large matrices.
  • When P rows, TSLU is equivalent to parallel
    pivoting.

Courtesy of H. Xiang
84
Growth factor for better CALU approach
Like threshold pivoting with worst case threshold
.33 , so L lt 3 Testing shows about same
residual as GEPP
85
Do any Cholesky algs reach lower bounds?
  • Only dense case understood
  • LAPACK (with right block size) or recursive
    Cholesky minimize bandwidth
  • Recursive Ahmed/Pingali, Gustavson/Jonsson,
    Andersen/ Gustavson/Wasniewski, Simecek/Tvrdik, a
    la Toledo
  • LAPACK can minimize latency with blocked data
    structure
  • Ahmed/Pingali minimize bandwidth and latency
    across multiple levels of memory hierarchy
  • Simultaneously minimize communication between all
    pairs L1/L2/L3/DRAM/disk/
  • Cache-oblivious, space-filling curve layout
  • ScaLAPACK minimizes bandwidth and latency (mod
    polylog P)
  • Need right choice of block size
  • Details in EECS TR 2009-29

86
Future work (1/2)
  • Extend lower bound to QR (soon!)
  • Dense sequential QR and LU that minimize
    bandwidth and latency over multiple memory
    hierarchy levels
  • Extend to fast algorithms (see next slide)
  • Extend to eigenvalue problems (see next slide)
  • Extend to 3D algorithms
  • Attaining upper bounds for sparse matrices?
  • Need to examine special cases of interest
  • Extend to heterogeneous architectures
  • Layers of parallelism and memory hiearchy
  • Varying flops speeds, bandwidths, latencies
  • Autotuning
  • Redo all of LAPACK and ScaLAPACK

87
Future Work (2/2) Asymptotically Faster
Algorithms
  • Thm (D., Dumitriu, Holtz, Kleinberg) (Numer.Math.
    2007)
  • Given any matmul running in O(n?) ops for some
    ?gt2, it can be made stable and still run in
    O(n??) ops, for any ?gt0.
  • Strassen already stable
  • Current record ? ? 2.38
  • Thm (D., Dumitriu, Holtz) (Numer. Math. 2008)
  • Given any stable matmul running in O(n??) ops,
    it is possible to do backward stable dense linear
    algebra in O(n??) ops
  • GEPP, QR
  • rank revealing QR (randomized)
  • (Generalized) Schur decomposition, SVD
    (randomized)
  • Also reduces communication to O(n??)
  • But lower bounds?
  • But constants?
Write a Comment
User Comments (0)
About PowerShow.com