Numerical Linear Algebra An Introduction - PowerPoint PPT Presentation

About This Presentation
Title:

Numerical Linear Algebra An Introduction

Description:

BLAS. Basic Linear Algebra Subprograms ... The BLAS are divided into 3 levels: vector-vector, matrix-vector, and matrix-matrix. ... – PowerPoint PPT presentation

Number of Views:157
Avg rating:3.0/5.0
Slides: 28
Provided by: universit74
Category:

less

Transcript and Presenter's Notes

Title: Numerical Linear Algebra An Introduction


1
Numerical Linear AlgebraAn Introduction
2
Levels of multiplication
  • vector-vector aibi
  • matrix-vector Ai,jbj
  • matrix-matrix AI,kBk,j

3
Matrix-Matrix
  • for (i0 iltn i)
  • for (j0 jltn j)
  • Ci,j0.0
  • for (k0 kltn k)
  • Ci,jCi,j Ai,kBk,j
  • NoteO(n3) work

4
Block matrix
  • Serial version - same pseudocode, but interpret
    i, j as indices of subblocks, and AB means block
    matrix multiplication
  • Let n be a power of two, and generate a recursive
    algorithm. Terminates with an explicit formula
    for the elementary 2X2 multiplications. Allows
    for parallelism. Can get O(n2.8) work

5
Pipeline method
  • Pump data left to right and top to bottom
  • recv(A,Pi,j-1)
  • recv(B,Pi-1,j)
  • CCAB
  • send(A,Pi,j1)
  • send(B,Pi1, j)

6
Pipeline method
B,0
B,1
B,2
B,3
C0,0
C0,3
C0,2
C0,1
A0,
C1,0
C1,1
C1,2
C1,3
A1,
C2,0
C2,1
C2,2
C2,3
A2,
C3,1
C3,3
C3,2
C3,0
A3,
7
Pipeline method
  • Similar method for matrix-vector multiplication.
    But you lose some of the cache reuse

8
A sense of speed vector ops
Loop Flops per pass Operation per pass operation
1 2 v1(i)v1(i)av2(i) update
2 8 v1(i)v1(i)S skvk(i) 4-fold vector update
3 1 v1(i)v1(i)/v2(i) divide
4 2 v1(i)v1(i)sv2(ind(i)) updategather
5 2 v1(i)v2(i)-v3(i)v1(i-1) Bidiagonal
6 2 ssv1(i)v2(i) Inner product
9
A sense of speed vector ops
Loop J90 cft77 (100 nsec clock) r_8 n1/2 T90 1 processor cft77 (2.2 nsec clock) r_8 n1/2
1 97 19 159
2 163 29 1245 50
3 21 6 219 43
4 72 21 780 83
5 4.9 2 25 9
6 120 202 474 164
10
Observation s
  • Simple do loops not effective
  • Cache and memory hierarchy bottlenecks
  • For better performance,
  • combine loops
  • minimize memory transfer

11
LINPACK
  • library of subroutines to solve linear algebra
  • example LU decomposition and system solve
    (dgefa and dgesl, resp.)
  • In turn, built on BLAS
  • see netlib.org

12
BLAS Basic Linear Algebra Subprograms
  • a library of subroutines designed to provide
    efficient computation of commonly-used linear
    algebra routines, like dot products,
    matrix-vector multiplies, and matrix-matrix
    multiplies.
  • The naming convention is not unlike other
    libraries - the fist letter indicates precision,
    the rest gives a hint (maybe) of what the routine
    does, e.g. SAXPY, DGEMM.
  • The BLAS are divided into 3 levels
    vector-vector, matrix-vector, and matrix-matrix.
    The biggest speed-up usually in level 3.

13
BLAS
  • Level 1

14
BLAS
  • Level 2

15
BLAS
  • Level 3

16
How efficient is the BLAS?
  • load/store float ops refs/ops
  • level 1
  • SAXPY 3N 2N 3/2
  • level 2
  • SGEMV MNN2M 2MN 1/2
  • level 3
  • SGEMM 2MNMKKN 2MNK 2/N

17
Matrix-vector
  • read x(1n) into fast memory
  • read y(1n) into fast memory
  • for i 1n
  • read row i of A into fast memory
  • for j 1n
  • y(i) y(i) A(i,j)x(j)
  • write y(1n) back to slow memory

18
Matrix-vector
  • m slow memory refs n2 3n
  • f arithmetic ops 2n2
  • qf/m 2
  • Mat-vec multiple limited by slow memory

19
Matrix-matrix
20
Matrix Multiply - unblocked
  • for i 1 to n
  • read row i of A into fast memory
  • for j 1 to n
  • read C(i,j) into fast memory
  • read column j of B into fast memory
  • for k 1 to n
  • C(i,j) C(i,j) A(i,k) B(k,j)
  • write C(i,j) back to slow memory


21
Matrix Multiply unblocked
  • Number of slow memory references on unblocked
    matrix multiply
  • m n3 read each column of B n times
  • n2 read each column of A once for
    each i
  • 2n2 read and write each element of C
    once
  • n3 3n2
  • So q f/m (2n3)/(n3 3n2)
  • 2 for large n, no improvement over
    matrix-vector multiply

22
Matrix Multiply blocked
  • Consider A,B,C to be N by N matrices of b by b
    subblocks where bn/N is called the blocksize
  • for i 1 to N
  • for j 1 to N
  • read block C(i,j) into fast memory
  • for k 1 to N
  • read block A(i,k) into fast
    memory
  • read block B(k,j) into fast
    memory
  • C(i,j) C(i,j) A(i,k)
    B(k,j) do a matrix multiply on blocks
  • write block C(i,j) back to slow memory


23
Matrix Multiply blocked
  • Number of slow memory references on blocked
    matrix multiply
  • m Nn2 read each block of B N3 times
    (N3 n/N n/N)
  • Nn2 read each block of A N3
    times
  • 2n2 read and write each block of
    C
  • (2N 2)n2
  • So q f/m 2n3 / ((2N 2)n2)
  • n/N b for large n

24
Matrix Multiply blocked
  • So we can improve performance by increasing the
    blocksize b
  • Can be much faster than matrix-vector multiplty
    (q2)
  • Limit All three blocks from A,B,C must fit in
    fast memory (cache), so we
  • cannot make these blocks arbitrarily large
    3b2 lt M, so q b lt sqrt(M/3)

25
More on BLAS
  • Industry standard interface(evolving)
  • Vendors, others supply optimized implementations
  • History
  • BLAS1 (1970s)
  • vector operations dot product, saxpy
  • m2n, f2n, q 1 or less
  • BLAS2 (mid 1980s)
  • matrix-vector operations
  • mn2, f2n2, q2, less overhead
  • somewhat faster than BLAS1

26
More on BLAS
  • BLAS3 (late 1980s)
  • matrix-matrix operations matrix matrix multiply,
    etc
  • m gt 4n2, fO(n3), so q can possibly be as
    large as n, so BLAS3 is potentially much faster
    than BLAS2
  • Good algorithms used BLAS3 when possible (LAPACK)
  • www.netlib.org/blas, www.netlib.org/lapack

27
BLAS on an IBM RS6000/590
Peak speed 266 Mflops
Peak
BLAS 3
BLAS 2
BLAS 1
BLAS 3 (n-by-n matrix matrix multiply) vs BLAS 2
(n-by-n matrix vector multiply) vs BLAS 1 (saxpy
of n vectors)
Write a Comment
User Comments (0)
About PowerShow.com