MA471 Fall 2002, Lecture 8 - PowerPoint PPT Presentation

1 / 28
About This Presentation
Title:

MA471 Fall 2002, Lecture 8

Description:

For efficiency reasons we will store the data in a matrix ... Tactic: for machine specific/compiler specific options; E.g. man gcc. Examples: ... – PowerPoint PPT presentation

Number of Views:19
Avg rating:3.0/5.0
Slides: 29
Provided by: marlenem9
Category:
Tags: double | fall | lecture | ma471 | tactic

less

Transcript and Presenter's Notes

Title: MA471 Fall 2002, Lecture 8


1
MA471Fall 2002, Lecture 8
  • Some efficiency related tricks

2
Data Point Storage
In Matlab and Fortran
For efficiency reasons we will store the data in
a matrixwith element nodes running along matrix
columns
Elements
1
2
Data Values
3
................................
4
.
.
.
.
.
.
.
.
.
3
Data Point Storage
Data Values

1
2
3

In C or C
.
.
For efficiency reasons we will store the data in
a matrix with element node values running along
rows
........................
Elements

.
.
4
Serial Optimization
5
Serial Optimization
  • Four approaches we will consider
  • Algorithmic optimization
  • Compiler based optimization
  • Employ library routines

6
Compiler Based Optimization
  • In some ways the easiest thing to do
  • Simple example
  • cc O3 o foo foo.c
  • Every compiler has its own set of options (like
    -O3)
  • Tactic
  • for machine specific/compiler specific options
  • E.g. gt man gcc
  • Examples
  • -O, -O1,-O2,-O3,-fast
  • -mcpupentium
  • And so on
  • BUILD THE OPTIONS INTO THE Makefile possibly
    using a command line switch

7
Library Based Optimization
  • Use atlas or manufacturer (Intel mkl, IBM essl,
    SGI complib, Sun Sunperf) based BLAS for
    matrix-matrix multiplication
  • I highly recommend creating a wrapper around
    DGEMM, as it has a very general interface and
    wewill only require a small subset of functions
  • I gave you umLINALG.c which contains a number of
    matrix-matrix routines neatly wrapped up.

8
Wrapper for C AB
  • Recall we are using BLAS routines which are
    written in Fortran with Fortran storage in mind
  • If we have A stored in c then we really have A
    stored in Fortran.
  • We use the following identity
  • C BA
  • So if we request BLAS to multiply c-stored B and
    A together it will actually be multiplying B and
    A together.
  • The result will be BA stored in Fortran
    convention
  • However, BA stored in Fortran convention is
    (BA) in c-convention and (BA) AB

9
  • / C A times B /
  • void umAxB(double A, int rowsA, int colsA,
  • double B, int rowsB, int
    colsB,
  • double C)
  • char opA N , opB N
  • double one 1., zero 0
  • int i,j,k
  • if(colsA ! rowsB) fprintf(stderr, "Warning,
    colsA!rowsB in umAxB\n")
  • ifdef UNDERSCORE
  • dgemm_
  • else
  • dgemm
  • endif
  • (opA, opB, colsB, rowsA, colsA, one,
    B, colsB, A, colsA, zero, C, colsB)

Notice we list B then A
10
Wrapper for A(B)
  • I included a routine umAxBtrans for this instance
    too.
  • This is the most efficient way to multiply two
    matrices since no cross row strides are
    necessary. Here is some pseudo-code
  • for(i0iltNrowsAi)
  • for(j0jltNrowsBj)
  • double d 0
  • for(k0kltNcolsAk)
  • d d A i k B j k
  • C i j d

Notice how we only ever accessthe matrix entries
in order of storage
11
Note on Linking ATLAS
  • Due to the way that the atlas libraries
    (cblas,..) wherecompiled you should add the
    following to your OBJS variable in the Makefile
  • xerbla.o
  • Then copy xerbla.f from
  • /home/cs471aa/umFEM/xerbla.f
  • Into your build directory

12
Individual Homework
  1. Write your own matrixmatrix and
    matrixtranpose(matrix) routines which do not
    call BLAS. Do not explicitly transpose the matrix
    for the second version.
  2. For N5,10,15,..,1000
  3. Create two random matrix of N rows and N columns
    called A and B
  4. Time how long it takes to multiply A and B into a
    third matrix C one thousand times using your own
    routine
  5. Time how long it takes to multiply A and B
    transpose into a third matrix C one thousand
    times using your own routine
  6. Repeat (2) using umAxB
  7. Repeat (3) using umAxBtrans

3) Plot the results from all the runs in part 2)
on the same graph with N on the horizontal
axis and average time taken on the vertical axis
using Matlab or similar. Label axes, include
a title and legend
13
Algorithmic Optimization
  • We have not paid too much attention to cache
    locality in the current code.
  • We have accepted the order of the data as it is
    read infrom file. i.e. we store triangle 1, then
    triangle 2,
  • We are free to shuffle the order of the elements
    (for example)
  • Why does this matter?.

14
Banded Storage
  • It is possible to perform LU factorization on a
    band limited matrix using only the storage
    required to holdthe non-zero diagonal bands.
  • i.e if a symmetric matrix only has 10 upper
    diagonals with non-zero entries but 10000 rows
    then we can save 10000-20 rows of storage. A big
    saving!.

15
Example
16
Cache Miss Minimization
  • When we go from an element local node to a
    globally numbered node we may end up with a
    random walk around memory.
  • This is not a good thing..

17
(No Transcript)
18
(No Transcript)
19
(No Transcript)
20
Reverse Cuthill McKee Algorithm
21
(No Transcript)
22
0 Do
23
(No Transcript)
24
(No Transcript)
25
(No Transcript)
26
(No Transcript)
27
(No Transcript)
28
Lab
  • Continue implementing group project
  • If you are waiting for some other piece to be
    readied work on the homework.
Write a Comment
User Comments (0)
About PowerShow.com