Software Project: Fast matrix operations Cache usage Sparse matrices

About This Presentation

Title:

Software Project: Fast matrix operations Cache usage Sparse matrices

Description:

... cache (for the recent websites you've visited) memory caches ... Blocked Algorithm. Maximum Block Size ... Accessing the column's elements is inconvenient. ... – PowerPoint PPT presentation

Number of Views:145

Avg rating:3.0/5.0

Slides: 82

Provided by: WSE995

Category:

more less

Transcript and Presenter's Notes

Title: Software Project: Fast matrix operations Cache usage Sparse matrices

1
Software ProjectFast matrix operations Cache
usage Sparse matrices

August 6, 2007

2
Overview

Part 1 Cache Memory and Code Efficiency
Part 2 Sparse Matrices
Part 3 Home Exercise
Part 4 Notes on C functions

3
Multiplication of 2D Matrices

Time and Performance Measurement
Simple Code Improvements

4
Matrix Multiplication

A
B

5
Matrix Multiplication
A
B

6
The simplest algorithm
Assumption the matrices are stored as 2-D NxN
arrays

for (i0iltNi)
for (j0jltNj)
for (k0kltNk) cij
aik bkj

Advantage code simplicity Disadvantage
performance
7
First Improvement

for (i0iltNi)
for (j0jltNj)
for (k0kltNk) cij aik
bkj

cij is constant in the k-loop!

8
First Performance Improvement

for (i0iltNi)
for (j0jltNj)
int sum 0
for (k0kltNk)
sum aik bkj
cij sum

9
Measuring the performance

clock_t clock(void) Returns the processor time
used by the program since the beginning of
execution, or -1 if unavailable.
clock()/CLOCKS_PER_SEC - is a time in seconds.

include lttime.hgt clock_t t1,t2 t1
clock() mult_ijk(a,b,c,n) t2 clock()
printf("The running time is lf seconds\n",
(double)(t2 - t1)/(CLOCKS_PER_SEC))
10
The running time
The simplest algorithm
After the first optimization
11
Profiling with gprof

gprof is a profiling program which collects and
arranges statistics on your programs.
Profiling allows you to learn where your program
spent its time.
This information can show you which pieces of
your program are slower and might be candidates
for rewriting to make your program execute
faster.
It can also tell you which functions are being
called more often than you expected.

12
The Unix gprof profiler

To run the profiler do
Add the flag pg to compilation and linking
Run your program
Run the profiler

13
The Unix gprof profiler
14
Part 1 Cache Memory
15
General Idea

You are in the library gathering books for an
assignment
The books you have gathered contain material that
you will likely use.
You do not collect ALL the books from the library
to your desk.
It is quicker to access information from the book
on your desk than to go to stack again.
This is like use of cache principles in computing.

16
Cache Types

There are many different types of caches
associated with your computer system
browser cache (for the recent websites you've
visited)
memory caches
hard drive caches
A cache is meant to improve access times and
enhance the overall performance of your computer.
The type we're concerned with today is cache
memory.

17
Memory Hierarchy
CPU Registers 500 Bytes 0.25 ns
CPU
word transfer (1-8 bytes)

decrease cost per bit
decrease frequency of access
increase capacity
increase access time
increase size of transfer unit

cache
Cache 16K-1M Bytes 1 ns
block transfer (8-128 bytes)
main memory
Main Memory 64M-2G Bytes 100ns
Disk 100 G Bytes 5 ms
disks

The memory cache is closer to the processor than
the main memory.
It is smaller and faster than the main memory.
Transfer between caches and main memory is
performed in units called cache blocks/lines.

18
Types of Cache Misses

1. Compulsory misses Cache misses caused by the
first access to the block that has never been in
cache (also known as cold-start misses)
2. Capacity misses Cache misses caused when the
cache cannot contain all the blocks needed during
execution of a program. Occur because of blocks
being replaced and later retrieved when
accessed.
3. Conflict misses Cache misses that occur when
multiple blocks compete for the same set.

19
Main Cache Principles

Temporal Locality (Locality in Time) If an item
is referenced, it will tend to be referenced
again soon.
Keep more recently accessed data items closer to
the processor
Spatial Locality (Locality in Space) If an item
is referenced, items whose addresses are close by
tend to be referenced soon.
Move blocks consists of contiguous words to the
cache

20
Improving Spatial LocalityLoop Reordering for
Matrices Allocated by Row
Allocation by rows
21
Writing Cache Friendly Code

Example
4-byte words, 4-word cache blocks

int sumarrayrows(int aMN) int i, j, sum
0 for (i 0 i lt M i) for (j
0 j lt N j) sum aij
return sum
int sumarraycols(int aMN) int i, j, sum
0 for (j 0 j lt N j) for (i
0 i lt M i) sum aij
return sum
22
Matrix Multiplication (ijk)
/ ijk / for (i0 iltn i) for (j0 jltn
j) sum 0 for (k0 kltn k)
sum aik bkj cij sum

Inner loop
(,j)
(i,j)
(i,)
A
B
C
Row-wise

Misses per Inner Loop Iteration
A B C
0.25 1.0 0.0

23
Matrix Multiplication (jik)
/ jik / for (j0 jltn j) for (i0 iltn
i) sum 0 for (k0 kltn k)
sum aik bkj cij sum
Inner loop
(,j)
(i,j)
(i,)
A
B
C

Misses per Inner Loop Iteration
A B C
0.25 1.0 0.0

24
Matrix Multiplication (kij)
/ kij / for (k0 kltn k) for (i0 iltn
i) r aik for (j0 jltn j)
cij r bkj
Inner loop

Misses per Inner Loop Iteration
A B C
0.0 0.25 0.25

25
Matrix Multiplication (jki)
/ jki / for (j0 jltn j) for (k0 kltn
k) r bkj for (i0 iltn i)
cij aik r
Inner loop
(,j)
(,k)
(k,j)
A
B
C

Misses per Inner Loop Iteration
A B C
1.0 0.0 1.0

26
Summary

ijk ( jik)
misses/iter 1.25

kij
misses/iter 0.5

jki
misses/iter 2.0

for (j0 jltn j) for (k0 kltn k)
r bkj for (i0 iltn i)
cij aik r
for (i0 iltn i) for (j0 jltn j)
sum 0 for (k0 kltn k)
sum aik bkj
cij sum
for (k0 kltn k) for (i0 iltn i)
r aik for (j0 jltn j)
cij r bkj
27
Cache Misses Analysis

Assumptions
the cache can hold M words of memory on blocks of
size B, .
The replacement policy is LRU (Least Recently
Used).

The lower bound for 3 matrices
A and C
B
For further details see the tutorial on the
website.
28
Improving Temporal Locality Blocked Matrix
Multiplication
29
Blocked Matrix Multiplication
j
cache block
j
i
i
A
B
C
Key idea reuse the other elements in each cache
block as much as possible
30
Blocked Matrix Multiplication
j
cache block
i
i
b elements
cij
cij1
b elements
A
B
C
j

Since one loads column j1 of B in the cache
lines anyway compute cij1.
Reorder the operations
compute the first b terms of cij, compute
the first b terms of cij1
compute the next b terms of cij, compute
the next b terms of cij1
.....

31
Blocked Matrix Multiplication
j
j
cache block
i
i
A
B
C
Compute a whole subrow of C, with the same
reordering of the operations. But then one has
to load all columns of B, which one has to do
again for computing the next row of C. Idea
reuse the blocks of B that we have just loaded.
32
Blocked Matrix Multiplication
cache block
j
j
i
i
A
B
C
Order of the operation Compute the first b
terms of all cij values in the C block Compute
the next b terms of all cij values in the C
block . . . Compute the last b terms of all cij
values in the C block
33
Blocked Matrix Multiplication
34
Blocked Matrix Multiplication
C11
C12
C13
C14
A11
A12
A13
A14
B11
B12
B13
B14
C21
C22
C23
C24
A21
A22
A23
A24
B21
B22
B23
B24
C31
C32
C43
C34
A31
A32
A33
A34
B32
B32
B33
B34
C41
C42
C43
C44
A41
A42
A43
A144
B41
B42
B43
B44
N 4 b

C22 A21B12 A22B22 A23B32 A24B42
4 matrix multiplications
4 matrix additions
Main Point each multiplication operates on
small block matrices, whose size may be chosen
so that they fit in the cache.

35
Blocked Algorithm

The blocked version of the i-j-k algorithm is
written simply as
for (i0iltN/Bi)
for (j0jltN/Bj)
for (k0kltN/Bk)
Cij AikBkj
where B is the block size (which we assume
divides N)
where Xij is the block of matrix X on block
row i and block column j
where means matrix addition
where means matrix multiplication

36
Maximum Block Size

The blocking optimization only works if the
blocks fit in cache.
That is, 3 blocks of size bxb must fit in memory
(for A, B, and C)
Let M be the cache size (in elements)
We must have 3b2 M, or b v(M/3)
Therefore, in the best case, ratio of number of
operations to slow memory accesses is v(M/3)

37
Part 2 Sparse Matrices
38
Sparse Matrices

A sparse matrix is a matrix populated primarily
with zeros.
Manipulating huge sparse matrices with the
standard algorithms may be impossible due to
their large size.
Sparse data is by its nature easily compressed,
which can yield enormous savings in memory usage.

39
Linked List Representation

Use the linked list representation.
The naïve implementation
Replace each row by a linked list.

40
Disadvantages

Accessing the columns elements is inconvenient.
We need to go over all the rows to search for a
certain column.
It is more convenient to view matrix
multiplication as multiplication of a row by a
column.

(,j)
(i,j)

(i,)
A
B
C
Row-wise
41
Doubly Linked List Representation
42
Advantages and Disadvantages

Advantages
No unnecessary memory allocations
You do not need to know the number of non zero
elements in advance.
Disadvantages
Memory allocation is performed for each non zero
element of the matrix.
Not a cache friendly code, i.e. we can not
optimize it for a better cache memory utilization.

43
Compressed Array

Use a compressed array representation.

values
rowind
colptr
44
Compressed data structure
22
84
61
values
2
1
0
rowind
colptr
45
Data Type Definition

typedef struct
int n / size /
int colptr / pointers to where
columns begin in rowind and values /
int rowind / row indexes /
elem values
sparse_matrix_arr

46
Compressed Array Implementation

Observations
The described data type is compressed by
column.
Accessing row elements is inconvenient.

22
84
61
values
2
1
0
rowind
colptr
47
Multiplication by column
The inner loop can be viewed as
Remember the jki loop ordering from exercise 2.1
(,j)
(,k)
(k,j)
/ jki / for (j0 jltn j) for (k0 kltn
k) r bkj for (i0 iltn i)
cij aik r
A
B
C
48
Matrix by vector multiplication
b
A

Sometimes a better way to look at it
Ab is a linear combination of As columns!

49
Part 3 Home Exercise
50
Home exercise

Implement the described algorithms for matrix
multiplication and measure the performance.
Store the matrices as arrays, organized by
columns!!!

M
aij AijN
i
jN
N
X
X
i
AijN
aij
51
Question 2.1 mlpl

Implement all the 6 options of loop ordering
(ijk, ikj, jik, jki, kij, kji).
Run them for matrices of different sizes.
Measure the performance with clock() and gprof.
Select the most efficient loop ordering.
Plot the running times of all the options (ijk,
jki, etc.) as the function of matrix size.

52
User Interface

Input
Case 1 0 or negative
Case 2 A positive integer number followed by a
matrix values
Output
Case 1 Running times
Case 2 A matrix, which is the multiplication of
the input matrices (CAB).

53
Graphs and the Manual

Prepare a graph, which plots the running times
(in CPU ticks, Y-axis) as a function of matrix
size (X-axis).
What is the most efficient loop ordering?
Provide a detailed analysis of the graphs.
Submit the details of the computer
/proc/cpuinfo

54
Plotting the graphs

Save the output to .csv file.
Open it in Excel.
Use the Excels Chart Wizard to plot the data
as the XY Scatter.
X-axis the matrix sizes.
Y-axis the running time in CPU ticks.

55
Question 2.2 matop

Implement the additional matrix operations of
addition and subtraction as well transpose (,
-,t).
The functions should comply to the prototypes in
matrix_manipulate.h.
Each function should be implemented in the most
efficient and cache friendly.

56
Question 2.2 matop

The input format example
4 ....
- 4 ....
t 4 ....
The output is in the same format as in 2.1
You should test your program by your self on your
own examples (use matlab or any online matrix
calculator).

57
Matlab

You are strongly encouraged to use Matlab to
check your functions.
A tutorial and how to get started cat be found
at
http//www.mathworks.com/access/helpdesk/help/tech
doc/matlab.shtml

58
Question 2.3 sparse_mlpl_arr

Implement the multiplication of sparse matrices
represented by the compressed array data
structure.
Perform multiplication of matrices of different
sizes (n) with different number of non zero
elements (nnz).
Measure the performance using the clock()
function.
Prepare 2 graphs which plot the running times in
CPU ticks as the function of
nn (the same chart should contain a set of
graphs for each nnz)
nn/nnz (one graph)
Compare the performance of sparse and regular
matrices. Which representation is better and when?

59
Question 2.4 sparse_matop_arr

Implement the additional matrix operations of
addition and subtraction as well transpose of
sparse matrices represented as compressed arrays
(, -,t).
The functions should comply to the prototypes in
matrix_manipulate_arr.h.
Each function should be implemented in the most
efficient and cache friendly.

60
Question 2.2 matop

The input format example
10 4 ....
- 10 4 ....
t 10 4 ....
The output is in the same format as in 2.1
You should test your program by your self on your
own examples (use matlab or any online matrix
calculator).

61
Files and locations

All of your files should be located under your
home directory /soft-proj07c/assign2/(No
subdirectories should be created this time)
The source files and the executable should match
the exercise name (e.g. sparse_matop_arr.c,
sparse_matop_arr)
Strictly follow the provided prototypes and the
file framework.

62
File Layout exercise 2.1

The files layout for exercise 2.1 is

Memory management
Exercise code
mlpl (exe)
matrix_manipulate.c
allocate_free.c
mlpl.c
allocate_free.h
matrix_manipulate.h
matrix.h
Matrix operations
63
File Layout exercise 2.2

The files layout for exercise 2.2 is

matop (exe)
matrix_manipulate.c
allocate_free.c
matop.c
allocate_free.h
matrix_manipulate.h
matrix.h
64
File Layout exercise 2.3

The files layout for exercise 2.3 is

sparse_mlpl_arr (exe)
matrix_manipulate_arr.c
allocate_free.c
sparse_mlpl_arr.c
allocate_free.h
matrix_manipulate_arr.h
matrix.h
65
File Layout exercise 2.4

The files layout for exercise 2.4 is

sparse_matop_arr (exe)
matrix_manipulate_arr.c
allocate_free.c
sparse_matop_arr.c
allocate_free.h
matrix_manipulate_arr.h
matrix.h
66
Notes on File Layout

You should strictly follow the provided layout.
You should strictly follow the provided
prototypes.
You are free to add additional files and
functions.
Add the function to the .c file.
Add its prototype to the corresponding .h file.
Explain what is the goal of each module
(function/file) in the manual and as a remark in
the code.

67
Exercise 2 Notes

Avoid unnecessary memory allocations.
Read the input directly into the sparse matrix.
Free all the allocated memory.
Consult the website for recent updates and
announcements.
Use your own judgment!!!

68
The Manual

You should submit a printed manual of your
exercise.
The manual should contain
All the explanations and the answers to all the
questions highlighted with boxes.
A printout of all of your source files (.c and
.h).

69
Part 4 Notes on C functions
70
Rand()
int rand() - returns a pseudo-random integer
between 0 and RAND_MAX (32767).

include ltstdio.hgt
include ltstdlib.hgtint main()    int
i    for (i 0 i lt 5 i)
         printf("d\n",rand())        return
0

71
Seed Definition

Observation The results of several runs are
identical.
Explanation The initial seed used by rand() is
identical each time.
The Seed
Used for the generation of the random numbers.
Explicitly specified using the srand function.

72
srand()

void srand (unsigned int seed)
Sets seed as the seed for a new sequence of
pseudo-random integers to be returned by rand().
The sequences can be repeatable by calling
srand() with the same seed value.
The system time is a good choice as an argument
for srand. It will be different each time the
program is run. Thus, results will also be
different.

73
Rand() and Srand()

unsigned long int next 1
int rand(void)
next next1103515245 12345
return (unsigned int) (next/65536) 32768
void srand (unsigned int seed)
next seed

Sets the seed for the rand() function
74
srand()

include ltstdio.hgt
include ltstdlib.hgt
include lttime.hgtint main() int i
/ Set the seed /    srand( time( NULL )
)   for (i 0 i lt 5 i)         printf("
d\n",rand()        return 0

srand(time(NULL)) is the same as int seed seed
time(NULL) srand(seed)
75
Struct Syntax

There are two equivalent option to access the
fields of a pointer to struct
When accessing the fields of a pointer to struct
you should be sure that it points to allocated
memory!!!

sparse_matrix_arr a
a-gtcolptrj 3
(a).colptrj 3
76
Precedence of C operators
77
Memory Allocation

Dynamically allocated regions of storage are like
arrays. Use array-like subscripting notation on
them.
Keep track of the allocated memory.
It is easy to accidentally use a pointer which
points "nowhere. It can point "somewhere", just
not where you wanted it to!!!

the 0 gets written "somewhere"
p 0
78
Allocating memory for integers

Goal allocate an array for 100 integers
It's much safer and more portable to let C
compute the size.
Use the sizeof() operator, that computes the
size, in bytes, of a variable or type

int ip (int ) malloc(100 sizeof(int))
sizeof() is an operator, and it is evaluated at
compile time
Can be accessed as ip0, ip1, ... up to
ip99
79
Verifying allocated memory

No real computer has an infinite amount of memory
available.
No guarantee that malloc will be able to give us
as much memory as we ask for.
malloc returns a null pointer upon failure.
It's vital to check the returned pointer before
using it!!

int ip (int ) malloc(100 sizeof(int)) if
(ip NULL) printf("ERROR out of
memory\n") return -1
80
Freeing memory

Memory allocated with malloc does not
automatically disappear when a function returns,
as automatic-duration variables do.
The allocated memory does not have to remain for
the entire duration of your program.
Memory is not inexhaustible.
Deallocate (free) memory when you don't need it
any more
p contains a pointer previously returned by
malloc

free(p)
81
Good Luck in the Exercise!!!

Write a Comment

User Comments (0)

About PowerShow.com

Software Project: Fast matrix operations Cache usage Sparse matrices - PowerPoint PPT Presentation

Software Project: Fast matrix operations Cache usage Sparse matrices

... cache (for the recent websites you've visited) memory caches ... Blocked Algorithm. Maximum Block Size ... Accessing the column's elements is inconvenient. ... – PowerPoint PPT presentation