Title: Introduction to Computer Hardware
1Array Libraries
2Array Libraries
- Function extensions of C and Fortran 77 with
array or vector libraries - The libraries are supposed to be optimised for
each particular computer - Regular compilers can be used gt no need in
dedicated optimising compilers - One of the most well-known and well-designed
array libraries is the Basic Linear Algebra
Subprograms (BLAS) - Provides basic array operations for numerical
linear algebra - Available for most modern VP and SP computers
3BLAS
- All BLAS routines are divided into 3 main
categories - Level 1 BLAS addresses scalar and vector
operations - Level 2 BLAS addresses matrix-vector operations
- Level 3 BLAS addresses matrix-matrix operations
- Routines of Level 1 do
- vector reduction operations
- vector rotation operations
- element-wise and combined vector operations
- data movement with vectors
4Level 1 BLAS
- A vector reduction operation
- The addition of the scaled dot product of two
real vectors x and y into a scaled scalar r
- The C interface of the routine implementing the
operation is - void BLAS_ddot(
- enum blas_conj_type conj, int n, double alpha,
- const double x, int incx, double beta,
- const double y, int incy, double r )
5Level 1 BLAS (ctd)
- Other routines doing reduction operations
- Compute different vector norms of vector x
- Compute the sum of the entries of vector x
- Find the smallest or biggest component of vector
x - Compute the sum of squares of the entries of
vector x - Routines doing rotation operations
- Generate Givens plane rotation
- Generate Jacobi rotation
- Generate Householder transformation
6Level 1 BLAS (ctd)
- An element-wise vector operation
- The scaled addition of two real vectors x and y
- The C interface of the routine implementing the
operation is - void BLAS_dwaxpby(
- int n, double alpha, const double x, int
incx, - double beta, const double y, int incy,
- double w, int incw )
- Function BLAS_cwaxpby does the same operation but
on complex vectors
7Level 1 BLAS (ctd)
- Other routines doing element-wise operations
- Scale the entries of a vector x by the real
scalar 1/a - Scale a vector x by a and a vector y by b, add
these two vectors to one another and store the
result in the vector y - Combine a scaled vector accumulation and a dot
product - Apply a plane rotation to vectors x and y
8Level 1 BLAS (ctd)
- An example of data movement with vectors
- The interchange of real vectors x and y
- The C interface of the routine implementing the
operation is - void BLAS_dswap( int n, double x, int incx,
- double y, int incy )
- Function BLAS_cswap does the same operation but
on complex vectors
9Level 1 BLAS (ctd)
- Other routines doing data movement with vectors
- Copy vector x into vector y
- Sort the entries of real vector x in increasing
or decreasing order and overwrite this vector x
with the sorted vector as well as compute the
corresponding permutation vector p - Scale the entries of a vector x by the real
scalar 1/a - Permute the entries of vector x according to
permutation vector p
10Level 2 BLAS
- Routines of Level 2
- Compute different matrix vector products
- Do addition of scaled matrix vector products
- Compute multiple matrix vector products
- Solve triangular equations
- Perform rank one and rank two updates
- Some operations use symmetric or triangular
matrices
11Level 2 BLAS (ctd)
- To store matrices, the following schemes are used
- Column-based and row-based storage
- Packed storage for symmetric or triangular
matrices - Band storage for band matrices
- Conventional storage
- An nxn matrix A is stored in a one-dimensional
array a - aij gt aijs (C, column-wise storage)
- aij gt ajis (C, row-wise storage)
- If sn, rows (columns) will be contiguous in
memory - If sgtn, there will be a gap of (s-n) memory
elements between two successive rows (columns) - Only significant elements of symmetric/triangular
matrices need be set
12Packed Storage
- Packed storage
- The relevant triangle of a symmetric/triangular
matrix is packed by columns or rows in a
one-dimensional array - The upper triangle of an nxn matrix A may be
stored in a one-dimensional array a - aij(ij) gt aji(2n-i-1)/2 (C, row-wise
storage) - Example.
gt
13Band Storage
- Band storage
- A compact storage scheme for band matrices
- Consider Fortran and a column-wise storage scheme
- An mxn band matrix A with l subdiagonals and u
superdiagonals may be stored in a 2-dimensional
array A with lu1 rows and n columns - Columns of matrix A are stored in corresponding
columns of array A - Diagonals of matrix A are stored in rows of array
A - aij gt A(ui-j,j) for max(0,j-u) i
min(m-1,jl) - Example.
gt
14Level 2 BLAS (ctd)
- An example of matrix vector multiplication
operation - The scaled addition of a real n-length vector y,
and the product of a general real mxn matrix A
and a real n-length vector x
- The C interface of the routine implementing this
operation is - void BLAS_dgemv( enum blas_order_type order,
- enum blas_trans_type trans,
int m, int n, - double alpha, const double
a, int stride, - const double x, int incx,
double beta, - const double y, int incy
)
- Parameters
- order gt blas_rowmajor or blas_colmajor
- trans gt blas_no_trans (do not transpose A)
15Level 2 BLAS (ctd)
- If matrix A is a general band matrix with l
subdiagonals and u superdiagonals, the function - void BLAS_dgbmv( enum blas_order_type order,
- enum blas_trans_type trans,
- int m, int n, int l, int u,
- double alpha, const double
a, int stride, - const double x, int incx,
double beta, - const double y, int incy
)
better uses the memory. It assumes that a
band storage scheme is used to store matrix A.
16Level 2 BLAS (ctd)
- Other routines of Level 2 perform the following
operations
- as well as many others
- For any matrix-vector operation with a specific
matrix operand (triangular, symmetric, banded,
etc.), there is a routine for each storage scheme
that can be used to store the operand
17Level 3 BLAS
- Routines of Level 3 do
- O(n2) matrix operations
- norms, diagonal scaling, scaled accumulation and
addition - different storage schemes to store matrix
operands are supported - O(n3) matrix-matrix operations
- multiplication, solving matrix equations,
symmetric rank k and 2k updates - Data movement with matrices
18Level 3 BLAS (ctd)
- An example of O(n2) matrix operation, which
scales two real mxn matrices A and B and stores
their sum in a matrix C, is
- The C interface of the routine implementing this
operation under assumption that the matrices A, B
and C are of the general form, is - void BLAS_dge_add( enum blas_order_type order,
int m, int n, - double alpha, const double a,
int stride_a, - double beta, const double b,
int stride_b, - double c, int stride_c)
- There are other 15 routines performing this
operation for different types and forms of the
matrices A, B and C
19Level 3 BLAS (ctd)
- An example of O(n3) matrix-matrix operation
involving a real mxn matrix A, a real nxk matrix
B, and a real mxk matrix C is
- The C routine implementing the operation for
matrices A, B and C in the general form is - void BLAS_dgemm( enum blas_order_type order,
- enum blas_trans_type trans_a,
- enum blas_trans_type trans_b,
- int m, int n, int k, double
alpha, - const double a, int stride_a,
- const double b, int stride_c,
- double beta, const double c,
int stride_c)
20Level 3 BLAS (ctd)
- Data movement with matrices includes
- Copying matrix A or its transpose with storing
the result in matrix B - B A or B AT
- Transposition of a square matrix A with the
result overwriting matrix A - A AT
- Permutation of the rows or columns of matrix A by
a permutation matrix P - A PA or A AP
- Different types and forms of matrix operands as
well as different storage schemes are supported
21Sparse BLAS
- Sparse BLAS
- Provides routines for unstructured sparse
matrices - Poorer functionality compared to Dense and Banded
BLAS - only some basic array operations used in solving
large sparse linear equations using iterative
techniques - matrix multiply, triangular solve, sparse vector
update, dot product, gather/scatter - Does not specify methods to store a sparse matrix
- storage format is dependent on the algorithm, the
original sparsity pattern, the format in which
the data already exists, etc. - sparse matrix arguments are a placeholder, or
handle, which refers to an abstract
representation of a matrix, not the actual data
components
22Sparse BLAS (ctd)
- Several routines provided to create sparse
matrices - The internal representation is implementation
dependent - Sparse BLAS applications are independent of the
matrix storage scheme, relying on the scheme
provided by each implementation - A typical Sparse BLAS application
- Creates an internal sparse matrix representation
and returns its handle - Uses the handle as a parameter in computational
Sparse BLAS routines - Calls a cleanup routine to free resourses
associated with the handle, when the matrix is no
longer needed
23Example
- Example. Consider a C program using Sparse BLAS
performing the matrix-vector operation y Ax,
where
24Example (ctd)
- include ltblas_sparse.hgt
- int main()
- const int n 4, nonzeros 6
- double values 1.1, 2.2, 2.4, 3.3, 4.1,
4.4 - int index_i 0, 1, 1, 2, 3, 3
- int index_j 0, 1, 3, 2, 0, 3
- double x 1.0, 1.0, 1.0, 1.0, y
0.0, 0.0, 0.0, 0.0 - blas_sparse_matrix A
- int k
- double alpha 1.0
-
- A BLAS_duscr_begin(n, n) //Create Sparse
BLAS handle - for(k0 k lt nonzeros k) //Insert entries
one by one - BLAS_duscr_insert_entry(A, valuesk,
index_ik, index_jk) - BLAS_uscr_end(A) // Complete construction of
sparse matrix -
- //Compute matrix-vector product y Ax
- BLAS_dusmv(blas_no_trans, alpha, A, x, 1, y,
1) -
25Parallel Languages
26Parallel Languages
- C and Fortran 77 do not reflect some essential
features of VP and SP architectures - They cannot play the same role for VPs and SPs
- Optimizing compilers
- Only for a simple and limited class of
applications - Array libraries
- Cover a limited class of array operations
- Other array operations can be only expressed as a
combination of the locally-optimized library
array operations - This excludes global optimization of combined
array operations
27Parallel Languages (ctd)
- Parallel extensions of C and Fortran 77 allows
programmers - To explicitly express in a portable form any
array operation - Compiler does not need to recognize code to
parallelize - Global optimisation of operations on array is
possible - We consider 2 parallel supersets of C and Fortran
77 - Fortran 90
- C
28Fortran 90
- Fortran 90 is a new Fortran standard released in
1991 - Widely implemented since then
- Two categories of new features
- Modernization of Fortran according to the
state-of-the-art in serial programming languages - Support for explicit expression of operations on
arrays
29Fortran 90 (ctd)
- Serial extensions include
- Free-format source code and some other simple
improvements - Dynamic memory allocation (automatic arrays,
allocatable arrays, and pointers and associated
heap storage management) - User-defined data types (structures)
- Generic user-defined procedures (functions and
subroutines) and operators
30Fortran 90 (ctd)
- Serial extensions (ctd)
- Recursive procedures
- New control structures to support structured
programming - A new program unit, MODULE, for encapsulation of
data and a related set of procedures - We focus on parallel extensions
31Fortran 90 (ctd)
- Fortran 90 considers arrays first-class objects
- Whole-array operations, assignments, and
functions - Operations and assignments are extended in an
obvious way, on an element-by-element basis - Intrinsic functions are array-valued for array
arguments - operate element-wise if given an array as their
argument - Array expressions may include scalar constants
and variables, which are replicated (or expanded)
to the required number of elements
32Fortran 90 (ctd)
- Example.
- REAL, DIMENSION(3,4,5) a, b, c, d
-
- c a b
- d SQRT(a)
- c a 2.0
33WHERE Structure
- Sometimes, some elements of arrays in an
array-valued expression should be treated
specially - Division by zero in a 1./a should be avoided
- WHERE statement
- WHERE (a / 0.) a 1./a
- WHERE construct
- WHERE (a / 0.)
- a 1./a
- ELSEWHERE
- a HUGE(a)
- END WHERE
34Fortran 90 (ctd)
- All the array elements in an array-valued
expression or array assignment must be
conformable, i.e., they must have the same shape - the same number of axes
- the same number of elements along each axis
- Example.
- REAL a(3,4,5), b(02,4,5), c(3,4,-13)
- Arrays a, b, and c have the same rank of 3,
extents of 3,4, and 5, shape of 3,4,5, size of
60 - Only differ in the lower and upper dimension
bounds
35Array Section
- An array section can be used everywhere in array
assignments and array-valued expressions where a
whole array is allowed - An array section may be specified with subscripts
of the form of triplet lowerupperstride - It designates an ordered set i1,,ik such that
- i1 lower
- ij1 ij stride ( j1,,k-1 )
- ik - upper lt stride
36Array Section (ctd)
- Example. REAL a(50,50)
- What sections are designated by the following
expressions? What are the rank and shape for each
section? - a(i,1501), a(i,150)
- a(i,)
- a(i,1503)
- a(i,501-1)
- a(1140,j)
- a(110,110)
37Array Section (ctd)
- Vector subscripts may also be used to specify
array sections - Any expression whose value is a rank 1 integer
array may be used as a vector subsript - Example.
- REAL a(5,5), b(5)
- INTEGER index(5)
- index (/5,4,3,2,1/)
- b a(index,1)
38Array Section (ctd)
- Whole arrays and array sections of the same shape
can be mixed in expressions and assignments - Note, that unlike a whole array, an array section
may not occupy contiguous storage locations
39Array Constants
- Fortran 90 introduces array constants, or array
constructors - The simplest form is just a list of elements
enclosed in (/ and /) - May contain lists of scalars, lists of arrays,
and implied-DO loops - Examples.
- (/ 0, i1,50 /)
- (/ (3.14i, i4,100,3) /)
- (/ ( (/ 5,4,3,2,1 /), i1,5 ) /)
40Array Constants (ctd)
- The array constructors can only produce
1-dimensional arrays - Function RESHAPE can be used to construct arrays
of higher rank - REAL a(500,500)
- a RESHAPE( (/ (0., i1,250000) /), (/ 500,500
/) )
41Assumed-Shape and Automatic Arrays
- Consider the user-defined procedure operating on
arrays - SUBROUTINE swap(a,b)
- REAL, DIMENSION(,) a, b
- REAL, DIMENSION(SIZE(a,1), SIZE(a,2)) temp
- temp a
- a b
- b temp
- END SUBROUTINE swap
42Assumed-Shape and Automatic Arrays (ctd)
- Formal array arguments a and b are of assumed
shape - Only the type and rank are specified
- The actual shape is taken from that of the actual
array arguments - The local array temp is an example of the
automatic array - Its size is set at runtime
- It stops existing as soon as control leaves the
procedure
43Intrinsic Array Functions
- Intrinsic array functions include
- Extension of such intrinsic functions as SQRT,
SIN, etc. to array arguments - Specific array intrinsic functions
- Specific array intrinsic functions do the
following - Compute the scalar product of two vectors
(DOT_PRODUCT) and the matrix product of two
matrices (MATMUL)
44Specific Intrinsic Array Functions
- Perform diverse reduction operations on an array
- logical multiplication (ALL) and addition (ANY)
- counting the number of true elements in the array
- arithmetical multiplication (PRODUCT) and
addition (SUM) of its elements - finding the smallest (MINVAL) or the largest
(MAXVAL) element
45Specific Intrinsic Array Functions (ctd)
- Return diverse attributes of an array
- its shape (SHAPE)
- the lower dimension bounds of the array (LBOUND)
- the upper dimension bounds (UBOUND)
- the number of elements (SIZE)
- the allocation status of the array (ALLOCATED)
46Specific Intrinsic Array Functions (ctd)
- Construct arrays by means of
- merging two arrays under mask (MERGE)
- packing an array into a vector (PACK)
- replication of an array by adding a dimension
(SPREAD) - unpacking a vector (a rank 1 array) into an array
under mask (UNPACK)
47Specific Intrinsic Array Functions (ctd)
- Reshape arrays (RESHAPE)
- Move array elements performing
- the circular shift (CSHIFT)
- the end-off shift (EOSHIFT)
- the transpose of a rank 2 array (TRANSPOSE)
- Locate the first maximum (MAXLOC) or minimum
(MINLOC) element in an array
48C
- C (C brackets) is a strict ANSI C superset
allowing programmers to explicitly describe
operations on arrays - Vector value, or vector
- An ordered set of values (or vector values) of
any one type - Any vector type is characterised by
- the number of elements
- the type of elements
49Vector Value and Vector Object
- Vector object
- A region of data storage, the contents of which
can represent vector values - An ordered sequence of objects (or vector
objects) of any one type - Unlike ANSI C, C defines the notion of value of
array object - This value is vector
50Vector Value and Vector Object (ctd)
- Example. The value of the array
- int a32 0,1,2,3,4,5
- is the vector
- 0,1, 2,3, 4,5
- This vector has the shape 3,2.
- This vector type is named by int32
- The shape of array that of its vector value
- In C, array object is a particular case of
vector object
51Arrays and Pointers
- C array is a contiguously allocated set of
elements of any one type of object - C array is a a set of elements of any one type
of object sequentially allocated with a positive
stride - The stride is a distance between successive
elements of the array measured in units equal to
the size of array element - If stride is not specified, it is assumed to be 1
52Arrays and Pointers (ctd)
- C array has at least three attributes
- the type of elements
- the number of elements
- the allocation stride
53Arrays and Pointers (ctd)
- Example 1.
- int a3
- int a31
The slot between array elements is of
2xsizeof(int) bytes
54Arrays and Pointers (ctd)
- In C, a pointer has only one attribute
- The type of object it points to
- It is needed to correctly interpret
- the value of the object it points to
- the address operators and - (operand(s) and
result should point into the same array) - In C, a pointer has an additional attribute,
stride - If stride is not specified, it is assumed to be 1
55Arrays and Pointers (ctd)
- Example 1. The declarations
- int a 0,1,2,3,4
- int p1 (void)a
- int 2 p2 (void)a4
- form the following structure of storage
p12 and p2-1 point to the same array element,
a2
56Arrays and Pointers (ctd)
- Expressions e1e2 or (e2)e1 provide access to
the e2-th element of an array e1 - Identical to (((e1)(e2)))
- e2 is an integer expression
- e1 is an lvalue that has the type array of type
- converted to an expression of the type pointer
to type pointing to the intial element of the
array object - the attribute stride of this pointer is identical
to that of the array object
57Arrays and Pointers (ctd)
- C allows dynamic arrays
- typedef int (pDiag)nn1
- int ann
- int j
- pDiag p (void)a
- ...
- for(j0 jltn j)
- (p)j1
58Blocking Operator
- In C, the value of an aray object is a vector
- The i-th element of the vector is the value of
the i-th element of the array object - The postfix operator (the blocking operator)
- Supports access to an array as the whole
- Its operand has the type array of type
- Blocks the conversion of the operand to a pointer
- Example. int a5, b52, c53
- a, b, and c designate arrays a, b, and c as
a whole - cab
59Lvector
- In C, an lvalue is an expression designating an
object - Example. int d55
- dij, d and d0 are lvalues
- dij1 and d0 are not.
- Modifiable lvalue
- May be the left operand of an assignment operator
- dij is a modifiable lvalue
- d and d0 are not
60Lvector (ctd)
- In C, an lvector is an expression designating a
vector object - Modifiable lvector
- May be the left operand of an assignment operator
- Example. int d55
- d, d0, d, and d0 are lvectors
- d, and d0 are modifiable
- d and d0 are not modifiable
61Lvector (ctd)
- Example. int a44
- ((int()45)a)
62Subarray
- An object belongs to an array if
- It is an element of the array, or
- It belongs to an element of the array
- Subarray
- A set of objects belonging to an array
- An array itself
- Example (ctd). The main diagonal is a subarray
- It is an array obect of the type int45
63Subarray (ctd)
- Example. int a44
- ((int()35)(a01))
64Subarray (ctd)
- Not every regular set of objects belonging to an
array makes up its subarray - Example. int a55
No constant modifiable lvector designates this
inner square
65Array Section
- The operator (the grid operator)
- Supports access to array sections of general form
- Syntax. elrs
- Expression e may have type array of type or
pointer to type - Expressions l, r, and s have integer types and
denote - the left bound
- the right bound
- the stride
66Array Section (ctd)
- Semantics. elrs
- A vector object of (r-s)/l1 elements of type
type - Its i-th element is elsi
- Expression elrs is lvector
- Expression elrs is a modifiable lvector if
- All expressions elsi i0,1, are modifiable
lvectors/lvalues - elr1 ? elr
67Array Section (ctd)
- Operand e in elrs may have a vector type
- Operator is applied element-wise
- Let the vector value of e be u1,,uk
- elrs will designate a vector of k vectors
- The i-th element of the j-th vector will be
ujlsi (j1,,k)
68Example
69Example (ctd)
a1313
70Array Section (ctd)
- Operands l and/or r in elrs may be omitted
- If l is omitted, the left bound is set to 0
- If r is omitted, the right bound is
- set to n-1, if the first operand e is an
n-element array - determined from the context, if e is a pointer
- Example. int a55
- a ? a
71Element-Wise Vector Operators
- The operand of the cast operator and the unary ,
, , -, , !, , and - operators may have a
vector type - The operators are applied element-wise
- Example.
- int j, k, l, m, n
- int p5 j, k, l, m, n
- (p13) designates a vector object consisting
of three integer variables k, l, and m.
72Element-Wise Vector Operators (ctd)
- Binary operators , /, , , -, ltlt, gtgt, lt, gt, lt,
gt, , !, , , , , and may have vector
operands - If the operands have the same shape, then the
operator is executed element-wise producing the
result of this shape
73Element-Wise Vector Operators (ctd)
- In general, the operands may have different
shapes but they must be conformable - 2 operands are conformable iff the beginning of
the shape of one operand is identical to the
shape of the other operand - Vectors having shapes 9,8,7,6 and 9,8 are
conformable - A non-vector operand is conformable with any
vector operand (why?)
74Element-Wise Vector Operators (ctd)
- Let operands a and b be conformable, and rank(a)
lt rank(b) - The execution of the operator starts from
conformable extension of the value of a to the
shape of b - The conformable extension just replicates the
value by adding dimensions
75Element-Wise Vector Operators (ctd)
- Example. Conformable extension of vector
1,2,3,4,5,6 of shape 2,3 to shape 2,3,2
is vector 1,1,2,2,3,3,4,4,5,5,6,6
- Then the operator is applied element-wise to the
result of the conformable extension of the value
of a and the value of b, producing the result of
the same shape as that of b
76Element-Wise Vector Operators (ctd)
- The assignment operators , , , etc. may have
vector operands - The left operand shall be a modifiable lvector
- Its rank shall not be less than that of the right
operand - The operands shall be conformable
- Two-step execution
- The right operand is conformably extended to the
shape of the left one - The assignment is executed element-wise
77Element-Wise Vector Operators (ctd)
- Example.
-
- int amn, bm
- ...
- a b
78Example
- LU-factorization of the square matrix a by using
the Gaussian elimination - double ann, t
- int i, j
- ...
- for(i0 iltn i)
- for(ji1 jltn j)
- t aji/aii
- if(aji!0.)
- ajin-1-taiin-1
-
-
79Element-Wise Vector Operators (ctd)
- By definition, e1e2 is identical to
(((e1)(e2))) - Therefore, e1 and e2 may be of vector type
- The programmer can construct lvectors that
designate irregular array sections - Example.
- int amn, ind 0,1,6,18
- ...
- aind 0
- This code zeros the elements of the 0-th, 1-st,
6-th, and 18-th rows of array a
80Element-Wise Vector Operators (ctd)
- The first operand of the . operator may have a
vector type - The second operand shall name a member of a
structure or union type - The operator is executed element-wise
- The result will have the same shape as the first
operand - e-gtid is identical to (e).id
81Reduction Operators
- Reduction operators , , , , ,
, , ?lt, and ?gt - Unary operators
- Only applicable to vector operands
- If v1,,vn are the elements of the vector value
of the expression e, then the value of the
expression e is that of the expression
(v1vn)
82Examples
- Example 1. Dot product of the vectors a and b.
- double an
- double bn
- double c
- ...
- c (ab)
83Examples (ctd)
- Example 2. Maximal element of the matrix a
- int amn
- int max
- ...
- max ?gt?gta
84Examples (ctd)
- Example 3. Multiplication of matrices a and b
- double aml
- double bln
- double cmn
- int i
- ...
- for(i0 iltm i)
- ci (aib)
85Memory Hierarchy
- Parallel programming systems for VPs and SPs take
into account their modern memory structure - Optimal memory management is often more efficient
than optimal usage of IEUs - Approaches to optimal memory management appear
surprisingly similar to optimisation of parallel
facilities - Simple two-level memory model
- Small and fast register memory
- Large and relatively slow main memory
86Memory Hierarchy (ctd)
- A simple modern memory hierarchy
- Register memory
- Cache memory
- Main memory
- Disk memory
- Cache memory
- A buffer memory between main memory and registers
- Holds copies of some data from the main memory
87Memory Hierarchy (ctd)
- Execution of instruction reading a data item from
the main memory into a register - Check if a copy of the data item is already in
the cache - If so, the data item will be actually transferred
into the register from the cache - If not, the data item will be transferred into
the register from the main memory, and a copy of
the item will appear in the cache
88Cache
- Cache
- Partitioned into cache lines
- Cache line is a minimum unit of data transfer
between the cache and the main memory - Scalars may be transferred only as a part of a
cache line - Much smaller than the main memory
- The same cache line may reflect different data
blocks from the main memory
89Cache (ctd)
- Types of cache memory
- Direct mapped
- each block of the main memory has only one place
it can appear in the cache - Fully associative
- a block can be placed anywhere in the cache
- Set associative
- a block can be placed in a restricted set of
places - a set is a group of two or more cache lines
- n-way associative cache
90Cache (ctd)
- Cache miss is the situation when a data item
being referenced is not in the cache - Minimization of cache misses is able to
significantly accelerate execution of the program - Programs intensively using basic operations on
arrays are obviously suitable for that type of
optimization
91Loop Tiling
- The main specific optimization minimizing the
number of cache misses is loop tiling - Consider the loop nest
- for(i0 iltm i) / loop 1 /
- for(j0 jltn j) / loop 2 /
- if(i0)
- bjaij
- else
- bjaij
- bj are repeatedly used by successive iterations
of loop 1
92Loop Tiling (ctd)
- If n is large enough, the data items may be
flushed from the cache by the moment of their
repeated use - To minimize the flushing of repeatedly used data
items, the number of iterations of loop 2 may be
decreased - To keep the total number of iterations of this
loop nest unchanged, an additional controlling
loop is introduced
93Loop Tiling (ctd)
- The transformed loop nest is
- for(k0 kltn kT) //additional controlling
loop 0 - for(i0 iltm i) // loop 1
- for(jk jltmin(kT,n) j) // loop 2
- if(i0)
- bjaij
- else
- bjaij
- This transformation is called tiling
- T is the tile size
94Loop Tiling (ctd)
- In general, the loop tiling is applied to loop
nests of the form - for(i1...) / loop 1 /
- for(i2...) / loop 2 /
- ...
- for(in...) / loop n /
- ...
- ei2in
- ...
-
- The goal is to minimize the number of cache
misses for reference ei2in, which is
repeatedly used by successive iterations of loop
1
95Loop Tiling and Optimising Compilers
- The recognition of the loop nests, which can be
tiled is the most difficult problem to be solved
by optimising C and Fortran 77 compilers - Based on the analysis of data dependencies in
loop nests - Theorem. The loop tiling is legally applicable
(to the above loop nest) iff the loops from loop
2 to loop n are fully interchangeable - To prove the interchangability an analysis of
data dependence between different iterations of
the loop nest is needed
96Loop Tiling and Array Libraries
- Level 3 BLAS is specified to support block
algorithms of matrix-matrix operations - Partitioning matrices into blocks and performing
the computation on the blocks maximizes the reuse
of data held in the upper levels of memory
hierarchy
97Loop Tiling and Parallel Languages
- Compilers for parallel languages do not need to
recognize loops suitable for tiling - They can translate explicit operations on arrays
into loop nests with the best possible temporal
locality
98Virtual memory
- Instructions address virtual memory rather than
the real physical memory - The virtual memory is partitioned into pages of a
fixed size - Each page is stored on a disk until it is needed
- When the page is needed, it copied to main
memory, with the virtual addresses mapping into
real addresses - This copying is known as paging or swapping
99Virtual memory (ctd)
- Programs processing large enough arrays do not
fit into main memory - The swapping takes place each time when required
data are not in the main memory - The swapping is a very expensive operation
- Minimization of the number of swappings can
significantly accelerate the programs - The problem is similar to minimization of cache
misses and can be, therefore, approached
similarly
100Vector and Superscalar Processors Summary
- VPs and SPs provide instruction-level
parallelism, which is best exploited by
applications with intensive operations on arrays - Such applications can be written in a serial
programming language and complied by dedicated
optimizing compilers performing some specific
loop optimizations - Modular, portable, and reliable programming are
supported - Efficiency and portable efficiency are also
supported but only for a limited class of
programs
101Vector and Superscalar Processors Summary (ctd)
- Array libraries allow the programmers to avoid
the use of dedicated compilers - The programmers express operations on arrays
directly using calls to carefully implemented
subroutines - Modular, portable, and reliable programming are
supported - Limited efficiency and portable efficiency
- Excludes global optimization of combined array
operations
102Vector and Superscalar Processors Summary (ctd)
- Parallel languages combine advantages of the
first and second approaches - Operations on arrays can be explicitly expressed
- No need in sophisticated algorithms to recognize
parallelizable loops - Global optimisation of combined array operations
is possible - They support general-purpose programming (unlike
existing array libraries)