Introduction to Computer Hardware - PowerPoint PPT Presentation

1 / 102
About This Presentation
Title:

Introduction to Computer Hardware

Description:

The libraries are supposed to be optimised ... Diagonals of matrix A are stored in rows of array A ... norms, diagonal scaling, scaled accumulation and addition ... – PowerPoint PPT presentation

Number of Views:93
Avg rating:3.0/5.0
Slides: 103
Provided by: alexeylas
Category:

less

Transcript and Presenter's Notes

Title: Introduction to Computer Hardware


1
Array Libraries
2
Array Libraries
  • Function extensions of C and Fortran 77 with
    array or vector libraries
  • The libraries are supposed to be optimised for
    each particular computer
  • Regular compilers can be used gt no need in
    dedicated optimising compilers
  • One of the most well-known and well-designed
    array libraries is the Basic Linear Algebra
    Subprograms (BLAS)
  • Provides basic array operations for numerical
    linear algebra
  • Available for most modern VP and SP computers

3
BLAS
  • All BLAS routines are divided into 3 main
    categories
  • Level 1 BLAS addresses scalar and vector
    operations
  • Level 2 BLAS addresses matrix-vector operations
  • Level 3 BLAS addresses matrix-matrix operations
  • Routines of Level 1 do
  • vector reduction operations
  • vector rotation operations
  • element-wise and combined vector operations
  • data movement with vectors

4
Level 1 BLAS
  • A vector reduction operation
  • The addition of the scaled dot product of two
    real vectors x and y into a scaled scalar r
  • The C interface of the routine implementing the
    operation is
  • void BLAS_ddot(
  • enum blas_conj_type conj, int n, double alpha,
  • const double x, int incx, double beta,
  • const double y, int incy, double r )

5
Level 1 BLAS (ctd)
  • Other routines doing reduction operations
  • Compute different vector norms of vector x
  • Compute the sum of the entries of vector x
  • Find the smallest or biggest component of vector
    x
  • Compute the sum of squares of the entries of
    vector x
  • Routines doing rotation operations
  • Generate Givens plane rotation
  • Generate Jacobi rotation
  • Generate Householder transformation

6
Level 1 BLAS (ctd)
  • An element-wise vector operation
  • The scaled addition of two real vectors x and y
  • The C interface of the routine implementing the
    operation is
  • void BLAS_dwaxpby(
  • int n, double alpha, const double x, int
    incx,
  • double beta, const double y, int incy,
  • double w, int incw )
  • Function BLAS_cwaxpby does the same operation but
    on complex vectors

7
Level 1 BLAS (ctd)
  • Other routines doing element-wise operations
  • Scale the entries of a vector x by the real
    scalar 1/a
  • Scale a vector x by a and a vector y by b, add
    these two vectors to one another and store the
    result in the vector y
  • Combine a scaled vector accumulation and a dot
    product
  • Apply a plane rotation to vectors x and y

8
Level 1 BLAS (ctd)
  • An example of data movement with vectors
  • The interchange of real vectors x and y
  • The C interface of the routine implementing the
    operation is
  • void BLAS_dswap( int n, double x, int incx,
  • double y, int incy )
  • Function BLAS_cswap does the same operation but
    on complex vectors

9
Level 1 BLAS (ctd)
  • Other routines doing data movement with vectors
  • Copy vector x into vector y
  • Sort the entries of real vector x in increasing
    or decreasing order and overwrite this vector x
    with the sorted vector as well as compute the
    corresponding permutation vector p
  • Scale the entries of a vector x by the real
    scalar 1/a
  • Permute the entries of vector x according to
    permutation vector p

10
Level 2 BLAS
  • Routines of Level 2
  • Compute different matrix vector products
  • Do addition of scaled matrix vector products
  • Compute multiple matrix vector products
  • Solve triangular equations
  • Perform rank one and rank two updates
  • Some operations use symmetric or triangular
    matrices

11
Level 2 BLAS (ctd)
  • To store matrices, the following schemes are used
  • Column-based and row-based storage
  • Packed storage for symmetric or triangular
    matrices
  • Band storage for band matrices
  • Conventional storage
  • An nxn matrix A is stored in a one-dimensional
    array a
  • aij gt aijs (C, column-wise storage)
  • aij gt ajis (C, row-wise storage)
  • If sn, rows (columns) will be contiguous in
    memory
  • If sgtn, there will be a gap of (s-n) memory
    elements between two successive rows (columns)
  • Only significant elements of symmetric/triangular
    matrices need be set

12
Packed Storage
  • Packed storage
  • The relevant triangle of a symmetric/triangular
    matrix is packed by columns or rows in a
    one-dimensional array
  • The upper triangle of an nxn matrix A may be
    stored in a one-dimensional array a
  • aij(ij) gt aji(2n-i-1)/2 (C, row-wise
    storage)
  • Example.

gt
13
Band Storage
  • Band storage
  • A compact storage scheme for band matrices
  • Consider Fortran and a column-wise storage scheme
  • An mxn band matrix A with l subdiagonals and u
    superdiagonals may be stored in a 2-dimensional
    array A with lu1 rows and n columns
  • Columns of matrix A are stored in corresponding
    columns of array A
  • Diagonals of matrix A are stored in rows of array
    A
  • aij gt A(ui-j,j) for max(0,j-u) i
    min(m-1,jl)
  • Example.

gt
14
Level 2 BLAS (ctd)
  • An example of matrix vector multiplication
    operation
  • The scaled addition of a real n-length vector y,
    and the product of a general real mxn matrix A
    and a real n-length vector x
  • The C interface of the routine implementing this
    operation is
  • void BLAS_dgemv( enum blas_order_type order,
  • enum blas_trans_type trans,
    int m, int n,
  • double alpha, const double
    a, int stride,
  • const double x, int incx,
    double beta,
  • const double y, int incy
    )
  • Parameters
  • order gt blas_rowmajor or blas_colmajor
  • trans gt blas_no_trans (do not transpose A)

15
Level 2 BLAS (ctd)
  • If matrix A is a general band matrix with l
    subdiagonals and u superdiagonals, the function
  • void BLAS_dgbmv( enum blas_order_type order,
  • enum blas_trans_type trans,
  • int m, int n, int l, int u,
  • double alpha, const double
    a, int stride,
  • const double x, int incx,
    double beta,
  • const double y, int incy
    )

better uses the memory. It assumes that a
band storage scheme is used to store matrix A.
16
Level 2 BLAS (ctd)
  • Other routines of Level 2 perform the following
    operations
  • as well as many others
  • For any matrix-vector operation with a specific
    matrix operand (triangular, symmetric, banded,
    etc.), there is a routine for each storage scheme
    that can be used to store the operand

17
Level 3 BLAS
  • Routines of Level 3 do
  • O(n2) matrix operations
  • norms, diagonal scaling, scaled accumulation and
    addition
  • different storage schemes to store matrix
    operands are supported
  • O(n3) matrix-matrix operations
  • multiplication, solving matrix equations,
    symmetric rank k and 2k updates
  • Data movement with matrices

18
Level 3 BLAS (ctd)
  • An example of O(n2) matrix operation, which
    scales two real mxn matrices A and B and stores
    their sum in a matrix C, is
  • The C interface of the routine implementing this
    operation under assumption that the matrices A, B
    and C are of the general form, is
  • void BLAS_dge_add( enum blas_order_type order,
    int m, int n,
  • double alpha, const double a,
    int stride_a,
  • double beta, const double b,
    int stride_b,
  • double c, int stride_c)
  • There are other 15 routines performing this
    operation for different types and forms of the
    matrices A, B and C

19
Level 3 BLAS (ctd)
  • An example of O(n3) matrix-matrix operation
    involving a real mxn matrix A, a real nxk matrix
    B, and a real mxk matrix C is
  • The C routine implementing the operation for
    matrices A, B and C in the general form is
  • void BLAS_dgemm( enum blas_order_type order,
  • enum blas_trans_type trans_a,
  • enum blas_trans_type trans_b,
  • int m, int n, int k, double
    alpha,
  • const double a, int stride_a,
  • const double b, int stride_c,
  • double beta, const double c,
    int stride_c)

20
Level 3 BLAS (ctd)
  • Data movement with matrices includes
  • Copying matrix A or its transpose with storing
    the result in matrix B
  • B A or B AT
  • Transposition of a square matrix A with the
    result overwriting matrix A
  • A AT
  • Permutation of the rows or columns of matrix A by
    a permutation matrix P
  • A PA or A AP
  • Different types and forms of matrix operands as
    well as different storage schemes are supported

21
Sparse BLAS
  • Sparse BLAS
  • Provides routines for unstructured sparse
    matrices
  • Poorer functionality compared to Dense and Banded
    BLAS
  • only some basic array operations used in solving
    large sparse linear equations using iterative
    techniques
  • matrix multiply, triangular solve, sparse vector
    update, dot product, gather/scatter
  • Does not specify methods to store a sparse matrix
  • storage format is dependent on the algorithm, the
    original sparsity pattern, the format in which
    the data already exists, etc.
  • sparse matrix arguments are a placeholder, or
    handle, which refers to an abstract
    representation of a matrix, not the actual data
    components

22
Sparse BLAS (ctd)
  • Several routines provided to create sparse
    matrices
  • The internal representation is implementation
    dependent
  • Sparse BLAS applications are independent of the
    matrix storage scheme, relying on the scheme
    provided by each implementation
  • A typical Sparse BLAS application
  • Creates an internal sparse matrix representation
    and returns its handle
  • Uses the handle as a parameter in computational
    Sparse BLAS routines
  • Calls a cleanup routine to free resourses
    associated with the handle, when the matrix is no
    longer needed

23
Example
  • Example. Consider a C program using Sparse BLAS
    performing the matrix-vector operation y Ax,
    where

24
Example (ctd)
  • include ltblas_sparse.hgt
  • int main()
  • const int n 4, nonzeros 6
  • double values 1.1, 2.2, 2.4, 3.3, 4.1,
    4.4
  • int index_i 0, 1, 1, 2, 3, 3
  • int index_j 0, 1, 3, 2, 0, 3
  • double x 1.0, 1.0, 1.0, 1.0, y
    0.0, 0.0, 0.0, 0.0
  • blas_sparse_matrix A
  • int k
  • double alpha 1.0
  •  
  • A BLAS_duscr_begin(n, n) //Create Sparse
    BLAS handle 
  • for(k0 k lt nonzeros k) //Insert entries
    one by one
  • BLAS_duscr_insert_entry(A, valuesk,
    index_ik, index_jk)
  • BLAS_uscr_end(A) // Complete construction of
    sparse matrix
  •  
  • //Compute matrix-vector product y Ax
  • BLAS_dusmv(blas_no_trans, alpha, A, x, 1, y,
    1)
  •  

25
Parallel Languages
26
Parallel Languages
  • C and Fortran 77 do not reflect some essential
    features of VP and SP architectures
  • They cannot play the same role for VPs and SPs
  • Optimizing compilers
  • Only for a simple and limited class of
    applications
  • Array libraries
  • Cover a limited class of array operations
  • Other array operations can be only expressed as a
    combination of the locally-optimized library
    array operations
  • This excludes global optimization of combined
    array operations

27
Parallel Languages (ctd)
  • Parallel extensions of C and Fortran 77 allows
    programmers
  • To explicitly express in a portable form any
    array operation
  • Compiler does not need to recognize code to
    parallelize
  • Global optimisation of operations on array is
    possible
  • We consider 2 parallel supersets of C and Fortran
    77
  • Fortran 90
  • C

28
Fortran 90
  • Fortran 90 is a new Fortran standard released in
    1991
  • Widely implemented since then
  • Two categories of new features
  • Modernization of Fortran according to the
    state-of-the-art in serial programming languages
  • Support for explicit expression of operations on
    arrays

29
Fortran 90 (ctd)
  • Serial extensions include
  • Free-format source code and some other simple
    improvements
  • Dynamic memory allocation (automatic arrays,
    allocatable arrays, and pointers and associated
    heap storage management)
  • User-defined data types (structures)
  • Generic user-defined procedures (functions and
    subroutines) and operators

30
Fortran 90 (ctd)
  • Serial extensions (ctd)
  • Recursive procedures
  • New control structures to support structured
    programming
  • A new program unit, MODULE, for encapsulation of
    data and a related set of procedures
  • We focus on parallel extensions

31
Fortran 90 (ctd)
  • Fortran 90 considers arrays first-class objects
  • Whole-array operations, assignments, and
    functions
  • Operations and assignments are extended in an
    obvious way, on an element-by-element basis
  • Intrinsic functions are array-valued for array
    arguments
  • operate element-wise if given an array as their
    argument
  • Array expressions may include scalar constants
    and variables, which are replicated (or expanded)
    to the required number of elements

32
Fortran 90 (ctd)
  • Example.
  • REAL, DIMENSION(3,4,5) a, b, c, d
  • c a b
  • d SQRT(a)
  • c a 2.0

33
WHERE Structure
  • Sometimes, some elements of arrays in an
    array-valued expression should be treated
    specially
  • Division by zero in a 1./a should be avoided
  • WHERE statement
  • WHERE (a / 0.) a 1./a
  • WHERE construct
  • WHERE (a / 0.)
  • a 1./a
  • ELSEWHERE
  • a HUGE(a)
  • END WHERE

34
Fortran 90 (ctd)
  • All the array elements in an array-valued
    expression or array assignment must be
    conformable, i.e., they must have the same shape
  • the same number of axes
  • the same number of elements along each axis
  • Example.
  • REAL a(3,4,5), b(02,4,5), c(3,4,-13)
  • Arrays a, b, and c have the same rank of 3,
    extents of 3,4, and 5, shape of 3,4,5, size of
    60
  • Only differ in the lower and upper dimension
    bounds

35
Array Section
  • An array section can be used everywhere in array
    assignments and array-valued expressions where a
    whole array is allowed
  • An array section may be specified with subscripts
    of the form of triplet lowerupperstride
  • It designates an ordered set i1,,ik such that
  • i1 lower
  • ij1 ij stride ( j1,,k-1 )
  • ik - upper lt stride

36
Array Section (ctd)
  • Example. REAL a(50,50)
  • What sections are designated by the following
    expressions? What are the rank and shape for each
    section?
  • a(i,1501), a(i,150)
  • a(i,)
  • a(i,1503)
  • a(i,501-1)
  • a(1140,j)
  • a(110,110)

37
Array Section (ctd)
  • Vector subscripts may also be used to specify
    array sections
  • Any expression whose value is a rank 1 integer
    array may be used as a vector subsript
  • Example.
  • REAL a(5,5), b(5)
  • INTEGER index(5)
  • index (/5,4,3,2,1/)
  • b a(index,1)

38
Array Section (ctd)
  • Whole arrays and array sections of the same shape
    can be mixed in expressions and assignments
  • Note, that unlike a whole array, an array section
    may not occupy contiguous storage locations

39
Array Constants
  • Fortran 90 introduces array constants, or array
    constructors
  • The simplest form is just a list of elements
    enclosed in (/ and /)
  • May contain lists of scalars, lists of arrays,
    and implied-DO loops
  • Examples.
  • (/ 0, i1,50 /)
  • (/ (3.14i, i4,100,3) /)
  • (/ ( (/ 5,4,3,2,1 /), i1,5 ) /)

40
Array Constants (ctd)
  • The array constructors can only produce
    1-dimensional arrays
  • Function RESHAPE can be used to construct arrays
    of higher rank
  • REAL a(500,500)
  • a RESHAPE( (/ (0., i1,250000) /), (/ 500,500
    /) )

41
Assumed-Shape and Automatic Arrays
  • Consider the user-defined procedure operating on
    arrays
  • SUBROUTINE swap(a,b)
  • REAL, DIMENSION(,) a, b
  • REAL, DIMENSION(SIZE(a,1), SIZE(a,2)) temp
  • temp a
  • a b
  • b temp
  • END SUBROUTINE swap

42
Assumed-Shape and Automatic Arrays (ctd)
  • Formal array arguments a and b are of assumed
    shape
  • Only the type and rank are specified
  • The actual shape is taken from that of the actual
    array arguments
  • The local array temp is an example of the
    automatic array
  • Its size is set at runtime
  • It stops existing as soon as control leaves the
    procedure

43
Intrinsic Array Functions
  • Intrinsic array functions include
  • Extension of such intrinsic functions as SQRT,
    SIN, etc. to array arguments
  • Specific array intrinsic functions
  • Specific array intrinsic functions do the
    following
  • Compute the scalar product of two vectors
    (DOT_PRODUCT) and the matrix product of two
    matrices (MATMUL)

44
Specific Intrinsic Array Functions
  • Perform diverse reduction operations on an array
  • logical multiplication (ALL) and addition (ANY)
  • counting the number of true elements in the array
  • arithmetical multiplication (PRODUCT) and
    addition (SUM) of its elements
  • finding the smallest (MINVAL) or the largest
    (MAXVAL) element

45
Specific Intrinsic Array Functions (ctd)
  • Return diverse attributes of an array
  • its shape (SHAPE)
  • the lower dimension bounds of the array (LBOUND)
  • the upper dimension bounds (UBOUND)
  • the number of elements (SIZE)
  • the allocation status of the array (ALLOCATED)

46
Specific Intrinsic Array Functions (ctd)
  • Construct arrays by means of
  • merging two arrays under mask (MERGE)
  • packing an array into a vector (PACK)
  • replication of an array by adding a dimension
    (SPREAD)
  • unpacking a vector (a rank 1 array) into an array
    under mask (UNPACK)

47
Specific Intrinsic Array Functions (ctd)
  • Reshape arrays (RESHAPE)
  • Move array elements performing
  • the circular shift (CSHIFT)
  • the end-off shift (EOSHIFT)
  • the transpose of a rank 2 array (TRANSPOSE)
  • Locate the first maximum (MAXLOC) or minimum
    (MINLOC) element in an array

48
C
  • C (C brackets) is a strict ANSI C superset
    allowing programmers to explicitly describe
    operations on arrays
  • Vector value, or vector
  • An ordered set of values (or vector values) of
    any one type
  • Any vector type is characterised by
  • the number of elements
  • the type of elements

49
Vector Value and Vector Object
  • Vector object
  • A region of data storage, the contents of which
    can represent vector values
  • An ordered sequence of objects (or vector
    objects) of any one type
  • Unlike ANSI C, C defines the notion of value of
    array object
  • This value is vector

50
Vector Value and Vector Object (ctd)
  • Example. The value of the array
  • int a32 0,1,2,3,4,5
  • is the vector
  • 0,1, 2,3, 4,5
  • This vector has the shape 3,2.
  • This vector type is named by int32
  • The shape of array that of its vector value
  • In C, array object is a particular case of
    vector object

51
Arrays and Pointers
  • C array is a contiguously allocated set of
    elements of any one type of object
  • C array is a a set of elements of any one type
    of object sequentially allocated with a positive
    stride
  • The stride is a distance between successive
    elements of the array measured in units equal to
    the size of array element
  • If stride is not specified, it is assumed to be 1

52
Arrays and Pointers (ctd)
  • C array has at least three attributes
  • the type of elements
  • the number of elements
  • the allocation stride

53
Arrays and Pointers (ctd)
  • Example 1.
  • int a3
  • int a31
  • Example 2.
  • int a33

The slot between array elements is of
2xsizeof(int) bytes
54
Arrays and Pointers (ctd)
  • In C, a pointer has only one attribute
  • The type of object it points to
  • It is needed to correctly interpret
  • the value of the object it points to
  • the address operators and - (operand(s) and
    result should point into the same array)
  • In C, a pointer has an additional attribute,
    stride
  • If stride is not specified, it is assumed to be 1

55
Arrays and Pointers (ctd)
  • Example 1. The declarations
  • int a 0,1,2,3,4
  • int p1 (void)a
  • int 2 p2 (void)a4
  • form the following structure of storage

p12 and p2-1 point to the same array element,
a2
56
Arrays and Pointers (ctd)
  • Expressions e1e2 or (e2)e1 provide access to
    the e2-th element of an array e1
  • Identical to (((e1)(e2)))
  • e2 is an integer expression
  • e1 is an lvalue that has the type array of type
  • converted to an expression of the type pointer
    to type pointing to the intial element of the
    array object
  • the attribute stride of this pointer is identical
    to that of the array object

57
Arrays and Pointers (ctd)
  • C allows dynamic arrays
  • typedef int (pDiag)nn1
  • int ann
  • int j
  • pDiag p (void)a
  • ...
  • for(j0 jltn j)
  • (p)j1

58
Blocking Operator
  • In C, the value of an aray object is a vector
  • The i-th element of the vector is the value of
    the i-th element of the array object
  • The postfix operator (the blocking operator)
  • Supports access to an array as the whole
  • Its operand has the type array of type
  • Blocks the conversion of the operand to a pointer
  • Example. int a5, b52, c53
  • a, b, and c designate arrays a, b, and c as
    a whole
  • cab

59
Lvector
  • In C, an lvalue is an expression designating an
    object
  • Example. int d55
  • dij, d and d0 are lvalues
  • dij1 and d0 are not.
  • Modifiable lvalue
  • May be the left operand of an assignment operator
  • dij is a modifiable lvalue
  • d and d0 are not

60
Lvector (ctd)
  • In C, an lvector is an expression designating a
    vector object
  • Modifiable lvector
  • May be the left operand of an assignment operator
  • Example. int d55
  • d, d0, d, and d0 are lvectors
  • d, and d0 are modifiable
  • d and d0 are not modifiable

61
Lvector (ctd)
  • Example. int a44
  • ((int()45)a)

62
Subarray
  • An object belongs to an array if
  • It is an element of the array, or
  • It belongs to an element of the array
  • Subarray
  • A set of objects belonging to an array
  • An array itself
  • Example (ctd). The main diagonal is a subarray
  • It is an array obect of the type int45

63
Subarray (ctd)
  • Example. int a44
  • ((int()35)(a01))

64
Subarray (ctd)
  • Not every regular set of objects belonging to an
    array makes up its subarray
  • Example. int a55

No constant modifiable lvector designates this
inner square
65
Array Section
  • The operator (the grid operator)
  • Supports access to array sections of general form
  • Syntax. elrs
  • Expression e may have type array of type or
    pointer to type
  • Expressions l, r, and s have integer types and
    denote
  • the left bound
  • the right bound
  • the stride

66
Array Section (ctd)
  • Semantics. elrs
  • A vector object of (r-s)/l1 elements of type
    type
  • Its i-th element is elsi
  • Expression elrs is lvector
  • Expression elrs is a modifiable lvector if
  • All expressions elsi i0,1, are modifiable
    lvectors/lvalues
  • elr1 ? elr

67
Array Section (ctd)
  • Operand e in elrs may have a vector type
  • Operator is applied element-wise
  • Let the vector value of e be u1,,uk
  • elrs will designate a vector of k vectors
  • The i-th element of the j-th vector will be
    ujlsi (j1,,k)

68
Example
  • Example. int a55
  • a1313

69
Example (ctd)
  • a13

a1313
70
Array Section (ctd)
  • Operands l and/or r in elrs may be omitted
  • If l is omitted, the left bound is set to 0
  • If r is omitted, the right bound is
  • set to n-1, if the first operand e is an
    n-element array
  • determined from the context, if e is a pointer
  • Example. int a55
  • a ? a

71
Element-Wise Vector Operators
  • The operand of the cast operator and the unary ,
    , , -, , !, , and - operators may have a
    vector type
  • The operators are applied element-wise
  • Example.
  • int j, k, l, m, n
  • int p5 j, k, l, m, n
  • (p13) designates a vector object consisting
    of three integer variables k, l, and m.

72
Element-Wise Vector Operators (ctd)
  • Binary operators , /, , , -, ltlt, gtgt, lt, gt, lt,
    gt, , !, , , , , and may have vector
    operands
  • If the operands have the same shape, then the
    operator is executed element-wise producing the
    result of this shape

73
Element-Wise Vector Operators (ctd)
  • In general, the operands may have different
    shapes but they must be conformable
  • 2 operands are conformable iff the beginning of
    the shape of one operand is identical to the
    shape of the other operand
  • Vectors having shapes 9,8,7,6 and 9,8 are
    conformable
  • A non-vector operand is conformable with any
    vector operand (why?)

74
Element-Wise Vector Operators (ctd)
  • Let operands a and b be conformable, and rank(a)
    lt rank(b)
  • The execution of the operator starts from
    conformable extension of the value of a to the
    shape of b
  • The conformable extension just replicates the
    value by adding dimensions

75
Element-Wise Vector Operators (ctd)
  • Example. Conformable extension of vector
    1,2,3,4,5,6 of shape 2,3 to shape 2,3,2
    is vector 1,1,2,2,3,3,4,4,5,5,6,6
  • Then the operator is applied element-wise to the
    result of the conformable extension of the value
    of a and the value of b, producing the result of
    the same shape as that of b

76
Element-Wise Vector Operators (ctd)
  • The assignment operators , , , etc. may have
    vector operands
  • The left operand shall be a modifiable lvector
  • Its rank shall not be less than that of the right
    operand
  • The operands shall be conformable
  • Two-step execution
  • The right operand is conformably extended to the
    shape of the left one
  • The assignment is executed element-wise

77
Element-Wise Vector Operators (ctd)
  • Example.
  • int amn, bm
  • ...
  • a b

78
Example
  • LU-factorization of the square matrix a by using
    the Gaussian elimination
  • double ann, t
  • int i, j
  • ...
  • for(i0 iltn i)
  • for(ji1 jltn j)
  • t aji/aii
  • if(aji!0.)
  • ajin-1-taiin-1

79
Element-Wise Vector Operators (ctd)
  • By definition, e1e2 is identical to
    (((e1)(e2)))
  • Therefore, e1 and e2 may be of vector type
  • The programmer can construct lvectors that
    designate irregular array sections
  • Example.
  • int amn, ind 0,1,6,18
  • ...
  • aind 0
  • This code zeros the elements of the 0-th, 1-st,
    6-th, and 18-th rows of array a

80
Element-Wise Vector Operators (ctd)
  • The first operand of the . operator may have a
    vector type
  • The second operand shall name a member of a
    structure or union type
  • The operator is executed element-wise
  • The result will have the same shape as the first
    operand
  • e-gtid is identical to (e).id

81
Reduction Operators
  • Reduction operators , , , , ,
    , , ?lt, and ?gt
  • Unary operators
  • Only applicable to vector operands
  • If v1,,vn are the elements of the vector value
    of the expression e, then the value of the
    expression e is that of the expression
    (v1vn)

82
Examples
  • Example 1. Dot product of the vectors a and b.
  • double an
  • double bn
  • double c
  • ...
  • c (ab)

83
Examples (ctd)
  • Example 2. Maximal element of the matrix a
  • int amn
  • int max
  • ...
  • max ?gt?gta

84
Examples (ctd)
  • Example 3. Multiplication of matrices a and b
  • double aml
  • double bln
  • double cmn
  • int i
  • ...
  • for(i0 iltm i)
  • ci (aib)

85
Memory Hierarchy
  • Parallel programming systems for VPs and SPs take
    into account their modern memory structure
  • Optimal memory management is often more efficient
    than optimal usage of IEUs
  • Approaches to optimal memory management appear
    surprisingly similar to optimisation of parallel
    facilities
  • Simple two-level memory model
  • Small and fast register memory
  • Large and relatively slow main memory

86
Memory Hierarchy (ctd)
  • A simple modern memory hierarchy
  • Register memory
  • Cache memory
  • Main memory
  • Disk memory
  • Cache memory
  • A buffer memory between main memory and registers
  • Holds copies of some data from the main memory

87
Memory Hierarchy (ctd)
  • Execution of instruction reading a data item from
    the main memory into a register
  • Check if a copy of the data item is already in
    the cache
  • If so, the data item will be actually transferred
    into the register from the cache
  • If not, the data item will be transferred into
    the register from the main memory, and a copy of
    the item will appear in the cache

88
Cache
  • Cache
  • Partitioned into cache lines
  • Cache line is a minimum unit of data transfer
    between the cache and the main memory
  • Scalars may be transferred only as a part of a
    cache line
  • Much smaller than the main memory
  • The same cache line may reflect different data
    blocks from the main memory

89
Cache (ctd)
  • Types of cache memory
  • Direct mapped
  • each block of the main memory has only one place
    it can appear in the cache
  • Fully associative
  • a block can be placed anywhere in the cache
  • Set associative
  • a block can be placed in a restricted set of
    places
  • a set is a group of two or more cache lines
  • n-way associative cache

90
Cache (ctd)
  • Cache miss is the situation when a data item
    being referenced is not in the cache
  • Minimization of cache misses is able to
    significantly accelerate execution of the program
  • Programs intensively using basic operations on
    arrays are obviously suitable for that type of
    optimization

91
Loop Tiling
  • The main specific optimization minimizing the
    number of cache misses is loop tiling
  • Consider the loop nest
  • for(i0 iltm i) / loop 1 /
  • for(j0 jltn j) / loop 2 /
  • if(i0)
  • bjaij
  • else
  • bjaij
  • bj are repeatedly used by successive iterations
    of loop 1

92
Loop Tiling (ctd)
  • If n is large enough, the data items may be
    flushed from the cache by the moment of their
    repeated use
  • To minimize the flushing of repeatedly used data
    items, the number of iterations of loop 2 may be
    decreased
  • To keep the total number of iterations of this
    loop nest unchanged, an additional controlling
    loop is introduced

93
Loop Tiling (ctd)
  • The transformed loop nest is
  • for(k0 kltn kT) //additional controlling
    loop 0
  • for(i0 iltm i) // loop 1
  • for(jk jltmin(kT,n) j) // loop 2
  • if(i0)
  • bjaij
  • else
  • bjaij
  • This transformation is called tiling
  • T is the tile size

94
Loop Tiling (ctd)
  • In general, the loop tiling is applied to loop
    nests of the form
  • for(i1...) / loop 1 /
  • for(i2...) / loop 2 /
  • ...
  • for(in...) / loop n /
  • ...
  • ei2in
  • ...
  • The goal is to minimize the number of cache
    misses for reference ei2in, which is
    repeatedly used by successive iterations of loop
    1

95
Loop Tiling and Optimising Compilers
  • The recognition of the loop nests, which can be
    tiled is the most difficult problem to be solved
    by optimising C and Fortran 77 compilers
  • Based on the analysis of data dependencies in
    loop nests
  • Theorem. The loop tiling is legally applicable
    (to the above loop nest) iff the loops from loop
    2 to loop n are fully interchangeable
  • To prove the interchangability an analysis of
    data dependence between different iterations of
    the loop nest is needed

96
Loop Tiling and Array Libraries
  • Level 3 BLAS is specified to support block
    algorithms of matrix-matrix operations
  • Partitioning matrices into blocks and performing
    the computation on the blocks maximizes the reuse
    of data held in the upper levels of memory
    hierarchy

97
Loop Tiling and Parallel Languages
  • Compilers for parallel languages do not need to
    recognize loops suitable for tiling
  • They can translate explicit operations on arrays
    into loop nests with the best possible temporal
    locality

98
Virtual memory
  • Instructions address virtual memory rather than
    the real physical memory
  • The virtual memory is partitioned into pages of a
    fixed size
  • Each page is stored on a disk until it is needed
  • When the page is needed, it copied to main
    memory, with the virtual addresses mapping into
    real addresses
  • This copying is known as paging or swapping

99
Virtual memory (ctd)
  • Programs processing large enough arrays do not
    fit into main memory
  • The swapping takes place each time when required
    data are not in the main memory
  • The swapping is a very expensive operation
  • Minimization of the number of swappings can
    significantly accelerate the programs
  • The problem is similar to minimization of cache
    misses and can be, therefore, approached
    similarly

100
Vector and Superscalar Processors Summary
  • VPs and SPs provide instruction-level
    parallelism, which is best exploited by
    applications with intensive operations on arrays
  • Such applications can be written in a serial
    programming language and complied by dedicated
    optimizing compilers performing some specific
    loop optimizations
  • Modular, portable, and reliable programming are
    supported
  • Efficiency and portable efficiency are also
    supported but only for a limited class of
    programs

101
Vector and Superscalar Processors Summary (ctd)
  • Array libraries allow the programmers to avoid
    the use of dedicated compilers
  • The programmers express operations on arrays
    directly using calls to carefully implemented
    subroutines
  • Modular, portable, and reliable programming are
    supported
  • Limited efficiency and portable efficiency
  • Excludes global optimization of combined array
    operations

102
Vector and Superscalar Processors Summary (ctd)
  • Parallel languages combine advantages of the
    first and second approaches
  • Operations on arrays can be explicitly expressed
  • No need in sophisticated algorithms to recognize
    parallelizable loops
  • Global optimisation of combined array operations
    is possible
  • They support general-purpose programming (unlike
    existing array libraries)
Write a Comment
User Comments (0)
About PowerShow.com