P573 Scientific Computing Lecture 4: Vector Operations - PowerPoint PPT Presentation

1 / 31
About This Presentation
Title:

P573 Scientific Computing Lecture 4: Vector Operations

Description:

V is a vector space over the field F if. V is an Abelian group. Vector ... int main(int argc, char** argv) { std::cout factorial 7 ::value 'n'; return 0; ... – PowerPoint PPT presentation

Number of Views:47
Avg rating:3.0/5.0
Slides: 32
Provided by: david3080
Category:

less

Transcript and Presenter's Notes

Title: P573 Scientific Computing Lecture 4: Vector Operations


1
P573Scientific ComputingLecture 4 Vector
Operations
  • Peter Gottschling
  • pgottsch_at_cs.indiana.edu
  • www.osl.iu.edu/pgottsch/courses/p573-06

2
Outline
  • Vector space
  • Defining your own vector type
  • Programming operators
  • Norms
  • Inner product

3
Vector Space
  • V is a vector space over the field F if
  • V is an Abelian group
  • Vector addition is associative ? u, v, w ? V
    (u v) w u (v w)
  • Vector addition is commutative? u, v ? V u v
    v u
  • Vector addition has an identity element 0? v ?
    V v 0 v
  • Vector addition is invertible? v ? V ? w ? V v
    w 0w is called additive inverse
  • Scalar multiplication is associative? v ? V, ?
    a, b in F a (b u) (a b) u
  • Scalar multiplication as an identity element 1?
    v ? V 1 v v
  • Distributivity holds for scalar multiplication
    over vector addition? u, v ? V , ? a ? F a (u
    v) a u a v
  • Distributivity holds for scalar multiplication
    over scalar addition? v ? V , ? a, b ? F (a
    b) v a v b v

4
Implementing our own Vector Type
  • For simplicity we use STL vector
  • See http//www.sgi.com/tech/stl/Vector.html
  • Does not provide arithmetic operations
  • Before we add operations, we define a new type
  • In its own namespace

5
Defining new Vector Type
  • namespace algebra
  • template lttypename Fgt
  • class vector
  • typedef F value_type
  • typedef F pointer
  • typedef F reference
  • typedef const F const_reference
  • //
  • reference operator(size_type n)
  • return my_vectorn
  • stdvectorltFgt my_vector
  • We have to define everything again!

6
Defining new Vector Type with Inheritance
  • namespace algebra
  • template lttypename Fgt
  • class vector public stdvectorltFgt
  • // typedef F value_type already
    exist
  • typedef stdvectorltFgt base
  • vector(typename basesize_type n) base(n)
  • We only need to re-implement constructors.
  • We can overwrite and add members if we want.

7
New operators
  • template lttypename Fgt
  • algebravectorltFgt operator(const
    algebravectorltFgt x,
  • const
    algebravectorltFgt y)
  • assert(x.size() y.size())
  • algebravectorltFgt result(x.size())
  • for (int i 0 i lt x.size() I)
  • resulti xi yi
  • return result
  • What is wrong with this?

8
What is the problem with operators in C?
  • Imagine you want to compute with vectorsz x
    y
  • What happens?
  • Adding the vectors and storing it in the
    temporary vector
  • Copy this vector into the stack
  • Deleting the temporary
  • Then assigning the vector on the stack to z
  • What it takes?
  • One loop for the addition
  • One memcopy to the stack
  • Another loop in the assignment (or a faster copy)
  • Optimization
  • Might remove the memcopy
  • We still have two loops
  • Anyway we go twice over the vectors

9
How Can This be Computed Faster?
  • Write an inline function for this, like add(x, y,
    z)
  • The easiest way
  • Will be as fast as any code you write inplace
  • Need to be implemented for any combination of
    operations and operands
  • Shallow copy
  • Values are actually not copied but only the
    pointer pointing to the data
  • Actually data are only aliased not really copied,
    e.g.,x y y3 7also changes x3
  • Becomes very error prone

10
How Can This be Computed Faster?
  • Move semantics
  • Shallow copies are not critical if only 1 copy
    exist
  • Results of expressions only exist once
  • After assigning the temporary variable or using
    in another expression it will be destroyed
  • Perfect for z x y
  • Does not avoid multiple loops in z x y w (in
    2 additions)
  • Expression templates
  • Operator only returns object of some type that
    contains information about operands and operator
  • In assignment computation will be finally
    executed
  • In one loop only no matter how many operands
  • Needs a lot of compile time when excessively used
  • Invented by Todd Veldhuizen from IU
  • See Blitz http//www.oonumerics.org/blitz/

11
How Can This be Computed Faster?
  • Source-to-source code transformation
  • Source program is parsed and represented as
    abstract syntax tree
  • Tree can be searched for certain patterns and
    modified
  • Tree retransformed into source code
  • For instance Rose
  • Special treatments for loops predefined
  • http//www.llnl.gov/CASC/rose/
  • Drawback dependence on rose (or other tool)

12
Vector Norms
  • Mapping from V to R
  • Always non-negative
  • Only 0 for additive neutral element
  • Must fulfill triangle inequality
  • Mostly used Lp-Norm and their discrete
    equivalents

13
Scalar Product
  • For vectors over real vectors defined as
  • algebravector x, y
  • s 0
  • for (int c 0 c lt x.size() c)
  • s xc yc
  • For vectors over complex number we would need the
    conjugates of xs elements

14
Optimization for Superscalar Processors
  • Some processors can compute multiple operations
    in parallel
  • All operations depend on s
  • Solution using partial sums
  • for (int c 0 c lt NP c 4)
  • sp0 xc yc
  • sp1 xc1 yc1
  • sp2 xc2 yc2
  • sp3 xc3 yc3
  • for (int c 4(NP/4) c lt NP c)
  • sp0 xc yc
  • Sp0 sp1 sp2 sp3

15
Optimizing Blocking Factor
  • Running tests and see which unrolling factor is
    optimal
  • Have to program this for each factor
  • Wouldnt it be nice to let the factor be a
    parameter
  • For instance a template parameter

16
Warm up
  • How we can compute at compile time?
  • Parameter must be known at compile time
  • Iterations must be transformed into recursions
  • Recursions are more expressive anyway
  • Class members which are static const are forced
    to be computed at compile time
  • Otherwise it yields a compilation error (or
    linkage error)
  • Example factorial

17
Computing Factorial at Compile Time
  • include ltiostreamgt
  • template ltunsigned Xgt
  • struct factorial
  • static const unsigned value X
    factorialltX-1gtvalue
  • template ltgt
  • struct factoriallt0gt
  • static const unsigned value 1
  • int main(int argc, char argv)
  • stdcout ltlt factoriallt7gtvalue ltlt '\n'
  • return 0

18
What We Need to Unroll the Dot Product?
  • Computable at run time
  • Stop criterion
  • How many iterations with blocks
  • Cleanup
  • How many iterations without blocks
  • To define at compile time
  • U independent operations in each block
  • U partial sums
  • How to compute the sum of the U partial sums
  • Again using recursion
  • Starting with data structure containing U
    variables
  • Next defining loop calling block in separate
    function
  • Recursive definition of computation in each block

19
Recursive Data
  • template ltunsigned Depthgt
  • struct recursive_data
  • double sum
  • recursive_dataltDepth-1gt remainder
  • recursive_data(double s) sum(s),
    remainder(s)
  • double sum_up()
  • return sum remainder.sum_up()
  • template ltgt
  • struct recursive_datalt1gt
  • double sum
  • recursive_data(double s) sum(s)
  • double sum_up()
  • return sum

20
Main function of Unrolled Dot product
  • template ltunsigned Depthgt
  • double unrolled_dot(vectorltdoublegt const v1,
  • vectorltdoublegt
    const v2)
  • // check v1.size() v2.size()
  • unsigned size v1.size(), blocks size /
    Depth,
  • blocked_size blocks Depth
  • recursive_dataltDepthgt sum_block(0.0)
  • for (unsigned i 0 i lt blocked_size i
    Depth)
  • dot_blockltDepth, Depthgt()(v1, v2,
    sum_block, i)
  • double sum sum_block.sum_up()
  • for (unsigned i blocked_size i lt size i)
  • sum v1i v2i
  • return sum

21
Definition of Blocks
  • template ltunsigned Depth, unsigned MaxDepthgt
  • struct dot_block
  • static unsigned const offset MaxDepth -
    Depth
  • void operator() (vectorltdoublegt const v1,
    vectorltdoublegt const v2,
  • recursive_dataltDepthgt sum_block, unsigned
    i)
  • sum_block.sum v1 i offset v2 i
    offset
  • dot_blockltDepth-1, MaxDepthgt() (v1, v2,
    sum_block.remainder, i)

22
Closing Definition of Block
  • template ltunsigned MaxDepthgt
  • struct dot_blocklt1, MaxDepthgt
  • static unsigned const offset MaxDepth - 1
  • void operator() (vectorltdoublegt const v1,
  • vectorltdoublegt
    const v2,
  • recursive_datalt1gt sum_block,
    unsigned i)
  • sum_block.sum v1 i offset v2 i
    offset

23
How Much Faster We are?
  • regular 4.09s, result 6000
  • unrolled 2 2.89s, result 6000
  • unrolled 4 2.52s, result 6000
  • --------------------
  • template 1 3.82s, result 6000
  • template 2 4.03s, result 6000
  • template 3 4.4s, result 6000
  • template 4 3.74s, result 6000
  • template 5 3.83s, result 6000
  • template 6 3.62s, result 6000
  • template 7 3.53s, result 6000
  • template 8 3.55s, result 6000
  • template 9 3.46s, result 6000
  • template10 3.54s, result 6000
  • template12 3.6s, result 6000
  • template14 4.15s, result 6000
  • template16 4.53s, result 6000
  • On Mac PowerBook with 1.33 GHz PowerPc

24
How Much Faster We are?
  • Overall performance is slower then C program
  • Unrolling has still some impact
  • Why the unrolled version is slower
  • Recursive class definition apparently avoid
    register use
  • Partial sums are kept in memory, well in cache
  • Proof template-less computation with recursive
    data structure has same timing
  • This are results with gcc
  • Other compilers, icc and kcc, can allocate it in
    registers
  • How can we get gcc to use registers?
  • Can we recursively define variables that are
    stored in registers?

25
How to Use Registers for Summation?
  • Define single variables in outer function
  • Call recursively function for block using
  • Functions must be inlined
  • Pass all variables to block function
  • As reference to modify the partial sums
  • Different statements in block must use different
    variable
  • Single computation adds to first variable
  • Call next statement in block without first
    variable
  • Every call would have different number of
    parameters
  • Unrolling will be limited to the number of
    variables

26
How to Use Registers for Summation?
  • Define single variables in outer function
  • Call recursively function for block using
  • Different statements in block must use different
    variable
  • Instead of dropping the first variable rotate it
  • All call have the same number of parameters
  • If unrolling factor is larger than of variables
  • Then variables are reused
  • Does not falsify the result
  • Parallelism is limited to the number of variables
  • But not the unrolling factor
  • If unrolling factor is smaller than of
    variables
  • Some variables stay 0
  • Does not falsify either

27
Unrolled Dot Product Revised
  • template ltunsigned Depthgt
  • double unrolled_dot(vectorltdoublegt const v1,
  • vectorltdoublegt
    const v2)
  • // check v1.size() v2.size()
  • unsigned size v1.size(), blocks size /
    Depth,
  • blocked_size blocks Depth
  • double s0 0.0, s1 0.0, s2 0.0, s3 0.0,
    s4 0.0, s5 0.0,
  • s6 0.0, s7 0.0
  • for (unsigned i 0 i lt blocked_size i
    Depth)
  • dot_blockltDepth, Depthgt()(v1, v2, i, s0, s1,
    s2, s3, s4, s5, s6, s7)
  • s0 s1 s2 s3 s4 s5 s6 s7
  • for (unsigned i blocked_size i lt size i)
  • s0 v1i v2i
  • return s0

28
Single Statement in Block
  • template ltunsigned Depth, unsigned MaxDepthgt
  • struct dot_block
  • static unsigned const offset MaxDepth -
    Depth
  • void operator() (vectorltdoublegt const v1,
    vectorltdoublegt const v2, unsigned i,
  • double s0, double s1, double s2,
    double s3,
  • double s4, double s5, double s6,
    double s7)
  • s0 v1 i offset v2 i offset
  • dot_blockltDepth-1, MaxDepthgt() (v1, v2, i, s1,
    s2, s3, s4, s5, s6, s7, s0)

29
Last Statement in Block
  • template ltunsigned MaxDepthgt
  • struct dot_blocklt1, MaxDepthgt
  • static unsigned const offset MaxDepth - 1
  • void operator() (vectorltdoublegt const v1,
    vectorltdoublegt const v2, unsigned i,
  • double s0, double, double, double,
  • double, double, double, double)
  • s0 v1 i offset v2 i offset

30
How Much Faster We are?
  • regular 3.98s, result 6000
  • unrolled 2 2.78s, result 6000
  • unrolled 4 2.39s, result 6000
  • --------------------
  • template 1 3.74s, result 6000
  • template 2 2.65s, result 6000
  • template 3 2.34s, result 6000
  • template 4 2.42s, result 6000
  • template 5 2.24s, result 6000
  • template 6 2.11s, result 6000
  • template 7 2.12s, result 6000
  • template 8 2.1s, result 6000
  • template 9 2.08s, result 6000
  • template10 2.01s, result 6000
  • template11 2.11s, result 6000
  • template12 1.97s, result 6000
  • template13 2.47s, result 6000
  • template14 2.76s, result 6000
  • template15 2.97s, result 6000

31
Analysis
  • Templated version is as fast as pure C code
  • For 2 and 4 unrolling factor
  • Unrolling beyond number of partial sum can speed
    up
  • Program is even simpler than the original
  • The simplest solutions are often the best.
Write a Comment
User Comments (0)
About PowerShow.com