Title: P573 Scientific Computing Lecture 4: Vector Operations
1P573Scientific ComputingLecture 4 Vector
Operations
- Peter Gottschling
- pgottsch_at_cs.indiana.edu
- www.osl.iu.edu/pgottsch/courses/p573-06
2Outline
- Vector space
- Defining your own vector type
- Programming operators
- Norms
- Inner product
3Vector Space
- V is a vector space over the field F if
- V is an Abelian group
- Vector addition is associative ? u, v, w ? V
(u v) w u (v w) - Vector addition is commutative? u, v ? V u v
v u - Vector addition has an identity element 0? v ?
V v 0 v - Vector addition is invertible? v ? V ? w ? V v
w 0w is called additive inverse - Scalar multiplication is associative? v ? V, ?
a, b in F a (b u) (a b) u - Scalar multiplication as an identity element 1?
v ? V 1 v v - Distributivity holds for scalar multiplication
over vector addition? u, v ? V , ? a ? F a (u
v) a u a v - Distributivity holds for scalar multiplication
over scalar addition? v ? V , ? a, b ? F (a
b) v a v b v
4Implementing our own Vector Type
- For simplicity we use STL vector
- See http//www.sgi.com/tech/stl/Vector.html
- Does not provide arithmetic operations
- Before we add operations, we define a new type
- In its own namespace
5Defining new Vector Type
- namespace algebra
- template lttypename Fgt
- class vector
-
- typedef F value_type
- typedef F pointer
- typedef F reference
- typedef const F const_reference
- //
- reference operator(size_type n)
- return my_vectorn
- stdvectorltFgt my_vector
-
-
- We have to define everything again!
6Defining new Vector Type with Inheritance
- namespace algebra
- template lttypename Fgt
- class vector public stdvectorltFgt
-
- // typedef F value_type already
exist - typedef stdvectorltFgt base
- vector(typename basesize_type n) base(n)
-
-
- We only need to re-implement constructors.
- We can overwrite and add members if we want.
7New operators
- template lttypename Fgt
- algebravectorltFgt operator(const
algebravectorltFgt x, - const
algebravectorltFgt y) -
- assert(x.size() y.size())
- algebravectorltFgt result(x.size())
- for (int i 0 i lt x.size() I)
- resulti xi yi
- return result
-
-
- What is wrong with this?
8What is the problem with operators in C?
- Imagine you want to compute with vectorsz x
y - What happens?
- Adding the vectors and storing it in the
temporary vector - Copy this vector into the stack
- Deleting the temporary
- Then assigning the vector on the stack to z
- What it takes?
- One loop for the addition
- One memcopy to the stack
- Another loop in the assignment (or a faster copy)
- Optimization
- Might remove the memcopy
- We still have two loops
- Anyway we go twice over the vectors
9How Can This be Computed Faster?
- Write an inline function for this, like add(x, y,
z) - The easiest way
- Will be as fast as any code you write inplace
- Need to be implemented for any combination of
operations and operands - Shallow copy
- Values are actually not copied but only the
pointer pointing to the data - Actually data are only aliased not really copied,
e.g.,x y y3 7also changes x3 - Becomes very error prone
10How Can This be Computed Faster?
- Move semantics
- Shallow copies are not critical if only 1 copy
exist - Results of expressions only exist once
- After assigning the temporary variable or using
in another expression it will be destroyed - Perfect for z x y
- Does not avoid multiple loops in z x y w (in
2 additions) - Expression templates
- Operator only returns object of some type that
contains information about operands and operator - In assignment computation will be finally
executed - In one loop only no matter how many operands
- Needs a lot of compile time when excessively used
- Invented by Todd Veldhuizen from IU
- See Blitz http//www.oonumerics.org/blitz/
11How Can This be Computed Faster?
- Source-to-source code transformation
- Source program is parsed and represented as
abstract syntax tree - Tree can be searched for certain patterns and
modified - Tree retransformed into source code
- For instance Rose
- Special treatments for loops predefined
- http//www.llnl.gov/CASC/rose/
- Drawback dependence on rose (or other tool)
12Vector Norms
- Mapping from V to R
- Always non-negative
- Only 0 for additive neutral element
- Must fulfill triangle inequality
- Mostly used Lp-Norm and their discrete
equivalents
13Scalar Product
- For vectors over real vectors defined as
- algebravector x, y
- s 0
- for (int c 0 c lt x.size() c)
- s xc yc
- For vectors over complex number we would need the
conjugates of xs elements
14Optimization for Superscalar Processors
- Some processors can compute multiple operations
in parallel - All operations depend on s
- Solution using partial sums
- for (int c 0 c lt NP c 4)
- sp0 xc yc
- sp1 xc1 yc1
- sp2 xc2 yc2
- sp3 xc3 yc3
- for (int c 4(NP/4) c lt NP c)
- sp0 xc yc
- Sp0 sp1 sp2 sp3
15Optimizing Blocking Factor
- Running tests and see which unrolling factor is
optimal - Have to program this for each factor
- Wouldnt it be nice to let the factor be a
parameter - For instance a template parameter
16Warm up
- How we can compute at compile time?
- Parameter must be known at compile time
- Iterations must be transformed into recursions
- Recursions are more expressive anyway
- Class members which are static const are forced
to be computed at compile time - Otherwise it yields a compilation error (or
linkage error) - Example factorial
17Computing Factorial at Compile Time
- include ltiostreamgt
- template ltunsigned Xgt
- struct factorial
-
- static const unsigned value X
factorialltX-1gtvalue -
- template ltgt
- struct factoriallt0gt
-
- static const unsigned value 1
-
- int main(int argc, char argv)
- stdcout ltlt factoriallt7gtvalue ltlt '\n'
- return 0
-
18What We Need to Unroll the Dot Product?
- Computable at run time
- Stop criterion
- How many iterations with blocks
- Cleanup
- How many iterations without blocks
- To define at compile time
- U independent operations in each block
- U partial sums
- How to compute the sum of the U partial sums
- Again using recursion
- Starting with data structure containing U
variables - Next defining loop calling block in separate
function - Recursive definition of computation in each block
19Recursive Data
- template ltunsigned Depthgt
- struct recursive_data
-
- double sum
- recursive_dataltDepth-1gt remainder
- recursive_data(double s) sum(s),
remainder(s) - double sum_up()
- return sum remainder.sum_up()
-
-
- template ltgt
- struct recursive_datalt1gt
-
- double sum
- recursive_data(double s) sum(s)
- double sum_up()
- return sum
20Main function of Unrolled Dot product
- template ltunsigned Depthgt
- double unrolled_dot(vectorltdoublegt const v1,
- vectorltdoublegt
const v2) -
- // check v1.size() v2.size()
- unsigned size v1.size(), blocks size /
Depth, - blocked_size blocks Depth
- recursive_dataltDepthgt sum_block(0.0)
- for (unsigned i 0 i lt blocked_size i
Depth) - dot_blockltDepth, Depthgt()(v1, v2,
sum_block, i) - double sum sum_block.sum_up()
- for (unsigned i blocked_size i lt size i)
- sum v1i v2i
- return sum
21Definition of Blocks
- template ltunsigned Depth, unsigned MaxDepthgt
- struct dot_block
-
- static unsigned const offset MaxDepth -
Depth - void operator() (vectorltdoublegt const v1,
vectorltdoublegt const v2, - recursive_dataltDepthgt sum_block, unsigned
i) -
- sum_block.sum v1 i offset v2 i
offset - dot_blockltDepth-1, MaxDepthgt() (v1, v2,
sum_block.remainder, i) -
-
22Closing Definition of Block
- template ltunsigned MaxDepthgt
- struct dot_blocklt1, MaxDepthgt
-
- static unsigned const offset MaxDepth - 1
- void operator() (vectorltdoublegt const v1,
- vectorltdoublegt
const v2, - recursive_datalt1gt sum_block,
unsigned i) -
- sum_block.sum v1 i offset v2 i
offset -
-
23How Much Faster We are?
- regular 4.09s, result 6000
- unrolled 2 2.89s, result 6000
- unrolled 4 2.52s, result 6000
- --------------------
- template 1 3.82s, result 6000
- template 2 4.03s, result 6000
- template 3 4.4s, result 6000
- template 4 3.74s, result 6000
- template 5 3.83s, result 6000
- template 6 3.62s, result 6000
- template 7 3.53s, result 6000
- template 8 3.55s, result 6000
- template 9 3.46s, result 6000
- template10 3.54s, result 6000
- template12 3.6s, result 6000
- template14 4.15s, result 6000
- template16 4.53s, result 6000
- On Mac PowerBook with 1.33 GHz PowerPc
24How Much Faster We are?
- Overall performance is slower then C program
- Unrolling has still some impact
- Why the unrolled version is slower
- Recursive class definition apparently avoid
register use - Partial sums are kept in memory, well in cache
- Proof template-less computation with recursive
data structure has same timing - This are results with gcc
- Other compilers, icc and kcc, can allocate it in
registers - How can we get gcc to use registers?
- Can we recursively define variables that are
stored in registers?
25How to Use Registers for Summation?
- Define single variables in outer function
- Call recursively function for block using
- Functions must be inlined
- Pass all variables to block function
- As reference to modify the partial sums
- Different statements in block must use different
variable - Single computation adds to first variable
- Call next statement in block without first
variable - Every call would have different number of
parameters - Unrolling will be limited to the number of
variables
26How to Use Registers for Summation?
- Define single variables in outer function
- Call recursively function for block using
- Different statements in block must use different
variable - Instead of dropping the first variable rotate it
- All call have the same number of parameters
- If unrolling factor is larger than of variables
- Then variables are reused
- Does not falsify the result
- Parallelism is limited to the number of variables
- But not the unrolling factor
- If unrolling factor is smaller than of
variables - Some variables stay 0
- Does not falsify either
27Unrolled Dot Product Revised
- template ltunsigned Depthgt
- double unrolled_dot(vectorltdoublegt const v1,
- vectorltdoublegt
const v2) -
- // check v1.size() v2.size()
- unsigned size v1.size(), blocks size /
Depth, - blocked_size blocks Depth
- double s0 0.0, s1 0.0, s2 0.0, s3 0.0,
s4 0.0, s5 0.0, - s6 0.0, s7 0.0
- for (unsigned i 0 i lt blocked_size i
Depth) - dot_blockltDepth, Depthgt()(v1, v2, i, s0, s1,
s2, s3, s4, s5, s6, s7) - s0 s1 s2 s3 s4 s5 s6 s7
- for (unsigned i blocked_size i lt size i)
- s0 v1i v2i
- return s0
-
28Single Statement in Block
- template ltunsigned Depth, unsigned MaxDepthgt
- struct dot_block
-
- static unsigned const offset MaxDepth -
Depth - void operator() (vectorltdoublegt const v1,
vectorltdoublegt const v2, unsigned i, - double s0, double s1, double s2,
double s3, - double s4, double s5, double s6,
double s7) -
- s0 v1 i offset v2 i offset
- dot_blockltDepth-1, MaxDepthgt() (v1, v2, i, s1,
s2, s3, s4, s5, s6, s7, s0) -
29Last Statement in Block
- template ltunsigned MaxDepthgt
- struct dot_blocklt1, MaxDepthgt
-
- static unsigned const offset MaxDepth - 1
- void operator() (vectorltdoublegt const v1,
vectorltdoublegt const v2, unsigned i, - double s0, double, double, double,
- double, double, double, double)
-
- s0 v1 i offset v2 i offset
-
30How Much Faster We are?
- regular 3.98s, result 6000
- unrolled 2 2.78s, result 6000
- unrolled 4 2.39s, result 6000
- --------------------
- template 1 3.74s, result 6000
- template 2 2.65s, result 6000
- template 3 2.34s, result 6000
- template 4 2.42s, result 6000
- template 5 2.24s, result 6000
- template 6 2.11s, result 6000
- template 7 2.12s, result 6000
- template 8 2.1s, result 6000
- template 9 2.08s, result 6000
- template10 2.01s, result 6000
- template11 2.11s, result 6000
- template12 1.97s, result 6000
- template13 2.47s, result 6000
- template14 2.76s, result 6000
- template15 2.97s, result 6000
31Analysis
- Templated version is as fast as pure C code
- For 2 and 4 unrolling factor
- Unrolling beyond number of partial sum can speed
up - Program is even simpler than the original
- The simplest solutions are often the best.