P573 Scientific Computing Lecture 4: Vector Operations - PowerPoint PPT Presentation

1 / 31

About This Presentation

Title:

P573 Scientific Computing Lecture 4: Vector Operations

Description:

V is a vector space over the field F if. V is an Abelian group. Vector ... int main(int argc, char** argv) { std::cout factorial 7 ::value 'n'; return 0; ... – PowerPoint PPT presentation

Number of Views:47

Avg rating:3.0/5.0

Slides: 32

Provided by: david3080

Category:

more less

Transcript and Presenter's Notes

Title: P573 Scientific Computing Lecture 4: Vector Operations

1
P573Scientific ComputingLecture 4 Vector
Operations

Peter Gottschling
pgottsch_at_cs.indiana.edu
www.osl.iu.edu/pgottsch/courses/p573-06

2
Outline

Vector space
Defining your own vector type
Programming operators
Norms
Inner product

3
Vector Space

V is a vector space over the field F if
V is an Abelian group
Vector addition is associative ? u, v, w ? V
(u v) w u (v w)
Vector addition is commutative? u, v ? V u v
v u
Vector addition has an identity element 0? v ?
V v 0 v
Vector addition is invertible? v ? V ? w ? V v
w 0w is called additive inverse
Scalar multiplication is associative? v ? V, ?
a, b in F a (b u) (a b) u
Scalar multiplication as an identity element 1?
v ? V 1 v v
Distributivity holds for scalar multiplication
over vector addition? u, v ? V , ? a ? F a (u
v) a u a v
Distributivity holds for scalar multiplication
over scalar addition? v ? V , ? a, b ? F (a
b) v a v b v

4
Implementing our own Vector Type

For simplicity we use STL vector
See http//www.sgi.com/tech/stl/Vector.html
Does not provide arithmetic operations
Before we add operations, we define a new type
In its own namespace

5
Defining new Vector Type

namespace algebra
template lttypename Fgt
class vector
typedef F value_type
typedef F pointer
typedef F reference
typedef const F const_reference
//
reference operator(size_type n)
return my_vectorn
stdvectorltFgt my_vector
We have to define everything again!

6
Defining new Vector Type with Inheritance

namespace algebra
template lttypename Fgt
class vector public stdvectorltFgt
// typedef F value_type already
exist
typedef stdvectorltFgt base
vector(typename basesize_type n) base(n)
We only need to re-implement constructors.
We can overwrite and add members if we want.

7
New operators

template lttypename Fgt
algebravectorltFgt operator(const
algebravectorltFgt x,
const
algebravectorltFgt y)
assert(x.size() y.size())
algebravectorltFgt result(x.size())
for (int i 0 i lt x.size() I)
resulti xi yi
return result
What is wrong with this?

8
What is the problem with operators in C?

Imagine you want to compute with vectorsz x
y
What happens?
Adding the vectors and storing it in the
temporary vector
Copy this vector into the stack
Deleting the temporary
Then assigning the vector on the stack to z
What it takes?
One loop for the addition
One memcopy to the stack
Another loop in the assignment (or a faster copy)
Optimization
Might remove the memcopy
We still have two loops
Anyway we go twice over the vectors

9
How Can This be Computed Faster?

Write an inline function for this, like add(x, y,
z)
The easiest way
Will be as fast as any code you write inplace
Need to be implemented for any combination of
operations and operands
Shallow copy
Values are actually not copied but only the
pointer pointing to the data
Actually data are only aliased not really copied,
e.g.,x y y3 7also changes x3
Becomes very error prone

10
How Can This be Computed Faster?

Move semantics
Shallow copies are not critical if only 1 copy
exist
Results of expressions only exist once
After assigning the temporary variable or using
in another expression it will be destroyed
Perfect for z x y
Does not avoid multiple loops in z x y w (in
2 additions)
Expression templates
Operator only returns object of some type that
contains information about operands and operator
In assignment computation will be finally
executed
In one loop only no matter how many operands
Needs a lot of compile time when excessively used
Invented by Todd Veldhuizen from IU
See Blitz http//www.oonumerics.org/blitz/

11
How Can This be Computed Faster?

Source-to-source code transformation
Source program is parsed and represented as
abstract syntax tree
Tree can be searched for certain patterns and
modified
Tree retransformed into source code
For instance Rose
Special treatments for loops predefined
http//www.llnl.gov/CASC/rose/
Drawback dependence on rose (or other tool)

12
Vector Norms

Mapping from V to R
Always non-negative
Only 0 for additive neutral element
Must fulfill triangle inequality
Mostly used Lp-Norm and their discrete
equivalents

13
Scalar Product

For vectors over real vectors defined as
algebravector x, y
s 0
for (int c 0 c lt x.size() c)
s xc yc
For vectors over complex number we would need the
conjugates of xs elements

14
Optimization for Superscalar Processors

Some processors can compute multiple operations
in parallel
All operations depend on s
Solution using partial sums
for (int c 0 c lt NP c 4)
sp0 xc yc
sp1 xc1 yc1
sp2 xc2 yc2
sp3 xc3 yc3
for (int c 4(NP/4) c lt NP c)
sp0 xc yc
Sp0 sp1 sp2 sp3

15
Optimizing Blocking Factor

Running tests and see which unrolling factor is
optimal
Have to program this for each factor
Wouldnt it be nice to let the factor be a
parameter
For instance a template parameter

16
Warm up

How we can compute at compile time?
Parameter must be known at compile time
Iterations must be transformed into recursions
Recursions are more expressive anyway
Class members which are static const are forced
to be computed at compile time
Otherwise it yields a compilation error (or
linkage error)
Example factorial

17
Computing Factorial at Compile Time

include ltiostreamgt
template ltunsigned Xgt
struct factorial
static const unsigned value X
factorialltX-1gtvalue
template ltgt
struct factoriallt0gt
static const unsigned value 1
int main(int argc, char argv)
stdcout ltlt factoriallt7gtvalue ltlt '\n'
return 0

18
What We Need to Unroll the Dot Product?

Computable at run time
Stop criterion
How many iterations with blocks
Cleanup
How many iterations without blocks
To define at compile time
U independent operations in each block
U partial sums
How to compute the sum of the U partial sums
Again using recursion
Starting with data structure containing U
variables
Next defining loop calling block in separate
function
Recursive definition of computation in each block

19
Recursive Data

template ltunsigned Depthgt
struct recursive_data
double sum
recursive_dataltDepth-1gt remainder
recursive_data(double s) sum(s),
remainder(s)
double sum_up()
return sum remainder.sum_up()
template ltgt
struct recursive_datalt1gt
double sum
recursive_data(double s) sum(s)
double sum_up()
return sum

20
Main function of Unrolled Dot product

template ltunsigned Depthgt
double unrolled_dot(vectorltdoublegt const v1,
vectorltdoublegt
const v2)
// check v1.size() v2.size()
unsigned size v1.size(), blocks size /
Depth,
blocked_size blocks Depth
recursive_dataltDepthgt sum_block(0.0)
for (unsigned i 0 i lt blocked_size i
Depth)
dot_blockltDepth, Depthgt()(v1, v2,
sum_block, i)
double sum sum_block.sum_up()
for (unsigned i blocked_size i lt size i)
sum v1i v2i
return sum

21
Definition of Blocks

template ltunsigned Depth, unsigned MaxDepthgt
struct dot_block
static unsigned const offset MaxDepth -
Depth
void operator() (vectorltdoublegt const v1,
vectorltdoublegt const v2,
recursive_dataltDepthgt sum_block, unsigned
i)
sum_block.sum v1 i offset v2 i
offset
dot_blockltDepth-1, MaxDepthgt() (v1, v2,
sum_block.remainder, i)

22
Closing Definition of Block

template ltunsigned MaxDepthgt
struct dot_blocklt1, MaxDepthgt
static unsigned const offset MaxDepth - 1
void operator() (vectorltdoublegt const v1,
vectorltdoublegt
const v2,
recursive_datalt1gt sum_block,
unsigned i)
sum_block.sum v1 i offset v2 i
offset

23
How Much Faster We are?

regular 4.09s, result 6000
unrolled 2 2.89s, result 6000
unrolled 4 2.52s, result 6000
--------------------
template 1 3.82s, result 6000
template 2 4.03s, result 6000
template 3 4.4s, result 6000
template 4 3.74s, result 6000
template 5 3.83s, result 6000
template 6 3.62s, result 6000
template 7 3.53s, result 6000
template 8 3.55s, result 6000
template 9 3.46s, result 6000
template10 3.54s, result 6000
template12 3.6s, result 6000
template14 4.15s, result 6000
template16 4.53s, result 6000
On Mac PowerBook with 1.33 GHz PowerPc

24
How Much Faster We are?

Overall performance is slower then C program
Unrolling has still some impact
Why the unrolled version is slower
Recursive class definition apparently avoid
register use
Partial sums are kept in memory, well in cache
Proof template-less computation with recursive
data structure has same timing
This are results with gcc
Other compilers, icc and kcc, can allocate it in
registers
How can we get gcc to use registers?
Can we recursively define variables that are
stored in registers?

25
How to Use Registers for Summation?

Define single variables in outer function
Call recursively function for block using
Functions must be inlined
Pass all variables to block function
As reference to modify the partial sums
Different statements in block must use different
variable
Single computation adds to first variable
Call next statement in block without first
variable
Every call would have different number of
parameters
Unrolling will be limited to the number of
variables

26
How to Use Registers for Summation?

Define single variables in outer function
Call recursively function for block using
Different statements in block must use different
variable
Instead of dropping the first variable rotate it
All call have the same number of parameters
If unrolling factor is larger than of variables
Then variables are reused
Does not falsify the result
Parallelism is limited to the number of variables
But not the unrolling factor
If unrolling factor is smaller than of
variables
Some variables stay 0
Does not falsify either

27
Unrolled Dot Product Revised

template ltunsigned Depthgt
double unrolled_dot(vectorltdoublegt const v1,
vectorltdoublegt
const v2)
// check v1.size() v2.size()
unsigned size v1.size(), blocks size /
Depth,
blocked_size blocks Depth
double s0 0.0, s1 0.0, s2 0.0, s3 0.0,
s4 0.0, s5 0.0,
s6 0.0, s7 0.0
for (unsigned i 0 i lt blocked_size i
Depth)
dot_blockltDepth, Depthgt()(v1, v2, i, s0, s1,
s2, s3, s4, s5, s6, s7)
s0 s1 s2 s3 s4 s5 s6 s7
for (unsigned i blocked_size i lt size i)
s0 v1i v2i
return s0

28
Single Statement in Block

template ltunsigned Depth, unsigned MaxDepthgt
struct dot_block
static unsigned const offset MaxDepth -
Depth
void operator() (vectorltdoublegt const v1,
vectorltdoublegt const v2, unsigned i,
double s0, double s1, double s2,
double s3,
double s4, double s5, double s6,
double s7)
s0 v1 i offset v2 i offset
dot_blockltDepth-1, MaxDepthgt() (v1, v2, i, s1,
s2, s3, s4, s5, s6, s7, s0)

29
Last Statement in Block

template ltunsigned MaxDepthgt
struct dot_blocklt1, MaxDepthgt
static unsigned const offset MaxDepth - 1
void operator() (vectorltdoublegt const v1,
vectorltdoublegt const v2, unsigned i,
double s0, double, double, double,
double, double, double, double)
s0 v1 i offset v2 i offset

30
How Much Faster We are?