Software Optimization

About This Presentation

Title:

Software Optimization

Description:

The University of New Mexico. Constant Folding and Propagation ... Suppose x[i] and x[i m] maps to same cache location. ( Direct mapped cache) ... – PowerPoint PPT presentation

Number of Views:45

Avg rating:3.0/5.0

Slides: 22

Provided by: netchannel

Category:

more less

Transcript and Presenter's Notes

Title: Software Optimization

1
Software Optimization
Vikas, Chaudhary
MA 471
2

High-level code
Intermediate code
Object code
Executable
Assembler
Assembly Code
Steps to create an executable
MA 471
3

Higher Optimizations
Procedure within basic blocks.
Procedure within single and nested loop
structures.
Entire procedure including all blocks and
structures.
File (inter-procedural analysis within a source
file)
Cross file (inter-procedural analysis across all
procedures)

MA 471
4

Compiler Options
There are no strict rules about what each level
of optimization means but generally
O0 does one to many translations.
O1 does basic block optimizations.
O2 does loop optimizations.
O4 does interfile optimizations.
Some compilers also provide odataprefetch to
indicate that prefetch instructions should be
inserted to prefetch data from memory to cache.

MA 471
5

Increasing Register Pressure
When too many registers are needed,
compilers must store values to memory and
restores values from memory. This degrades the
performance.
If we generate assembly code from compiler
via S and see that there is an inordinate number
of load and store instructions then it is implied
that compiler is generating too many spills.
Use register data type carefully.

MA 471
6

Dead Code Elimination
Dead code Elimination is merely the removal of
code that is never used.
i0
If (i!0) deadcode(i)

MA 471
7

Constant Folding and Propagation
Constant folding is when expressions with
multiple constants are folded together and
evaluated at compile time.
a 1 2
Will be replaced by a 3.

MA 471
8
Common Subexpression elimination Common
subexpression elimination analyzes lines of code,
determines where identical subexpressions are
used and creates a temporary variable to hold one
instance of these values.
a b (c d) f e (c
d)

MA 471
9

Strength Reductions
Strength reduction means replacing expensive
operations with cheaper ones.
Replacing integer multiplication or division by
constants with shift operations.
Replacing 32-bit integer division by 64-bit
floating point division.
Replacing floating point multiplications by small
constants with floating point additions.
Replacing power function by floating point
multiplications.

MA 471
10
Filling Branch Delay Slots Branch delay slots
are the instructions after a branch that are
always executed. If the compiler is used with no
optimization, it will probably insert a nop into
branch delay slot.

MA 471
11
Induction Variable Optimization for (i0 ilt n
i 2) iai i k m Where i is
induction variable. The above code can be
replaced by ic m for (i 0 ilt n i 2)
iai ic ic ic k

MA 471
12

Loop Fission
This technique is often used when an inner
loop consists of a large number of lines and the
compiler has difficulty generating code without
spilling.
This technique is also helpful in improving
cache performance.
for(i 0 i lt n i)
Yi yi xi xim
Suppose xi and xi m maps to same
cache location. (Direct mapped cache). This will
cause cache thrashing.

MA 471
13

Loop can be split as
for(i 0 i lt n i)
yi yi xi
for(i 0 i lt n i)
yi yi xi m
This technique might not be very useful when
cache is n-way set associative.

MA 471
14

Loop Unrolling
This technique reduces the effect of
branches, instruction latency, and potentially
the number of cache misses.
Do I 1, N
Y(I) X(I)
ENDDO
After Unrolling
NEND 4 (N/4)
Do I I, N , 4
Y(I) X(I)
Y(I 1) X(I 1)
Y(I 1) X(I 1)
ENDDO
Do I NEND1 , N
Y(I) X(I)
ENDDO

MA 471
15

Loading all the values of X before the values of
Y reduces the possibility of cache thrashing.
Amount of unrolling can decrease the number of
software prefetch instructions.
Excessive unrolling will cause data to be spilled
from register to memory.
Unrolling increases size of object code, which
might cause too many instruction cache misses.

MA 471
16

Clock Cycles in an Unrolled Loop

Original order CC Modified order CC
Load X(1) 1 Load X(1) 1
Store Y(1) 7 Load X(2) 2
Load X(2) 8 Load X(3) 3
Store X(2) 14 Load X(4) 4
Load X(3) 15 Store X(1) 7
Store X(3) 21 Store X(2) 8
Load X(4) 22 Store X(3) 9
Store X(4) 28 Store X(4) 10
MA 471
17

Loop peeling
This technique is used by compilers to
handle boundary conditions.
Do I 1, N
if(I .EQ. 1) then
XI 0
ELSEIF (I. EQ. N) THEN
X(I) N
ELSE
X(I) X (I) Y(I)
ENDDO
AFTER LOOP PEELING
X(1) 0
Do I 2, N-1
X(I) X(I) Y(I)
ENDDO
X(N) N

MA 471
18

Software Pipelining
Software pipelining is a technique for
recognizing loops such that each iteration in the
software-pipelined code is made from instructions
chosen from different iterations of the original
loop.
Iteration 0

Iteration 1
Iteration 2
Iteration 3
MA 471
19

Software pipeline is an optimization that is
impossible to duplicate with high level code
since the multiple assembly language instruction
that a single line of high level language creates
are moved around extensively.
Software pipeline is created only at high
optimization level.

MA 471
20

Compiler Speculation with Hardware Support
Modern compilers try to speculate either to
improve the scheduling or to increase issue rate.
Hurdle
Conditional instructions.
In moving instructions across a branch the
compiler must ensure that exception behavior is
not changed and dynamic data dependence remains
same.
Compiler also finds out, which registers
are not being used and those registers are
renamed.