Title: Software Optimization
1Software Optimization
Vikas, Chaudhary
MA 471
2High-level code
Intermediate code
Object code
Executable
Assembler
Assembly Code
Steps to create an executable
MA 471
3- Higher Optimizations
- Procedure within basic blocks.
- Procedure within single and nested loop
structures. - Entire procedure including all blocks and
structures. - File (inter-procedural analysis within a source
file) - Cross file (inter-procedural analysis across all
procedures)
MA 471
4- Compiler Options
- There are no strict rules about what each level
of optimization means but generally -
- O0 does one to many translations.
- O1 does basic block optimizations.
- O2 does loop optimizations.
- O4 does interfile optimizations.
- Some compilers also provide odataprefetch to
indicate that prefetch instructions should be
inserted to prefetch data from memory to cache.
MA 471
5- Increasing Register Pressure
- When too many registers are needed,
compilers must store values to memory and
restores values from memory. This degrades the
performance. -
- If we generate assembly code from compiler
via S and see that there is an inordinate number
of load and store instructions then it is implied
that compiler is generating too many spills. - Use register data type carefully.
MA 471
6- Dead Code Elimination
- Dead code Elimination is merely the removal of
code that is never used. - i0
- If (i!0) deadcode(i)
MA 471
7- Constant Folding and Propagation
-
- Constant folding is when expressions with
multiple constants are folded together and
evaluated at compile time. - a 1 2
- Will be replaced by a 3.
MA 471
8Common Subexpression elimination Common
subexpression elimination analyzes lines of code,
determines where identical subexpressions are
used and creates a temporary variable to hold one
instance of these values.
a b (c d) f e (c
d)
MA 471
9- Strength Reductions
- Strength reduction means replacing expensive
operations with cheaper ones. - Replacing integer multiplication or division by
constants with shift operations. - Replacing 32-bit integer division by 64-bit
floating point division. - Replacing floating point multiplications by small
constants with floating point additions. - Replacing power function by floating point
multiplications.
MA 471
10Filling Branch Delay Slots Branch delay slots
are the instructions after a branch that are
always executed. If the compiler is used with no
optimization, it will probably insert a nop into
branch delay slot.
MA 471
11Induction Variable Optimization for (i0 ilt n
i 2) iai i k m Where i is
induction variable. The above code can be
replaced by ic m for (i 0 ilt n i 2)
iai ic ic ic k
MA 471
12- Loop Fission
- This technique is often used when an inner
loop consists of a large number of lines and the
compiler has difficulty generating code without
spilling. -
- This technique is also helpful in improving
cache performance. - for(i 0 i lt n i)
- Yi yi xi xim
- Suppose xi and xi m maps to same
cache location. (Direct mapped cache). This will
cause cache thrashing. -
MA 471
13- Loop can be split as
- for(i 0 i lt n i)
- yi yi xi
- for(i 0 i lt n i)
- yi yi xi m
- This technique might not be very useful when
cache is n-way set associative.
MA 471
14- Loop Unrolling
- This technique reduces the effect of
branches, instruction latency, and potentially
the number of cache misses. - Do I 1, N
- Y(I) X(I)
- ENDDO
- After Unrolling
- NEND 4 (N/4)
- Do I I, N , 4
- Y(I) X(I)
- Y(I 1) X(I 1)
- Y(I 1) X(I 1)
- ENDDO
- Do I NEND1 , N
- Y(I) X(I)
- ENDDO
MA 471
15- Loading all the values of X before the values of
Y reduces the possibility of cache thrashing. - Amount of unrolling can decrease the number of
software prefetch instructions. - Excessive unrolling will cause data to be spilled
from register to memory. - Unrolling increases size of object code, which
might cause too many instruction cache misses. -
MA 471
16- Clock Cycles in an Unrolled Loop
Original order CC Modified order CC
Load X(1) 1 Load X(1) 1
Store Y(1) 7 Load X(2) 2
Load X(2) 8 Load X(3) 3
Store X(2) 14 Load X(4) 4
Load X(3) 15 Store X(1) 7
Store X(3) 21 Store X(2) 8
Load X(4) 22 Store X(3) 9
Store X(4) 28 Store X(4) 10
MA 471
17- Loop peeling
- This technique is used by compilers to
handle boundary conditions. -
- Do I 1, N
- if(I .EQ. 1) then
- XI 0
- ELSEIF (I. EQ. N) THEN
- X(I) N
- ELSE
- X(I) X (I) Y(I)
- ENDDO
- AFTER LOOP PEELING
- X(1) 0
- Do I 2, N-1
- X(I) X(I) Y(I)
- ENDDO
- X(N) N
MA 471
18- Software Pipelining
- Software pipelining is a technique for
recognizing loops such that each iteration in the
software-pipelined code is made from instructions
chosen from different iterations of the original
loop. -
- Iteration 0
Iteration 1
Iteration 2
Iteration 3
MA 471
19- Software pipeline is an optimization that is
impossible to duplicate with high level code
since the multiple assembly language instruction
that a single line of high level language creates
are moved around extensively. -
- Software pipeline is created only at high
optimization level.
MA 471
20- Compiler Speculation with Hardware Support
- Modern compilers try to speculate either to
improve the scheduling or to increase issue rate. - Hurdle
- Conditional instructions.
- In moving instructions across a branch the
compiler must ensure that exception behavior is
not changed and dynamic data dependence remains
same. -
- Compiler also finds out, which registers
are not being used and those registers are
renamed. -
-
-
MA 471
21MA 471