Architecturedependent optimizations - PowerPoint PPT Presentation

1 / 19

About This Presentation

Title:

Architecturedependent optimizations

Description:

The pipeline structure of modern architectures requires careful instruction ... If an instruction awaits the result of a previous computation, the pipeline may ... – PowerPoint PPT presentation

Number of Views:43

Avg rating:3.0/5.0

Slides: 20

Provided by: scho71

Learn more at: https://cs.nyu.edu

Category:

more less

Transcript and Presenter's Notes

Title: Architecturedependent optimizations

1
Architecture-dependent optimizations

Functional units, delay slots and dependency
analysis

2
RISC architectures

The pipeline structure of modern architectures
requires careful instruction scheduling.
If instruction I1 creates a value, there may be a
latency that has to elapse before another
instruction I2 can use this value
If an instruction awaits the result of a previous
computation, the pipeline may have to stall until
the result becomes available
A branch instruction is time-consuming and
affects the contents of the instruction cache.
Execution cannot start at destination before one
or more cycles have elapsed.

3
Instruction scheduling

Purpose minimize stalls and delays, fill delay
slots with useful computations, minimize
execution time of basic block.
Tool dependency analysis. Uncover legal
reorderings of instructions, available
parallelism in basic blocks and beyond
Applications
Filling delay slots is important for all programs
Dependency analysis is critical for reordering of
loop computations on vector processors and others

4
Dependence relations

A data dependence is a constraint that arises
from the flow of data between statements.
Violating a data dependence by reordering may
lead to incorrect results.
If S1 sets a value that S2 uses, this is flow
dependence or true dependence between S1 and S2
If S1 uses some variables value and S2 sets it,
there is an antidependence between them.
If both S1 and S2 set the value of some variable,
there is an output dependence between them.
If both S1 and S2 read the value of some variable
there is an input dependence between then. This
does not impose an ordering.

5
The dependence DAG of a basic block

There is an edge in the dependence dag if
I1 writes a register or location that I2 uses
I1 fd I2
I1 uses a register or location that I2 changes
I1 ad I2
I1 and I2 write to the same register or location
I1 od I2
I1 and I2 exhibit a structural hazard a load
followed by a store cannot be interchanged unless
the addresses are known to be distinct
X AI
AJ Y -- cannot
interchange, X might get Y
if there is an edge between I1 and I2, I2 must
not start executing until I1 has executed for
some number of cycles.

6
Example

1 R3 R15
2 R4 R15 4
3 R2 R3 R4 -- needs R4 stall one cycle
4 R5 R12
5 R12 R12 4
6 R6 R3 R5
7 R154 R3
8 R5 R6 2

1
2
4
3
7
5
6
8
7
Contention for resources

Functional unit is pipelined, consists of
multiple resources. Instructions through the
pipeline may conflict on use of resources.
Eg floating-point unit on MIPS
A Mantissa add
E Exception test
M Multiplier first stage
N Multiplier second stage
R Adder Round
S Operand shift
U unpack
Add uses successively U, S and A, A and R, R and
S. (4 cycles)
Mul uses U, M, M, M, M, M and A, R. (7 cycles)
Conflict depends on relative starting time of two
instructions.
Edges in dependency graph are labelled with
latencies (gt 1).

8
Branch scheduling

Important use of dependency graph fill delay
slots (branch takes two cycles to reach
destination)
R2 R1 R2
R1
R3 R14 R3
R14
R4 R2 R3 (stall) R5 R2 -1
R5 R2 -1 goto
L1
goto L1 R4
R2 R3
nop

9
Conditional jumps and delay slots

Instruction in delay slot is executed while jump
is in progress. What if jump is not taken? Need
mechanism to annull instruction.
Branch prediction assume target is known, fill
delay slot with first instruction in target block
If both destinations start with same instruction,
ideal choice for delay slot
Good heuristics for loops assume that a
backwards conditional jump is usually taken. Move
first instruction in loop to delay slot for
branch at end
Call instruction has delay slot fill with
parameter push

10
A greedy algorithm list scheduling

Finding optimal schedule for DAG is NP-complete
Simple algorithm is O (N2) at worst, usually
linear
Roots of DAG are instructions without
predecessors
First pass from leaves to roots compute latest
possible starting time for each instruction to
end of block
For leaf execution time of instruction
For inner node maximum delay imposed by
successors
E.g. if In is followed by Im, Im can start at T
4, and there is a latency of 2 between In and Im,
then In must start by T 6.

11
List scheduling second pass

Second pass from roots to leaves schedule
instructions with the greatest slack (farthest
from block end) and that can start as early as
possible from now.
At each step
D1 candidates with the largest remaining delay
D2 candidates with the earliest possible
starting time (computed from starting time of
their predecessors)
Choose from D1 if unique, else from D2 if unique,
else use heuristics
Choose earliest starting time, or
Choose instruction that uses least used pipeline,
or
Choose instruction that frees register.

12
Procedure integration inlining

Calls make optimizations harder. There is a large
payoff to local optimizations over large basic
blocks inlining subprogram bodies is often very
effective
It exposes the values of the actuals in the body
It creates larger basic blocks
It saves the cost of the call
Can be done at the tree level or at the RTL
level. In both cases it can enable other
optimizations.
Possible disadvantages code size increase,
debugging is harder

13
Inlining as a tree transformation

Treat body of subprogram as a generic unit
Each inlined body needs its own local variables
Global references are captured at the point of
definition
Inlining works like instantiation replace
formals with actuals, complete analysis and
expansion of inserted body
Replace multiple return statements where needed
Introduce temporary to hold return value of
function

14
Name capture recognize global entities

function memo (x integer) return
integer is
local integer x 2
begin
Saved Saved local x
-- Saved is global
return Saved
end memo
Val memo (15)
Becomes
declare
local integer 15 2
-- each inlining has its own
result integer
-- maybe superfluous if context is
assignment
begin
Saved Saved local 15l
-- Saved is the same entity in all inlinings
result Saved
end
Val result

15
Handling return statements

Subprogram needs a label to serve a single exit
point.
In a function identify target of result, or
create temporary for it replace return with
assignment to target, followed by goto to exit
label
In a procedure replace return with goto to exit
label
Optimizations
if body of function is single return statement
and context is assignment, can replace right-hand
side with expression
If procedure has no return statement, exit label
is superfluous

16
Parameter passing

If actual is an expression, it is evaluated once
create temporary in block and replace formal with
temporary
Val2 memo (x f (z))
Becomes
declare
C1 integer x f (z)
local integer C1 2
begin
Saved Saved local C1
Val2 Saved
-- context is assignment
end

17
Parameter passing variables

An in-out parameter is a location, cannot create
a temporary for it must use a renaming
declaration.
procedure incr (x in out integer) is
begin x x 1 end
incr (a (i))
Becomes
declare
c1 integer renames a (i)
begin
c1 c1 1
end

18
Context includes more than global names

semantics of inlined call must be identical to
original program
If constraint checks are not suppressed in the
body, they must not be suppressed in the inlined
block, even if suppressed at the point of call.
Status of constraint checks is part of closure of
body to inline, must be applied when analyzing
inlined block

19
Specialized inlining loop unrolling

Create successive copies of body of loop saves
tests, makes bigger basic block, increases
instruction level parallelism
for j in 1 .. N loop
loop_body
end loop
Becomes
for k in 1 .. N / r loop --
unroll r times
loop_ body
loop_body j -gt j 1 --
replace loop variable for each unrolling
loop_ body j -gt j r -1
end
for k in N / r 1.. N loop loop_body
end loop -- leftover iterations