Architecturedependent optimizations - PowerPoint PPT Presentation

1 / 19
About This Presentation
Title:

Architecturedependent optimizations

Description:

The pipeline structure of modern architectures requires careful instruction ... If an instruction awaits the result of a previous computation, the pipeline may ... – PowerPoint PPT presentation

Number of Views:43
Avg rating:3.0/5.0
Slides: 20
Provided by: scho71
Learn more at: https://cs.nyu.edu
Category:

less

Transcript and Presenter's Notes

Title: Architecturedependent optimizations


1
Architecture-dependent optimizations
  • Functional units, delay slots and dependency
    analysis

2
RISC architectures
  • The pipeline structure of modern architectures
    requires careful instruction scheduling.
  • If instruction I1 creates a value, there may be a
    latency that has to elapse before another
    instruction I2 can use this value
  • If an instruction awaits the result of a previous
    computation, the pipeline may have to stall until
    the result becomes available
  • A branch instruction is time-consuming and
    affects the contents of the instruction cache.
    Execution cannot start at destination before one
    or more cycles have elapsed.

3
Instruction scheduling
  • Purpose minimize stalls and delays, fill delay
    slots with useful computations, minimize
    execution time of basic block.
  • Tool dependency analysis. Uncover legal
    reorderings of instructions, available
    parallelism in basic blocks and beyond
  • Applications
  • Filling delay slots is important for all programs
  • Dependency analysis is critical for reordering of
    loop computations on vector processors and others

4
Dependence relations
  • A data dependence is a constraint that arises
    from the flow of data between statements.
    Violating a data dependence by reordering may
    lead to incorrect results.
  • If S1 sets a value that S2 uses, this is flow
    dependence or true dependence between S1 and S2
  • If S1 uses some variables value and S2 sets it,
    there is an antidependence between them.
  • If both S1 and S2 set the value of some variable,
    there is an output dependence between them.
  • If both S1 and S2 read the value of some variable
    there is an input dependence between then. This
    does not impose an ordering.

5
The dependence DAG of a basic block
  • There is an edge in the dependence dag if
  • I1 writes a register or location that I2 uses
    I1 fd I2
  • I1 uses a register or location that I2 changes
    I1 ad I2
  • I1 and I2 write to the same register or location
    I1 od I2
  • I1 and I2 exhibit a structural hazard a load
    followed by a store cannot be interchanged unless
    the addresses are known to be distinct
  • X AI
  • AJ Y -- cannot
    interchange, X might get Y
  • if there is an edge between I1 and I2, I2 must
    not start executing until I1 has executed for
    some number of cycles.

6
Example
  • 1 R3 R15
  • 2 R4 R15 4
  • 3 R2 R3 R4 -- needs R4 stall one cycle
  • 4 R5 R12
  • 5 R12 R12 4
  • 6 R6 R3 R5
  • 7 R154 R3
  • 8 R5 R6 2

1
2
4
3
7
5
6
8
7
Contention for resources
  • Functional unit is pipelined, consists of
    multiple resources. Instructions through the
    pipeline may conflict on use of resources.
  • Eg floating-point unit on MIPS
  • A Mantissa add
  • E Exception test
  • M Multiplier first stage
  • N Multiplier second stage
  • R Adder Round
  • S Operand shift
  • U unpack
  • Add uses successively U, S and A, A and R, R and
    S. (4 cycles)
  • Mul uses U, M, M, M, M, M and A, R. (7 cycles)
  • Conflict depends on relative starting time of two
    instructions.
  • Edges in dependency graph are labelled with
    latencies (gt 1).

8
Branch scheduling
  • Important use of dependency graph fill delay
    slots (branch takes two cycles to reach
    destination)
  • R2 R1 R2
    R1
  • R3 R14 R3
    R14
  • R4 R2 R3 (stall) R5 R2 -1
  • R5 R2 -1 goto
    L1
  • goto L1 R4
    R2 R3
  • nop

9
Conditional jumps and delay slots
  • Instruction in delay slot is executed while jump
    is in progress. What if jump is not taken? Need
    mechanism to annull instruction.
  • Branch prediction assume target is known, fill
    delay slot with first instruction in target block
  • If both destinations start with same instruction,
    ideal choice for delay slot
  • Good heuristics for loops assume that a
    backwards conditional jump is usually taken. Move
    first instruction in loop to delay slot for
    branch at end
  • Call instruction has delay slot fill with
    parameter push

10
A greedy algorithm list scheduling
  • Finding optimal schedule for DAG is NP-complete
  • Simple algorithm is O (N2) at worst, usually
    linear
  • Roots of DAG are instructions without
    predecessors
  • First pass from leaves to roots compute latest
    possible starting time for each instruction to
    end of block
  • For leaf execution time of instruction
  • For inner node maximum delay imposed by
    successors
  • E.g. if In is followed by Im, Im can start at T
    4, and there is a latency of 2 between In and Im,
    then In must start by T 6.

11
List scheduling second pass
  • Second pass from roots to leaves schedule
    instructions with the greatest slack (farthest
    from block end) and that can start as early as
    possible from now.
  • At each step
  • D1 candidates with the largest remaining delay
  • D2 candidates with the earliest possible
    starting time (computed from starting time of
    their predecessors)
  • Choose from D1 if unique, else from D2 if unique,
    else use heuristics
  • Choose earliest starting time, or
  • Choose instruction that uses least used pipeline,
    or
  • Choose instruction that frees register.

12
Procedure integration inlining
  • Calls make optimizations harder. There is a large
    payoff to local optimizations over large basic
    blocks inlining subprogram bodies is often very
    effective
  • It exposes the values of the actuals in the body
  • It creates larger basic blocks
  • It saves the cost of the call
  • Can be done at the tree level or at the RTL
    level. In both cases it can enable other
    optimizations.
  • Possible disadvantages code size increase,
    debugging is harder

13
Inlining as a tree transformation
  • Treat body of subprogram as a generic unit
  • Each inlined body needs its own local variables
  • Global references are captured at the point of
    definition
  • Inlining works like instantiation replace
    formals with actuals, complete analysis and
    expansion of inserted body
  • Replace multiple return statements where needed
  • Introduce temporary to hold return value of
    function

14
Name capture recognize global entities
  • function memo (x integer) return
    integer is
  • local integer x 2
  • begin
  • Saved Saved local x
    -- Saved is global
  • return Saved
  • end memo
  • Val memo (15)
  • Becomes
  • declare
  • local integer 15 2
    -- each inlining has its own
  • result integer
    -- maybe superfluous if context is
    assignment
  • begin
  • Saved Saved local 15l
    -- Saved is the same entity in all inlinings
  • result Saved
  • end
  • Val result

15
Handling return statements
  • Subprogram needs a label to serve a single exit
    point.
  • In a function identify target of result, or
    create temporary for it replace return with
    assignment to target, followed by goto to exit
    label
  • In a procedure replace return with goto to exit
    label
  • Optimizations
  • if body of function is single return statement
    and context is assignment, can replace right-hand
    side with expression
  • If procedure has no return statement, exit label
    is superfluous

16
Parameter passing
  • If actual is an expression, it is evaluated once
    create temporary in block and replace formal with
    temporary
  • Val2 memo (x f (z))
  • Becomes
  • declare
  • C1 integer x f (z)
  • local integer C1 2
  • begin
  • Saved Saved local C1
  • Val2 Saved
    -- context is assignment
  • end

17
Parameter passing variables
  • An in-out parameter is a location, cannot create
    a temporary for it must use a renaming
    declaration.
  • procedure incr (x in out integer) is
  • begin x x 1 end
  • incr (a (i))
  • Becomes
  • declare
  • c1 integer renames a (i)
  • begin
  • c1 c1 1
  • end

18
Context includes more than global names
  • semantics of inlined call must be identical to
    original program
  • If constraint checks are not suppressed in the
    body, they must not be suppressed in the inlined
    block, even if suppressed at the point of call.
  • Status of constraint checks is part of closure of
    body to inline, must be applied when analyzing
    inlined block

19
Specialized inlining loop unrolling
  • Create successive copies of body of loop saves
    tests, makes bigger basic block, increases
    instruction level parallelism
  • for j in 1 .. N loop
  • loop_body
  • end loop
  • Becomes
  • for k in 1 .. N / r loop --
    unroll r times
  • loop_ body
  • loop_body j -gt j 1 --
    replace loop variable for each unrolling
  • loop_ body j -gt j r -1
  • end
  • for k in N / r 1.. N loop loop_body
    end loop -- leftover iterations
Write a Comment
User Comments (0)
About PowerShow.com