Fusing Instructions to Reduce Resource Usage in IfConverted Regions - PowerPoint PPT Presentation

1 / 28
About This Presentation
Title:

Fusing Instructions to Reduce Resource Usage in IfConverted Regions

Description:

Apply Copy Propagation and Dead Code Elimination to remove movs. cmp.gt p1,p2 = v2,v3 ... Since movs are floating point operations, this could be a loss ... – PowerPoint PPT presentation

Number of Views:69
Avg rating:3.0/5.0
Slides: 29
Provided by: intel155
Category:

less

Transcript and Presenter's Notes

Title: Fusing Instructions to Reduce Resource Usage in IfConverted Regions


1
Fusing Instructions to Reduce Resource Usage in
If-Converted Regions
  • Howard Chen, Gerolf Hoflehner, Daniel Lavery
  • Intel Corporation
  • Matthew Bridges
  • Princeton University
  • 5th Workshop on EPIC Architectures and Compiler
    Technology, Sunday March 26

2
Instruction Fusion Motivation
  • Resource usage limits the performance of
    predicated regions
  • Limits on the number of integer, floating point,
    branch, and loads/store operations that can be
    executed in a cycle.
  • Limits on total number of instructions per cycle.
  • Resource constraints are common in software
    pipelined loops
  • Fusing instructions reduces resource usage
    allowing similar instructions with disjoint
    predicates to share resources
  • The two loads execute on different paths so they
    can be fused to share the same resource

cmp.gt p1,p2 v2,v3 (p1) ld4 v1 V0 (p2) ld4
v1 V0
cmp.gt p1,p2 v2,v3 ld4 v1 V0
Disjoint Predicates
One load eliminated
Identical loads
3
If-Conversion Redundancy Example
BLOCK 3 ld4 v1 V0 v1p addi v2 8, V0
v2p1 ld4 v3 v2 v3 p1 sub v4 v1, v3
v4pp1 st4 V0v4 p v4 BLOCK
4 ld4 v5 V0 v5p addi v6 4, V0
v6p2 ld4 v7 v6 v7 p2 add v8 v5, v7
v8pp2 st4 V0v8 p v8
void foo(int p, int i) if(!p) return
if(i gt 0) p p1 else
p - p2
Similar loads and stores exist in both blocks...
4
If-Conversion Fusion Example
Operand Copying and Register Renaming (p1) ld4
v1 V0 (p2) ld4 v1 V0 (p2) mov v5 v1 (p1)
addi v2 8, V0 (p2) addi v6 4, V0 (p2) mov v2
v6 (p1) ld4 v3 v2 (p2) ld4 v3 v2 (p2) mov
v7 v3 (p1) sub v4 v1, v3 (p2) add v8 v5,
v7 (p2) mov v4 v8 (p1) st4 V0v4 (p2) st4
V0v4
After Fusion and Copy Propagation ld4 v1
V0 (p1) addi v2 8, V0 (p2) addi v2 4, V0
ld4 v3 v2 (p1) sub v4 v1, v3 (p2) add v4 v1,
v3 st4 V0v4
Original Predicated Region (p1) ld4 v1 V0 (p2)
ld4 v5 V0 (p1) addi v2 8, V0 (p2) addi v6 4,
V0 (p1) ld4 v3 v2 (p2) ld4 v7 v6 (p1) sub
v4 v1, v3 (p2) add v8 v5, v7 (p1) st4
V0v4 (p2) st4 V0v8
Eliminate duplicated loads and stores on both
sides of the if statement
Use moves to make identical instructions...
5
Example from 181.mcf
  • while( node )
  • if( node-gtorientation UP )
  • node-gtpotential
    node-gtbasic_arc-gtcost node-gtpred-gtpotential
  • else / DOWN /
  • node-gtpotential
    node-gtpred-gtpotential - node-gtbasic_arc-gtcost
  • checksum
  • tmp node
  • node node-gtchild

Common address calculations performed along both
paths in refresh_potential can be fused to reduce
resource usage
6
Instruction Fusion
  • Instruction fusion replaces multiple instructions
    with the same opcode with a single instruction
  • Fused instructions act like a routine
  • Operands are copied into input registers
  • Results are copied from output registers
  • Unlike classical redundancy elimination, fusion
    can combine instructions that compute unrelated
    values
  • Unrelated computations on separate disjoint paths
    can be fused
  • Results in extra mov instructions, but it may
    free up more constrained resources

(p1) ld4 v3 v1 (p2) ld4 v4 v2 (p1) st4
v3v5 (p2) st4 v4v6
Fusion
... (p1) mov v2 v1 ld4 v4 v2 (p1) mov
v6 v5 st4 v4 v6 ...
7
Related Work
  • Code Hoisting / Sinking
  • Removes redundant code that is anticipatable on
    all paths
  • If-Conversion Example Revisited
  • Hoisting the first two loads and sinking the last
    two stores on the right can remove the redundant
    load and store instructions
  • The center two loads can not be hoisted or sunk,
    but can be fused
  • Fusion can be applied in place, and is not
    affected by dependencies

ld4 v1 V0 (p1) addi v2 8, V0 (p2) addi
v2 4, V0 ld4 v3 v2 (p1) sub v4 v1,
v3 (p2) add v4 v1, v3 st4 V0v4
(p1) ld4 v1 V0 (p2) ld4 v5 V0 (p1) addi v2
8, V0 (p2) addi v6 4, V0 (p1) ld4 v3 v2 (p2)
ld4 v7 v6 (p1) sub v4 v1, v3 (p2) add v8
v5, v7 (p1) st4 V0v4 (p2) st4 V0v8
8
Related Work
  • Common Sub-expression Elimination / Partial
    Redundancy Elimination
  • Eliminates calculation of redundant expressions /
    values along a path
  • The two redundant loads can be eliminated
  • Fusion eliminates operations across disjoint
    paths
  • Can eliminate the two disjoint add instructions
  • Can be combined with redundancy elimination to
    enable better fusion

ld4 v1 V0 ld4 v5 V0 (p1)addi v2
4, v1 (p2)addi v6 4, v5
Fusion
CSE
ld4 v1 V0 ld4 v5 V0 (p2)mov v1
v5 addi v2 4, v1 (p2)mov v6 v2
ld4 v1 V0 (p1)addi v2 4, v1 (p2)addi
v6 4, v1
Fusion
ld4 v1 V0 addi v2 4, v1 (p2)mov
v6 v2
9
Related Work
  • Instruction Merging Mahlke 92
  • Merges identical instructions in hammocks
  • Does not apply merging to all disjoint regions or
    transform instructions to allow merging
  • Predicate Aware Scheduling Smelyanskiy 03
  • Instructions with disjoint predicates can be
    scheduled to use the same resource at the same
    time
  • Requires architectural support for issuing
    disjoint operations to the same resource in the
    same clock cycle

10
Fusion Algorithm
  • Find a pair of instructions that contain the same
    opcode
  • The pair of instructions must calculate the same
    value along non-disjoint paths
  • Make the two instructions identical by renaming
    registers of both instructions and moving values
    from old registers
  • Move one of the instructions immediately
    before/after the other without violating any
    dependencies
  • Merge the two instructions by eliminating one of
    the instructions
  • Assign the remaining instruction a new predicate
    that is the union of the merged predicates
  • Apply Copy Propagation and Dead Code Elimination
    to remove movs

cmp.gt p1,p2 v2,v3 (p1) ld4 v2 V1 (p2) ld4
v1 V0
cmp.gt p1,p2 v2,v3 (p1) mov V4 V1 (p2) mov V4
V0 ld4 v3 V4 (p1) mov v2 v3 (p2) mov v1
v3
cmp.gt p1,p2 v2,v3 (p1) mov V4 V1 (p2) mov V4
V0 (p1) ld4 v3 V4 (p2) ld4 v3 V4 (p1) mov
v2 v3 (p2) mov v1 v3
cmp.gt p1,p2 v2,v3 (p1) mov V4 V1 (p1) ld4 v3
V4 (p1) mov v2 v3 (p2) mov V4 V0 (p2) ld4 v3
V4 (p2) mov v1 v3
11
Selecting Instructions to Fuse
  • Some combinations introduce dependencies that
    increase the critical schedule path
  • ie Fusion of ld instructions
  • Other combinations introduce dependencies that
    limit other fusion opportunities
  • ie Fusing ld may prevent fusion of mult
  • Need to identify which fusion combinations are
    desirable

br
ld
ld
mult
mult
ld
mult
...
12
Goals for Selecting Fusion Opportunities
  • Maximize the number of instructions removed
  • Minimize the number of new instructions added
  • Minimize constraints that occur due to new
    dependencies
  • Dependencies may limit scheduling by adding new
    dependencies
  • Fusion may affect disambiguation of memory
    accesses but memory addresses stay disjoint on
    disjoint paths
  • Other benefits may influence the decision to fuse
  • Fusion may remove control dependencies

13
Minimizing New mov Instructions
  • Mov instructions are only necessary if operands
    of fused instructions use different registers
  • Registers can be fused to be the same if they
    can only generate uses with different values on
    disjoint paths
  • Assuming there are no conflicts, v5 can be
    renamed to v1 and v8 can be renamed to v4

(p1) ld4 v1 V0 (p2) ld4 v5 V1 (p1) addi v2
4, v1 (p2) addi v3 4, v5 ... (p1) st4
V0v4 (p2) st4 V1v8
ld4 v1 V0 addi v2 4, v1 ...
st4 V0v4
14
Minimizing new cmp instructions
  • Fusing instructions that execute along every path
    from an existing dominating block can reuse an
    existing predicate
  • On the right, the instructions with predicates
    (p1) and (p2) are fused together to take on the
    block predicate of the parent, the universal
    predicate

(p1) ld4 v1 V0 (p2) ld4 v5 V0 (p1) addi v2
4, v1 (p2) addi v3 4, v5 ... (p1) st4
V0v4 (p2) st4 V0v8
ld4 v1 V0 addi v2 4, v1 ...
st4 V0v4
15
Strategy to Minimize New Instructions
  • 1) Use anticipatability data-flow to find a set
    of instructions that executes along all paths.
    These instructions can reuse the predicate of a
    common dominator.
  • 2) Find instructions that share dependencies on
    the same defs / uses and can be fused without
    introducing new mov instructions
  • 3) Walk down the dependence chain to fuse
    instructions dependent on the newly fused
    instruction

16
Experimental Setup
  • Itanium 2 Processor
  • Intel Itanium Compiler 9.0
  • Options used for SPEC CPU2000 base performance
  • Apply a Subset of Fusions
  • Hoist code into dominator
  • Sink code into post-dominators
  • No fusion that generates additional mov
    instructions is allowed
  • Fusion of instructions in if-converted regions
    inside of SWP loops
  • In a software pipelined loop, possible increases
    in the dependency length off the recurrence path
    can be tolerated

17
Results
18
Results
19
Future Work
  • Fuse instructions that do not share input or
    output dependencies if it eliminates a long chain
    of instructions
  • Common long chains of related instructions
  • Pointer dereferencing sequences
  • e.g. (p1)-gtprev-gtmember (p1)-gtmember
  • Mathematical sequences of instructions
  • divide, sqrt, etc.
  • Fuse instructions to remove control dependencies
  • Fuse unrelated instructions to schedule them
    above/below their current block
  • Move up defs, move down uses

20
Future Work
  • Partial Fusion
  • Fuse instructions that do not execute along all
    paths
  • Partial fusion may require computation of new
    predicates to get the execution condition of
    fused instruction
  • e.g. Fusion of div instructions on the right
  • Can eliminate new predicate compares using
    predicate promotion

div
div
21
Fusion Tradeoffs
  • Register Pressure
  • Fusion can affect register pressure by changing
    register lifetimes and eliminating redundant
    registers
  • Scheduling
  • Additional moves may add height to critical path
  • Fusion may remove control dependencies to reduce
    the height of the critical path
  • Aliasing of memory addresses in loads may affect
    memory disambiguation

22
Conclusion
  • Fusion is a technique that can eliminate
    redundancy on disjoint paths that is not removed
    by classical optimizations
  • Fusion allows instructions with the same opcode
    to share resources even if they do not compute
    the same value
  • Reduces resource usage with a possible increase
    to dependency length
  • In a software pipelined loop, increasing the
    dependency length off the recurrence path can be
    tolerated
  • Effective at reducing resource II of loops with
    control flow

23
Questions
24
Backup Slides
25
Fusing Non-Disjoint Predicates
  • Find a pair of instructions that contain the same
    opcode
  • Instructions with non-disjoint predicates can be
    fused if they calculate the same value
  • All the div instructions can be fused together
    if the two left-most divs compute the same value
  • The fused instruction takes the predicate of the
    top-most node

div
div
div
26
Minimizing new cmp instructions
  • Fusing instructions that execute along every path
    from an existing dominating block can reuse an
    existing predicate
  • On the right, the instructions with predicates
    (p1) and (p2) are fused together to take on the
    block predicate of the parent, the universal
    predicate

(p1) ld4 v1 V0 (p2) ld4 v5 V0 (p1) addi v2
4, v1 (p2) addi v3 4, v5 ... (p1) st4
V0v4 (p2) st4 V0v8
ld4 v1 V0 addi v2 4, v1 ...
st4 V0v4
27
Minimizing new cmp instructions
  • Fused instructions can also reuse an existing cmp
    from a post-dominator
  • the compare instruction that generates the
    post-dominators predicate may need to be moved
    downward to execute after the merged instructions.

div
div
28
Related Work
cmp.lt p1,p2r1,r2 (p1)fma f3
f1,f2,f0 (p2)fma f5 f4,f2,f0 fma f3 f1,f2,f0
  • Partial Redundancy Elimination
  • Eliminates redundancy along some program paths
  • Two fma operations on the right can be merged
    using PRE
  • Fusion can be applied to common operations with
    any input values along different paths
  • All three fma instructions on the right can be
    fused into one fma, and one compare can be
    eliminated
  • Since movs are floating point operations, this
    could be a loss
  • If the movs can be eliminated this could be a gain

cmp.lt p1,p2r1,r2 (p2)fma f5 f4,f2,f0
fma f3 f1,f2,f0
cmp.lt p1,p2r1,r2 (p2)mov f1 f4 fma f3
f1,f2,f0 (p2)mov f5 f3
Write a Comment
User Comments (0)
About PowerShow.com