Title: Fusing Instructions to Reduce Resource Usage in IfConverted Regions
1Fusing Instructions to Reduce Resource Usage in
If-Converted Regions
- Howard Chen, Gerolf Hoflehner, Daniel Lavery
- Intel Corporation
- Matthew Bridges
- Princeton University
- 5th Workshop on EPIC Architectures and Compiler
Technology, Sunday March 26
2Instruction Fusion Motivation
- Resource usage limits the performance of
predicated regions - Limits on the number of integer, floating point,
branch, and loads/store operations that can be
executed in a cycle. - Limits on total number of instructions per cycle.
- Resource constraints are common in software
pipelined loops - Fusing instructions reduces resource usage
allowing similar instructions with disjoint
predicates to share resources - The two loads execute on different paths so they
can be fused to share the same resource
cmp.gt p1,p2 v2,v3 (p1) ld4 v1 V0 (p2) ld4
v1 V0
cmp.gt p1,p2 v2,v3 ld4 v1 V0
Disjoint Predicates
One load eliminated
Identical loads
3If-Conversion Redundancy Example
BLOCK 3 ld4 v1 V0 v1p addi v2 8, V0
v2p1 ld4 v3 v2 v3 p1 sub v4 v1, v3
v4pp1 st4 V0v4 p v4 BLOCK
4 ld4 v5 V0 v5p addi v6 4, V0
v6p2 ld4 v7 v6 v7 p2 add v8 v5, v7
v8pp2 st4 V0v8 p v8
void foo(int p, int i) if(!p) return
if(i gt 0) p p1 else
p - p2
Similar loads and stores exist in both blocks...
4If-Conversion Fusion Example
Operand Copying and Register Renaming (p1) ld4
v1 V0 (p2) ld4 v1 V0 (p2) mov v5 v1 (p1)
addi v2 8, V0 (p2) addi v6 4, V0 (p2) mov v2
v6 (p1) ld4 v3 v2 (p2) ld4 v3 v2 (p2) mov
v7 v3 (p1) sub v4 v1, v3 (p2) add v8 v5,
v7 (p2) mov v4 v8 (p1) st4 V0v4 (p2) st4
V0v4
After Fusion and Copy Propagation ld4 v1
V0 (p1) addi v2 8, V0 (p2) addi v2 4, V0
ld4 v3 v2 (p1) sub v4 v1, v3 (p2) add v4 v1,
v3 st4 V0v4
Original Predicated Region (p1) ld4 v1 V0 (p2)
ld4 v5 V0 (p1) addi v2 8, V0 (p2) addi v6 4,
V0 (p1) ld4 v3 v2 (p2) ld4 v7 v6 (p1) sub
v4 v1, v3 (p2) add v8 v5, v7 (p1) st4
V0v4 (p2) st4 V0v8
Eliminate duplicated loads and stores on both
sides of the if statement
Use moves to make identical instructions...
5Example from 181.mcf
- while( node )
-
- if( node-gtorientation UP )
- node-gtpotential
node-gtbasic_arc-gtcost node-gtpred-gtpotential - else / DOWN /
-
- node-gtpotential
node-gtpred-gtpotential - node-gtbasic_arc-gtcost - checksum
-
- tmp node
- node node-gtchild
-
Common address calculations performed along both
paths in refresh_potential can be fused to reduce
resource usage
6Instruction Fusion
- Instruction fusion replaces multiple instructions
with the same opcode with a single instruction - Fused instructions act like a routine
- Operands are copied into input registers
- Results are copied from output registers
- Unlike classical redundancy elimination, fusion
can combine instructions that compute unrelated
values - Unrelated computations on separate disjoint paths
can be fused - Results in extra mov instructions, but it may
free up more constrained resources
(p1) ld4 v3 v1 (p2) ld4 v4 v2 (p1) st4
v3v5 (p2) st4 v4v6
Fusion
... (p1) mov v2 v1 ld4 v4 v2 (p1) mov
v6 v5 st4 v4 v6 ...
7Related Work
- Code Hoisting / Sinking
- Removes redundant code that is anticipatable on
all paths - If-Conversion Example Revisited
- Hoisting the first two loads and sinking the last
two stores on the right can remove the redundant
load and store instructions - The center two loads can not be hoisted or sunk,
but can be fused - Fusion can be applied in place, and is not
affected by dependencies
ld4 v1 V0 (p1) addi v2 8, V0 (p2) addi
v2 4, V0 ld4 v3 v2 (p1) sub v4 v1,
v3 (p2) add v4 v1, v3 st4 V0v4
(p1) ld4 v1 V0 (p2) ld4 v5 V0 (p1) addi v2
8, V0 (p2) addi v6 4, V0 (p1) ld4 v3 v2 (p2)
ld4 v7 v6 (p1) sub v4 v1, v3 (p2) add v8
v5, v7 (p1) st4 V0v4 (p2) st4 V0v8
8Related Work
- Common Sub-expression Elimination / Partial
Redundancy Elimination - Eliminates calculation of redundant expressions /
values along a path - The two redundant loads can be eliminated
- Fusion eliminates operations across disjoint
paths - Can eliminate the two disjoint add instructions
- Can be combined with redundancy elimination to
enable better fusion
ld4 v1 V0 ld4 v5 V0 (p1)addi v2
4, v1 (p2)addi v6 4, v5
Fusion
CSE
ld4 v1 V0 ld4 v5 V0 (p2)mov v1
v5 addi v2 4, v1 (p2)mov v6 v2
ld4 v1 V0 (p1)addi v2 4, v1 (p2)addi
v6 4, v1
Fusion
ld4 v1 V0 addi v2 4, v1 (p2)mov
v6 v2
9Related Work
- Instruction Merging Mahlke 92
- Merges identical instructions in hammocks
- Does not apply merging to all disjoint regions or
transform instructions to allow merging - Predicate Aware Scheduling Smelyanskiy 03
- Instructions with disjoint predicates can be
scheduled to use the same resource at the same
time - Requires architectural support for issuing
disjoint operations to the same resource in the
same clock cycle
10Fusion Algorithm
- Find a pair of instructions that contain the same
opcode - The pair of instructions must calculate the same
value along non-disjoint paths - Make the two instructions identical by renaming
registers of both instructions and moving values
from old registers - Move one of the instructions immediately
before/after the other without violating any
dependencies - Merge the two instructions by eliminating one of
the instructions - Assign the remaining instruction a new predicate
that is the union of the merged predicates - Apply Copy Propagation and Dead Code Elimination
to remove movs
cmp.gt p1,p2 v2,v3 (p1) ld4 v2 V1 (p2) ld4
v1 V0
cmp.gt p1,p2 v2,v3 (p1) mov V4 V1 (p2) mov V4
V0 ld4 v3 V4 (p1) mov v2 v3 (p2) mov v1
v3
cmp.gt p1,p2 v2,v3 (p1) mov V4 V1 (p2) mov V4
V0 (p1) ld4 v3 V4 (p2) ld4 v3 V4 (p1) mov
v2 v3 (p2) mov v1 v3
cmp.gt p1,p2 v2,v3 (p1) mov V4 V1 (p1) ld4 v3
V4 (p1) mov v2 v3 (p2) mov V4 V0 (p2) ld4 v3
V4 (p2) mov v1 v3
11Selecting Instructions to Fuse
- Some combinations introduce dependencies that
increase the critical schedule path - ie Fusion of ld instructions
- Other combinations introduce dependencies that
limit other fusion opportunities - ie Fusing ld may prevent fusion of mult
- Need to identify which fusion combinations are
desirable
br
ld
ld
mult
mult
ld
mult
...
12Goals for Selecting Fusion Opportunities
- Maximize the number of instructions removed
- Minimize the number of new instructions added
- Minimize constraints that occur due to new
dependencies - Dependencies may limit scheduling by adding new
dependencies - Fusion may affect disambiguation of memory
accesses but memory addresses stay disjoint on
disjoint paths - Other benefits may influence the decision to fuse
- Fusion may remove control dependencies
13Minimizing New mov Instructions
- Mov instructions are only necessary if operands
of fused instructions use different registers - Registers can be fused to be the same if they
can only generate uses with different values on
disjoint paths - Assuming there are no conflicts, v5 can be
renamed to v1 and v8 can be renamed to v4
(p1) ld4 v1 V0 (p2) ld4 v5 V1 (p1) addi v2
4, v1 (p2) addi v3 4, v5 ... (p1) st4
V0v4 (p2) st4 V1v8
ld4 v1 V0 addi v2 4, v1 ...
st4 V0v4
14Minimizing new cmp instructions
- Fusing instructions that execute along every path
from an existing dominating block can reuse an
existing predicate - On the right, the instructions with predicates
(p1) and (p2) are fused together to take on the
block predicate of the parent, the universal
predicate
(p1) ld4 v1 V0 (p2) ld4 v5 V0 (p1) addi v2
4, v1 (p2) addi v3 4, v5 ... (p1) st4
V0v4 (p2) st4 V0v8
ld4 v1 V0 addi v2 4, v1 ...
st4 V0v4
15Strategy to Minimize New Instructions
- 1) Use anticipatability data-flow to find a set
of instructions that executes along all paths.
These instructions can reuse the predicate of a
common dominator. - 2) Find instructions that share dependencies on
the same defs / uses and can be fused without
introducing new mov instructions - 3) Walk down the dependence chain to fuse
instructions dependent on the newly fused
instruction
16Experimental Setup
- Itanium 2 Processor
- Intel Itanium Compiler 9.0
- Options used for SPEC CPU2000 base performance
- Apply a Subset of Fusions
- Hoist code into dominator
- Sink code into post-dominators
- No fusion that generates additional mov
instructions is allowed - Fusion of instructions in if-converted regions
inside of SWP loops - In a software pipelined loop, possible increases
in the dependency length off the recurrence path
can be tolerated
17Results
18Results
19Future Work
- Fuse instructions that do not share input or
output dependencies if it eliminates a long chain
of instructions - Common long chains of related instructions
- Pointer dereferencing sequences
- e.g. (p1)-gtprev-gtmember (p1)-gtmember
- Mathematical sequences of instructions
- divide, sqrt, etc.
- Fuse instructions to remove control dependencies
- Fuse unrelated instructions to schedule them
above/below their current block - Move up defs, move down uses
20Future Work
- Partial Fusion
- Fuse instructions that do not execute along all
paths - Partial fusion may require computation of new
predicates to get the execution condition of
fused instruction - e.g. Fusion of div instructions on the right
- Can eliminate new predicate compares using
predicate promotion
div
div
21Fusion Tradeoffs
- Register Pressure
- Fusion can affect register pressure by changing
register lifetimes and eliminating redundant
registers - Scheduling
- Additional moves may add height to critical path
- Fusion may remove control dependencies to reduce
the height of the critical path - Aliasing of memory addresses in loads may affect
memory disambiguation
22Conclusion
- Fusion is a technique that can eliminate
redundancy on disjoint paths that is not removed
by classical optimizations - Fusion allows instructions with the same opcode
to share resources even if they do not compute
the same value - Reduces resource usage with a possible increase
to dependency length - In a software pipelined loop, increasing the
dependency length off the recurrence path can be
tolerated - Effective at reducing resource II of loops with
control flow
23Questions
24Backup Slides
25Fusing Non-Disjoint Predicates
- Find a pair of instructions that contain the same
opcode - Instructions with non-disjoint predicates can be
fused if they calculate the same value - All the div instructions can be fused together
if the two left-most divs compute the same value - The fused instruction takes the predicate of the
top-most node
div
div
div
26Minimizing new cmp instructions
- Fusing instructions that execute along every path
from an existing dominating block can reuse an
existing predicate - On the right, the instructions with predicates
(p1) and (p2) are fused together to take on the
block predicate of the parent, the universal
predicate
(p1) ld4 v1 V0 (p2) ld4 v5 V0 (p1) addi v2
4, v1 (p2) addi v3 4, v5 ... (p1) st4
V0v4 (p2) st4 V0v8
ld4 v1 V0 addi v2 4, v1 ...
st4 V0v4
27Minimizing new cmp instructions
- Fused instructions can also reuse an existing cmp
from a post-dominator - the compare instruction that generates the
post-dominators predicate may need to be moved
downward to execute after the merged instructions.
div
div
28Related Work
cmp.lt p1,p2r1,r2 (p1)fma f3
f1,f2,f0 (p2)fma f5 f4,f2,f0 fma f3 f1,f2,f0
- Partial Redundancy Elimination
- Eliminates redundancy along some program paths
- Two fma operations on the right can be merged
using PRE - Fusion can be applied to common operations with
any input values along different paths - All three fma instructions on the right can be
fused into one fma, and one compare can be
eliminated - Since movs are floating point operations, this
could be a loss - If the movs can be eliminated this could be a gain
cmp.lt p1,p2r1,r2 (p2)fma f5 f4,f2,f0
fma f3 f1,f2,f0
cmp.lt p1,p2r1,r2 (p2)mov f1 f4 fma f3
f1,f2,f0 (p2)mov f5 f3