Fusing Instructions to Reduce Resource Usage in IfConverted Regions

About This Presentation

Title:

Fusing Instructions to Reduce Resource Usage in IfConverted Regions

Description:

Apply Copy Propagation and Dead Code Elimination to remove movs. cmp.gt p1,p2 = v2,v3 ... Since movs are floating point operations, this could be a loss ... – PowerPoint PPT presentation

Number of Views:69

Avg rating:3.0/5.0

Slides: 29

Provided by: intel155

Category:

more less

Transcript and Presenter's Notes

Title: Fusing Instructions to Reduce Resource Usage in IfConverted Regions

1
Fusing Instructions to Reduce Resource Usage in
If-Converted Regions

Howard Chen, Gerolf Hoflehner, Daniel Lavery
Intel Corporation
Matthew Bridges
Princeton University
5th Workshop on EPIC Architectures and Compiler
Technology, Sunday March 26

2
Instruction Fusion Motivation

Resource usage limits the performance of
predicated regions
Limits on the number of integer, floating point,
branch, and loads/store operations that can be
executed in a cycle.
Limits on total number of instructions per cycle.
Resource constraints are common in software
pipelined loops
Fusing instructions reduces resource usage
allowing similar instructions with disjoint
predicates to share resources
The two loads execute on different paths so they
can be fused to share the same resource

cmp.gt p1,p2 v2,v3 (p1) ld4 v1 V0 (p2) ld4
v1 V0
cmp.gt p1,p2 v2,v3 ld4 v1 V0
Disjoint Predicates
One load eliminated
Identical loads
3
If-Conversion Redundancy Example
BLOCK 3 ld4 v1 V0 v1p addi v2 8, V0
v2p1 ld4 v3 v2 v3 p1 sub v4 v1, v3
v4pp1 st4 V0v4 p v4 BLOCK
4 ld4 v5 V0 v5p addi v6 4, V0
v6p2 ld4 v7 v6 v7 p2 add v8 v5, v7
v8pp2 st4 V0v8 p v8
void foo(int p, int i) if(!p) return
if(i gt 0) p p1 else
p - p2
Similar loads and stores exist in both blocks...
4
If-Conversion Fusion Example
Operand Copying and Register Renaming (p1) ld4
v1 V0 (p2) ld4 v1 V0 (p2) mov v5 v1 (p1)
addi v2 8, V0 (p2) addi v6 4, V0 (p2) mov v2
v6 (p1) ld4 v3 v2 (p2) ld4 v3 v2 (p2) mov
v7 v3 (p1) sub v4 v1, v3 (p2) add v8 v5,
v7 (p2) mov v4 v8 (p1) st4 V0v4 (p2) st4
V0v4
After Fusion and Copy Propagation ld4 v1
V0 (p1) addi v2 8, V0 (p2) addi v2 4, V0
ld4 v3 v2 (p1) sub v4 v1, v3 (p2) add v4 v1,
v3 st4 V0v4
Original Predicated Region (p1) ld4 v1 V0 (p2)
ld4 v5 V0 (p1) addi v2 8, V0 (p2) addi v6 4,
V0 (p1) ld4 v3 v2 (p2) ld4 v7 v6 (p1) sub
v4 v1, v3 (p2) add v8 v5, v7 (p1) st4
V0v4 (p2) st4 V0v8
Eliminate duplicated loads and stores on both
sides of the if statement
Use moves to make identical instructions...
5
Example from 181.mcf

while( node )
if( node-gtorientation UP )
node-gtpotential
node-gtbasic_arc-gtcost node-gtpred-gtpotential
else / DOWN /
node-gtpotential
node-gtpred-gtpotential - node-gtbasic_arc-gtcost
checksum
tmp node
node node-gtchild

Common address calculations performed along both
paths in refresh_potential can be fused to reduce
resource usage
6
Instruction Fusion

Instruction fusion replaces multiple instructions
with the same opcode with a single instruction
Fused instructions act like a routine
Operands are copied into input registers
Results are copied from output registers
Unlike classical redundancy elimination, fusion
can combine instructions that compute unrelated
values
Unrelated computations on separate disjoint paths
can be fused
Results in extra mov instructions, but it may
free up more constrained resources

(p1) ld4 v3 v1 (p2) ld4 v4 v2 (p1) st4
v3v5 (p2) st4 v4v6
Fusion
... (p1) mov v2 v1 ld4 v4 v2 (p1) mov
v6 v5 st4 v4 v6 ...
7
Related Work

Code Hoisting / Sinking
Removes redundant code that is anticipatable on
all paths
If-Conversion Example Revisited
Hoisting the first two loads and sinking the last
two stores on the right can remove the redundant
load and store instructions
The center two loads can not be hoisted or sunk,
but can be fused
Fusion can be applied in place, and is not
affected by dependencies

ld4 v1 V0 (p1) addi v2 8, V0 (p2) addi
v2 4, V0 ld4 v3 v2 (p1) sub v4 v1,
v3 (p2) add v4 v1, v3 st4 V0v4
(p1) ld4 v1 V0 (p2) ld4 v5 V0 (p1) addi v2
8, V0 (p2) addi v6 4, V0 (p1) ld4 v3 v2 (p2)
ld4 v7 v6 (p1) sub v4 v1, v3 (p2) add v8
v5, v7 (p1) st4 V0v4 (p2) st4 V0v8
8
Related Work

Common Sub-expression Elimination / Partial
Redundancy Elimination
Eliminates calculation of redundant expressions /
values along a path
The two redundant loads can be eliminated
Fusion eliminates operations across disjoint
paths
Can eliminate the two disjoint add instructions
Can be combined with redundancy elimination to
enable better fusion

ld4 v1 V0 ld4 v5 V0 (p1)addi v2
4, v1 (p2)addi v6 4, v5
Fusion
CSE
ld4 v1 V0 ld4 v5 V0 (p2)mov v1
v5 addi v2 4, v1 (p2)mov v6 v2
ld4 v1 V0 (p1)addi v2 4, v1 (p2)addi
v6 4, v1
Fusion
ld4 v1 V0 addi v2 4, v1 (p2)mov
v6 v2
9
Related Work

Instruction Merging Mahlke 92
Merges identical instructions in hammocks
Does not apply merging to all disjoint regions or
transform instructions to allow merging
Predicate Aware Scheduling Smelyanskiy 03
Instructions with disjoint predicates can be
scheduled to use the same resource at the same
time
Requires architectural support for issuing
disjoint operations to the same resource in the
same clock cycle

10
Fusion Algorithm

Find a pair of instructions that contain the same
opcode
The pair of instructions must calculate the same
value along non-disjoint paths
Make the two instructions identical by renaming
registers of both instructions and moving values
from old registers
Move one of the instructions immediately
before/after the other without violating any
dependencies
Merge the two instructions by eliminating one of
the instructions
Assign the remaining instruction a new predicate
that is the union of the merged predicates
Apply Copy Propagation and Dead Code Elimination
to remove movs

cmp.gt p1,p2 v2,v3 (p1) ld4 v2 V1 (p2) ld4
v1 V0
cmp.gt p1,p2 v2,v3 (p1) mov V4 V1 (p2) mov V4
V0 ld4 v3 V4 (p1) mov v2 v3 (p2) mov v1
v3
cmp.gt p1,p2 v2,v3 (p1) mov V4 V1 (p2) mov V4
V0 (p1) ld4 v3 V4 (p2) ld4 v3 V4 (p1) mov
v2 v3 (p2) mov v1 v3
cmp.gt p1,p2 v2,v3 (p1) mov V4 V1 (p1) ld4 v3
V4 (p1) mov v2 v3 (p2) mov V4 V0 (p2) ld4 v3
V4 (p2) mov v1 v3
11
Selecting Instructions to Fuse

Some combinations introduce dependencies that
increase the critical schedule path
ie Fusion of ld instructions
Other combinations introduce dependencies that
limit other fusion opportunities
ie Fusing ld may prevent fusion of mult
Need to identify which fusion combinations are
desirable

br
ld
ld
mult
mult
ld
mult
...
12
Goals for Selecting Fusion Opportunities

Maximize the number of instructions removed
Minimize the number of new instructions added
Minimize constraints that occur due to new
dependencies
Dependencies may limit scheduling by adding new
dependencies
Fusion may affect disambiguation of memory
accesses but memory addresses stay disjoint on
disjoint paths
Other benefits may influence the decision to fuse
Fusion may remove control dependencies

13
Minimizing New mov Instructions

Mov instructions are only necessary if operands
of fused instructions use different registers
Registers can be fused to be the same if they
can only generate uses with different values on
disjoint paths
Assuming there are no conflicts, v5 can be
renamed to v1 and v8 can be renamed to v4

(p1) ld4 v1 V0 (p2) ld4 v5 V1 (p1) addi v2
4, v1 (p2) addi v3 4, v5 ... (p1) st4
V0v4 (p2) st4 V1v8
ld4 v1 V0 addi v2 4, v1 ...
st4 V0v4
14
Minimizing new cmp instructions

Fusing instructions that execute along every path
from an existing dominating block can reuse an
existing predicate
On the right, the instructions with predicates
(p1) and (p2) are fused together to take on the
block predicate of the parent, the universal
predicate

(p1) ld4 v1 V0 (p2) ld4 v5 V0 (p1) addi v2
4, v1 (p2) addi v3 4, v5 ... (p1) st4
V0v4 (p2) st4 V0v8
ld4 v1 V0 addi v2 4, v1 ...
st4 V0v4
15
Strategy to Minimize New Instructions

1) Use anticipatability data-flow to find a set
of instructions that executes along all paths.
These instructions can reuse the predicate of a
common dominator.
2) Find instructions that share dependencies on
the same defs / uses and can be fused without
introducing new mov instructions
3) Walk down the dependence chain to fuse
instructions dependent on the newly fused
instruction

16
Experimental Setup

Itanium 2 Processor
Intel Itanium Compiler 9.0
Options used for SPEC CPU2000 base performance
Apply a Subset of Fusions
Hoist code into dominator
Sink code into post-dominators
No fusion that generates additional mov
instructions is allowed
Fusion of instructions in if-converted regions
inside of SWP loops
In a software pipelined loop, possible increases
in the dependency length off the recurrence path
can be tolerated

17
Results
18
Results
19
Future Work

Fuse instructions that do not share input or
output dependencies if it eliminates a long chain
of instructions
Common long chains of related instructions
Pointer dereferencing sequences
e.g. (p1)-gtprev-gtmember (p1)-gtmember
Mathematical sequences of instructions
divide, sqrt, etc.
Fuse instructions to remove control dependencies
Fuse unrelated instructions to schedule them
above/below their current block
Move up defs, move down uses

20
Future Work

Partial Fusion
Fuse instructions that do not execute along all
paths
Partial fusion may require computation of new
predicates to get the execution condition of
fused instruction
e.g. Fusion of div instructions on the right
Can eliminate new predicate compares using
predicate promotion

div
div
21
Fusion Tradeoffs

Register Pressure
Fusion can affect register pressure by changing
register lifetimes and eliminating redundant
registers
Scheduling
Additional moves may add height to critical path
Fusion may remove control dependencies to reduce
the height of the critical path
Aliasing of memory addresses in loads may affect
memory disambiguation

22
Conclusion

Fusion is a technique that can eliminate
redundancy on disjoint paths that is not removed
by classical optimizations
Fusion allows instructions with the same opcode
to share resources even if they do not compute
the same value
Reduces resource usage with a possible increase
to dependency length
In a software pipelined loop, increasing the
dependency length off the recurrence path can be
tolerated
Effective at reducing resource II of loops with
control flow

23
Questions
24
Backup Slides
25
Fusing Non-Disjoint Predicates

Find a pair of instructions that contain the same
opcode
Instructions with non-disjoint predicates can be
fused if they calculate the same value
All the div instructions can be fused together
if the two left-most divs compute the same value
The fused instruction takes the predicate of the
top-most node

div
div
div
26
Minimizing new cmp instructions

Fusing instructions that execute along every path
from an existing dominating block can reuse an
existing predicate
On the right, the instructions with predicates
(p1) and (p2) are fused together to take on the
block predicate of the parent, the universal
predicate

(p1) ld4 v1 V0 (p2) ld4 v5 V0 (p1) addi v2
4, v1 (p2) addi v3 4, v5 ... (p1) st4
V0v4 (p2) st4 V0v8
ld4 v1 V0 addi v2 4, v1 ...
st4 V0v4
27
Minimizing new cmp instructions

Fused instructions can also reuse an existing cmp
from a post-dominator
the compare instruction that generates the
post-dominators predicate may need to be moved
downward to execute after the merged instructions.

div
div
28
Related Work
cmp.lt p1,p2r1,r2 (p1)fma f3
f1,f2,f0 (p2)fma f5 f4,f2,f0 fma f3 f1,f2,f0

Partial Redundancy Elimination
Eliminates redundancy along some program paths
Two fma operations on the right can be merged
using PRE
Fusion can be applied to common operations with
any input values along different paths
All three fma instructions on the right can be
fused into one fma, and one compare can be
eliminated
Since movs are floating point operations, this
could be a loss
If the movs can be eliminated this could be a gain

cmp.lt p1,p2r1,r2 (p2)fma f5 f4,f2,f0
fma f3 f1,f2,f0
cmp.lt p1,p2r1,r2 (p2)mov f1 f4 fma f3
f1,f2,f0 (p2)mov f5 f3

Write a Comment

User Comments (0)

About PowerShow.com

Fusing Instructions to Reduce Resource Usage in IfConverted Regions - PowerPoint PPT Presentation

Fusing Instructions to Reduce Resource Usage in IfConverted Regions

Apply Copy Propagation and Dead Code Elimination to remove movs. cmp.gt p1,p2 = v2,v3 ... Since movs are floating point operations, this could be a loss ... – PowerPoint PPT presentation