Title: Compiler Techniques for ILP
1Compiler Techniques for ILP
- So far we have explored dynamic hardware
techniques for ILP exploitation - BTB and branch prediction
- Dynamic scheduling
- Scoreboard
- Tomasulos algorithm
- Speculation
- Multiple issue
- How can compilers help?
2Loop Unrolling
- Lets look at the code
- for (i1000igt0ii-1)
- xi xi s
ADD R2,R0,R0 Loop L.D F0,0(R1) ADD.D F4, F0,
F2 S.D F4, 0(R1) DADDUI R1, R1, -8 BNE R1,
R2, Loop
3Scheduling On A Simple 5 Stage MIPS
Loop L.D F0,0(R1) stall, wait for F0 value to
propagate ADD.D F4, F0, F2 stall, wait for FP
add to be completed stall, wait for FP add to be
completed S.D F4, 0(R1) DADDUI R1, R1,
-8 stall, wait for R1 value to propagate BNE
R1, R2, Loop stall one cycle, branch penalty
10 cycles
4We Could Rearrange The Instructions
Loop L.D F0,0(R1) stall, wait for F0 value to
propagate ADD.D F4, F0, F2 stall, wait for FP
add to be completed stall, wait for FP add to be
completed S.D F4, 0(R1) DADDUI R1, R1,
-8 stall, wait for R1 value to propagate BNE
R1, R2, Loop stall one cycle, branch penalty
Interleavethese inst. with someindependentinst
.Best we canachieve is 6
6 cycles
Loop L.D F0,0(R1) ADD.D F4, F0, F2 S.D
F4, 0(R1)
DADDUI R1, R1, -8
BNE R1, R2, Loop
8
5Loop Unrolling
- Getting into the loop more useful instructions
and reducing overhead - Step 1 Put several iterations together
Loop L.D F0,0(R1) ADD.D F4, F0, F2 S.D F4,
0(R1) DADDUI R1, R1, -8 BNE R1, R2,
Loop L.D F0,0(R1) ADD.D F4, F0, F2 S.D F4,
0(R1) DADDUI R1, R1, -8 BNE R1, R2,
Loop L.D F0,0(R1) ADD.D F4, F0, F2 S.D F4,
0(R1) DADDUI R1, R1, -8 BNE R1, R2,
Loop L.D F0,0(R1) ADD.D F4, F0, F2 S.D F4,
0(R1) DADDUI R1, R1, -8 BNE R1, R2, Loop
Loop L.D F0,0(R1) ADD.D F4, F0, F2 S.D F4,
0(R1) DADDUI R1, R1, -8 BNE R1, R2, Loop
Assume taken
6Loop Unrolling
- Step 2 Take out control instructions, adjust
offsets
Loop L.D F0,0(R1) ADD.D F4, F0, F2 S.D F4,
0(R1) DADDUI R1, R1, -8 BNE R1, R2,
Loop L.D F0,0(R1) ADD.D F4, F0, F2 S.D F4,
0(R1) DADDUI R1, R1, -8 BNE R1, R2,
Loop L.D F0,0(R1) ADD.D F4, F0, F2 S.D F4,
0(R1) DADDUI R1, R1, -8 BNE R1, R2,
Loop L.D F0,0(R1) ADD.D F4, F0, F2 S.D F4,
0(R1) DADDUI R1, R1, -8 BNE R1, R2, Loop
Loop L.D F0,0(R1) ADD.D F4, F0, F2 S.D F4,
0(R1) L.D F0,-8(R1) ADD.D F4, F0, F2 S.D F4,
-8(R1) L.D F0,-16(R1) ADD.D F4, F0, F2 S.D
F4, -16(R1) L.D F0,-24(R1) ADD.D F4, F0,
F2 S.D F4, -24(R1) DADDUI R1, R1, -32 BNE
R1, R2, Loop
7Loop Unrolling
Loop L.D F0,0(R1) ADD.D F4, F0, F2 S.D F4,
0(R1) L.D F0,-8(R1) ADD.D F4, F0, F2 S.D F4,
-8(R1) L.D F0,-16(R1) ADD.D F4, F0, F2 S.D
F4, -16(R1) L.D F0,-24(R1) ADD.D F4, F0,
F2 S.D F4, -24(R1) DADDUI R1, R1, -32 BNE
R1, R2, Loop
Loop L.D F0,0(R1) ADD.D F4, F0, F2 S.D F4,
0(R1) L.D F6,-8(R1) ADD.D F8, F6, F2 S.D F8,
-8(R1) L.D F10,-16(R1) ADD.D F12, F10, F2 S.D
F12, -16(R1) L.D F14,-24(R1) ADD.D F16, F14,
F2 S.D F16, -24(R1) DADDUI R1, R1, -32 BNE
R1, R2, Loop
8Loop Unrolling
- Current loop still has stalls due to RAW
dependencies
28 cycles 7 per it.
Loop L.D F0,0(R1) ADD.D F4, F0, F2 S.D F4,
0(R1) L.D F6,-8(R1) ADD.D F8, F6, F2 S.D F8,
-8(R1) L.D F10,-16(R1) ADD.D F12, F10, F2 S.D
F12, -16(R1) L.D F14,-24(R1) ADD.D F16, F14,
F2 S.D F16, -24(R1) DADDUI R1, R1, -32 BNE
R1, R2, Loop
Loop L.D F0,0(R1) stall, wait for F0 value to
propagate ADD.D F4, F0, F2 stall, wait for FP
add to be completed stall, wait for FP add to be
completed S.D F4, 0(R1) DADDUI R1, R1,
-8 stall, wait for R1 value to propagate BNE
R1, R2, Loop stall one cycle, branch penalty
9Loop Unrolling
- Step 4 Interleave iterations
14 cycles 3.5 per it.
Loop L.D F0,0(R1) ADD.D F4, F0, F2 S.D F4,
0(R1) L.D F6,-8(R1) ADD.D F8, F6, F2 S.D F8,
-8(R1) L.D F10,-16(R1) ADD.D F12, F10, F2 S.D
F12, -16(R1) L.D F14,-24(R1) ADD.D F16, F14,
F2 S.D F16, -24(R1) DADDUI R1, R1, -32 BNE
R1, R2, Loop
Loop L.D F0,0(R1) L.D F6,-8(R1) L.D
F10,-16(R1) L.D F14,-24(R1) ADD.D F4, F0,
F2 ADD.D F8, F6, F2 ADD.D F12, F10, F2 ADD.D
F16, F14, F2 S.D F4, 0(R1) S.D F8,
-8(R1) DADDUI R1, R1, -32 S.D F12, 16(R1) BNE
R1, R2, Loop S.D F16, 8(R1)
10Loop Unrolling Multiple Issue
- Lets unroll the loop 5 times, mark int. and FP
operations
Loop L.D F0,0(R1) ADD.D F4, F0, F2 S.D F4,
0(R1) L.D F6,-8(R1) ADD.D F8, F6, F2 S.D F8,
-8(R1) L.D F10,-16(R1) ADD.D F12, F10, F2 S.D
F12, -16(R1) L.D F14,-24(R1) ADD.D F16, F14,
F2 S.D F16, -24(R1) L.D F18,-32(R1) ADD.D F20,
F18, F2 S.D F20, -32(R1) DADDUI R1, R1,
-40 BNE R1, R2, Loop
11Loop Unrolling Multiple Issue
- Move all loads first, then ADD.D then S.D
Loop L.D F0,0(R1) L.D F6,-8(R1) L.D
F10,-16(R1) L.D F14,-24(R1) L.D
F18,-32(R1) ADD.D F4, F0, F2 ADD.D F8, F6,
F2 ADD.D F12, F10, F2 ADD.D F16, F14, F2 ADD.D
F20, F18, F2 S.D F4, 0(R1) S.D F8, -8(R1) S.D
F12, -16(R1) S.D F16, -24(R1) S.D F20,
-32(R1) DADDUI R1, R1, -40 BNE R1, R2, Loop
12Loop Unrolling Multiple Issue
- Rearrange instructions to handle delay for DADDUI
and BNE
Loop L.D F0,0(R1) L.D F6,-8(R1) L.D
F10,-16(R1) L.D F14,-24(R1) L.D
F18,-32(R1) ADD.D F4, F0, F2 ADD.D F8, F6,
F2 ADD.D F12, F10, F2 ADD.D F16, F14, F2 ADD.D
F20, F18, F2 S.D F4, 0(R1) S.D F8, -8(R1) S.D
F12, -16(R1) DADDUI R1, R1, -40 S.D F16,
-24(R1) BNE R1, R2, Loop S.D F20, -32(R1)
Loop L.D F0,0(R1) L.D F6,-8(R1) L.D
F10,-16(R1) L.D F14,-24(R1) L.D
F18,-32(R1) ADD.D F4, F0, F2 ADD.D F8, F6,
F2 ADD.D F12, F10, F2 ADD.D F16, F14, F2 ADD.D
F20, F18, F2 S.D F4, 0(R1) S.D F8, -8(R1) S.D
F12, -16(R1) S.D F16, -24(R1) S.D F20,
-32(R1) DADDUI R1, R1, -40 BNE R1, R2, Loop
13Loop Unrolling Multiple Issue
- Fix immediate displacement values
Loop L.D F0,0(R1) L.D F6,-8(R1) L.D
F10,-16(R1) L.D F14,-24(R1) L.D
F18,-32(R1) ADD.D F4, F0, F2 ADD.D F8, F6,
F2 ADD.D F12, F10, F2 ADD.D F16, F14, F2 ADD.D
F20, F18, F2 S.D F4, 0(R1) S.D F8, -8(R1) S.D
F12, -16(R1) DADDUI R1, R1, -40 S.D F16,
16(R1) BNE R1, R2, Loop S.D F20, 8(R1)
14Loop Unrolling Multiple Issue
- Now imagine we can issue 2 instructions per
cycle, one integer and one FP
12 cycles 2.4 per it.
Loop L.D F0,0(R1) L.D F6,-8(R1) L.D
F10,-16(R1) L.D F14,-24(R1) L.D
F18,-32(R1) ADD.D F4, F0, F2 ADD.D F8, F6,
F2 ADD.D F12, F10, F2 ADD.D F16, F14, F2 ADD.D
F20, F18, F2 S.D F4, 0(R1) S.D F8, -8(R1) S.D
F12, -16(R1) DADDUI R1, R1, -40 S.D F16,
16(R1) BNE R1, R2, Loop S.D F20, 8(R1)
1
2
3
4
5
3
4
5
6
7
6
7
8
9
10
11
12
15Static Branch Prediction
- Analyze the code, figure out which outcome of a
branch is likely - Always predict taken
- Predict backward branches as taken, forward as
not taken - Predict based on the profile of previous runs
- Static branch prediction can help us schedule
delayed branch slots
16Static Multiple Issue VLIW
- Hardware checking for dependencies in issue
packets may be expensive and complex - Compiler can examine instructions and decide
which ones can be scheduled in parallel group
instructions into instruction packets VLIW - Hardware can then be simplified
- Processor has multiple functional units and each
field of the VLIW is assigned to one unit - For example, VLIW could contain 5 fields and one
has to contain ALU instruction or branch, two
have to contain FP instructions and two have to
be memory references
17Example
- Assume VLIW contains 5 fields ALU instruction or
branch, two FP instructions and two memory
references - Ignore branch delay slot
-
Memory reference
Loop L.D F0,0(R1) stall, wait for F0 value to
propagate ADD.D F4, F0, F2 stall, wait for FP
add to be completed stall, wait for FP add to be
completed S.D F4, 0(R1) DADDUI R1, R1,
-8 stall, wait for R1 value to propagate BNE
R1, R2, Loop
FP instruction
Memory reference
ALU instruction
ALU instruction
18Example
- Unroll seven times and rearrange
-
Loop L.D F0,0(R1) L.D F6,-8(R1) L.D
F10,-16(R1) L.D F14,-24(R1) L.D
F18,-32(R1) L.D F22,-40(R1) L.D
F26,-48(R1) ADD.D F4, F0, F2 ADD.D F8, F6,
F2 ADD.D F12, F10, F2 ADD.D F16, F14, F2 ADD.D
F20, F18, F2 ADD.D F24, F22, F2 ADD.D F28, F26,
F2
S.D F4, 0(R1) S.D F8, -8(R1) S.D F12,
-16(R1) S.D F16, -24(R1) S.D F20,
-32(R1) DADDUI R1, R1, -56 S.D F24,
16(R1) BNE R1, R2, Loop S.D F28, 8(R1)
1
3
ALU /branch
FP
FP
mem
mem
19Example
S.D F4, 0(R1) S.D F8, -8(R1) S.D F12,
-16(R1) S.D F16, -24(R1) S.D F20,
-32(R1) DADDUI R1, R1, -56 S.D F24,
16(R1) BNE R1, R2, Loop S.D F28, 8(R1)
Loop L.D F0,0(R1) L.D F6,-8(R1) L.D
F10,-16(R1) L.D F14,-24(R1) L.D
F18,-32(R1) L.D F22,-40(R1) L.D
F26,-48(R1) ADD.D F4, F0, F2 ADD.D F8, F6,
F2 ADD.D F12, F10, F2 ADD.D F16, F14, F2 ADD.D
F20, F18, F2 ADD.D F24, F22, F2 ADD.D F28, F26,
F2
2
3
4
ALU /branch
FP
FP
mem
mem
20Example
S.D F4, 0(R1) S.D F8, -8(R1) S.D F12,
-16(R1) S.D F16, -24(R1) S.D F20,
-32(R1) DADDUI R1, R1, -56 S.D F24,
16(R1) BNE R1, R2, Loop S.D F28, 8(R1)
Loop L.D F0,0(R1) L.D F6,-8(R1) L.D
F10,-16(R1) L.D F14,-24(R1) L.D
F18,-32(R1) L.D F22,-40(R1) L.D
F26,-48(R1) ADD.D F4, F0, F2 ADD.D F8, F6,
F2 ADD.D F12, F10, F2 ADD.D F16, F14, F2 ADD.D
F20, F18, F2 ADD.D F24, F22, F2 ADD.D F28, F26,
F2
6
3
3
4
5
ALU /branch
FP
FP
mem
mem
21Example
S.D F4, 0(R1) S.D F8, -8(R1) S.D F12,
-16(R1) S.D F16, -24(R1) S.D F20,
-32(R1) DADDUI R1, R1, -56 S.D F24,
16(R1) BNE R1, R2, Loop S.D F28, 8(R1)
Loop L.D F0,0(R1) L.D F6,-8(R1) L.D
F10,-16(R1) L.D F14,-24(R1) L.D
F18,-32(R1) L.D F22,-40(R1) L.D
F26,-48(R1) ADD.D F4, F0, F2 ADD.D F8, F6,
F2 ADD.D F12, F10, F2 ADD.D F16, F14, F2 ADD.D
F20, F18, F2 ADD.D F24, F22, F2 ADD.D F28, F26,
F2
6
7
4
4
5
6
ALU /branch
FP
FP
mem
mem
22Example
S.D F4, 0(R1) S.D F8, -8(R1) S.D F12,
-16(R1) S.D F16, -24(R1) S.D F20,
-32(R1) DADDUI R1, R1, -56 S.D F24,
16(R1) BNE R1, R2, Loop S.D F28, 8(R1)
Loop L.D F0,0(R1) L.D F6,-8(R1) L.D
F10,-16(R1) L.D F14,-24(R1) L.D
F18,-32(R1) L.D F22,-40(R1) L.D
F26,-48(R1) ADD.D F4, F0, F2 ADD.D F8, F6,
F2 ADD.D F12, F10, F2 ADD.D F16, F14, F2 ADD.D
F20, F18, F2 ADD.D F24, F22, F2 ADD.D F28, F26,
F2
6
7
8
5
6
ALU /branch
FP
FP
mem
mem
23Example
S.D F4, 0(R1) S.D F8, -8(R1) S.D F12,
-16(R1) S.D F16, -24(R1) S.D F20,
-32(R1) DADDUI R1, R1, -56 S.D F24,
16(R1) BNE R1, R2, Loop S.D F28, 8(R1)
6
Loop L.D F0,0(R1) L.D F6,-8(R1) L.D
F10,-16(R1) L.D F14,-24(R1) L.D
F18,-32(R1) L.D F22,-40(R1) L.D
F26,-48(R1) ADD.D F4, F0, F2 ADD.D F8, F6,
F2 ADD.D F12, F10, F2 ADD.D F16, F14, F2 ADD.D
F20, F18, F2 ADD.D F24, F22, F2 ADD.D F28, F26,
F2
7
8
9
6
ALU /branch
FP
FP
mem
mem
24Example
S.D F4, 0(R1) S.D F8, -8(R1) S.D F12,
-16(R1) S.D F16, -24(R1) S.D F20,
24(R1) DADDUI R1, R1, -56 S.D F24, 16(R1) BNE
R1, R2, Loop S.D F28, 8(R1)
Loop L.D F0,0(R1) L.D F6,-8(R1) L.D
F10,-16(R1) L.D F14,-24(R1) L.D
F18,-32(R1) L.D F22,-40(R1) L.D
F26,-48(R1) ADD.D F4, F0, F2 ADD.D F8, F6,
F2 ADD.D F12, F10, F2 ADD.D F16, F14, F2 ADD.D
F20, F18, F2 ADD.D F24, F22, F2 ADD.D F28, F26,
F2
7
7
8
9
ALU /branch
FP
FP
mem
mem
25Example
S.D F4, 0(R1) S.D F8, -8(R1) S.D F12,
-16(R1) S.D F16, -24(R1) S.D F20,
24(R1) DADDUI R1, R1, -56 S.D F24, 16(R1) BNE
R1, R2, Loop S.D F28, 8(R1)
Loop L.D F0,0(R1) L.D F6,-8(R1) L.D
F10,-16(R1) L.D F14,-24(R1) L.D
F18,-32(R1) L.D F22,-40(R1) L.D
F26,-48(R1) ADD.D F4, F0, F2 ADD.D F8, F6,
F2 ADD.D F12, F10, F2 ADD.D F16, F14, F2 ADD.D
F20, F18, F2 ADD.D F24, F22, F2 ADD.D F28, F26,
F2
8
8
9
ALU /branch
FP
FP
mem
mem
26Example
S.D F4, 0(R1) S.D F8, -8(R1) S.D F12,
-16(R1) S.D F16, -24(R1) S.D F20,
24(R1) DADDUI R1, R1, -56 S.D F24, 16(R1) BNE
R1, R2, Loop S.D F28, 8(R1)
Loop L.D F0,0(R1) L.D F6,-8(R1) L.D
F10,-16(R1) L.D F14,-24(R1) L.D
F18,-32(R1) L.D F22,-40(R1) L.D
F26,-48(R1) ADD.D F4, F0, F2 ADD.D F8, F6,
F2 ADD.D F12, F10, F2 ADD.D F16, F14, F2 ADD.D
F20, F18, F2 ADD.D F24, F22, F2 ADD.D F28, F26,
F2
9
Overall 9 cycles for 7 iterations 1.29 per
iteration But VLIW was always half-full
ALU /branch
FP
FP
mem
mem
27Detecting and Enhancing Loop Level Parallelism
- Determine whether data in later iterations
depends on data in earlier iterations
loop-carried dependence - Easier detected at source code level than at
machine code
for(i1 ilt100 ii1) Ai1 Ai
Ci / S1 / Bi1 Bi Ai1 / S2
/
- S1 calculates a value Ai1 which will be used
in next iteration of S1 - S2 calculates a value Bi1 which will be used
in next iteration of S2 - This is a loop-carried dependence and prevents
parallelism - S1 calculates a value Ai1 which will be used
in the current iteration of S2 - ? This is dependence within the loop
28Detecting and Enhancing Loop Level Parallelism
for(i1 ilt100 ii1) Ai Ai
Bi / S1 / Bi1 Ci Di /
S2 /
- S1 calculates a value Ai which is not used in
the future - S2 calculates a value Bi1 which will be used
in next iteration of S1 - This is a loop-carried dependence but S1 depends
on S2 not on itself and S2 does not depend
on S1 - This loop can be made parallel if we transform it
so that there is no loop-carried dependence
A1 A1 B1 for(i1 ilt99
ii1) Bi1 Ci Di / S2 /
Ai1 Ai1 Bi1 / S1 / B101
C100D100
29Detecting and Enhancing Loop Level Parallelism
- Recursion creates loop-carried dependence
- But sometimes it may parallelizable if distance
between dependent elements is gt1
for(i1 ilt100 ii1) Ai Ai-1
Bi
for(i1 ilt100 ii1) Ai Ai-5
Bi
30Detecting and Enhancing Loop Level Parallelism
- Find all dependencies in the following loop (5)
and eliminate as many as you can
for(i1 ilt100 ii1) Yi Xi / c
/ S1 / Xi Xi c / S2 /
Zi Yi c / S3 / Yi c
Yi / S4 /
Solution at page 325
31Code Transformation
- Eliminating dependent computations
- Copy propagation
- Tree height reduction
DADDUI R1, R2, 4 DADDUI R1, R1, 4
?
DADDUI R1, R2, 8
ADD R1, R2, R3 ADD R4, R1, R6 ADD R8, R4, R7
ADD R1, R2, R3 ADD R4, R6, R7 ADD R8, R1, R4
Can be done in parallel
?
sumsumx1x2x3x4x5sum(sumx1)(x2x3)(x4x5
)
sumsumx / suppose this is in a
loop and we unroll it 5 times /
?
Can be done in parallel
Must be done sequentially
32Software Pipelining
- Combining instructions from different loop
iterations to separate dependent instructions
within an iteration
33Software Pipelining
- Apply software pipelining technique to the
following loop
L.D F0,0(R1) ADD.D F4, F0, F2 S.D F4,
0(R1) DADDUI R1, R1, -8 BNE R1, R2, Loop
S.D F0,16(R1) ADD.D F4, F0, F2 L.D F4,
0(R1) DADDUI R1, R1, -8 BNE R1, R2, Loop
?
?
Startup code
R18
R1
R116
L.D F0, 0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1)
L.D F0, 0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1)
L.D F0, 0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1)
16
8
Cleanup code
34Software Pipelining vs. Loop Unrolling
- Loop unrolling eliminates loop maintenance
overhead exposing parallelism between iterations - Creates larger code
- Software pipelining enables some loop iterations
to run at top speed by eliminating RAW hazards
that create latencies within iteration - Requires more complex transformations
35Homework 8
- Due Tuesday, November 16 by the end of the class
- Submit either in class (paper) or by E-mail (PS
or PDF only) or bring the paper copy to my office
- Do exercises 4.2, 4.6, 4.9 (skip parts d. and
e.), 4.11