Compiler Techniques for ILP - PowerPoint PPT Presentation

1 / 35
About This Presentation
Title:

Compiler Techniques for ILP

Description:

Current loop still has stalls due to RAW dependencies. Loop: L.D F0,0(R1) ... stall one cycle, branch penalty. 28 ... stall, wait for R1 value to propagate ... – PowerPoint PPT presentation

Number of Views:26
Avg rating:3.0/5.0
Slides: 36
Provided by: eecis
Category:

less

Transcript and Presenter's Notes

Title: Compiler Techniques for ILP


1
Compiler Techniques for ILP
  • So far we have explored dynamic hardware
    techniques for ILP exploitation
  • BTB and branch prediction
  • Dynamic scheduling
  • Scoreboard
  • Tomasulos algorithm
  • Speculation
  • Multiple issue
  • How can compilers help?

2
Loop Unrolling
  • Lets look at the code
  • for (i1000igt0ii-1)
  • xi xi s

ADD R2,R0,R0 Loop L.D F0,0(R1) ADD.D F4, F0,
F2 S.D F4, 0(R1) DADDUI R1, R1, -8 BNE R1,
R2, Loop
3
Scheduling On A Simple 5 Stage MIPS
Loop L.D F0,0(R1) stall, wait for F0 value to
propagate ADD.D F4, F0, F2 stall, wait for FP
add to be completed stall, wait for FP add to be
completed S.D F4, 0(R1) DADDUI R1, R1,
-8 stall, wait for R1 value to propagate BNE
R1, R2, Loop stall one cycle, branch penalty
10 cycles
4
We Could Rearrange The Instructions
Loop L.D F0,0(R1) stall, wait for F0 value to
propagate ADD.D F4, F0, F2 stall, wait for FP
add to be completed stall, wait for FP add to be
completed S.D F4, 0(R1) DADDUI R1, R1,
-8 stall, wait for R1 value to propagate BNE
R1, R2, Loop stall one cycle, branch penalty
Interleavethese inst. with someindependentinst
.Best we canachieve is 6
6 cycles
Loop L.D F0,0(R1) ADD.D F4, F0, F2 S.D
F4, 0(R1)
DADDUI R1, R1, -8
BNE R1, R2, Loop
8
5
Loop Unrolling
  • Getting into the loop more useful instructions
    and reducing overhead
  • Step 1 Put several iterations together

Loop L.D F0,0(R1) ADD.D F4, F0, F2 S.D F4,
0(R1) DADDUI R1, R1, -8 BNE R1, R2,
Loop L.D F0,0(R1) ADD.D F4, F0, F2 S.D F4,
0(R1) DADDUI R1, R1, -8 BNE R1, R2,
Loop L.D F0,0(R1) ADD.D F4, F0, F2 S.D F4,
0(R1) DADDUI R1, R1, -8 BNE R1, R2,
Loop L.D F0,0(R1) ADD.D F4, F0, F2 S.D F4,
0(R1) DADDUI R1, R1, -8 BNE R1, R2, Loop
Loop L.D F0,0(R1) ADD.D F4, F0, F2 S.D F4,
0(R1) DADDUI R1, R1, -8 BNE R1, R2, Loop
Assume taken
6
Loop Unrolling
  • Step 2 Take out control instructions, adjust
    offsets

Loop L.D F0,0(R1) ADD.D F4, F0, F2 S.D F4,
0(R1) DADDUI R1, R1, -8 BNE R1, R2,
Loop L.D F0,0(R1) ADD.D F4, F0, F2 S.D F4,
0(R1) DADDUI R1, R1, -8 BNE R1, R2,
Loop L.D F0,0(R1) ADD.D F4, F0, F2 S.D F4,
0(R1) DADDUI R1, R1, -8 BNE R1, R2,
Loop L.D F0,0(R1) ADD.D F4, F0, F2 S.D F4,
0(R1) DADDUI R1, R1, -8 BNE R1, R2, Loop
Loop L.D F0,0(R1) ADD.D F4, F0, F2 S.D F4,
0(R1) L.D F0,-8(R1) ADD.D F4, F0, F2 S.D F4,
-8(R1) L.D F0,-16(R1) ADD.D F4, F0, F2 S.D
F4, -16(R1) L.D F0,-24(R1) ADD.D F4, F0,
F2 S.D F4, -24(R1) DADDUI R1, R1, -32 BNE
R1, R2, Loop
7
Loop Unrolling
  • Step 3 Rename registers

Loop L.D F0,0(R1) ADD.D F4, F0, F2 S.D F4,
0(R1) L.D F0,-8(R1) ADD.D F4, F0, F2 S.D F4,
-8(R1) L.D F0,-16(R1) ADD.D F4, F0, F2 S.D
F4, -16(R1) L.D F0,-24(R1) ADD.D F4, F0,
F2 S.D F4, -24(R1) DADDUI R1, R1, -32 BNE
R1, R2, Loop
Loop L.D F0,0(R1) ADD.D F4, F0, F2 S.D F4,
0(R1) L.D F6,-8(R1) ADD.D F8, F6, F2 S.D F8,
-8(R1) L.D F10,-16(R1) ADD.D F12, F10, F2 S.D
F12, -16(R1) L.D F14,-24(R1) ADD.D F16, F14,
F2 S.D F16, -24(R1) DADDUI R1, R1, -32 BNE
R1, R2, Loop
8
Loop Unrolling
  • Current loop still has stalls due to RAW
    dependencies

28 cycles 7 per it.
Loop L.D F0,0(R1) ADD.D F4, F0, F2 S.D F4,
0(R1) L.D F6,-8(R1) ADD.D F8, F6, F2 S.D F8,
-8(R1) L.D F10,-16(R1) ADD.D F12, F10, F2 S.D
F12, -16(R1) L.D F14,-24(R1) ADD.D F16, F14,
F2 S.D F16, -24(R1) DADDUI R1, R1, -32 BNE
R1, R2, Loop
Loop L.D F0,0(R1) stall, wait for F0 value to
propagate ADD.D F4, F0, F2 stall, wait for FP
add to be completed stall, wait for FP add to be
completed S.D F4, 0(R1) DADDUI R1, R1,
-8 stall, wait for R1 value to propagate BNE
R1, R2, Loop stall one cycle, branch penalty
9
Loop Unrolling
  • Step 4 Interleave iterations

14 cycles 3.5 per it.
Loop L.D F0,0(R1) ADD.D F4, F0, F2 S.D F4,
0(R1) L.D F6,-8(R1) ADD.D F8, F6, F2 S.D F8,
-8(R1) L.D F10,-16(R1) ADD.D F12, F10, F2 S.D
F12, -16(R1) L.D F14,-24(R1) ADD.D F16, F14,
F2 S.D F16, -24(R1) DADDUI R1, R1, -32 BNE
R1, R2, Loop
Loop L.D F0,0(R1) L.D F6,-8(R1) L.D
F10,-16(R1) L.D F14,-24(R1) ADD.D F4, F0,
F2 ADD.D F8, F6, F2 ADD.D F12, F10, F2 ADD.D
F16, F14, F2 S.D F4, 0(R1) S.D F8,
-8(R1) DADDUI R1, R1, -32 S.D F12, 16(R1) BNE
R1, R2, Loop S.D F16, 8(R1)
10
Loop Unrolling Multiple Issue
  • Lets unroll the loop 5 times, mark int. and FP
    operations

Loop L.D F0,0(R1) ADD.D F4, F0, F2 S.D F4,
0(R1) L.D F6,-8(R1) ADD.D F8, F6, F2 S.D F8,
-8(R1) L.D F10,-16(R1) ADD.D F12, F10, F2 S.D
F12, -16(R1) L.D F14,-24(R1) ADD.D F16, F14,
F2 S.D F16, -24(R1) L.D F18,-32(R1) ADD.D F20,
F18, F2 S.D F20, -32(R1) DADDUI R1, R1,
-40 BNE R1, R2, Loop
11
Loop Unrolling Multiple Issue
  • Move all loads first, then ADD.D then S.D

Loop L.D F0,0(R1) L.D F6,-8(R1) L.D
F10,-16(R1) L.D F14,-24(R1) L.D
F18,-32(R1) ADD.D F4, F0, F2 ADD.D F8, F6,
F2 ADD.D F12, F10, F2 ADD.D F16, F14, F2 ADD.D
F20, F18, F2 S.D F4, 0(R1) S.D F8, -8(R1) S.D
F12, -16(R1) S.D F16, -24(R1) S.D F20,
-32(R1) DADDUI R1, R1, -40 BNE R1, R2, Loop
12
Loop Unrolling Multiple Issue
  • Rearrange instructions to handle delay for DADDUI
    and BNE

Loop L.D F0,0(R1) L.D F6,-8(R1) L.D
F10,-16(R1) L.D F14,-24(R1) L.D
F18,-32(R1) ADD.D F4, F0, F2 ADD.D F8, F6,
F2 ADD.D F12, F10, F2 ADD.D F16, F14, F2 ADD.D
F20, F18, F2 S.D F4, 0(R1) S.D F8, -8(R1) S.D
F12, -16(R1) DADDUI R1, R1, -40 S.D F16,
-24(R1) BNE R1, R2, Loop S.D F20, -32(R1)
Loop L.D F0,0(R1) L.D F6,-8(R1) L.D
F10,-16(R1) L.D F14,-24(R1) L.D
F18,-32(R1) ADD.D F4, F0, F2 ADD.D F8, F6,
F2 ADD.D F12, F10, F2 ADD.D F16, F14, F2 ADD.D
F20, F18, F2 S.D F4, 0(R1) S.D F8, -8(R1) S.D
F12, -16(R1) S.D F16, -24(R1) S.D F20,
-32(R1) DADDUI R1, R1, -40 BNE R1, R2, Loop
13
Loop Unrolling Multiple Issue
  • Fix immediate displacement values

Loop L.D F0,0(R1) L.D F6,-8(R1) L.D
F10,-16(R1) L.D F14,-24(R1) L.D
F18,-32(R1) ADD.D F4, F0, F2 ADD.D F8, F6,
F2 ADD.D F12, F10, F2 ADD.D F16, F14, F2 ADD.D
F20, F18, F2 S.D F4, 0(R1) S.D F8, -8(R1) S.D
F12, -16(R1) DADDUI R1, R1, -40 S.D F16,
16(R1) BNE R1, R2, Loop S.D F20, 8(R1)
14
Loop Unrolling Multiple Issue
  • Now imagine we can issue 2 instructions per
    cycle, one integer and one FP

12 cycles 2.4 per it.
Loop L.D F0,0(R1) L.D F6,-8(R1) L.D
F10,-16(R1) L.D F14,-24(R1) L.D
F18,-32(R1) ADD.D F4, F0, F2 ADD.D F8, F6,
F2 ADD.D F12, F10, F2 ADD.D F16, F14, F2 ADD.D
F20, F18, F2 S.D F4, 0(R1) S.D F8, -8(R1) S.D
F12, -16(R1) DADDUI R1, R1, -40 S.D F16,
16(R1) BNE R1, R2, Loop S.D F20, 8(R1)
1
2
3
4
5
3
4
5
6
7
6
7
8
9
10
11
12
15
Static Branch Prediction
  • Analyze the code, figure out which outcome of a
    branch is likely
  • Always predict taken
  • Predict backward branches as taken, forward as
    not taken
  • Predict based on the profile of previous runs
  • Static branch prediction can help us schedule
    delayed branch slots

16
Static Multiple Issue VLIW
  • Hardware checking for dependencies in issue
    packets may be expensive and complex
  • Compiler can examine instructions and decide
    which ones can be scheduled in parallel group
    instructions into instruction packets VLIW
  • Hardware can then be simplified
  • Processor has multiple functional units and each
    field of the VLIW is assigned to one unit
  • For example, VLIW could contain 5 fields and one
    has to contain ALU instruction or branch, two
    have to contain FP instructions and two have to
    be memory references

17
Example
  • Assume VLIW contains 5 fields ALU instruction or
    branch, two FP instructions and two memory
    references
  • Ignore branch delay slot

Memory reference
Loop L.D F0,0(R1) stall, wait for F0 value to
propagate ADD.D F4, F0, F2 stall, wait for FP
add to be completed stall, wait for FP add to be
completed S.D F4, 0(R1) DADDUI R1, R1,
-8 stall, wait for R1 value to propagate BNE
R1, R2, Loop
FP instruction
Memory reference
ALU instruction
ALU instruction
18
Example
  • Unroll seven times and rearrange

Loop L.D F0,0(R1) L.D F6,-8(R1) L.D
F10,-16(R1) L.D F14,-24(R1) L.D
F18,-32(R1) L.D F22,-40(R1) L.D
F26,-48(R1) ADD.D F4, F0, F2 ADD.D F8, F6,
F2 ADD.D F12, F10, F2 ADD.D F16, F14, F2 ADD.D
F20, F18, F2 ADD.D F24, F22, F2 ADD.D F28, F26,
F2
S.D F4, 0(R1) S.D F8, -8(R1) S.D F12,
-16(R1) S.D F16, -24(R1) S.D F20,
-32(R1) DADDUI R1, R1, -56 S.D F24,
16(R1) BNE R1, R2, Loop S.D F28, 8(R1)
1
3
ALU /branch
FP
FP
mem
mem
19
Example
S.D F4, 0(R1) S.D F8, -8(R1) S.D F12,
-16(R1) S.D F16, -24(R1) S.D F20,
-32(R1) DADDUI R1, R1, -56 S.D F24,
16(R1) BNE R1, R2, Loop S.D F28, 8(R1)
Loop L.D F0,0(R1) L.D F6,-8(R1) L.D
F10,-16(R1) L.D F14,-24(R1) L.D
F18,-32(R1) L.D F22,-40(R1) L.D
F26,-48(R1) ADD.D F4, F0, F2 ADD.D F8, F6,
F2 ADD.D F12, F10, F2 ADD.D F16, F14, F2 ADD.D
F20, F18, F2 ADD.D F24, F22, F2 ADD.D F28, F26,
F2
2
3
4
ALU /branch
FP
FP
mem
mem
20
Example
S.D F4, 0(R1) S.D F8, -8(R1) S.D F12,
-16(R1) S.D F16, -24(R1) S.D F20,
-32(R1) DADDUI R1, R1, -56 S.D F24,
16(R1) BNE R1, R2, Loop S.D F28, 8(R1)
Loop L.D F0,0(R1) L.D F6,-8(R1) L.D
F10,-16(R1) L.D F14,-24(R1) L.D
F18,-32(R1) L.D F22,-40(R1) L.D
F26,-48(R1) ADD.D F4, F0, F2 ADD.D F8, F6,
F2 ADD.D F12, F10, F2 ADD.D F16, F14, F2 ADD.D
F20, F18, F2 ADD.D F24, F22, F2 ADD.D F28, F26,
F2
6
3
3
4
5
ALU /branch
FP
FP
mem
mem
21
Example
S.D F4, 0(R1) S.D F8, -8(R1) S.D F12,
-16(R1) S.D F16, -24(R1) S.D F20,
-32(R1) DADDUI R1, R1, -56 S.D F24,
16(R1) BNE R1, R2, Loop S.D F28, 8(R1)
Loop L.D F0,0(R1) L.D F6,-8(R1) L.D
F10,-16(R1) L.D F14,-24(R1) L.D
F18,-32(R1) L.D F22,-40(R1) L.D
F26,-48(R1) ADD.D F4, F0, F2 ADD.D F8, F6,
F2 ADD.D F12, F10, F2 ADD.D F16, F14, F2 ADD.D
F20, F18, F2 ADD.D F24, F22, F2 ADD.D F28, F26,
F2
6
7
4
4
5
6
ALU /branch
FP
FP
mem
mem
22
Example
S.D F4, 0(R1) S.D F8, -8(R1) S.D F12,
-16(R1) S.D F16, -24(R1) S.D F20,
-32(R1) DADDUI R1, R1, -56 S.D F24,
16(R1) BNE R1, R2, Loop S.D F28, 8(R1)
Loop L.D F0,0(R1) L.D F6,-8(R1) L.D
F10,-16(R1) L.D F14,-24(R1) L.D
F18,-32(R1) L.D F22,-40(R1) L.D
F26,-48(R1) ADD.D F4, F0, F2 ADD.D F8, F6,
F2 ADD.D F12, F10, F2 ADD.D F16, F14, F2 ADD.D
F20, F18, F2 ADD.D F24, F22, F2 ADD.D F28, F26,
F2
6
7
8
5
6
ALU /branch
FP
FP
mem
mem
23
Example
S.D F4, 0(R1) S.D F8, -8(R1) S.D F12,
-16(R1) S.D F16, -24(R1) S.D F20,
-32(R1) DADDUI R1, R1, -56 S.D F24,
16(R1) BNE R1, R2, Loop S.D F28, 8(R1)
6
Loop L.D F0,0(R1) L.D F6,-8(R1) L.D
F10,-16(R1) L.D F14,-24(R1) L.D
F18,-32(R1) L.D F22,-40(R1) L.D
F26,-48(R1) ADD.D F4, F0, F2 ADD.D F8, F6,
F2 ADD.D F12, F10, F2 ADD.D F16, F14, F2 ADD.D
F20, F18, F2 ADD.D F24, F22, F2 ADD.D F28, F26,
F2
7
8
9
6
ALU /branch
FP
FP
mem
mem
24
Example
S.D F4, 0(R1) S.D F8, -8(R1) S.D F12,
-16(R1) S.D F16, -24(R1) S.D F20,
24(R1) DADDUI R1, R1, -56 S.D F24, 16(R1) BNE
R1, R2, Loop S.D F28, 8(R1)
Loop L.D F0,0(R1) L.D F6,-8(R1) L.D
F10,-16(R1) L.D F14,-24(R1) L.D
F18,-32(R1) L.D F22,-40(R1) L.D
F26,-48(R1) ADD.D F4, F0, F2 ADD.D F8, F6,
F2 ADD.D F12, F10, F2 ADD.D F16, F14, F2 ADD.D
F20, F18, F2 ADD.D F24, F22, F2 ADD.D F28, F26,
F2
7
7
8
9
ALU /branch
FP
FP
mem
mem
25
Example
S.D F4, 0(R1) S.D F8, -8(R1) S.D F12,
-16(R1) S.D F16, -24(R1) S.D F20,
24(R1) DADDUI R1, R1, -56 S.D F24, 16(R1) BNE
R1, R2, Loop S.D F28, 8(R1)
Loop L.D F0,0(R1) L.D F6,-8(R1) L.D
F10,-16(R1) L.D F14,-24(R1) L.D
F18,-32(R1) L.D F22,-40(R1) L.D
F26,-48(R1) ADD.D F4, F0, F2 ADD.D F8, F6,
F2 ADD.D F12, F10, F2 ADD.D F16, F14, F2 ADD.D
F20, F18, F2 ADD.D F24, F22, F2 ADD.D F28, F26,
F2
8
8
9
ALU /branch
FP
FP
mem
mem
26
Example
S.D F4, 0(R1) S.D F8, -8(R1) S.D F12,
-16(R1) S.D F16, -24(R1) S.D F20,
24(R1) DADDUI R1, R1, -56 S.D F24, 16(R1) BNE
R1, R2, Loop S.D F28, 8(R1)
Loop L.D F0,0(R1) L.D F6,-8(R1) L.D
F10,-16(R1) L.D F14,-24(R1) L.D
F18,-32(R1) L.D F22,-40(R1) L.D
F26,-48(R1) ADD.D F4, F0, F2 ADD.D F8, F6,
F2 ADD.D F12, F10, F2 ADD.D F16, F14, F2 ADD.D
F20, F18, F2 ADD.D F24, F22, F2 ADD.D F28, F26,
F2
9
Overall 9 cycles for 7 iterations 1.29 per
iteration But VLIW was always half-full
ALU /branch
FP
FP
mem
mem
27
Detecting and Enhancing Loop Level Parallelism
  • Determine whether data in later iterations
    depends on data in earlier iterations
    loop-carried dependence
  • Easier detected at source code level than at
    machine code

for(i1 ilt100 ii1) Ai1 Ai
Ci / S1 / Bi1 Bi Ai1 / S2
/
  • S1 calculates a value Ai1 which will be used
    in next iteration of S1
  • S2 calculates a value Bi1 which will be used
    in next iteration of S2
  • This is a loop-carried dependence and prevents
    parallelism
  • S1 calculates a value Ai1 which will be used
    in the current iteration of S2
  • ? This is dependence within the loop

28
Detecting and Enhancing Loop Level Parallelism
for(i1 ilt100 ii1) Ai Ai
Bi / S1 / Bi1 Ci Di /
S2 /
  • S1 calculates a value Ai which is not used in
    the future
  • S2 calculates a value Bi1 which will be used
    in next iteration of S1
  • This is a loop-carried dependence but S1 depends
    on S2 not on itself and S2 does not depend
    on S1
  • This loop can be made parallel if we transform it
    so that there is no loop-carried dependence

A1 A1 B1 for(i1 ilt99
ii1) Bi1 Ci Di / S2 /
Ai1 Ai1 Bi1 / S1 / B101
C100D100
29
Detecting and Enhancing Loop Level Parallelism
  • Recursion creates loop-carried dependence
  • But sometimes it may parallelizable if distance
    between dependent elements is gt1

for(i1 ilt100 ii1) Ai Ai-1
Bi
for(i1 ilt100 ii1) Ai Ai-5
Bi
30
Detecting and Enhancing Loop Level Parallelism
  • Find all dependencies in the following loop (5)
    and eliminate as many as you can

for(i1 ilt100 ii1) Yi Xi / c
/ S1 / Xi Xi c / S2 /
Zi Yi c / S3 / Yi c
Yi / S4 /
Solution at page 325
31
Code Transformation
  • Eliminating dependent computations
  • Copy propagation
  • Tree height reduction

DADDUI R1, R2, 4 DADDUI R1, R1, 4
?
DADDUI R1, R2, 8
ADD R1, R2, R3 ADD R4, R1, R6 ADD R8, R4, R7
ADD R1, R2, R3 ADD R4, R6, R7 ADD R8, R1, R4
Can be done in parallel
?
sumsumx1x2x3x4x5sum(sumx1)(x2x3)(x4x5
)
sumsumx / suppose this is in a
loop and we unroll it 5 times /
?
Can be done in parallel
Must be done sequentially
32
Software Pipelining
  • Combining instructions from different loop
    iterations to separate dependent instructions
    within an iteration

33
Software Pipelining
  • Apply software pipelining technique to the
    following loop

L.D F0,0(R1) ADD.D F4, F0, F2 S.D F4,
0(R1) DADDUI R1, R1, -8 BNE R1, R2, Loop
S.D F0,16(R1) ADD.D F4, F0, F2 L.D F4,
0(R1) DADDUI R1, R1, -8 BNE R1, R2, Loop
?
?
Startup code
R18
R1
R116
L.D F0, 0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1)
L.D F0, 0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1)
L.D F0, 0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1)
16
8
Cleanup code
34
Software Pipelining vs. Loop Unrolling
  • Loop unrolling eliminates loop maintenance
    overhead exposing parallelism between iterations
  • Creates larger code
  • Software pipelining enables some loop iterations
    to run at top speed by eliminating RAW hazards
    that create latencies within iteration
  • Requires more complex transformations

35
Homework 8
  • Due Tuesday, November 16 by the end of the class
  • Submit either in class (paper) or by E-mail (PS
    or PDF only) or bring the paper copy to my office
  • Do exercises 4.2, 4.6, 4.9 (skip parts d. and
    e.), 4.11
Write a Comment
User Comments (0)
About PowerShow.com