Title: Compiler techniques for exposing ILP
1Compiler techniques for exposing ILP
2Instruction Level Parallelism
- Potential overlap among instructions
- Few possibilities in a basic block
- Blocks are small (6-7 instructions)
- Instructions are dependent
- Goal Exploit ILP across multiple basic blocks
- Iterations of a loop
- for (i 1000 i gt 0 ii-1)
- xi xi s
3Basic Scheduling
Sequential MIPS Assembly Code Loop LD F0,
0(R1) ADDD F4, F0, F2 SD 0(R1), F4 SUBI R1,
R1, 8 BNEZ R1, Loop
for (i 1000 i gt 0 ii-1) xi xi s
Pipelined execution Loop LD F0, 0(R1)
1 stall 2 ADDD F4, F0, F2
3 stall 4 stall 5 SD 0(R1),
F4 6 SUBI R1, R1, 8 7 stall
8 BNEZ R1, Loop 9 stall 10
Scheduled pipelined execution Loop LD F0, 0(R1)
1 SUBI R1, R1, 8 2 ADDD F4, F0,
F2 3 stall 4 BNEZ R1, Loop
5 SD 8(R1), F4 6
4Loop Unrolling
Loop LD F0, 0(R1) ADDD F4, F0, F2 SD 0(R1),
F4 SUBI R1, R1, 8 BEQZ R1, Exit LD F6,
0(R1) ADDD F8, F6, F2 SD 0(R1), F8 SUBI R1,
R1, 8 BEQZ R1, Exit LD F10, 0(R1) ADDD F12,
F10, F2 SD 0(R1), F12 SUBI R1, R1, 8 BEQZ R1,
Exit LD F14, 0(R1) ADDD F16, F14,
F2 SD 0(R1), F16 SUBI R1, R1, 8 BNEZ R1,
Loop Exit
Pros Larger basic block More scope for
scheduling and eliminating dependencies Cons
Increases code size Comment Often a
precursor step for other optimizations
5Loop Transformations
- Instruction independency is the key requirement
for the transformations - Example
- Determine that is legal to move SD after SUBI and
BNEZ - Determine that unrolling is useful (iterations
are independent) - Use different registers to avoid unnecessary
constrains - Eliminate extra tests and branches
- Determine that LD and SD can be interchanged
- Schedule the code, preserving the semantics of
the code
61. Eliminating Name Dependences
Loop LD F0, 0(R1) ADDD F4, F0, F2 SD 0(R1),
F4 LD F0, -8(R1) ADDD F4, F0, F2 SD -8(R1),
F4 LD F0, -16(R1) ADDD F4, F0, F2 SD -16(R1),
F4 LD F0, -24(R1) ADDD F4, F0, F2 SD -24(R1),
F4 SUBI R1, R1, 32 BNEZ R1, Loop
Loop LD F0, 0(R1) ADDD F4, F0, F2 SD 0(R1),
F4 LD F6, -8(R1) ADDD F8, F6, F2 SD -8(R1),
F8 LD F10, -16(R1) ADDD F12, F10,
F2 SD -16(R1), F12 LD F14, -24(R1) ADDD F16,
F14, F2 SD -24(R1), F16 SUBI R1, R1,
32 BNEZ R1, Loop
Register Renaming
72. Eliminating Control Dependences
Loop LD F0, 0(R1) ADDD F4, F0, F2 SD 0(R1),
F4 SUBI R1, R1, 8 BEQZ R1, Exit LD F6,
0(R1) ADDD F8, F6, F2 SD 0(R1), F8 SUBI R1,
R1, 8 BEQZ R1, Exit LD F10, 0(R1) ADDD F12,
F10, F2 SD 0(R1), F12 SUBI R1, R1, 8 BEQZ R1,
Exit LD F14, 0(R1) ADDD F16, F14,
F2 SD 0(R1), F16 SUBI R1, R1, 8 BNEZ R1,
Loop Exit
Intermediate BEQZ are never taken Eliminate!
83. Eliminating Data Dependences
Loop LD F0, 0(R1) ADDD F4, F0, F2 SD 0(R1),
F4 SUBI R1, R1, 8 LD F6, 0(R1) ADDD F8, F6,
F2 SD 0(R1), F8 SUBI R1, R1, 8 LD F10,
0(R1) ADDD F12, F10, F2 SD 0(R1), F12 SUBI R1,
R1, 8 LD F14, 0(R1) ADDD F16, F14,
F2 SD 0(R1), F16 SUBI R1, R1, 8 BNEZ R1, Loop
- Data dependencies SUBI, LD, SD
- Force sequential execution of iterations
- Compiler removes this dependency by
- Computing intermediate R1 values
- Eliminating intermediate SUBI
- Changing final SUBI
- Data flow analysis
- Can do on Registers
- Cannot do easily on memory locations
- 100(R1) 20(R2)
94. Alleviating Data Dependencies
Unrolled loop Loop LD F0, 0(R1) ADDD F4, F0,
F2 SD 0(R1), F4 LD F6, -8(R1) ADDD F8, F6,
F2 SD -8(R1), F8 LD F10, -16(R1) ADDD F12,
F10, F2 SD -16(R1), F12 LD F14,
-24(R1) ADDD F16, F14, F2 SD -24(R1),
F16 SUBI R1, R1, 32 BNEZ R1, Loop
Scheduled Unrolled loop Loop LD F0,
0(R1) LD F6, -8(R1) LD F10, -16(R1) LD F14,
-24(R1) ADDD F4, F0, F2 ADDD F8, F6,
F2 ADDD F12, F10, F2 ADDD F16, F14, F2
SD 0(R1), F4 SD -8(R1), F8 SUBI R1, R1,
32 SD 16(R1), F12 BNEZ R1, Loop SD 8(R1), F16
10Some General Comments
- Dependences are a property of programs
- Actual hazards are a property of the pipeline
- Techniques to avoid dependence limitations
- Maintain dependences but avoid hazards
- Code scheduling
- hardware
- software
- Eliminate dependences by code transformations
- Complex
- Compiler-based
11Loop-level Parallelism
- Primary focus of dependence analysis
- Determine all dependences and find cycles
for (i1 ilt100 ii1) xi yi
zi wi xi vi
for (i1 ilt100 ii1) xi1 xi
zi
x1 x1 y1 for (i1 ilt99 ii1)
yi1 wi zi xi1 xi 1
yi 1 y101 w100 z100
for (i1 ilt100 ii1) xi xi
yi yi1 wi zi
12Dependence Analysis Algorithms
- Assume array indexes are affine (ai b)
- GCD test
- For two affine array indexes aib and cid
- if a loop-carried dependence exists, then GCD
(c,a) must - divide (d-b)
- x8i x4i 2 3
- (2-0)/GCD(8,4)
- General graph cycle determination is NP
- a, b, c, and d may not be known at compile time
13Software Pipelining
Start-up
Finish-up
Iteration 0 Iteration 1 Iteration
2 Iteration 3
Software pipelined iteration
14Example
Iteration i Iteration
i1 Iteration i2
LD F0, 0(R1) ADDD F4, F0, F2 SD 0(R1), F4
LD F0, 0(R1) ADDD F4, F0, F2 SD 0(R1), F4
LD F0, 0(R1) ADDD F4, F0, F2 SD 0(R1), F4
Loop SD 16(R1), F4 ADDD F4, F0, F2 LD F0,
0(R1) SUBI R1, R1, 8 BNEZ R1, Loop
Loop LD F0, 0(R1) ADDD F4, F0, F2 SD 0(R1),
F4 SUBI R1, R1, 8 BNEZ R1, Loop
15Trace (global-code) Scheduling
- Find ILP across conditional branches
- Two-step process
- Trace selection
- Find a trace (sequence of basic blocks)
- Use loop unrolling to generate long traces
- Use static branch prediction for other
conditional branches - Trace compaction
- Squeeze the trace into a small number of wide
instructions - Preserve data and control dependences
16Trace Selection
AI AI BI
LW R4, 0(R1) LW R5, 0(R2) ADD R4, R4,
R5 SW 0(R1), R4 BNEZ R4, else . . .
. SW 0(R2), . . . J join Else . . .
. X Join . . . . SW 0(R3), . . .
F
T
AI 0?
X
BI
CI
17Summary of Compiler Techniques
- Try to avoid dependence stalls
- Loop unrolling
- Reduce loop overhead
- Software pipelining
- Reduce single body dependence stalls
- Trace scheduling
- Reduce impact of other branches
- Compilers use a mix of three
- All techniques depend on prediction accuracy
18Food for thought Analyze this
- Analyze this for different values of X and Y
- To evaluate different branch prediction schemes
- For compiler scheduling purposes
- add r1, r0, 1000 ? all numbers in decimal
- add r2, r0, a Base address of array a
- loop
- andi r10, r1, X
- beqz r10, even
- lw r11, 0(r2)
- addi r11, r11, 1
- sw 0(r2), r11
- even
- addi r2, r2, 4
- subi r1, r1, Y
- bnez r1, loop