Compiler techniques for exposing ILP - PowerPoint PPT Presentation

1 / 18

About This Presentation

Title:

Compiler techniques for exposing ILP

Description:

Title: Loop-Level Parallelism Author: srini Last modified by: Computing Services Created Date: 11/12/2002 6:43:22 PM Document presentation format – PowerPoint PPT presentation

Number of Views:114

Avg rating:3.0/5.0

Slides: 19

Provided by: srin54

Learn more at: https://cse.osu.edu

Category:

more less

Transcript and Presenter's Notes

Title: Compiler techniques for exposing ILP

1
Compiler techniques for exposing ILP
2
Instruction Level Parallelism

Potential overlap among instructions
Few possibilities in a basic block
Blocks are small (6-7 instructions)
Instructions are dependent
Goal Exploit ILP across multiple basic blocks
Iterations of a loop
for (i 1000 i gt 0 ii-1)
xi xi s

3
Basic Scheduling
Sequential MIPS Assembly Code Loop LD F0,
0(R1) ADDD F4, F0, F2 SD 0(R1), F4 SUBI R1,
R1, 8 BNEZ R1, Loop
for (i 1000 i gt 0 ii-1) xi xi s
Pipelined execution Loop LD F0, 0(R1)
1 stall 2 ADDD F4, F0, F2
3 stall 4 stall 5 SD 0(R1),
F4 6 SUBI R1, R1, 8 7 stall
8 BNEZ R1, Loop 9 stall 10
Scheduled pipelined execution Loop LD F0, 0(R1)
1 SUBI R1, R1, 8 2 ADDD F4, F0,
F2 3 stall 4 BNEZ R1, Loop
5 SD 8(R1), F4 6
4
Loop Unrolling
Loop LD F0, 0(R1) ADDD F4, F0, F2 SD 0(R1),
F4 SUBI R1, R1, 8 BEQZ R1, Exit LD F6,
0(R1) ADDD F8, F6, F2 SD 0(R1), F8 SUBI R1,
R1, 8 BEQZ R1, Exit LD F10, 0(R1) ADDD F12,
F10, F2 SD 0(R1), F12 SUBI R1, R1, 8 BEQZ R1,
Exit LD F14, 0(R1) ADDD F16, F14,
F2 SD 0(R1), F16 SUBI R1, R1, 8 BNEZ R1,
Loop Exit
Pros Larger basic block More scope for
scheduling and eliminating dependencies Cons
Increases code size Comment Often a
precursor step for other optimizations
5
Loop Transformations

Instruction independency is the key requirement
for the transformations
Example
Determine that is legal to move SD after SUBI and
BNEZ
Determine that unrolling is useful (iterations
are independent)
Use different registers to avoid unnecessary
constrains
Eliminate extra tests and branches
Determine that LD and SD can be interchanged
Schedule the code, preserving the semantics of
the code

6
1. Eliminating Name Dependences
Loop LD F0, 0(R1) ADDD F4, F0, F2 SD 0(R1),
F4 LD F0, -8(R1) ADDD F4, F0, F2 SD -8(R1),
F4 LD F0, -16(R1) ADDD F4, F0, F2 SD -16(R1),
F4 LD F0, -24(R1) ADDD F4, F0, F2 SD -24(R1),
F4 SUBI R1, R1, 32 BNEZ R1, Loop
Loop LD F0, 0(R1) ADDD F4, F0, F2 SD 0(R1),
F4 LD F6, -8(R1) ADDD F8, F6, F2 SD -8(R1),
F8 LD F10, -16(R1) ADDD F12, F10,
F2 SD -16(R1), F12 LD F14, -24(R1) ADDD F16,
F14, F2 SD -24(R1), F16 SUBI R1, R1,
32 BNEZ R1, Loop
Register Renaming
7
2. Eliminating Control Dependences
Loop LD F0, 0(R1) ADDD F4, F0, F2 SD 0(R1),
F4 SUBI R1, R1, 8 BEQZ R1, Exit LD F6,
0(R1) ADDD F8, F6, F2 SD 0(R1), F8 SUBI R1,
R1, 8 BEQZ R1, Exit LD F10, 0(R1) ADDD F12,
F10, F2 SD 0(R1), F12 SUBI R1, R1, 8 BEQZ R1,
Exit LD F14, 0(R1) ADDD F16, F14,
F2 SD 0(R1), F16 SUBI R1, R1, 8 BNEZ R1,
Loop Exit
Intermediate BEQZ are never taken Eliminate!
8
3. Eliminating Data Dependences
Loop LD F0, 0(R1) ADDD F4, F0, F2 SD 0(R1),
F4 SUBI R1, R1, 8 LD F6, 0(R1) ADDD F8, F6,
F2 SD 0(R1), F8 SUBI R1, R1, 8 LD F10,
0(R1) ADDD F12, F10, F2 SD 0(R1), F12 SUBI R1,
R1, 8 LD F14, 0(R1) ADDD F16, F14,
F2 SD 0(R1), F16 SUBI R1, R1, 8 BNEZ R1, Loop

Data dependencies SUBI, LD, SD
Force sequential execution of iterations
Compiler removes this dependency by
Computing intermediate R1 values
Eliminating intermediate SUBI
Changing final SUBI
Data flow analysis
Can do on Registers
Cannot do easily on memory locations
100(R1) 20(R2)

9
4. Alleviating Data Dependencies
Unrolled loop Loop LD F0, 0(R1) ADDD F4, F0,
F2 SD 0(R1), F4 LD F6, -8(R1) ADDD F8, F6,
F2 SD -8(R1), F8 LD F10, -16(R1) ADDD F12,
F10, F2 SD -16(R1), F12 LD F14,
-24(R1) ADDD F16, F14, F2 SD -24(R1),
F16 SUBI R1, R1, 32 BNEZ R1, Loop
Scheduled Unrolled loop Loop LD F0,
0(R1) LD F6, -8(R1) LD F10, -16(R1) LD F14,
-24(R1) ADDD F4, F0, F2 ADDD F8, F6,
F2 ADDD F12, F10, F2 ADDD F16, F14, F2
SD 0(R1), F4 SD -8(R1), F8 SUBI R1, R1,
32 SD 16(R1), F12 BNEZ R1, Loop SD 8(R1), F16
10
Some General Comments

Dependences are a property of programs
Actual hazards are a property of the pipeline
Techniques to avoid dependence limitations
Maintain dependences but avoid hazards
Code scheduling
hardware
software
Eliminate dependences by code transformations
Complex
Compiler-based

11
Loop-level Parallelism

Primary focus of dependence analysis
Determine all dependences and find cycles

for (i1 ilt100 ii1) xi yi
zi wi xi vi
for (i1 ilt100 ii1) xi1 xi
zi
x1 x1 y1 for (i1 ilt99 ii1)
yi1 wi zi xi1 xi 1
yi 1 y101 w100 z100
for (i1 ilt100 ii1) xi xi
yi yi1 wi zi
12
Dependence Analysis Algorithms

Assume array indexes are affine (ai b)
GCD test
For two affine array indexes aib and cid
if a loop-carried dependence exists, then GCD
(c,a) must
divide (d-b)
x8i x4i 2 3
(2-0)/GCD(8,4)
General graph cycle determination is NP
a, b, c, and d may not be known at compile time

13
Software Pipelining
Start-up
Finish-up
Iteration 0 Iteration 1 Iteration
2 Iteration 3
Software pipelined iteration
14
Example
Iteration i Iteration
i1 Iteration i2
LD F0, 0(R1) ADDD F4, F0, F2 SD 0(R1), F4
LD F0, 0(R1) ADDD F4, F0, F2 SD 0(R1), F4
LD F0, 0(R1) ADDD F4, F0, F2 SD 0(R1), F4
Loop SD 16(R1), F4 ADDD F4, F0, F2 LD F0,
0(R1) SUBI R1, R1, 8 BNEZ R1, Loop
Loop LD F0, 0(R1) ADDD F4, F0, F2 SD 0(R1),
F4 SUBI R1, R1, 8 BNEZ R1, Loop
15
Trace (global-code) Scheduling

Find ILP across conditional branches
Two-step process
Trace selection
Find a trace (sequence of basic blocks)
Use loop unrolling to generate long traces
Use static branch prediction for other
conditional branches
Trace compaction
Squeeze the trace into a small number of wide
instructions
Preserve data and control dependences

16
Trace Selection
AI AI BI
LW R4, 0(R1) LW R5, 0(R2) ADD R4, R4,
R5 SW 0(R1), R4 BNEZ R4, else . . .
. SW 0(R2), . . . J join Else . . .
. X Join . . . . SW 0(R3), . . .
F
T
AI 0?
X
BI
CI
17
Summary of Compiler Techniques