Title: ILP: Software Approaches
1ILP Software Approaches
- Vincent H. Berk
- October 12th
- Reading for today 3.7-3.9, 4.1
- Reading for Friday 4.2 4.6
- Homework 2 due Friday 14th, 2.8, A.2, A.13,
3.6ab, 3.10, 4.5, 4.8, (4.13 optional)
2Basic Loop Unrolling
for (i1000 igt0 ii-1) xi xi s
Loop LD F0, 0(R1) F0array element ADDD F4,
F0, F2 add scalar in F2 SD 0 (R1), F4 store
result SUBI R1, R1, 8 decrement pointer 8
bytes (DW) BNEZ R1, Loop branch R1!
zero NOP delayed branch slot
3FP Loop Hazards
Loop LD F0, 0(R1) F0vector element ADDD F4,
F0, F2 add scalar in F2 SD 0 (R1), F4 store
result SUBI R1, R1, 8 decrement pointer 8
bytes (DW) BNEZ R1, Loop branch R1!
zero NOP delayed branch slot
Where are the stalls?
4FP Loop Showing Stalls
Rewrite code to minimize stalls?
5Revised FP Loop Minimizing Stalls
Can we unroll the loop to make it faster?
6Loop Unrolling
- Short loop minimizes parallelism, induces
significant overhead - Branches per instruction is high
- Replicate the loop body several times and adjust
the loop termination code - for (i 0 i lt 100 i i 4)
- xi xi yi
- xi 1 xi 1 yi 1
- xi 2 xi 2 yi
2 - xi 3 xi 3 yi
3 - Improves scheduling since instructions from
different iterations can be scheduled together - This is done very early in the compilation
process - All dependences have to be found beforehand
- Need to use different registers for each iteration
7Where are the control dependences?
1 Loop LD F0, 0 (R1) 2 ADDD F4, F0, F2
3 SD 0 (R1), F4 4 SUBI R1, R1, 8
5 BEQZ R1, exit 6 LD F0, 0 (R1) 7 ADDD F4,
F0, F2 8 SD 0 (R1), F4 9 SUBI R1, R1,
8 10 BEQZ R1, exit 11 LD F0, 0 (R1)
12 ADDD F4, F0, F2 13 SD 0 (R1), F4
14 SUBI R1, R1, 8 15 BEQZ R1, exit ....
8Data Dependences
1 Loop LD F0, 0 (R1) 2 ADDD F4, F0, F2 3 SD 0
(R1), F4 drop SUBI BNEZ 4 LD F0, 8 (R1)
2 ADDD F4, F0, F2 3 SD 8 (R1), F4 drop SUBI
BNEZ 7 LD F0, 16 (R1) 8 ADDD F4, F0, F2
9 SD 16 (R1), F4 drop SUBI BNEZ 10 LD F0,
24 (R1) 11 ADDD F4, F0, F2 12 SD 24 (R1), F4
13 SUBI R1, R1, 32 alter to 48 14 BNEZ R1,
LOOP 15 NOP
9Name Dependences
1 Loop LD F0, 0 (R1) 2 ADDD F4, F0, F2 3 SD 0
(R1), F4 drop SUBI BNEZ 4 LD F6, 8 (R1)
5 ADDD F8, F6, F2 6 SD 8 (R1), F8 drop SUBI
BNEZ 7 LD F10, 16 (R1) 8 ADDD F12, F10, F2
9 SD 16 (R1), F12 drop SUBI BNEZ
10 LD F14, 24 (R1) 11 ADDD F16, F14, F2
12 SD 24 (R1), F16 13 SUBI R1, R1, 32 alter
to 48 14 BNEZ R1, LOOP 15 NOP Register
renaming
10Unroll Loop Four Times
Rewrite loop to minimize stalls?
15 4 ? (12) 1 28 clock cycles to initiate,
or 7 per iteration Assumes R1 is multiple of 4
11Unrolled Loop That Minimizes Stalls
- What assumptions were made when we moved code?
- OK to move store past SUBI even though SUBI
changes the register - OK to move loads before stores get right data?
- When is it safe for compiler to do such changes?
Can we eliminate the remaining stall?
14115 clock cycles, or 3.75 per iteration
12Compiler Loop Unrolling
- Most important Code Correctness
- Unrolling produces larger code that might
interfere with cache - Code sequence no longer fits in L1 cache
- Cache to memory bandwidth might not be wide
enough - Compiler must understand hardware
- Enough registers must be available OR
- Compiler must rely on hardware register renaming
- Compiler must understand the code
- Determine that loop iterations are independent
- Eliminate branch instructions while preserving
correctness - Determine that the LD and SD are independent over
the loop - Rescheduling of instructions and adjusting the
offsets
13Superscalar Example
- Superscalar
- Our system can issue one floating point and one
other (non-floating point) instruction per cycle. - Instructions are dynamically scheduled from the
window - Unroll the loop 5 times and reschedule to
minimize cycles per iteration. (WHY?) - While Integer/FP split is simple for the HW, get
CPI of 0.5 only for programs with - Exactly 50 FP operations
- No hazards
- If more instructions issued at same time, greater
difficulty in decode and issue - Even 2-way scalar ? examine 2 opcodes, 6 register
specifiers, decide if 1 or 2 instructions can
issue
14Loop Unrolling in Superscalar
- Integer instruction FP instruction Clock cycle
- Loop LD F0, 0 (R1) 1
- LD F6, 8 (R1) 2
- LD F10, 16 (R1) ADDD F4, F0, F2 3
- LD F14, 24 (R1) ADDD F8, F6, F2 4
- LD F18, 32 (R1) ADDD F12, F10, F2 5
- SD 0 (R1), F4 ADDD F16, F14, F2 6
- SD 8 (R1), F8 ADDD F20, F18, F2 7
- SD 16 (R1), F12 8
- SUBI R1, R1, 40 9
- SD 16 (R1), F16 10
- BNEZ R1, Loop 11
- SD 8 (R1), F20 12
- Unrolled 5 times to avoid delays ( 1 due to SS)
- 12 clocks to initiate, or 2.4 clocks per iteration
15VLIW Example
- VLIW
- 5 instructions in one very long instruction word.
- 2 FP, 2 Memory, 1 branch/integer
- Compiler avoids hazards
- Not all slots are always full
- VLIW tradeoff instruction space for simple
decoding - The long instruction word has room for many
operations - By definition, all the operations the compiler
puts in the long instruction word are independent
? execute in parallel - E.g., 2 integer operations, 2 FP ops, 2 memory
refs, 1 branch ? 16 to 24 bits per field ? 716
or 112 bits to 724 or 168 bits wide - Need compiling technique that schedules across
several branches
16Loop Unrolling in VLIW
- Memory Memory FP FP Int. op/ Clockreference
1 reference 2 operation 1 op. 2 branch - LD F0, 0 (R1) LD F6, 8 (R1) 1
- LD F10, 16 (R1) LD F14, 24 (R1) 2
- LD F18, 32 (R1) LD F22, 40 (R1) ADDD F4, F0,
F2 ADDD F8, F6, F2 3 - LD F26, 48 (R1) ADDD F12, F10, F2 ADDD F16,
F14, F2 4 - ADDD F20, F18, F2 ADDD F24, F22, F2 5
- SD 0 (R1), F4 SD 8 (R1), F8 ADDD F28, F26,
F2 6 - SD 16 (R1), F12 SD 24 (R1), F16 7
- SD 32 (R1), F20 SD 40 (R1), F24 SUBI R1, R1,
48 8 - SD 0 (R1), F28 BNEZ R1, LOOP 9
-
- Unrolled 7 times to avoid delays
- 9 clocks to initiate, or 1.3 clocks per iteration
- Average 2.5 ops per clock, 50 efficiency
- Note Need more registers in VLIW (15 vs. 6 in
SS)
17Limits to Multi-Issue Machines
- Inherent limitations of instruction-level
parallelism - 1 branch in 5 How to keep a 5-way VLIW busy?
- Latencies of units many operations must be
scheduled - Easy More instruction bandwidth
- Easy Duplicate functional units to get parallel
execution - Hard Increase ports to register file
(bandwidth) - VLIW example needs 7 reads and 3 writes for
integer registers 5 reads and 3
writes for FP registers - Harder Increase ports to memory (bandwidth)
- Pipelines in lockstep
- One pipeline stall, stalls all others to avoid
hazards
18Limits to Multi-Issue Machines
- Limitations specific to either superscalar or
VLIW implementation - Decode issue in superscalar how wide is
practical? - VLIW code size unroll loops wasted fields in
VLIW - IA-64 compresses dependent instructions, but
still larger - VLIW lock step ? 1 hazard all instructions
stall - IA-64 not lock step? Dynamic pipeline?
- VLIW binary compatibility IA-64 promises
binary compatibility
19Dependences
- Two instructions are parallel if they can execute
simultaneously in a pipeline without causing any
stalls (assuming no structural hazards) and can
be reordered (depending on code semantics) - Two instructions that are dependent are not
parallel and cannot be reordered - Types of dependences
- Data dependences
- Name dependences
- Control dependences
- Dependences are properties of programs
- Hazards are properties of the pipeline
organization - Dependence indicates the potential for a hazard
20Compiler Perspectives on Code Movement
- Hard for memory accesses
- Does 100(R4) 20 (R6)?
- From different loop iterations, does 20(R6)
20(R6)? - Our example required compiler to know that if R1
doesnt change then - 0(R1) ? -8 (R1) ? -16 (R1) ? -24
(R1) - There were no dependences between some loads and
stores so they could be moved by each other
21Detecting Loop Level Dependences
for (i1 ilt100 ii1) Ai Ai
Bi / S1 / Bi1 Ci Di / S2 /
- Loop carried dependence
- S1 relies on the S2 of the previous iteration
- There is no dependence between S1 and S2,
consider
A1 A1 B1 for (i1 ilt99 ii1)
Bi1 Ci Di Ai1 A i1
Bi1 B101 C100 D100
22Dependence Distance
for (i6 ilt100 ii1) Yi Yi-5 Yi
- Loop carried dependence in the form of a
recurrence of Y - Dependence distance of 5
- Higher dependence distance allows for more ILP
23Greatest Common Divisor test
- Affine array indices
- All array indices DIRECTLY depend on loop
variable i - Assume the code properties
- for loop runs from n to m with index i
- loop has an access pattern X a i b X
c i d - two values for i j and k both between n and m
- store indexed by j and a load later on index by k
with ajb ckd - A loop carried dependence exists if GCD (c,a)
must divide (d-b) - a2, b3, c2, d0 GDC(a,c) 2 and d-b -3
- There is no loop dependence since 2 does not
divide -3
for (i1 ilt100 ii1) X2i3 X2i 5.0
24Problem Cases
- Reference by pointers instead of array indices
- partly eliminated by strict type checking
- Sparse arrays with indexing through other arrays
(similar to pointers) - When a dependence exists for values of the
indices but those values are never reached - The loop-carried dependence has a distance far
greater than what loop-unrolling would cover
25Software Pipelining
- Observation if iterations from loops are
independent, then can get more ILP by taking
instructions from different iterations - Software pipelining reorganizes loops so that
each iteration is made from instructions chosen
from different iterations of the original loop
Software-
iteration
26SW Pipelining Example
1 LD F0, 0 (R1) LD F0, 0 (R1) 2 ADDD F4, F0,
F2 ADDD F4, F0, F2 3 SD 0 (R1), F4 LD F0, 8
(R1) 4 LD F6, 8 (R1) 1 SD 0 (R1), F4 Stores
Mi 5 ADDD F8, F6, F2 2 ADDD F4, F0, F2 Adds to
Mi-1 6 SD 8, (R1), F8 3 LD F0, 16 (R1) Loads
Mi-2 7 LD F10, 16 (R1) 4 SUBI R1, R1,
8 8 ADDD F12, F10, F2 5 BNEZ R1, LOOP 9 SD 16
(R1), F12 SD 0 (R1), F4 10 SUBI R1, R1,
24 ADDD F4, F0, F2 11 BNEZ R1, LOOP SD 8
(R1), F4
Read F4 Read F0 SD IF ID EX Mem WB Write
F4 ADD IF ID EX Mem WB LD IF ID EX Mem WB
Write F0
27SW Pipelining Example
- Symbolic Loop Unrolling
- Smaller code space
- Overhead paid only once vs. each iteration in
loop unrolling - 100 iterations 25 loops with 4 unrolled
iterations each
Software Pipelining
Number of overlapped operations
(a) Software pipelining
Time
Loop Unrolling
Number of overlapped operations
Time
(b) Loop unrolling