ILP: Software Approaches - PowerPoint PPT Presentation

1 / 27

About This Presentation

Title:

ILP: Software Approaches

Description:

Homework #2: due Friday 14th, 2.8, A.2, A.13, 3.6a&b, 3.10, 4.5, 4.8, (4.13 optional) ... Affine array indices: All array indices DIRECTLY depend on loop variable i ... – PowerPoint PPT presentation

Number of Views:261

Avg rating:3.0/5.0

Slides: 28

Provided by: vincen68

Category:

more less

Transcript and Presenter's Notes

Title: ILP: Software Approaches

1
ILP Software Approaches

Vincent H. Berk
October 12th
Reading for today 3.7-3.9, 4.1
Reading for Friday 4.2 4.6
Homework 2 due Friday 14th, 2.8, A.2, A.13,
3.6ab, 3.10, 4.5, 4.8, (4.13 optional)

2
Basic Loop Unrolling
for (i1000 igt0 ii-1) xi xi s
Loop LD F0, 0(R1) F0array element ADDD F4,
F0, F2 add scalar in F2 SD 0 (R1), F4 store
result SUBI R1, R1, 8 decrement pointer 8
bytes (DW) BNEZ R1, Loop branch R1!
zero NOP delayed branch slot
3
FP Loop Hazards
Loop LD F0, 0(R1) F0vector element ADDD F4,
F0, F2 add scalar in F2 SD 0 (R1), F4 store
result SUBI R1, R1, 8 decrement pointer 8
bytes (DW) BNEZ R1, Loop branch R1!
zero NOP delayed branch slot
Where are the stalls?
4
FP Loop Showing Stalls
Rewrite code to minimize stalls?
5
Revised FP Loop Minimizing Stalls
Can we unroll the loop to make it faster?
6
Loop Unrolling

Short loop minimizes parallelism, induces
significant overhead
Branches per instruction is high
Replicate the loop body several times and adjust
the loop termination code
for (i 0 i lt 100 i i 4)
xi xi yi
xi 1 xi 1 yi 1
xi 2 xi 2 yi
2
xi 3 xi 3 yi
3
Improves scheduling since instructions from
different iterations can be scheduled together
This is done very early in the compilation
process
All dependences have to be found beforehand
Need to use different registers for each iteration

7
Where are the control dependences?
1 Loop LD F0, 0 (R1) 2 ADDD F4, F0, F2
3 SD 0 (R1), F4 4 SUBI R1, R1, 8
5 BEQZ R1, exit 6 LD F0, 0 (R1) 7 ADDD F4,
F0, F2 8 SD 0 (R1), F4 9 SUBI R1, R1,
8 10 BEQZ R1, exit 11 LD F0, 0 (R1)
12 ADDD F4, F0, F2 13 SD 0 (R1), F4
14 SUBI R1, R1, 8 15 BEQZ R1, exit ....
8
Data Dependences
1 Loop LD F0, 0 (R1) 2 ADDD F4, F0, F2 3 SD 0
(R1), F4 drop SUBI BNEZ 4 LD F0, 8 (R1)
2 ADDD F4, F0, F2 3 SD 8 (R1), F4 drop SUBI
BNEZ 7 LD F0, 16 (R1) 8 ADDD F4, F0, F2
9 SD 16 (R1), F4 drop SUBI BNEZ 10 LD F0,
24 (R1) 11 ADDD F4, F0, F2 12 SD 24 (R1), F4
13 SUBI R1, R1, 32 alter to 48 14 BNEZ R1,
LOOP 15 NOP
9
Name Dependences
1 Loop LD F0, 0 (R1) 2 ADDD F4, F0, F2 3 SD 0
(R1), F4 drop SUBI BNEZ 4 LD F6, 8 (R1)
5 ADDD F8, F6, F2 6 SD 8 (R1), F8 drop SUBI
BNEZ 7 LD F10, 16 (R1) 8 ADDD F12, F10, F2
9 SD 16 (R1), F12 drop SUBI BNEZ
10 LD F14, 24 (R1) 11 ADDD F16, F14, F2
12 SD 24 (R1), F16 13 SUBI R1, R1, 32 alter
to 48 14 BNEZ R1, LOOP 15 NOP Register
renaming
10
Unroll Loop Four Times
Rewrite loop to minimize stalls?
15 4 ? (12) 1 28 clock cycles to initiate,
or 7 per iteration Assumes R1 is multiple of 4
11
Unrolled Loop That Minimizes Stalls

What assumptions were made when we moved code?
OK to move store past SUBI even though SUBI
changes the register
OK to move loads before stores get right data?
When is it safe for compiler to do such changes?

Can we eliminate the remaining stall?
14115 clock cycles, or 3.75 per iteration
12
Compiler Loop Unrolling

Most important Code Correctness
Unrolling produces larger code that might
interfere with cache
Code sequence no longer fits in L1 cache
Cache to memory bandwidth might not be wide
enough
Compiler must understand hardware
Enough registers must be available OR
Compiler must rely on hardware register renaming
Compiler must understand the code
Determine that loop iterations are independent
Eliminate branch instructions while preserving
correctness
Determine that the LD and SD are independent over
the loop
Rescheduling of instructions and adjusting the
offsets

13
Superscalar Example

Superscalar
Our system can issue one floating point and one
other (non-floating point) instruction per cycle.
Instructions are dynamically scheduled from the
window
Unroll the loop 5 times and reschedule to
minimize cycles per iteration. (WHY?)
While Integer/FP split is simple for the HW, get
CPI of 0.5 only for programs with
Exactly 50 FP operations
No hazards
If more instructions issued at same time, greater
difficulty in decode and issue
Even 2-way scalar ? examine 2 opcodes, 6 register
specifiers, decide if 1 or 2 instructions can
issue

14
Loop Unrolling in Superscalar

Integer instruction FP instruction Clock cycle
Loop LD F0, 0 (R1) 1
LD F6, 8 (R1) 2
LD F10, 16 (R1) ADDD F4, F0, F2 3
LD F14, 24 (R1) ADDD F8, F6, F2 4
LD F18, 32 (R1) ADDD F12, F10, F2 5
SD 0 (R1), F4 ADDD F16, F14, F2 6
SD 8 (R1), F8 ADDD F20, F18, F2 7
SD 16 (R1), F12 8
SUBI R1, R1, 40 9
SD 16 (R1), F16 10
BNEZ R1, Loop 11
SD 8 (R1), F20 12
Unrolled 5 times to avoid delays ( 1 due to SS)
12 clocks to initiate, or 2.4 clocks per iteration

15
VLIW Example

VLIW
5 instructions in one very long instruction word.
2 FP, 2 Memory, 1 branch/integer
Compiler avoids hazards
Not all slots are always full
VLIW tradeoff instruction space for simple
decoding
The long instruction word has room for many
operations
By definition, all the operations the compiler
puts in the long instruction word are independent
? execute in parallel
E.g., 2 integer operations, 2 FP ops, 2 memory
refs, 1 branch ? 16 to 24 bits per field ? 716
or 112 bits to 724 or 168 bits wide
Need compiling technique that schedules across
several branches

16
Loop Unrolling in VLIW

Memory Memory FP FP Int. op/ Clockreference
1 reference 2 operation 1 op. 2 branch
LD F0, 0 (R1) LD F6, 8 (R1) 1
LD F10, 16 (R1) LD F14, 24 (R1) 2
LD F18, 32 (R1) LD F22, 40 (R1) ADDD F4, F0,
F2 ADDD F8, F6, F2 3
LD F26, 48 (R1) ADDD F12, F10, F2 ADDD F16,
F14, F2 4
ADDD F20, F18, F2 ADDD F24, F22, F2 5
SD 0 (R1), F4 SD 8 (R1), F8 ADDD F28, F26,
F2 6
SD 16 (R1), F12 SD 24 (R1), F16 7
SD 32 (R1), F20 SD 40 (R1), F24 SUBI R1, R1,
48 8
SD 0 (R1), F28 BNEZ R1, LOOP 9
Unrolled 7 times to avoid delays
9 clocks to initiate, or 1.3 clocks per iteration
Average 2.5 ops per clock, 50 efficiency
Note Need more registers in VLIW (15 vs. 6 in
SS)

17
Limits to Multi-Issue Machines

Inherent limitations of instruction-level
parallelism
1 branch in 5 How to keep a 5-way VLIW busy?
Latencies of units many operations must be
scheduled
Easy More instruction bandwidth
Easy Duplicate functional units to get parallel
execution
Hard Increase ports to register file
(bandwidth)
VLIW example needs 7 reads and 3 writes for
integer registers 5 reads and 3
writes for FP registers
Harder Increase ports to memory (bandwidth)
Pipelines in lockstep
One pipeline stall, stalls all others to avoid
hazards

18
Limits to Multi-Issue Machines

Limitations specific to either superscalar or
VLIW implementation
Decode issue in superscalar how wide is
practical?
VLIW code size unroll loops wasted fields in
VLIW
IA-64 compresses dependent instructions, but
still larger
VLIW lock step ? 1 hazard all instructions
stall
IA-64 not lock step? Dynamic pipeline?
VLIW binary compatibility IA-64 promises
binary compatibility

19
Dependences

Two instructions are parallel if they can execute
simultaneously in a pipeline without causing any
stalls (assuming no structural hazards) and can
be reordered (depending on code semantics)
Two instructions that are dependent are not
parallel and cannot be reordered
Types of dependences
Data dependences
Name dependences
Control dependences
Dependences are properties of programs
Hazards are properties of the pipeline
organization
Dependence indicates the potential for a hazard

20
Compiler Perspectives on Code Movement

Hard for memory accesses
Does 100(R4) 20 (R6)?
From different loop iterations, does 20(R6)
20(R6)?
Our example required compiler to know that if R1
doesnt change then
0(R1) ? -8 (R1) ? -16 (R1) ? -24
(R1)
There were no dependences between some loads and
stores so they could be moved by each other

21
Detecting Loop Level Dependences
for (i1 ilt100 ii1) Ai Ai
Bi / S1 / Bi1 Ci Di / S2 /

Loop carried dependence
S1 relies on the S2 of the previous iteration
There is no dependence between S1 and S2,
consider

A1 A1 B1 for (i1 ilt99 ii1)
Bi1 Ci Di Ai1 A i1
Bi1 B101 C100 D100
22
Dependence Distance
for (i6 ilt100 ii1) Yi Yi-5 Yi

Loop carried dependence in the form of a
recurrence of Y
Dependence distance of 5
Higher dependence distance allows for more ILP

23
Greatest Common Divisor test

Affine array indices
All array indices DIRECTLY depend on loop
variable i
Assume the code properties
for loop runs from n to m with index i
loop has an access pattern X a i b X
c i d
two values for i j and k both between n and m
store indexed by j and a load later on index by k
with ajb ckd
A loop carried dependence exists if GCD (c,a)
must divide (d-b)
a2, b3, c2, d0 GDC(a,c) 2 and d-b -3
There is no loop dependence since 2 does not
divide -3

for (i1 ilt100 ii1) X2i3 X2i 5.0
24
Problem Cases

Reference by pointers instead of array indices
partly eliminated by strict type checking
Sparse arrays with indexing through other arrays
(similar to pointers)
When a dependence exists for values of the
indices but those values are never reached
The loop-carried dependence has a distance far
greater than what loop-unrolling would cover

25
Software Pipelining

Observation if iterations from loops are
independent, then can get more ILP by taking
instructions from different iterations
Software pipelining reorganizes loops so that
each iteration is made from instructions chosen
from different iterations of the original loop

Software-
iteration
26
SW Pipelining Example
1 LD F0, 0 (R1) LD F0, 0 (R1) 2 ADDD F4, F0,
F2 ADDD F4, F0, F2 3 SD 0 (R1), F4 LD F0, 8
(R1) 4 LD F6, 8 (R1) 1 SD 0 (R1), F4 Stores
Mi 5 ADDD F8, F6, F2 2 ADDD F4, F0, F2 Adds to
Mi-1 6 SD 8, (R1), F8 3 LD F0, 16 (R1) Loads
Mi-2 7 LD F10, 16 (R1) 4 SUBI R1, R1,
8 8 ADDD F12, F10, F2 5 BNEZ R1, LOOP 9 SD 16
(R1), F12 SD 0 (R1), F4 10 SUBI R1, R1,
24 ADDD F4, F0, F2 11 BNEZ R1, LOOP SD 8
(R1), F4
Read F4 Read F0 SD IF ID EX Mem WB Write
F4 ADD IF ID EX Mem WB LD IF ID EX Mem WB
Write F0
27
SW Pipelining Example

Symbolic Loop Unrolling
Smaller code space
Overhead paid only once vs. each iteration in
loop unrolling
100 iterations 25 loops with 4 unrolled
iterations each

Software Pipelining
Number of overlapped operations
(a) Software pipelining
Time
Loop Unrolling
Number of overlapped operations
Time
(b) Loop unrolling

Write a Comment

User Comments (0)