Title: Computer Architecture
1Computer Architecture
- Chapter 4
- Instruction-Level Parallelism - 3
- Prof. Jerry Breecher
- CS 240
- Fall 2003
2Chapter Overview
- 4.1 Compiler Techniques for Exposing ILP
- 4.2 Static Branch Prediction
- 4.3 Static Multiple Issue VLIW
- 4.4 Advanced Compiler Support for ILP
- 4.5 Hardware Support for Exposing more
Parallelism
3Ideas To Reduce Stalls
Chapter 3
Chapter 4
4Instruction Level Parallelism
- 4.1 Compiler Techniques for Exposing ILP
- 4.3 Static Multiple Issue VLIW
- 4.4 Advanced Compiler Support for ILP
- 4.5 Hardware Support for Exposing more
Parallelism
How can compilers recognize and take advantage of
ILP?
5Simple Loop and its Assembler Equivalent
Compilers and ILP
Pipeline Scheduling and Loop Unrolling
This is a clean and simple example!
- for (i1 ilt1000 i) x(i) x(i) s
Loop LD F0,0(R1) F0vector element
ADDD F4,F0,F2 add scalar from F2
SD 0(R1),F4 store result SUBI R1,R1,8 decre
ment pointer 8bytes (DW) BNEZ R1,Loop branch
R1!zero NOP delayed branch slot
6FP Loop Hazards
Compilers and ILP
Pipeline Scheduling and Loop Unrolling
Loop LD F0,0(R1) F0vector element
ADDD F4,F0,F2 add scalar in F2
SD 0(R1),F4 store result SUBI R1,R1,8 decre
ment pointer 8B (DW) BNEZ R1,Loop branch
R1!zero NOP delayed branch slot
Instruction Instruction Latency inproducing
result using result clock cycles FP ALU
op Another FP ALU op 3 FP ALU op Store double 2
Load double FP ALU op 1 Load double Store
double 0 Integer op Integer op 0
Where are the stalls?
7FP Loop Showing Stalls
Compilers and ILP
Pipeline Scheduling and Loop Unrolling
1 Loop LD F0,0(R1) F0vector element
2 stall 3 ADDD F4,F0,F2 add scalar in F2
4 stall 5 stall 6 SD 0(R1),F4 store
result 7 SUBI R1,R1,8 decrement pointer 8Byte
(DW) 8 stall 9
BNEZ R1,Loop branch R1!zero
10 stall delayed branch slot
Instruction Instruction Latency inproducing
result using result clock cycles FP ALU
op Another FP ALU op 3 FP ALU op Store double 2
Load double FP ALU op 1 Load double Store
double 0 Integer op Integer op 0
- 10 clocks Rewrite code to minimize stalls?
8Scheduled FP Loop Minimizing Stalls
Compilers and ILP
Pipeline Scheduling and Loop Unrolling
1 Loop LD F0,0(R1) 2 SUBI R1,R1,8
3 ADDD F4,F0,F2 4 stall 5 BNEZ R1,Loop de
layed branch 6 SD 8(R1),F4 altered when move
past SUBI
Stall is because SD cant proceed.
Swap BNEZ and SD by changing address of SD
Instruction Instruction Latency inproducing
result using result clock cycles FP ALU
op Another FP ALU op 3 FP ALU op Store double 2
Load double FP ALU op 1
- Now 6 clocks Now unroll loop 4 times to make
faster.
9Unroll Loop Four Times (straightforward way)
Compilers and ILP
Pipeline Scheduling and Loop Unrolling
1 Loop LD F0,0(R1) 2 stall 3 ADDD F4,F0,F2
4 stall 5 stall 6 SD 0(R1),F4 7 LD F6,-8(R1)
8 stall 9 ADDD F8,F6,F2 10 stall 11 stall 12 SD -
8(R1),F8 13 LD F10,-16(R1) 14 stall
15 ADDD F12,F10,F2 16 stall 17 stall 18 SD -16(R1)
,F12 19 LD F14,-24(R1) 20 stall 21 ADDD F16,F14,F2
22 stall 23 stall 24 SD -24(R1),F16 25 SUBI R1,R1
,32 26 BNEZ R1,LOOP 27 stall 28 NOP
15 4 x (12) 1 28 clock cycles, or 7 per
iteration Assumes R1 is multiple of 4
- Rewrite loop to minimize stalls.
10Unrolled Loop That Minimizes Stalls
Compilers and ILP
Pipeline Scheduling and Loop Unrolling
- What assumptions made when moved code?
- OK to move store past SUBI even though changes
register - OK to move loads before stores get right data?
- When is it safe for compiler to do such changes?
1 Loop LD F0,0(R1) 2 LD F6,-8(R1) 3 LD F10,-16(R1
) 4 LD F14,-24(R1) 5 ADDD F4,F0,F2 6 ADDD F8,F6,F2
7 ADDD F12,F10,F2 8 ADDD F16,F14,F2 9 SD 0(R1),F4
10 SD -8(R1),F8 11 SD -16(R1),F12 12 SUBI R1,R1,
32 13 BNEZ R1,LOOP 14 SD 8(R1),F16 8-32 -24
14 clock cycles, or 3.5 per iteration
No Stalls!!
11Summary of Loop Unrolling Example
Compilers and ILP
Pipeline Scheduling and Loop Unrolling
- Determine that it was legal to move the SD after
the SUBI and BNEZ, and find the amount to adjust
the SD offset. - Determine that unrolling the loop would be useful
by finding that the loop iterations were
independent, except for the loop maintenance
code. - Use different registers to avoid unnecessary
constraints that would be forced by using the
same registers for different computations. - Eliminate the extra tests and branches and adjust
the loop maintenance code. - Determine that the loads and stores in the
unrolled loop can be interchanged by observing
that the loads and stores from different
iterations are independent. This requires
analyzing the memory addresses and finding that
they do not refer to the same address. - Schedule the code, preserving any dependences
needed to yield the same result as the original
code.
12Compiler Perspectives on Code Movement
Compilers and ILP
Dependencies
- Compiler concerned about dependencies in program.
Not concerned if a HW hazard depends on a
given pipeline. - Tries to schedule code to avoid hazards.
- Looks for Data dependencies (RAW if a hazard for
HW) - Instruction i produces a result used by
instruction j, or - Instruction j is data dependent on instruction k,
and instruction k is data dependent on
instruction i. - If dependent, cant execute in parallel
- Easy to determine for registers (fixed names)
- Hard for memory
- Does 100(R4) 20(R6)?
- From different loop iterations, does 20(R6)
20(R6)?
13Compiler Perspectives on Code Movement
Compilers and ILP
Data Dependencies
Where are the data dependencies?
1 Loop LD F0,0(R1) 2 ADDD F4,F0,F2
3 SUBI R1,R1,8 4 BNEZ R1,Loop delayed
branch 5 SD 8(R1),F4 altered when move past
SUBI
14Compiler Perspectives on Code Movement
Compilers and ILP
Name Dependencies
- Another kind of dependence called name
dependence two instructions use same name
(register or memory location) but dont exchange
data - Anti-dependence (WAR if a hazard for HW)
- Instruction j writes a register or memory
location that instruction i reads from and
instruction i is executed first - Output dependence (WAW if a hazard for HW)
- Instruction i and instruction j write the same
register or memory location ordering between
instructions must be preserved.
15Compiler Perspectives on Code Movement
Compilers and ILP
Name Dependencies
1 Loop LD F0,0(R1) 2 ADDD F4,F0,F2
3 SD 0(R1),F4 4 LD F0,-8(R1) 5 ADDD F4,F0,F2
6 SD -8(R1),F4 7 LD F0,-16(R1)
8 ADDD F4,F0,F2 9 SD -16(R1),F4
10 LD F0,-24(R1) 11 ADDD F4,F0,F2
12 SD -24(R1),F4 13 SUBI R1,R1,32
14 BNEZ R1,LOOP 15 NOP How can we remove these
dependencies?
Where are the name dependencies?
No data is passed in F0, but cant reuse F0 in
cycle 4.
16Where are the name dependencies?
Compilers and ILP
Name Dependencies
Compiler Perspectives on Code Movement
1 Loop LD F0,0(R1) 2 ADDD F4,F0,F2
3 SD 0(R1),F4 4 LD F6,-8(R1) 5 ADDD F8,F6,F2
6 SD -8(R1),F8 7 LD F10,-16(R1)
8 ADDD F12,F10,F2 9 SD -16(R1),F12
10 LD F14,-24(R1) 11 ADDD F16,F14,F2
12 SD -24(R1),F16 13 SUBI R1,R1,32
14 BNEZ R1,LOOP 15 NOP Called register
renaming
Now there are data dependencies only. F0 exists
only in instructions 1 and 2.
17Compiler Perspectives on Code Movement
Compilers and ILP
Name Dependencies
- Again Name Dependencies are Hard for Memory
Accesses - Does 100(R4) 20(R6)?
- From different loop iterations, does 20(R6)
20(R6)? - Our example required compiler to know that if R1
doesnt change then0(R1) ? -8(R1) ? -16(R1) ?
-24(R1) - There were no dependencies between some
loads and stores so they could be moved around
each other
18Compilers and ILP
Control Dependencies
Compiler Perspectives on Code Movement
- Final kind of dependence called control
dependence - Example
- if p1 S1
- if p2 S2
- S1 is control dependent on p1 and S2 is control
dependent on p2 but not on p1.
19Compilers and ILP
Control Dependencies
Compiler Perspectives on Code Movement
- Two (obvious) constraints on control dependences
- An instruction that is control dependent on a
branch cannot be moved before the branch so
that its execution is no longer controlled by the
branch. - An instruction that is not control dependent on a
branch cannot be moved to after the branch so
that its execution is controlled by the branch.
- Control dependencies relaxed to get parallelism
get same effect if preserve order of exceptions
(address in register checked by branch before
use) and data flow (value in register depends on
branch)
20Where are the control dependencies?
Compilers and ILP
Control Dependencies
Compiler Perspectives on Code Movement
1 Loop LD F0,0(R1) 2 ADDD F4,F0,F2
3 SD 0(R1),F4 4 SUBI R1,R1,8 5 BEQZ R1,exit
6 LD F0,0(R1) 7 ADDD F4,F0,F2 8 SD 0(R1),F4
9 SUBI R1,R1,8 10 BEQZ R1,exit 11 LD F0,0(R1)
12 ADDD F4,F0,F2 13 SD 0(R1),F4
14 SUBI R1,R1,8 15 BEQZ R1,exit ....
21When Safe to Unroll Loop?
Compilers and ILP
Loop Level Parallelism
- Example Where are data dependencies? (A,B,C
distinct non-overlapping) - 1. S2 uses the value, Ai1, computed by S1 in
the same iteration. - 2. S1 uses a value computed by S1 in an earlier
iteration, since iteration i computes Ai1
which is read in iteration i1. The same is true
of S2 for Bi and Bi1. This is a
loop-carried dependence between iterations - Implies that iterations are dependent, and cant
be executed in parallel - Note the case for our prior example each
iteration was distinct
for (i1 ilt100 ii1) Ai1 Ai Ci
/ S1 / Bi1 Bi Ai1 / S2 /
22When Safe to Unroll Loop?
Compilers and ILP
Loop Level Parallelism
- Example Where are data dependencies? (A,B,C,D
distinct non-overlapping) - 1. No dependence from S1 to S2. If there
were, then there would be a cycle in the
dependencies and the loop would not be parallel.
Since this other dependence is absent,
interchanging the two statements will not affect
the execution of S2. - 2. On the first iteration of the loop,
statement S1 depends on the value of B1
computed prior to initiating the loop.
for (i1 ilt100 ii1) Ai1 Ai Bi
/ S1 / Bi1 Ci Di / S2 /
23Now Safe to Unroll Loop? (p. 240)
Compilers and ILP
Loop Level Parallelism
for (i1 ilt100 ii1) Ai1 Ai Bi
/ S1 / Bi1 Ci Di / S2 /
No circular dependencies.
OLD
Loop caused dependence on B.
-
- A1 A1 B1
- for (i1 ilt99 ii1) Bi1 Ci
Di Ai1 Ai1 Bi1 - B101 C100 D100
Have eliminated loop dependence.
NEW
24Example 1There are NO dependencies
Compilers and ILP
Loop Level Parallelism
- /
- This is the example on page 305 of Hennessy
Patterson but running on an Intel Machine
/ - define MAX 1000
- define ITER 100000
- int main( int argc, char argv )
-
- double xMAX 2
- double s 3.14159
- int i, j
- for ( i MAX i gt 0 i-- ) / Init array
/ - xi 0
- for ( j ITER j gt 0 j-- )
- for ( i MAX i gt 0 i-- )
- xi xi s
-
25 Elapsed seconds 0.122848
Compilers and ILP
This is the ICC optimized code .L2
fstpl 8(esp,edx,8) fldl
(esp,edx,8) fadd st(1), st
fldl -8(esp,edx,8) fldl
-16(esp,edx,8) fldl
-24(esp,edx,8) fldl
-32(esp,edx,8) fxch st(4)
fstpl (esp,edx,8) fxch st(2)
fadd st(4), st fstpl
-8(esp,edx,8) fadd st(3), st
fstpl -16(esp,edx,8) fadd
st(2), st fstpl -24(esp,edx,8)
fadd st(1), st addl -5,
edx testl edx, edx jg
.L2 Prob 99 fstpl
8(esp,edx,8)
Loop Level Parallelism
Example 1
Elapsed seconds 0.590026
- This is the GCC optimized code
- .L15
- fldl (ecx,eax)
- fadd st(1),st
- decl edx
- fstpl (ecx,eax)
- addl -8,eax
- testl edx,edx
- jg .L15
-
26Example 2
Compilers and ILP
Loop Level Parallelism
- // Example on Page 320
- get_current_time( start_time )
- for ( j ITER j gt 0 j-- )
-
- for ( i 1 i lt MAX i )
-
- Ai1 Ai Ci
- Bi1 Bi Ai1
-
-
- get_current_time( end_time )
There are two depend-encies here what are they?
27 Compilers and ILP
Elapsed seconds 0.664073
Loop Level Parallelism
This is the ICC optimized code .L4
fstpl 25368(esp,edx,8) fldl
8472(esp,edx,8) faddl
16920(esp,edx,8) fldl
25368(esp,edx,8) fldl
16928(esp,edx,8) fxch st(2)
fstl 8480(esp,edx,8)
fadd st, st(1) fxch
st(1) fstl
25376(esp,edx,8) fxch st(2)
faddp st, st(1)
fstl 8488(esp,edx,8) faddp
st, st(1) addl 2, edx
cmpl 1000, edx jle
.L4 Prob 99 fstpl
25368(esp,edx,8)
Example 2
Elapsed seconds 1.357084
- This is GCC optimized code
- .L55
- fldl -8(esi,eax)
- faddl -8(edi,eax)
- fstl (esi,eax)
- faddl -8(ecx,eax)
- incl edx
- fstpl (ecx,eax)
- addl 8,eax
- cmpl 1000,edx
- jle .L55
This is Microsoft optimized code L1225 fld QWOR
D PTR _Cespeax40108 add eax, 8 cmp eax,
7992 fadd QWORD PTR _Aespeax40100 fst QW
ORD PTR _Aespeax40108 fadd QWORD PTR
_Bespeax40100 fstp QWORD PTR
_Bespeax40108 jle L1225
28Example 3
Compilers and ILP
Loop Level Parallelism
- // Example on Page 321
- get_current_time( start_time )
- for ( j ITER j gt 0 j-- )
-
- for ( i 1 i lt MAX i )
-
- Ai Ai Bi
- Bi1 Ci Di
-
-
- get_current_time( end_time )
What are the depend-encies here??
29 Elapsed seconds 0.325419
Compilers and ILP
This is the ICC optimized code .L6
fstpl 8464(esp,edx,8) fldl
8472(esp,edx,8) faddl
25368(esp,edx,8 fldl
16920(esp,edx,8) faddl
33824(esp,edx,8) fldl
8480(esp,edx,8) fldl
16928(esp,edx,8) faddl
33832(esp,edx,8) fxch st(3)
fstpl 8472(esp,edx,8)
fxch st(1) fstl
25376(esp,edx,8) fxch st(2)
fstpl 25384(esp,edx,8)
faddp st, st(1) addl 2,
edx cmpl 1000, edx
jle .L6 Prob 99
fstpl 8464(esp,edx,8)
Loop Level Parallelism
Example 3
Elapsed seconds 1.370478
- This is the GCC optimized code
- .L65
- fldl (esi,eax)
- faddl (ecx,eax)
- fstpl (esi,eax)
- movl -40100(ebp),edi
- fldl (edi,eax)
- movl -40136(ebp),edi
- faddl (edi,eax)
- incl edx
- fstpl 8(ecx,eax)
- addl 8,eax
- cmpl 1000,edx
- jle .L65
30Example 4
Compilers and ILP
Loop Level Parallelism
- // Example on Page 322
- get_current_time( start_time )
- for ( j ITER j gt 0 j-- )
-
- A1 A1 B1
- for ( i 1 i lt MAX - 1 i )
-
- Bi1 Ci Di
- Ai1 Ai1 Bi1
-
- B101 C100 D100
-
- get_current_time( end_time )
Elapsed seconds 1.200525
How many depend-encies here??
31 Compilers and ILP
Loop Level Parallelism
Example 4
Elapsed seconds 1.200525
- This is the GCC optimized code
- .L75
- movl -40136(ebp),edi
- fldl -8(edi,eax)
- faddl -8(esi,eax)
- movl -40104(ebp),edi
- fstl (edi,eax)
- faddl (ecx,eax)
- incl edx
- fstpl (ecx,eax)
- addl 8,eax
- cmpl 999,edx
- jle .L75
This is the Microsoft optimized
code L1239 fld QWORD PTR _Despeax40108 ad
d eax, 8 cmp eax, 7984 00001f30H fadd QWORD
PTR _Cespeax40100 fst QWORD PTR
_Bespeax40108 fadd QWORD PTR
_Aespeax40108 fstp QWORD PTR
_Aespeax40108 jle SHORT L1239
32 Compilers and ILP
Elapsed seconds 0.359232
Loop Level Parallelism
CONTINUED fstl 25376(esp,edx,8)
fxch st(3) fstl
25384(esp,edx,8) fxch st(1)
fstl 25392(esp,edx,8)
fxch st(3) faddp
st, st(4) fxch st(3)
fstpl 8480(esp,edx,8)
faddp st, st(2) fxch
st(1) fstpl
8488(esp,edx,8) faddp st, st(1)
addl 3, edx cmpl
999, edx jle .L8
fstpl 8472(esp,edx,8)
Example 4
- This is the ICC optimized code
- .L8
- fstpl 8472(esp,edx,8)
- fldl 16920(esp,edx,8)
- faddl 33824(esp,edx,8)
- fldl 8480(esp,edx,8)
- fldl 16928(esp,edx,8)
- faddl 33832(esp,edx,8)
- fldl 8488(esp,edx,8)
- fldl 16936(esp,edx,8)
- faddl 33840(esp,edx,8)
- fldl 8496(esp,edx,8)
- fxch st(5)
33Static Multiple Issue
Multiple Issue is the ability of the processor to
start more than one instruction in a given
cycle. Flavor I Superscalar processors issue
varying number of instructions per clock - can be
either statically scheduled (by the compiler) or
dynamically scheduled (by the hardware). Supersca
lar has a varying number of instructions/cycle
(1 to 8), scheduled by compiler or by HW
(Tomasulo). IBM PowerPC, Sun UltraSparc, DEC
Alpha, HP 8000
- 4.1 Compiler Techniques for Exposing ILP
- 4.3 Static Multiple Issue VLIW
- 4.4 Advanced Compiler Support for ILP
- 4.5 Hardware Support for Exposing more
Parallelism
34Issuing Multiple Instructions/Cycle
Multiple Issue
- Flavor II
- VLIW - Very Long Instruction Word - issues a
fixed number of instructions formatted either as
one very large instruction or as a fixed packet
of smaller instructions. - fixed number of instructions (4-16) scheduled by
the compiler put operators into wide templates - Joint HP/Intel agreement in 1999/2000
- Intel Architecture-64 (IA-64) 64-bit address
- Style Explicitly Parallel Instruction Computer
(EPIC)
35Issuing Multiple Instructions/Cycle
Multiple Issue
- Flavor II - continued
- 3 Instructions in 128 bit groups field
determines if instructions dependent or
independent - Smaller code size than old VLIW, larger than
x86/RISC - Groups can be linked to show independence gt 3
instr - 64 integer registers 64 floating point
registers - Not separate files per functional unit as in old
VLIW - Hardware checks dependencies (interlocks gt
binary compatibility over time) - Predicated execution (select 1 out of 64 1-bit
flags) gt 40 fewer mis-predictions? - IA-64 name of instruction set architecture
EPIC is type - Merced is name of first implementation
(1999/2000?)
36Issuing Multiple Instructions/Cycle
Multiple Issue
A SuperScalar Version of MIPS
- In our MIPS example, we can handle 2
instructions/cycle - Floating Point
- Anything Else
- Fetch 64-bits/clock cycle Int on left, FP on
right - Can only issue 2nd instruction if 1st
instruction issues - More ports for FP registers to do FP load FP
op in a pair - Type Pipe Stages
- Int. instruction IF ID EX MEM WB
- FP instruction IF ID EX MEM WB
- Int. instruction IF ID EX MEM WB
- FP instruction IF ID EX MEM WB
- Int. instruction IF ID EX MEM WB
- FP instruction IF ID EX MEM WB
- 1 cycle load delay causes delay to 3
instructions in Superscalar - instruction in right half cant use it, nor
instructions in next slot
37Unrolled Loop Minimizes Stalls for Scalar
Multiple Issue
A SuperScalar Version of MIPS
1 Loop LD F0,0(R1) 2 LD F6,-8(R1) 3 LD F10,-16(R1
) 4 LD F14,-24(R1) 5 ADDD F4,F0,F2 6 ADDD F8,F6,F2
7 ADDD F12,F10,F2 8 ADDD F16,F14,F2 9 SD 0(R1),F4
10 SD -8(R1),F8 11 SD -16(R1),F12 12 SUBI R1,R1,
32 13 BNEZ R1,LOOP 14 SD 8(R1),F16 8-32
-24 14 clock cycles, or 3.5 per iteration
Latencies LD to ADDD 1 Cycle ADDD to SD 2
Cycles
38Loop Unrolling in Superscalar
Multiple Issue
A SuperScalar Version of MIPS
- Integer instruction FP instruction Clock cycle
- Loop LD F0,0(R1) 1
- LD F6,-8(R1) 2
- LD F10,-16(R1) ADDD F4,F0,F2 3
- LD F14,-24(R1) ADDD F8,F6,F2 4
- LD F18,-32(R1) ADDD F12,F10,F2 5
- SD 0(R1),F4 ADDD F16,F14,F2 6
- SD -8(R1),F8 ADDD F20,F18,F2 7
- SD -16(R1),F12 8
- SD -24(R1),F16 9
- SUBI R1,R1,40 10
- BNEZ R1,LOOP 11
- SD 8(R1),F20 12
- Unrolled 5 times to avoid delays (1 due to SS)
- 12 clocks, or 2.4 clocks per iteration
39Dynamic Scheduling in Superscalar
Multiple Issue
Multiple Instruction Issue Dynamic Scheduling
- Code compiler for scalar version will run poorly
on Superscalar - May want code to vary depending on how
Superscalar - Simple approach separate Tomasulo Control for
separate reservation stations for Integer FU/Reg
and for FP FU/Reg
40Dynamic Scheduling in Superscalar
Multiple Issue
Multiple Instruction Issue Dynamic Scheduling
- How to do instruction issue with two instructions
and keep in-order instruction issue for Tomasulo? - Issue 2X Clock Rate, so that issue remains in
order - Only FP loads might cause dependency between
integer and FP issue - Replace load reservation station with a load
queue operands must be read in the order they
are fetched - Load checks addresses in Store Queue to avoid RAW
violation - Store checks addresses in Load Queue to avoid
WAR,WAW
41Performance of Dynamic Superscalar
Multiple Issue
Multiple Instruction Issue Dynamic Scheduling
- Iteration Instructions Issues Executes Writes
result - no.
clock-cycle number - 1 LD F0,0(R1) 1 2 4
- 1 ADDD F4,F0,F2 1 5 8
- 1 SD 0(R1),F4 2 9
- 1 SUBI R1,R1,8 3 4 5
- 1 BNEZ R1,LOOP 4 5
- 2 LD F0,0(R1) 5 6 8
- 2 ADDD F4,F0,F2 5 9 12
- 2 SD 0(R1),F4 6 13
- 2 SUBI R1,R1,8 7 8 9
- 2 BNEZ R1,LOOP 8 9
- 4 clocks per iteration
- Branches, Decrements still take 1 clock cycle
42Loop Unrolling in VLIW
Multiple Issue
VLIW
- Memory Memory FP FP Int. op/ Clockreference
1 reference 2 operation 1 op. 2 branch - LD F0,0(R1) LD F6,-8(R1) 1
- LD F10,-16(R1) LD F14,-24(R1) 2
- LD F18,-32(R1) LD F22,-40(R1) ADDD F4,F0,F2 ADDD
F8,F6,F2 3 - LD F26,-48(R1) ADDD F12,F10,F2 ADDD F16,F14,F2 4
- ADDD F20,F18,F2 ADDD F24,F22,F2 5
- SD 0(R1),F4 SD -8(R1),F8 ADDD F28,F26,F2 6
- SD -16(R1),F12 SD -24(R1),F16 7
- SD -32(R1),F20 SD -40(R1),F24 SUBI R1,R1,48 8
- SD -0(R1),F28 BNEZ R1,LOOP 9
- Unrolled 7 times to avoid delays
- 7 results in 9 clocks, or 1.3 clocks per
iteration - Need more registers to effectively use VLIW
43Limits to Multi-Issue Machines
Multiple Issue
Limitations With Multiple Issue
- Inherent limitations of ILP
- 1 branch in 5 instructions gt how to keep a 5-way
VLIW busy? - Latencies of units gt many operations must be
scheduled - Need about Pipeline Depth x No. Functional Units
of independent operations to keep machines busy. - Difficulties in building HW
- Duplicate Functional Units to get parallel
execution - Increase ports to Register File (VLIW example
needs 6 read and 3 write for Int. Reg. 6 read
and 4 write for Reg.) - Increase ports to memory
- Decoding SS and impact on clock rate, pipeline
depth
44Limits to Multi-Issue Machines
Multiple Issue
Limitations With Multiple Issue
- Limitations specific to either SS or VLIW
implementation - Decode issue in SS
- VLIW code size unroll loops wasted fields in
VLIW - VLIW lock step gt 1 hazard all instructions
stall - VLIW binary compatibility
45Multiple Issue Challenges
Multiple Issue
Limitations With Multiple Issue
- While Integer/FP split is simple for the HW, get
CPI of 0.5 only for programs with - Exactly 50 FP operations
- No hazards
- If more instructions issue at same time, greater
difficulty of decode and issue - Even 2-scalar gt examine 2 opcodes, 6 register
specifiers, decide if 1 or 2 instructions can
issue - VLIW tradeoff instruction space for simple
decoding - The long instruction word has room for many
operations - By definition, all the operations the compiler
puts in the long instruction word are independent
gt execute in parallel - E.g., 2 integer operations, 2 FP ops, 2 Memory
refs, 1 branch - 16 to 24 bits per field gt 716 or 112 bits to
724 or 168 bits wide - Need compiling technique that schedules across
several branches
46Compiler Support For ILP
- 4.1 Compiler Techniques for Exposing ILP
- 4.3 Static Multiple Issue VLIW
- 4.4 Advanced Compiler Support for ILP
- 4.5 Hardware Support for Exposing more
Parallelism
How can compilers be smart? 1. Produce good
scheduling of code. 2. Determine which loops
might contain parallelism. 3. Eliminate name
dependencies. Compilers must be REALLY smart to
figure out aliases -- pointers in C are a real
problem. Techniques lead to Symbolic Loop
Unrolling Critical Path Scheduling
47Software Pipelining
Compiler Support For ILP
Symbolic Loop Unrolling
- Observation if iterations from loops are
independent, then can get ILP by taking
instructions from different iterations - Software pipelining reorganizes loops so that
each iteration is made from instructions chosen
from different iterations of the original loop
(Tomasulo in SW)
48SW Pipelining Example
Compiler Support For ILP
Symbolic Loop Unrolling
- Before Unrolled 3 times
- 1 LD F0,0(R1)
- 2 ADDD F4,F0,F2
- 3 SD 0(R1),F4
- 4 LD F6,-8(R1)
- 5 ADDD F8,F6,F2
- 6 SD -8(R1),F8
- 7 LD F10,-16(R1)
- 8 ADDD F12,F10,F2
- 9 SD -16(R1),F12
- 10 SUBI R1,R1,24
- 11 BNEZ R1,LOOP
After Software Pipelined LD F0,0(R1) ADDD F4,F0
,F2 LD F0,-8(R1) 1 SD 0(R1),F4 Stores Mi
2 ADDD F4,F0,F2 Adds to Mi-1
3 LD F0,-16(R1) loads Mi-2 4 SUBI R1,R1,8
5 BNEZ R1,LOOP SD 0(R1),F4 ADDD F4,F0,F2 SD -8(
R1),F4
Read F4
Read F0
IF ID EX Mem WB IF ID EX Mem WB
IF ID EX Mem WB
SD ADDD LD
Write F4
Write F0
49SW Pipelining Example
Compiler Support For ILP
Symbolic Loop Unrolling
- Symbolic Loop Unrolling
- Less code space
- Overhead paid only once vs. each iteration
in loop unrolling
Software Pipelining
Loop Unrolling
100 iterations 25 loops with 4 unrolled
iterations each
50Trace Scheduling
Compiler Support For ILP
Critical Path Scheduling
- Parallelism across IF branches vs. LOOP branches
- Two steps
- Trace Selection
- Find likely sequence of basic blocks (trace) of
(statically predicted or profile predicted) long
sequence of straight-line code - Trace Compaction
- Squeeze trace into few VLIW instructions
- Need bookkeeping code in case prediction is wrong
- Compiler undoes bad guess (discards values in
registers) - Subtle compiler bugs mean wrong answer vs.
poorer performance no hardware interlocks
51Hardware Support For Parallelism
- 4.1 Compiler Techniques for Exposing ILP
- 4.3 Static Multiple Issue VLIW
- 4.4 Advanced Compiler Support for ILP
- 4.5 Hardware Support for Exposing more
Parallelism
- Software support of ILP is best when code is
predictable at compile time. - But what if theres no predictability?
- Here well talk about hardware techniques. These
include - Conditional or Predicated Instructions
- Hardware Speculation
52Tell the Hardware To Ignore An Instruction
Hardware Support For Parallelism
Nullified Instructions
- Avoid branch prediction by turning branches into
conditionally executed instructions - IF (x) then A B op C else NOP
- If false, then neither store result nor cause
exception - Expanded ISA of Alpha, MIPs, PowerPC, SPARC, have
conditional move. PA-RISC can annul any
following instruction. - IA-64 64 1-bit condition fields selected so
conditional execution of any instruction - Drawbacks to conditional instructions
- Still takes a clock, even if annulled
- Stalls if condition evaluated late
- Complex conditions reduce effectiveness
condition becomes known late in pipeline. - This can be a major win because there is no time
lost by taking a branch!!
x
A B op C
53Tell the Hardware To Ignore An Instruction
Hardware Support For Parallelism
Nullified Instructions
- Suppose we have the code
- if ( VarA 0 )
- VarS VarT
- Previous Method
- LD R1, VarA
- BNEZ R1, Label
- LD R2, VarT
- SD VarS, R2
- Label
Nullified Method LD R1, VarA LD R2,
VarT CMPNNZ R1, 0 SD VarS, R2 Label
Compare and Nullify Next Instr. If Not Zero
Nullified Method LD R1, VarA LD R2, VarT CMOVZ
VarS,R2, R1
Compare and Move IF Zero
54Hardware Support For Parallelism
Compiler Speculation
Increasing Parallelism
- The theory here is to move an instruction across
a branch so as to increase the size of a basic
block and thus to increase parallelism. - Primary difficulty is in avoiding exceptions.
For example - if ( a 0 ) c b/a may have divide by
zero error in some cases. - Methods for increasing speculation include
- 1. Use a set of status bits (poison bits)
associated with the registers. Are a signal that
the instruction results are invalid until some
later time. - 2. Result of instruction isnt written until
its certain the instruction is no longer
speculative.
55Hardware Support For Parallelism
Compiler Speculation
Increasing Parallelism
Original Code LW R1, 0(R3) Load A BNEZ
R1, L1 Test A LW R1, 0(R2) If
Clause J L2 Skip Else L1 ADDI R1, R1,
4 Else Clause L2 SW 0(R3), R1 Store A
- Example on Page 305.
- Code for
- if ( A 0 )
- A B
- else
- A A 4
- Assume A is at 0(R3) and B is at 0(R4)
Speculated Code LW R1, 0(R3) Load A
LW R14, 0(R2) Spec Load B BEQZ R1, L3
Other if Branch ADDI R14, R1, 4 Else
Clause L3 SW 0(R3), R14 Non-Spec Store
Note here that only ONE side needs to take a
branch!!
56Hardware Support For Parallelism
Compiler Speculation
Poison Bits
Speculated Code LW R1, 0(R3) Load A
LW R14, 0(R2) Spec Load B BEQZ R1, L3
Other if Branch ADDI R14, R1, 4 Else
Clause L3 SW 0(R3), R14 Non-Spec Store
- In the example on the last page, if the LW
produces an exception, a poison bit is set on
that register. The if a later instruction tries
to use the register, an exception is THEN raised.
57HW support for More ILP
Hardware Support For Parallelism
Hardware Speculation
- Need HW buffer for results of uncommitted
instructions reorder buffer - Reorder buffer can be operand source
- Once operand commits, result is found in register
- 3 fields instr. type, destination, value
- Use reorder buffer number instead of reservation
station - Discard instructions on mis-predicted branches or
on exceptions
58HW support for More ILP
Hardware Support For Parallelism
Hardware Speculation
- How is this used in practice?
- Rather than predicting the direction of a branch,
execute the instructions on both side!! - We early on know the target of a branch, long
before we know it if will be taken or not. - So begin fetching/executing at that new Target
PC. - But also continue fetching/executing as if the
branch NOT taken.
59Summary
- 4.1 Compiler Techniques for Exposing ILP
- 4.3 Static Multiple Issue VLIW
- 4.4 Advanced Compiler Support for ILP
- 4.5 Hardware Support for Exposing more Parallelism