Computer Architecture

About This Presentation

Title:

Computer Architecture

Description:

4 stall. 5 BNEZ R1,Loop ;delayed branch. 6 SD 8(R1),F4 ;altered when move past SUBI ... 14 stall. Compilers and ILP. Pipeline Scheduling and Loop Unrolling ... – PowerPoint PPT presentation

Number of Views:19

Avg rating:3.0/5.0

Slides: 59

Provided by: jb20

Category:

more less

Transcript and Presenter's Notes

Title: Computer Architecture

1
Computer Architecture

Chapter 4
Instruction-Level Parallelism - 3
Prof. Jerry Breecher
CS 240
Fall 2003

2
Chapter Overview

4.1 Compiler Techniques for Exposing ILP
4.2 Static Branch Prediction
4.3 Static Multiple Issue VLIW
4.4 Advanced Compiler Support for ILP
4.5 Hardware Support for Exposing more
Parallelism

3
Ideas To Reduce Stalls
Chapter 3
Chapter 4
4
Instruction Level Parallelism

4.1 Compiler Techniques for Exposing ILP
4.3 Static Multiple Issue VLIW
4.4 Advanced Compiler Support for ILP
4.5 Hardware Support for Exposing more
Parallelism

How can compilers recognize and take advantage of
ILP?
5
Simple Loop and its Assembler Equivalent
Compilers and ILP
Pipeline Scheduling and Loop Unrolling
This is a clean and simple example!

for (i1 ilt1000 i) x(i) x(i) s

Loop LD F0,0(R1) F0vector element
ADDD F4,F0,F2 add scalar from F2
SD 0(R1),F4 store result SUBI R1,R1,8 decre
ment pointer 8bytes (DW) BNEZ R1,Loop branch
R1!zero NOP delayed branch slot
6
FP Loop Hazards
Compilers and ILP
Pipeline Scheduling and Loop Unrolling
Loop LD F0,0(R1) F0vector element
ADDD F4,F0,F2 add scalar in F2
SD 0(R1),F4 store result SUBI R1,R1,8 decre
ment pointer 8B (DW) BNEZ R1,Loop branch
R1!zero NOP delayed branch slot
Instruction Instruction Latency inproducing
result using result clock cycles FP ALU
op Another FP ALU op 3 FP ALU op Store double 2
Load double FP ALU op 1 Load double Store
double 0 Integer op Integer op 0
Where are the stalls?
7
FP Loop Showing Stalls
Compilers and ILP
Pipeline Scheduling and Loop Unrolling
1 Loop LD F0,0(R1) F0vector element
2 stall 3 ADDD F4,F0,F2 add scalar in F2
4 stall 5 stall 6 SD 0(R1),F4 store
result 7 SUBI R1,R1,8 decrement pointer 8Byte
(DW) 8 stall 9
BNEZ R1,Loop branch R1!zero
10 stall delayed branch slot
Instruction Instruction Latency inproducing
result using result clock cycles FP ALU
op Another FP ALU op 3 FP ALU op Store double 2
Load double FP ALU op 1 Load double Store
double 0 Integer op Integer op 0

10 clocks Rewrite code to minimize stalls?

8
Scheduled FP Loop Minimizing Stalls
Compilers and ILP
Pipeline Scheduling and Loop Unrolling
1 Loop LD F0,0(R1) 2 SUBI R1,R1,8
3 ADDD F4,F0,F2 4 stall 5 BNEZ R1,Loop de
layed branch 6 SD 8(R1),F4 altered when move
past SUBI
Stall is because SD cant proceed.
Swap BNEZ and SD by changing address of SD
Instruction Instruction Latency inproducing
result using result clock cycles FP ALU
op Another FP ALU op 3 FP ALU op Store double 2
Load double FP ALU op 1

Now 6 clocks Now unroll loop 4 times to make
faster.

9
Unroll Loop Four Times (straightforward way)
Compilers and ILP
Pipeline Scheduling and Loop Unrolling
1 Loop LD F0,0(R1) 2 stall 3 ADDD F4,F0,F2
4 stall 5 stall 6 SD 0(R1),F4 7 LD F6,-8(R1)
8 stall 9 ADDD F8,F6,F2 10 stall 11 stall 12 SD -
8(R1),F8 13 LD F10,-16(R1) 14 stall
15 ADDD F12,F10,F2 16 stall 17 stall 18 SD -16(R1)
,F12 19 LD F14,-24(R1) 20 stall 21 ADDD F16,F14,F2
22 stall 23 stall 24 SD -24(R1),F16 25 SUBI R1,R1
,32 26 BNEZ R1,LOOP 27 stall 28 NOP
15 4 x (12) 1 28 clock cycles, or 7 per
iteration Assumes R1 is multiple of 4

Rewrite loop to minimize stalls.

10
Unrolled Loop That Minimizes Stalls
Compilers and ILP
Pipeline Scheduling and Loop Unrolling

What assumptions made when moved code?
OK to move store past SUBI even though changes
register
OK to move loads before stores get right data?
When is it safe for compiler to do such changes?

1 Loop LD F0,0(R1) 2 LD F6,-8(R1) 3 LD F10,-16(R1
) 4 LD F14,-24(R1) 5 ADDD F4,F0,F2 6 ADDD F8,F6,F2
7 ADDD F12,F10,F2 8 ADDD F16,F14,F2 9 SD 0(R1),F4
10 SD -8(R1),F8 11 SD -16(R1),F12 12 SUBI R1,R1,
32 13 BNEZ R1,LOOP 14 SD 8(R1),F16 8-32 -24
14 clock cycles, or 3.5 per iteration
No Stalls!!
11
Summary of Loop Unrolling Example
Compilers and ILP
Pipeline Scheduling and Loop Unrolling

Determine that it was legal to move the SD after
the SUBI and BNEZ, and find the amount to adjust
the SD offset.
Determine that unrolling the loop would be useful
by finding that the loop iterations were
independent, except for the loop maintenance
code.
Use different registers to avoid unnecessary
constraints that would be forced by using the
same registers for different computations.
Eliminate the extra tests and branches and adjust
the loop maintenance code.
Determine that the loads and stores in the
unrolled loop can be interchanged by observing
that the loads and stores from different
iterations are independent. This requires
analyzing the memory addresses and finding that
they do not refer to the same address.
Schedule the code, preserving any dependences
needed to yield the same result as the original
code.

12
Compiler Perspectives on Code Movement
Compilers and ILP
Dependencies

Compiler concerned about dependencies in program.
Not concerned if a HW hazard depends on a
given pipeline.
Tries to schedule code to avoid hazards.
Looks for Data dependencies (RAW if a hazard for
HW)
Instruction i produces a result used by
instruction j, or
Instruction j is data dependent on instruction k,
and instruction k is data dependent on
instruction i.
If dependent, cant execute in parallel
Easy to determine for registers (fixed names)
Hard for memory
Does 100(R4) 20(R6)?
From different loop iterations, does 20(R6)
20(R6)?

13
Compiler Perspectives on Code Movement
Compilers and ILP
Data Dependencies
Where are the data dependencies?
1 Loop LD F0,0(R1) 2 ADDD F4,F0,F2
3 SUBI R1,R1,8 4 BNEZ R1,Loop delayed
branch 5 SD 8(R1),F4 altered when move past
SUBI
14
Compiler Perspectives on Code Movement
Compilers and ILP
Name Dependencies

Another kind of dependence called name
dependence two instructions use same name
(register or memory location) but dont exchange
data
Anti-dependence (WAR if a hazard for HW)
Instruction j writes a register or memory
location that instruction i reads from and
instruction i is executed first
Output dependence (WAW if a hazard for HW)
Instruction i and instruction j write the same
register or memory location ordering between
instructions must be preserved.

15
Compiler Perspectives on Code Movement
Compilers and ILP
Name Dependencies
1 Loop LD F0,0(R1) 2 ADDD F4,F0,F2
3 SD 0(R1),F4 4 LD F0,-8(R1) 5 ADDD F4,F0,F2
6 SD -8(R1),F4 7 LD F0,-16(R1)
8 ADDD F4,F0,F2 9 SD -16(R1),F4
10 LD F0,-24(R1) 11 ADDD F4,F0,F2
12 SD -24(R1),F4 13 SUBI R1,R1,32
14 BNEZ R1,LOOP 15 NOP How can we remove these
dependencies?
Where are the name dependencies?
No data is passed in F0, but cant reuse F0 in
cycle 4.
16
Where are the name dependencies?
Compilers and ILP
Name Dependencies
Compiler Perspectives on Code Movement
1 Loop LD F0,0(R1) 2 ADDD F4,F0,F2
3 SD 0(R1),F4 4 LD F6,-8(R1) 5 ADDD F8,F6,F2
6 SD -8(R1),F8 7 LD F10,-16(R1)
8 ADDD F12,F10,F2 9 SD -16(R1),F12
10 LD F14,-24(R1) 11 ADDD F16,F14,F2
12 SD -24(R1),F16 13 SUBI R1,R1,32
14 BNEZ R1,LOOP 15 NOP Called register
renaming
Now there are data dependencies only. F0 exists
only in instructions 1 and 2.
17
Compiler Perspectives on Code Movement
Compilers and ILP
Name Dependencies

Again Name Dependencies are Hard for Memory
Accesses
Does 100(R4) 20(R6)?
From different loop iterations, does 20(R6)
20(R6)?
Our example required compiler to know that if R1
doesnt change then0(R1) ? -8(R1) ? -16(R1) ?
-24(R1)
There were no dependencies between some
loads and stores so they could be moved around
each other

18
Compilers and ILP
Control Dependencies
Compiler Perspectives on Code Movement

Final kind of dependence called control
dependence
Example
if p1 S1
if p2 S2
S1 is control dependent on p1 and S2 is control
dependent on p2 but not on p1.

19
Compilers and ILP
Control Dependencies
Compiler Perspectives on Code Movement

Two (obvious) constraints on control dependences
An instruction that is control dependent on a
branch cannot be moved before the branch so
that its execution is no longer controlled by the
branch.
An instruction that is not control dependent on a
branch cannot be moved to after the branch so
that its execution is controlled by the branch.
Control dependencies relaxed to get parallelism
get same effect if preserve order of exceptions
(address in register checked by branch before
use) and data flow (value in register depends on
branch)

20
Where are the control dependencies?
Compilers and ILP
Control Dependencies
Compiler Perspectives on Code Movement
1 Loop LD F0,0(R1) 2 ADDD F4,F0,F2
3 SD 0(R1),F4 4 SUBI R1,R1,8 5 BEQZ R1,exit
6 LD F0,0(R1) 7 ADDD F4,F0,F2 8 SD 0(R1),F4
9 SUBI R1,R1,8 10 BEQZ R1,exit 11 LD F0,0(R1)
12 ADDD F4,F0,F2 13 SD 0(R1),F4
14 SUBI R1,R1,8 15 BEQZ R1,exit ....
21
When Safe to Unroll Loop?
Compilers and ILP
Loop Level Parallelism

Example Where are data dependencies? (A,B,C
distinct non-overlapping)
1. S2 uses the value, Ai1, computed by S1 in
the same iteration.
2. S1 uses a value computed by S1 in an earlier
iteration, since iteration i computes Ai1
which is read in iteration i1. The same is true
of S2 for Bi and Bi1. This is a
loop-carried dependence between iterations
Implies that iterations are dependent, and cant
be executed in parallel
Note the case for our prior example each
iteration was distinct

for (i1 ilt100 ii1) Ai1 Ai Ci
/ S1 / Bi1 Bi Ai1 / S2 /
22
When Safe to Unroll Loop?
Compilers and ILP
Loop Level Parallelism

Example Where are data dependencies? (A,B,C,D
distinct non-overlapping)
1. No dependence from S1 to S2. If there
were, then there would be a cycle in the
dependencies and the loop would not be parallel.
Since this other dependence is absent,
interchanging the two statements will not affect
the execution of S2.
2. On the first iteration of the loop,
statement S1 depends on the value of B1
computed prior to initiating the loop.

for (i1 ilt100 ii1) Ai1 Ai Bi
/ S1 / Bi1 Ci Di / S2 /
23
Now Safe to Unroll Loop? (p. 240)
Compilers and ILP
Loop Level Parallelism
for (i1 ilt100 ii1) Ai1 Ai Bi
/ S1 / Bi1 Ci Di / S2 /
No circular dependencies.
OLD
Loop caused dependence on B.

A1 A1 B1
for (i1 ilt99 ii1) Bi1 Ci
Di Ai1 Ai1 Bi1
B101 C100 D100

Have eliminated loop dependence.
NEW
24
Example 1There are NO dependencies
Compilers and ILP
Loop Level Parallelism

/
This is the example on page 305 of Hennessy
Patterson but running on an Intel Machine
/
define MAX 1000
define ITER 100000
int main( int argc, char argv )
double xMAX 2
double s 3.14159
int i, j
for ( i MAX i gt 0 i-- ) / Init array
/
xi 0
for ( j ITER j gt 0 j-- )
for ( i MAX i gt 0 i-- )
xi xi s

25

Elapsed seconds 0.122848
Compilers and ILP
This is the ICC optimized code .L2

fstpl 8(esp,edx,8) fldl
(esp,edx,8) fadd st(1), st
fldl -8(esp,edx,8) fldl
-16(esp,edx,8) fldl
-24(esp,edx,8) fldl
-32(esp,edx,8) fxch st(4)
fstpl (esp,edx,8) fxch st(2)
fadd st(4), st fstpl
-8(esp,edx,8) fadd st(3), st
fstpl -16(esp,edx,8) fadd
st(2), st fstpl -24(esp,edx,8)
fadd st(1), st addl -5,
edx testl edx, edx jg
.L2 Prob 99 fstpl
8(esp,edx,8)
Loop Level Parallelism
Example 1
Elapsed seconds 0.590026

This is the GCC optimized code
.L15
fldl (ecx,eax)
fadd st(1),st
decl edx
fstpl (ecx,eax)
addl -8,eax
testl edx,edx
jg .L15

26
Example 2
Compilers and ILP
Loop Level Parallelism

// Example on Page 320
get_current_time( start_time )
for ( j ITER j gt 0 j-- )
for ( i 1 i lt MAX i )
Ai1 Ai Ci
Bi1 Bi Ai1
get_current_time( end_time )

There are two depend-encies here what are they?
27

Compilers and ILP
Elapsed seconds 0.664073
Loop Level Parallelism
This is the ICC optimized code .L4
fstpl 25368(esp,edx,8) fldl
8472(esp,edx,8) faddl
16920(esp,edx,8) fldl
25368(esp,edx,8) fldl
16928(esp,edx,8) fxch st(2)
fstl 8480(esp,edx,8)
fadd st, st(1) fxch
st(1) fstl
25376(esp,edx,8) fxch st(2)
faddp st, st(1)
fstl 8488(esp,edx,8) faddp
st, st(1) addl 2, edx
cmpl 1000, edx jle
.L4 Prob 99 fstpl
25368(esp,edx,8)
Example 2
Elapsed seconds 1.357084

This is GCC optimized code
.L55
fldl -8(esi,eax)
faddl -8(edi,eax)
fstl (esi,eax)
faddl -8(ecx,eax)
incl edx
fstpl (ecx,eax)
addl 8,eax
cmpl 1000,edx
jle .L55

This is Microsoft optimized code L1225 fld QWOR
D PTR _Cespeax40108 add eax, 8 cmp eax,
7992 fadd QWORD PTR _Aespeax40100 fst QW
ORD PTR _Aespeax40108 fadd QWORD PTR
_Bespeax40100 fstp QWORD PTR
_Bespeax40108 jle L1225
28
Example 3
Compilers and ILP
Loop Level Parallelism

// Example on Page 321
get_current_time( start_time )
for ( j ITER j gt 0 j-- )
for ( i 1 i lt MAX i )
Ai Ai Bi
Bi1 Ci Di
get_current_time( end_time )

What are the depend-encies here??
29

Elapsed seconds 0.325419
Compilers and ILP
This is the ICC optimized code .L6
fstpl 8464(esp,edx,8) fldl
8472(esp,edx,8) faddl
25368(esp,edx,8 fldl
16920(esp,edx,8) faddl
33824(esp,edx,8) fldl
8480(esp,edx,8) fldl
16928(esp,edx,8) faddl
33832(esp,edx,8) fxch st(3)
fstpl 8472(esp,edx,8)
fxch st(1) fstl
25376(esp,edx,8) fxch st(2)
fstpl 25384(esp,edx,8)
faddp st, st(1) addl 2,
edx cmpl 1000, edx
jle .L6 Prob 99
fstpl 8464(esp,edx,8)
Loop Level Parallelism
Example 3
Elapsed seconds 1.370478

This is the GCC optimized code
.L65
fldl (esi,eax)
faddl (ecx,eax)
fstpl (esi,eax)
movl -40100(ebp),edi
fldl (edi,eax)
movl -40136(ebp),edi
faddl (edi,eax)
incl edx
fstpl 8(ecx,eax)
addl 8,eax
cmpl 1000,edx
jle .L65

30
Example 4
Compilers and ILP
Loop Level Parallelism

// Example on Page 322
get_current_time( start_time )
for ( j ITER j gt 0 j-- )
A1 A1 B1
for ( i 1 i lt MAX - 1 i )
Bi1 Ci Di
Ai1 Ai1 Bi1
B101 C100 D100
get_current_time( end_time )

Elapsed seconds 1.200525
How many depend-encies here??
31

Compilers and ILP
Loop Level Parallelism
Example 4
Elapsed seconds 1.200525

This is the GCC optimized code
.L75
movl -40136(ebp),edi
fldl -8(edi,eax)
faddl -8(esi,eax)
movl -40104(ebp),edi
fstl (edi,eax)
faddl (ecx,eax)
incl edx
fstpl (ecx,eax)
addl 8,eax
cmpl 999,edx
jle .L75

This is the Microsoft optimized
code L1239 fld QWORD PTR _Despeax40108 ad
d eax, 8 cmp eax, 7984 00001f30H fadd QWORD
PTR _Cespeax40100 fst QWORD PTR
_Bespeax40108 fadd QWORD PTR
_Aespeax40108 fstp QWORD PTR
_Aespeax40108 jle SHORT L1239
32

Compilers and ILP
Elapsed seconds 0.359232
Loop Level Parallelism
CONTINUED fstl 25376(esp,edx,8)
fxch st(3) fstl
25384(esp,edx,8) fxch st(1)
fstl 25392(esp,edx,8)
fxch st(3) faddp
st, st(4) fxch st(3)
fstpl 8480(esp,edx,8)
faddp st, st(2) fxch
st(1) fstpl
8488(esp,edx,8) faddp st, st(1)
addl 3, edx cmpl
999, edx jle .L8
fstpl 8472(esp,edx,8)
Example 4

This is the ICC optimized code
.L8
fstpl 8472(esp,edx,8)
fldl 16920(esp,edx,8)
faddl 33824(esp,edx,8)
fldl 8480(esp,edx,8)
fldl 16928(esp,edx,8)
faddl 33832(esp,edx,8)
fldl 8488(esp,edx,8)
fldl 16936(esp,edx,8)
faddl 33840(esp,edx,8)
fldl 8496(esp,edx,8)
fxch st(5)

33
Static Multiple Issue
Multiple Issue is the ability of the processor to
start more than one instruction in a given
cycle. Flavor I Superscalar processors issue
varying number of instructions per clock - can be
either statically scheduled (by the compiler) or
dynamically scheduled (by the hardware). Supersca
lar has a varying number of instructions/cycle
(1 to 8), scheduled by compiler or by HW
(Tomasulo). IBM PowerPC, Sun UltraSparc, DEC
Alpha, HP 8000

4.1 Compiler Techniques for Exposing ILP
4.3 Static Multiple Issue VLIW
4.4 Advanced Compiler Support for ILP
4.5 Hardware Support for Exposing more
Parallelism

34
Issuing Multiple Instructions/Cycle
Multiple Issue

Flavor II
VLIW - Very Long Instruction Word - issues a
fixed number of instructions formatted either as
one very large instruction or as a fixed packet
of smaller instructions.
fixed number of instructions (4-16) scheduled by
the compiler put operators into wide templates
Joint HP/Intel agreement in 1999/2000
Intel Architecture-64 (IA-64) 64-bit address
Style Explicitly Parallel Instruction Computer
(EPIC)

35
Issuing Multiple Instructions/Cycle
Multiple Issue

Flavor II - continued
3 Instructions in 128 bit groups field
determines if instructions dependent or
independent
Smaller code size than old VLIW, larger than
x86/RISC
Groups can be linked to show independence gt 3
instr
64 integer registers 64 floating point
registers
Not separate files per functional unit as in old
VLIW
Hardware checks dependencies (interlocks gt
binary compatibility over time)
Predicated execution (select 1 out of 64 1-bit
flags) gt 40 fewer mis-predictions?
IA-64 name of instruction set architecture
EPIC is type
Merced is name of first implementation
(1999/2000?)

36
Issuing Multiple Instructions/Cycle
Multiple Issue
A SuperScalar Version of MIPS

In our MIPS example, we can handle 2
instructions/cycle
Floating Point
Anything Else

Fetch 64-bits/clock cycle Int on left, FP on
right
Can only issue 2nd instruction if 1st
instruction issues
More ports for FP registers to do FP load FP
op in a pair
Type Pipe Stages
Int. instruction IF ID EX MEM WB
FP instruction IF ID EX MEM WB
Int. instruction IF ID EX MEM WB
FP instruction IF ID EX MEM WB
Int. instruction IF ID EX MEM WB
FP instruction IF ID EX MEM WB
1 cycle load delay causes delay to 3
instructions in Superscalar
instruction in right half cant use it, nor
instructions in next slot

37
Unrolled Loop Minimizes Stalls for Scalar
Multiple Issue
A SuperScalar Version of MIPS
1 Loop LD F0,0(R1) 2 LD F6,-8(R1) 3 LD F10,-16(R1
) 4 LD F14,-24(R1) 5 ADDD F4,F0,F2 6 ADDD F8,F6,F2
7 ADDD F12,F10,F2 8 ADDD F16,F14,F2 9 SD 0(R1),F4
10 SD -8(R1),F8 11 SD -16(R1),F12 12 SUBI R1,R1,
32 13 BNEZ R1,LOOP 14 SD 8(R1),F16 8-32
-24 14 clock cycles, or 3.5 per iteration
Latencies LD to ADDD 1 Cycle ADDD to SD 2
Cycles
38
Loop Unrolling in Superscalar
Multiple Issue
A SuperScalar Version of MIPS

Integer instruction FP instruction Clock cycle
Loop LD F0,0(R1) 1
LD F6,-8(R1) 2
LD F10,-16(R1) ADDD F4,F0,F2 3
LD F14,-24(R1) ADDD F8,F6,F2 4
LD F18,-32(R1) ADDD F12,F10,F2 5
SD 0(R1),F4 ADDD F16,F14,F2 6
SD -8(R1),F8 ADDD F20,F18,F2 7
SD -16(R1),F12 8
SD -24(R1),F16 9
SUBI R1,R1,40 10
BNEZ R1,LOOP 11
SD 8(R1),F20 12
Unrolled 5 times to avoid delays (1 due to SS)
12 clocks, or 2.4 clocks per iteration

39
Dynamic Scheduling in Superscalar
Multiple Issue
Multiple Instruction Issue Dynamic Scheduling

Code compiler for scalar version will run poorly
on Superscalar
May want code to vary depending on how
Superscalar
Simple approach separate Tomasulo Control for
separate reservation stations for Integer FU/Reg
and for FP FU/Reg

40
Dynamic Scheduling in Superscalar
Multiple Issue
Multiple Instruction Issue Dynamic Scheduling

How to do instruction issue with two instructions
and keep in-order instruction issue for Tomasulo?
Issue 2X Clock Rate, so that issue remains in
order
Only FP loads might cause dependency between
integer and FP issue
Replace load reservation station with a load
queue operands must be read in the order they
are fetched
Load checks addresses in Store Queue to avoid RAW
violation
Store checks addresses in Load Queue to avoid
WAR,WAW

41
Performance of Dynamic Superscalar
Multiple Issue
Multiple Instruction Issue Dynamic Scheduling

Iteration Instructions Issues Executes Writes
result
no.
clock-cycle number
1 LD F0,0(R1) 1 2 4
1 ADDD F4,F0,F2 1 5 8
1 SD 0(R1),F4 2 9
1 SUBI R1,R1,8 3 4 5
1 BNEZ R1,LOOP 4 5
2 LD F0,0(R1) 5 6 8
2 ADDD F4,F0,F2 5 9 12
2 SD 0(R1),F4 6 13
2 SUBI R1,R1,8 7 8 9
2 BNEZ R1,LOOP 8 9
4 clocks per iteration
Branches, Decrements still take 1 clock cycle

42
Loop Unrolling in VLIW
Multiple Issue
VLIW

Memory Memory FP FP Int. op/ Clockreference
1 reference 2 operation 1 op. 2 branch
LD F0,0(R1) LD F6,-8(R1) 1
LD F10,-16(R1) LD F14,-24(R1) 2
LD F18,-32(R1) LD F22,-40(R1) ADDD F4,F0,F2 ADDD
F8,F6,F2 3
LD F26,-48(R1) ADDD F12,F10,F2 ADDD F16,F14,F2 4
ADDD F20,F18,F2 ADDD F24,F22,F2 5
SD 0(R1),F4 SD -8(R1),F8 ADDD F28,F26,F2 6
SD -16(R1),F12 SD -24(R1),F16 7
SD -32(R1),F20 SD -40(R1),F24 SUBI R1,R1,48 8
SD -0(R1),F28 BNEZ R1,LOOP 9
Unrolled 7 times to avoid delays
7 results in 9 clocks, or 1.3 clocks per
iteration
Need more registers to effectively use VLIW

43
Limits to Multi-Issue Machines
Multiple Issue
Limitations With Multiple Issue

Inherent limitations of ILP
1 branch in 5 instructions gt how to keep a 5-way
VLIW busy?
Latencies of units gt many operations must be
scheduled
Need about Pipeline Depth x No. Functional Units
of independent operations to keep machines busy.
Difficulties in building HW
Duplicate Functional Units to get parallel
execution
Increase ports to Register File (VLIW example
needs 6 read and 3 write for Int. Reg. 6 read
and 4 write for Reg.)
Increase ports to memory
Decoding SS and impact on clock rate, pipeline
depth

44
Limits to Multi-Issue Machines
Multiple Issue
Limitations With Multiple Issue

Limitations specific to either SS or VLIW
implementation
Decode issue in SS
VLIW code size unroll loops wasted fields in
VLIW
VLIW lock step gt 1 hazard all instructions
stall
VLIW binary compatibility

45
Multiple Issue Challenges
Multiple Issue
Limitations With Multiple Issue

While Integer/FP split is simple for the HW, get
CPI of 0.5 only for programs with
Exactly 50 FP operations
No hazards
If more instructions issue at same time, greater
difficulty of decode and issue
Even 2-scalar gt examine 2 opcodes, 6 register
specifiers, decide if 1 or 2 instructions can
issue
VLIW tradeoff instruction space for simple
decoding
The long instruction word has room for many
operations
By definition, all the operations the compiler
puts in the long instruction word are independent
gt execute in parallel
E.g., 2 integer operations, 2 FP ops, 2 Memory
refs, 1 branch
16 to 24 bits per field gt 716 or 112 bits to
724 or 168 bits wide
Need compiling technique that schedules across
several branches

46
Compiler Support For ILP

4.1 Compiler Techniques for Exposing ILP
4.3 Static Multiple Issue VLIW
4.4 Advanced Compiler Support for ILP
4.5 Hardware Support for Exposing more
Parallelism

How can compilers be smart? 1. Produce good
scheduling of code. 2. Determine which loops
might contain parallelism. 3. Eliminate name
dependencies. Compilers must be REALLY smart to
figure out aliases -- pointers in C are a real
problem. Techniques lead to Symbolic Loop
Unrolling Critical Path Scheduling
47
Software Pipelining
Compiler Support For ILP
Symbolic Loop Unrolling

Observation if iterations from loops are
independent, then can get ILP by taking
instructions from different iterations
Software pipelining reorganizes loops so that
each iteration is made from instructions chosen
from different iterations of the original loop
(Tomasulo in SW)

48
SW Pipelining Example
Compiler Support For ILP
Symbolic Loop Unrolling

Before Unrolled 3 times
1 LD F0,0(R1)
2 ADDD F4,F0,F2
3 SD 0(R1),F4
4 LD F6,-8(R1)
5 ADDD F8,F6,F2
6 SD -8(R1),F8
7 LD F10,-16(R1)
8 ADDD F12,F10,F2
9 SD -16(R1),F12
10 SUBI R1,R1,24
11 BNEZ R1,LOOP

After Software Pipelined LD F0,0(R1) ADDD F4,F0
,F2 LD F0,-8(R1) 1 SD 0(R1),F4 Stores Mi
2 ADDD F4,F0,F2 Adds to Mi-1
3 LD F0,-16(R1) loads Mi-2 4 SUBI R1,R1,8
5 BNEZ R1,LOOP SD 0(R1),F4 ADDD F4,F0,F2 SD -8(
R1),F4
Read F4
Read F0
IF ID EX Mem WB IF ID EX Mem WB
IF ID EX Mem WB
SD ADDD LD
Write F4
Write F0
49
SW Pipelining Example
Compiler Support For ILP
Symbolic Loop Unrolling

Symbolic Loop Unrolling
Less code space
Overhead paid only once vs. each iteration
in loop unrolling

Software Pipelining
Loop Unrolling
100 iterations 25 loops with 4 unrolled
iterations each
50
Trace Scheduling
Compiler Support For ILP
Critical Path Scheduling

Parallelism across IF branches vs. LOOP branches
Two steps
Trace Selection
Find likely sequence of basic blocks (trace) of
(statically predicted or profile predicted) long
sequence of straight-line code
Trace Compaction
Squeeze trace into few VLIW instructions
Need bookkeeping code in case prediction is wrong
Compiler undoes bad guess (discards values in
registers)
Subtle compiler bugs mean wrong answer vs.
poorer performance no hardware interlocks

51
Hardware Support For Parallelism

4.1 Compiler Techniques for Exposing ILP
4.3 Static Multiple Issue VLIW
4.4 Advanced Compiler Support for ILP
4.5 Hardware Support for Exposing more
Parallelism

Software support of ILP is best when code is
predictable at compile time.
But what if theres no predictability?
Here well talk about hardware techniques. These
include
Conditional or Predicated Instructions
Hardware Speculation

52
Tell the Hardware To Ignore An Instruction
Hardware Support For Parallelism
Nullified Instructions

Avoid branch prediction by turning branches into
conditionally executed instructions
IF (x) then A B op C else NOP
If false, then neither store result nor cause
exception
Expanded ISA of Alpha, MIPs, PowerPC, SPARC, have
conditional move. PA-RISC can annul any
following instruction.
IA-64 64 1-bit condition fields selected so
conditional execution of any instruction
Drawbacks to conditional instructions
Still takes a clock, even if annulled
Stalls if condition evaluated late
Complex conditions reduce effectiveness
condition becomes known late in pipeline.
This can be a major win because there is no time
lost by taking a branch!!

x
A B op C
53
Tell the Hardware To Ignore An Instruction
Hardware Support For Parallelism
Nullified Instructions

Suppose we have the code
if ( VarA 0 )
VarS VarT
Previous Method
LD R1, VarA
BNEZ R1, Label
LD R2, VarT
SD VarS, R2
Label

Nullified Method LD R1, VarA LD R2,
VarT CMPNNZ R1, 0 SD VarS, R2 Label
Compare and Nullify Next Instr. If Not Zero
Nullified Method LD R1, VarA LD R2, VarT CMOVZ
VarS,R2, R1
Compare and Move IF Zero
54
Hardware Support For Parallelism
Compiler Speculation
Increasing Parallelism

The theory here is to move an instruction across
a branch so as to increase the size of a basic
block and thus to increase parallelism.
Primary difficulty is in avoiding exceptions.
For example
if ( a 0 ) c b/a may have divide by
zero error in some cases.
Methods for increasing speculation include
1. Use a set of status bits (poison bits)
associated with the registers. Are a signal that
the instruction results are invalid until some
later time.
2. Result of instruction isnt written until
its certain the instruction is no longer
speculative.

55
Hardware Support For Parallelism
Compiler Speculation
Increasing Parallelism
Original Code LW R1, 0(R3) Load A BNEZ
R1, L1 Test A LW R1, 0(R2) If
Clause J L2 Skip Else L1 ADDI R1, R1,
4 Else Clause L2 SW 0(R3), R1 Store A

Example on Page 305.
Code for
if ( A 0 )
A B
else
A A 4
Assume A is at 0(R3) and B is at 0(R4)

Speculated Code LW R1, 0(R3) Load A
LW R14, 0(R2) Spec Load B BEQZ R1, L3
Other if Branch ADDI R14, R1, 4 Else
Clause L3 SW 0(R3), R14 Non-Spec Store
Note here that only ONE side needs to take a
branch!!
56
Hardware Support For Parallelism
Compiler Speculation
Poison Bits
Speculated Code LW R1, 0(R3) Load A
LW R14, 0(R2) Spec Load B BEQZ R1, L3
Other if Branch ADDI R14, R1, 4 Else
Clause L3 SW 0(R3), R14 Non-Spec Store

In the example on the last page, if the LW
produces an exception, a poison bit is set on
that register. The if a later instruction tries
to use the register, an exception is THEN raised.

57
HW support for More ILP
Hardware Support For Parallelism
Hardware Speculation

Need HW buffer for results of uncommitted
instructions reorder buffer
Reorder buffer can be operand source
Once operand commits, result is found in register
3 fields instr. type, destination, value
Use reorder buffer number instead of reservation
station
Discard instructions on mis-predicted branches or
on exceptions

58
HW support for More ILP
Hardware Support For Parallelism
Hardware Speculation

How is this used in practice?
Rather than predicting the direction of a branch,
execute the instructions on both side!!
We early on know the target of a branch, long
before we know it if will be taken or not.
So begin fetching/executing at that new Target
PC.
But also continue fetching/executing as if the
branch NOT taken.