Computer Architecture - PowerPoint PPT Presentation

1 / 58
About This Presentation
Title:

Computer Architecture

Description:

4 stall. 5 BNEZ R1,Loop ;delayed branch. 6 SD 8(R1),F4 ;altered when move past SUBI ... 14 stall. Compilers and ILP. Pipeline Scheduling and Loop Unrolling ... – PowerPoint PPT presentation

Number of Views:19
Avg rating:3.0/5.0
Slides: 59
Provided by: jb20
Category:

less

Transcript and Presenter's Notes

Title: Computer Architecture


1
Computer Architecture
  • Chapter 4
  • Instruction-Level Parallelism - 3
  • Prof. Jerry Breecher
  • CS 240
  • Fall 2003

2
Chapter Overview
  • 4.1 Compiler Techniques for Exposing ILP
  • 4.2 Static Branch Prediction
  • 4.3 Static Multiple Issue VLIW
  • 4.4 Advanced Compiler Support for ILP
  • 4.5 Hardware Support for Exposing more
    Parallelism

3
Ideas To Reduce Stalls
Chapter 3
Chapter 4
4
Instruction Level Parallelism
  • 4.1 Compiler Techniques for Exposing ILP
  • 4.3 Static Multiple Issue VLIW
  • 4.4 Advanced Compiler Support for ILP
  • 4.5 Hardware Support for Exposing more
    Parallelism

How can compilers recognize and take advantage of
ILP?
5
Simple Loop and its Assembler Equivalent
Compilers and ILP
Pipeline Scheduling and Loop Unrolling
This is a clean and simple example!
  • for (i1 ilt1000 i) x(i) x(i) s

Loop LD F0,0(R1) F0vector element
ADDD F4,F0,F2 add scalar from F2
SD 0(R1),F4 store result SUBI R1,R1,8 decre
ment pointer 8bytes (DW) BNEZ R1,Loop branch
R1!zero NOP delayed branch slot
6
FP Loop Hazards
Compilers and ILP
Pipeline Scheduling and Loop Unrolling
Loop LD F0,0(R1) F0vector element
ADDD F4,F0,F2 add scalar in F2
SD 0(R1),F4 store result SUBI R1,R1,8 decre
ment pointer 8B (DW) BNEZ R1,Loop branch
R1!zero NOP delayed branch slot
Instruction Instruction Latency inproducing
result using result clock cycles FP ALU
op Another FP ALU op 3 FP ALU op Store double 2
Load double FP ALU op 1 Load double Store
double 0 Integer op Integer op 0
Where are the stalls?
7
FP Loop Showing Stalls
Compilers and ILP
Pipeline Scheduling and Loop Unrolling
1 Loop LD F0,0(R1) F0vector element
2 stall 3 ADDD F4,F0,F2 add scalar in F2
4 stall 5 stall 6 SD 0(R1),F4 store
result 7 SUBI R1,R1,8 decrement pointer 8Byte
(DW) 8 stall 9
BNEZ R1,Loop branch R1!zero
10 stall delayed branch slot
Instruction Instruction Latency inproducing
result using result clock cycles FP ALU
op Another FP ALU op 3 FP ALU op Store double 2
Load double FP ALU op 1 Load double Store
double 0 Integer op Integer op 0
  • 10 clocks Rewrite code to minimize stalls?

8
Scheduled FP Loop Minimizing Stalls
Compilers and ILP
Pipeline Scheduling and Loop Unrolling
1 Loop LD F0,0(R1) 2 SUBI R1,R1,8
3 ADDD F4,F0,F2 4 stall 5 BNEZ R1,Loop de
layed branch 6 SD 8(R1),F4 altered when move
past SUBI
Stall is because SD cant proceed.
Swap BNEZ and SD by changing address of SD
Instruction Instruction Latency inproducing
result using result clock cycles FP ALU
op Another FP ALU op 3 FP ALU op Store double 2
Load double FP ALU op 1
  • Now 6 clocks Now unroll loop 4 times to make
    faster.

9
Unroll Loop Four Times (straightforward way)
Compilers and ILP
Pipeline Scheduling and Loop Unrolling
1 Loop LD F0,0(R1) 2 stall 3 ADDD F4,F0,F2
4 stall 5 stall 6 SD 0(R1),F4 7 LD F6,-8(R1)
8 stall 9 ADDD F8,F6,F2 10 stall 11 stall 12 SD -
8(R1),F8 13 LD F10,-16(R1) 14 stall
15 ADDD F12,F10,F2 16 stall 17 stall 18 SD -16(R1)
,F12 19 LD F14,-24(R1) 20 stall 21 ADDD F16,F14,F2
22 stall 23 stall 24 SD -24(R1),F16 25 SUBI R1,R1
,32 26 BNEZ R1,LOOP 27 stall 28 NOP
15 4 x (12) 1 28 clock cycles, or 7 per
iteration Assumes R1 is multiple of 4
  • Rewrite loop to minimize stalls.

10
Unrolled Loop That Minimizes Stalls
Compilers and ILP
Pipeline Scheduling and Loop Unrolling
  • What assumptions made when moved code?
  • OK to move store past SUBI even though changes
    register
  • OK to move loads before stores get right data?
  • When is it safe for compiler to do such changes?

1 Loop LD F0,0(R1) 2 LD F6,-8(R1) 3 LD F10,-16(R1
) 4 LD F14,-24(R1) 5 ADDD F4,F0,F2 6 ADDD F8,F6,F2
7 ADDD F12,F10,F2 8 ADDD F16,F14,F2 9 SD 0(R1),F4
10 SD -8(R1),F8 11 SD -16(R1),F12 12 SUBI R1,R1,
32 13 BNEZ R1,LOOP 14 SD 8(R1),F16 8-32 -24
14 clock cycles, or 3.5 per iteration
No Stalls!!
11
Summary of Loop Unrolling Example
Compilers and ILP
Pipeline Scheduling and Loop Unrolling
  • Determine that it was legal to move the SD after
    the SUBI and BNEZ, and find the amount to adjust
    the SD offset.
  • Determine that unrolling the loop would be useful
    by finding that the loop iterations were
    independent, except for the loop maintenance
    code.
  • Use different registers to avoid unnecessary
    constraints that would be forced by using the
    same registers for different computations.
  • Eliminate the extra tests and branches and adjust
    the loop maintenance code.
  • Determine that the loads and stores in the
    unrolled loop can be interchanged by observing
    that the loads and stores from different
    iterations are independent. This requires
    analyzing the memory addresses and finding that
    they do not refer to the same address.
  • Schedule the code, preserving any dependences
    needed to yield the same result as the original
    code.

12
Compiler Perspectives on Code Movement
Compilers and ILP
Dependencies
  • Compiler concerned about dependencies in program.
    Not concerned if a HW hazard depends on a
    given pipeline.
  • Tries to schedule code to avoid hazards.
  • Looks for Data dependencies (RAW if a hazard for
    HW)
  • Instruction i produces a result used by
    instruction j, or
  • Instruction j is data dependent on instruction k,
    and instruction k is data dependent on
    instruction i.
  • If dependent, cant execute in parallel
  • Easy to determine for registers (fixed names)
  • Hard for memory
  • Does 100(R4) 20(R6)?
  • From different loop iterations, does 20(R6)
    20(R6)?

13
Compiler Perspectives on Code Movement
Compilers and ILP
Data Dependencies
Where are the data dependencies?
1 Loop LD F0,0(R1) 2 ADDD F4,F0,F2
3 SUBI R1,R1,8 4 BNEZ R1,Loop delayed
branch 5 SD 8(R1),F4 altered when move past
SUBI
14
Compiler Perspectives on Code Movement
Compilers and ILP
Name Dependencies
  • Another kind of dependence called name
    dependence two instructions use same name
    (register or memory location) but dont exchange
    data
  • Anti-dependence (WAR if a hazard for HW)
  • Instruction j writes a register or memory
    location that instruction i reads from and
    instruction i is executed first
  • Output dependence (WAW if a hazard for HW)
  • Instruction i and instruction j write the same
    register or memory location ordering between
    instructions must be preserved.

15
Compiler Perspectives on Code Movement
Compilers and ILP
Name Dependencies
1 Loop LD F0,0(R1) 2 ADDD F4,F0,F2
3 SD 0(R1),F4 4 LD F0,-8(R1) 5 ADDD F4,F0,F2
6 SD -8(R1),F4 7 LD F0,-16(R1)
8 ADDD F4,F0,F2 9 SD -16(R1),F4
10 LD F0,-24(R1) 11 ADDD F4,F0,F2
12 SD -24(R1),F4 13 SUBI R1,R1,32
14 BNEZ R1,LOOP 15 NOP How can we remove these
dependencies?
Where are the name dependencies?
No data is passed in F0, but cant reuse F0 in
cycle 4.
16
Where are the name dependencies?
Compilers and ILP
Name Dependencies
Compiler Perspectives on Code Movement
1 Loop LD F0,0(R1) 2 ADDD F4,F0,F2
3 SD 0(R1),F4 4 LD F6,-8(R1) 5 ADDD F8,F6,F2
6 SD -8(R1),F8 7 LD F10,-16(R1)
8 ADDD F12,F10,F2 9 SD -16(R1),F12
10 LD F14,-24(R1) 11 ADDD F16,F14,F2
12 SD -24(R1),F16 13 SUBI R1,R1,32
14 BNEZ R1,LOOP 15 NOP Called register
renaming
Now there are data dependencies only. F0 exists
only in instructions 1 and 2.
17
Compiler Perspectives on Code Movement
Compilers and ILP
Name Dependencies
  • Again Name Dependencies are Hard for Memory
    Accesses
  • Does 100(R4) 20(R6)?
  • From different loop iterations, does 20(R6)
    20(R6)?
  • Our example required compiler to know that if R1
    doesnt change then0(R1) ? -8(R1) ? -16(R1) ?
    -24(R1)
  • There were no dependencies between some
    loads and stores so they could be moved around
    each other

18
Compilers and ILP
Control Dependencies
Compiler Perspectives on Code Movement
  • Final kind of dependence called control
    dependence
  • Example
  • if p1 S1
  • if p2 S2
  • S1 is control dependent on p1 and S2 is control
    dependent on p2 but not on p1.

19
Compilers and ILP
Control Dependencies
Compiler Perspectives on Code Movement
  • Two (obvious) constraints on control dependences
  • An instruction that is control dependent on a
    branch cannot be moved before the branch so
    that its execution is no longer controlled by the
    branch.
  • An instruction that is not control dependent on a
    branch cannot be moved to after the branch so
    that its execution is controlled by the branch.
  • Control dependencies relaxed to get parallelism
    get same effect if preserve order of exceptions
    (address in register checked by branch before
    use) and data flow (value in register depends on
    branch)

20
Where are the control dependencies?
Compilers and ILP
Control Dependencies
Compiler Perspectives on Code Movement
1 Loop LD F0,0(R1) 2 ADDD F4,F0,F2
3 SD 0(R1),F4 4 SUBI R1,R1,8 5 BEQZ R1,exit
6 LD F0,0(R1) 7 ADDD F4,F0,F2 8 SD 0(R1),F4
9 SUBI R1,R1,8 10 BEQZ R1,exit 11 LD F0,0(R1)
12 ADDD F4,F0,F2 13 SD 0(R1),F4
14 SUBI R1,R1,8 15 BEQZ R1,exit ....
21
When Safe to Unroll Loop?
Compilers and ILP
Loop Level Parallelism
  • Example Where are data dependencies? (A,B,C
    distinct non-overlapping)
  • 1. S2 uses the value, Ai1, computed by S1 in
    the same iteration.
  • 2. S1 uses a value computed by S1 in an earlier
    iteration, since iteration i computes Ai1
    which is read in iteration i1. The same is true
    of S2 for Bi and Bi1. This is a
    loop-carried dependence between iterations
  • Implies that iterations are dependent, and cant
    be executed in parallel
  • Note the case for our prior example each
    iteration was distinct

for (i1 ilt100 ii1) Ai1 Ai Ci
/ S1 / Bi1 Bi Ai1 / S2 /
22
When Safe to Unroll Loop?
Compilers and ILP
Loop Level Parallelism
  • Example Where are data dependencies? (A,B,C,D
    distinct non-overlapping)
  • 1. No dependence from S1 to S2. If there
    were, then there would be a cycle in the
    dependencies and the loop would not be parallel.
    Since this other dependence is absent,
    interchanging the two statements will not affect
    the execution of S2.
  • 2. On the first iteration of the loop,
    statement S1 depends on the value of B1
    computed prior to initiating the loop.

for (i1 ilt100 ii1) Ai1 Ai Bi
/ S1 / Bi1 Ci Di / S2 /
23
Now Safe to Unroll Loop? (p. 240)
Compilers and ILP
Loop Level Parallelism
for (i1 ilt100 ii1) Ai1 Ai Bi
/ S1 / Bi1 Ci Di / S2 /
No circular dependencies.
OLD
Loop caused dependence on B.
  • A1 A1 B1
  • for (i1 ilt99 ii1) Bi1 Ci
    Di Ai1 Ai1 Bi1
  • B101 C100 D100

Have eliminated loop dependence.
NEW
24
Example 1There are NO dependencies
Compilers and ILP
Loop Level Parallelism
  • /
  • This is the example on page 305 of Hennessy
    Patterson but running on an Intel Machine

  • /
  • define MAX 1000
  • define ITER 100000
  • int main( int argc, char argv )
  • double xMAX 2
  • double s 3.14159
  • int i, j
  • for ( i MAX i gt 0 i-- ) / Init array
    /
  • xi 0
  • for ( j ITER j gt 0 j-- )
  • for ( i MAX i gt 0 i-- )
  • xi xi s

25

Elapsed seconds 0.122848
Compilers and ILP
This is the ICC optimized code .L2

fstpl 8(esp,edx,8) fldl
(esp,edx,8) fadd st(1), st
fldl -8(esp,edx,8) fldl
-16(esp,edx,8) fldl
-24(esp,edx,8) fldl
-32(esp,edx,8) fxch st(4)
fstpl (esp,edx,8) fxch st(2)
fadd st(4), st fstpl
-8(esp,edx,8) fadd st(3), st
fstpl -16(esp,edx,8) fadd
st(2), st fstpl -24(esp,edx,8)
fadd st(1), st addl -5,
edx testl edx, edx jg
.L2 Prob 99 fstpl
8(esp,edx,8)
Loop Level Parallelism
Example 1
Elapsed seconds 0.590026
  • This is the GCC optimized code
  • .L15
  • fldl (ecx,eax)
  • fadd st(1),st
  • decl edx
  • fstpl (ecx,eax)
  • addl -8,eax
  • testl edx,edx
  • jg .L15

26
Example 2
Compilers and ILP
Loop Level Parallelism
  • // Example on Page 320
  • get_current_time( start_time )
  • for ( j ITER j gt 0 j-- )
  • for ( i 1 i lt MAX i )
  • Ai1 Ai Ci
  • Bi1 Bi Ai1
  • get_current_time( end_time )

There are two depend-encies here what are they?
27

Compilers and ILP
Elapsed seconds 0.664073
Loop Level Parallelism
This is the ICC optimized code .L4
fstpl 25368(esp,edx,8) fldl
8472(esp,edx,8) faddl
16920(esp,edx,8) fldl
25368(esp,edx,8) fldl
16928(esp,edx,8) fxch st(2)
fstl 8480(esp,edx,8)
fadd st, st(1) fxch
st(1) fstl
25376(esp,edx,8) fxch st(2)
faddp st, st(1)
fstl 8488(esp,edx,8) faddp
st, st(1) addl 2, edx
cmpl 1000, edx jle
.L4 Prob 99 fstpl
25368(esp,edx,8)
Example 2
Elapsed seconds 1.357084
  • This is GCC optimized code
  • .L55
  • fldl -8(esi,eax)
  • faddl -8(edi,eax)
  • fstl (esi,eax)
  • faddl -8(ecx,eax)
  • incl edx
  • fstpl (ecx,eax)
  • addl 8,eax
  • cmpl 1000,edx
  • jle .L55

This is Microsoft optimized code L1225 fld QWOR
D PTR _Cespeax40108 add eax, 8 cmp eax,
7992 fadd QWORD PTR _Aespeax40100 fst QW
ORD PTR _Aespeax40108 fadd QWORD PTR
_Bespeax40100 fstp QWORD PTR
_Bespeax40108 jle L1225
28
Example 3
Compilers and ILP
Loop Level Parallelism
  • // Example on Page 321
  • get_current_time( start_time )
  • for ( j ITER j gt 0 j-- )
  • for ( i 1 i lt MAX i )
  • Ai Ai Bi
  • Bi1 Ci Di
  • get_current_time( end_time )

What are the depend-encies here??
29

Elapsed seconds 0.325419
Compilers and ILP
This is the ICC optimized code .L6
fstpl 8464(esp,edx,8) fldl
8472(esp,edx,8) faddl
25368(esp,edx,8 fldl
16920(esp,edx,8) faddl
33824(esp,edx,8) fldl
8480(esp,edx,8) fldl
16928(esp,edx,8) faddl
33832(esp,edx,8) fxch st(3)
fstpl 8472(esp,edx,8)
fxch st(1) fstl
25376(esp,edx,8) fxch st(2)
fstpl 25384(esp,edx,8)
faddp st, st(1) addl 2,
edx cmpl 1000, edx
jle .L6 Prob 99
fstpl 8464(esp,edx,8)
Loop Level Parallelism
Example 3
Elapsed seconds 1.370478
  • This is the GCC optimized code
  • .L65
  • fldl (esi,eax)
  • faddl (ecx,eax)
  • fstpl (esi,eax)
  • movl -40100(ebp),edi
  • fldl (edi,eax)
  • movl -40136(ebp),edi
  • faddl (edi,eax)
  • incl edx
  • fstpl 8(ecx,eax)
  • addl 8,eax
  • cmpl 1000,edx
  • jle .L65

30
Example 4
Compilers and ILP
Loop Level Parallelism
  • // Example on Page 322
  • get_current_time( start_time )
  • for ( j ITER j gt 0 j-- )
  • A1 A1 B1
  • for ( i 1 i lt MAX - 1 i )
  • Bi1 Ci Di
  • Ai1 Ai1 Bi1
  • B101 C100 D100
  • get_current_time( end_time )

Elapsed seconds 1.200525
How many depend-encies here??
31

Compilers and ILP
Loop Level Parallelism
Example 4
Elapsed seconds 1.200525
  • This is the GCC optimized code
  • .L75
  • movl -40136(ebp),edi
  • fldl -8(edi,eax)
  • faddl -8(esi,eax)
  • movl -40104(ebp),edi
  • fstl (edi,eax)
  • faddl (ecx,eax)
  • incl edx
  • fstpl (ecx,eax)
  • addl 8,eax
  • cmpl 999,edx
  • jle .L75

This is the Microsoft optimized
code L1239 fld QWORD PTR _Despeax40108 ad
d eax, 8 cmp eax, 7984 00001f30H fadd QWORD
PTR _Cespeax40100 fst QWORD PTR
_Bespeax40108 fadd QWORD PTR
_Aespeax40108 fstp QWORD PTR
_Aespeax40108 jle SHORT L1239
32

Compilers and ILP
Elapsed seconds 0.359232
Loop Level Parallelism
CONTINUED fstl 25376(esp,edx,8)
fxch st(3) fstl
25384(esp,edx,8) fxch st(1)
fstl 25392(esp,edx,8)
fxch st(3) faddp
st, st(4) fxch st(3)
fstpl 8480(esp,edx,8)
faddp st, st(2) fxch
st(1) fstpl
8488(esp,edx,8) faddp st, st(1)
addl 3, edx cmpl
999, edx jle .L8
fstpl 8472(esp,edx,8)
Example 4
  • This is the ICC optimized code
  • .L8
  • fstpl 8472(esp,edx,8)
  • fldl 16920(esp,edx,8)
  • faddl 33824(esp,edx,8)
  • fldl 8480(esp,edx,8)
  • fldl 16928(esp,edx,8)
  • faddl 33832(esp,edx,8)
  • fldl 8488(esp,edx,8)
  • fldl 16936(esp,edx,8)
  • faddl 33840(esp,edx,8)
  • fldl 8496(esp,edx,8)
  • fxch st(5)

33
Static Multiple Issue
Multiple Issue is the ability of the processor to
start more than one instruction in a given
cycle. Flavor I Superscalar processors issue
varying number of instructions per clock - can be
either statically scheduled (by the compiler) or
dynamically scheduled (by the hardware). Supersca
lar has a varying number of instructions/cycle
(1 to 8), scheduled by compiler or by HW
(Tomasulo). IBM PowerPC, Sun UltraSparc, DEC
Alpha, HP 8000
  • 4.1 Compiler Techniques for Exposing ILP
  • 4.3 Static Multiple Issue VLIW
  • 4.4 Advanced Compiler Support for ILP
  • 4.5 Hardware Support for Exposing more
    Parallelism

34
Issuing Multiple Instructions/Cycle
Multiple Issue
  • Flavor II
  • VLIW - Very Long Instruction Word - issues a
    fixed number of instructions formatted either as
    one very large instruction or as a fixed packet
    of smaller instructions.
  • fixed number of instructions (4-16) scheduled by
    the compiler put operators into wide templates
  • Joint HP/Intel agreement in 1999/2000
  • Intel Architecture-64 (IA-64) 64-bit address
  • Style Explicitly Parallel Instruction Computer
    (EPIC)

35
Issuing Multiple Instructions/Cycle
Multiple Issue
  • Flavor II - continued
  • 3 Instructions in 128 bit groups field
    determines if instructions dependent or
    independent
  • Smaller code size than old VLIW, larger than
    x86/RISC
  • Groups can be linked to show independence gt 3
    instr
  • 64 integer registers 64 floating point
    registers
  • Not separate files per functional unit as in old
    VLIW
  • Hardware checks dependencies (interlocks gt
    binary compatibility over time)
  • Predicated execution (select 1 out of 64 1-bit
    flags) gt 40 fewer mis-predictions?
  • IA-64 name of instruction set architecture
    EPIC is type
  • Merced is name of first implementation
    (1999/2000?)

36
Issuing Multiple Instructions/Cycle
Multiple Issue
A SuperScalar Version of MIPS
  • In our MIPS example, we can handle 2
    instructions/cycle
  • Floating Point
  • Anything Else
  • Fetch 64-bits/clock cycle Int on left, FP on
    right
  • Can only issue 2nd instruction if 1st
    instruction issues
  • More ports for FP registers to do FP load FP
    op in a pair
  • Type Pipe Stages
  • Int. instruction IF ID EX MEM WB
  • FP instruction IF ID EX MEM WB
  • Int. instruction IF ID EX MEM WB
  • FP instruction IF ID EX MEM WB
  • Int. instruction IF ID EX MEM WB
  • FP instruction IF ID EX MEM WB
  • 1 cycle load delay causes delay to 3
    instructions in Superscalar
  • instruction in right half cant use it, nor
    instructions in next slot

37
Unrolled Loop Minimizes Stalls for Scalar
Multiple Issue
A SuperScalar Version of MIPS
1 Loop LD F0,0(R1) 2 LD F6,-8(R1) 3 LD F10,-16(R1
) 4 LD F14,-24(R1) 5 ADDD F4,F0,F2 6 ADDD F8,F6,F2
7 ADDD F12,F10,F2 8 ADDD F16,F14,F2 9 SD 0(R1),F4
10 SD -8(R1),F8 11 SD -16(R1),F12 12 SUBI R1,R1,
32 13 BNEZ R1,LOOP 14 SD 8(R1),F16 8-32
-24 14 clock cycles, or 3.5 per iteration
Latencies LD to ADDD 1 Cycle ADDD to SD 2
Cycles
38
Loop Unrolling in Superscalar
Multiple Issue
A SuperScalar Version of MIPS
  • Integer instruction FP instruction Clock cycle
  • Loop LD F0,0(R1) 1
  • LD F6,-8(R1) 2
  • LD F10,-16(R1) ADDD F4,F0,F2 3
  • LD F14,-24(R1) ADDD F8,F6,F2 4
  • LD F18,-32(R1) ADDD F12,F10,F2 5
  • SD 0(R1),F4 ADDD F16,F14,F2 6
  • SD -8(R1),F8 ADDD F20,F18,F2 7
  • SD -16(R1),F12 8
  • SD -24(R1),F16 9
  • SUBI R1,R1,40 10
  • BNEZ R1,LOOP 11
  • SD 8(R1),F20 12
  • Unrolled 5 times to avoid delays (1 due to SS)
  • 12 clocks, or 2.4 clocks per iteration

39
Dynamic Scheduling in Superscalar
Multiple Issue
Multiple Instruction Issue Dynamic Scheduling
  • Code compiler for scalar version will run poorly
    on Superscalar
  • May want code to vary depending on how
    Superscalar
  • Simple approach separate Tomasulo Control for
    separate reservation stations for Integer FU/Reg
    and for FP FU/Reg

40
Dynamic Scheduling in Superscalar
Multiple Issue
Multiple Instruction Issue Dynamic Scheduling
  • How to do instruction issue with two instructions
    and keep in-order instruction issue for Tomasulo?
  • Issue 2X Clock Rate, so that issue remains in
    order
  • Only FP loads might cause dependency between
    integer and FP issue
  • Replace load reservation station with a load
    queue operands must be read in the order they
    are fetched
  • Load checks addresses in Store Queue to avoid RAW
    violation
  • Store checks addresses in Load Queue to avoid
    WAR,WAW

41
Performance of Dynamic Superscalar
Multiple Issue
Multiple Instruction Issue Dynamic Scheduling
  • Iteration Instructions Issues Executes Writes
    result
  • no.
    clock-cycle number
  • 1 LD F0,0(R1) 1 2 4
  • 1 ADDD F4,F0,F2 1 5 8
  • 1 SD 0(R1),F4 2 9
  • 1 SUBI R1,R1,8 3 4 5
  • 1 BNEZ R1,LOOP 4 5
  • 2 LD F0,0(R1) 5 6 8
  • 2 ADDD F4,F0,F2 5 9 12
  • 2 SD 0(R1),F4 6 13
  • 2 SUBI R1,R1,8 7 8 9
  • 2 BNEZ R1,LOOP 8 9
  • 4 clocks per iteration
  • Branches, Decrements still take 1 clock cycle

42
Loop Unrolling in VLIW
Multiple Issue
VLIW
  • Memory Memory FP FP Int. op/ Clockreference
    1 reference 2 operation 1 op. 2 branch
  • LD F0,0(R1) LD F6,-8(R1) 1
  • LD F10,-16(R1) LD F14,-24(R1) 2
  • LD F18,-32(R1) LD F22,-40(R1) ADDD F4,F0,F2 ADDD
    F8,F6,F2 3
  • LD F26,-48(R1) ADDD F12,F10,F2 ADDD F16,F14,F2 4
  • ADDD F20,F18,F2 ADDD F24,F22,F2 5
  • SD 0(R1),F4 SD -8(R1),F8 ADDD F28,F26,F2 6
  • SD -16(R1),F12 SD -24(R1),F16 7
  • SD -32(R1),F20 SD -40(R1),F24 SUBI R1,R1,48 8
  • SD -0(R1),F28 BNEZ R1,LOOP 9
  • Unrolled 7 times to avoid delays
  • 7 results in 9 clocks, or 1.3 clocks per
    iteration
  • Need more registers to effectively use VLIW

43
Limits to Multi-Issue Machines
Multiple Issue
Limitations With Multiple Issue
  • Inherent limitations of ILP
  • 1 branch in 5 instructions gt how to keep a 5-way
    VLIW busy?
  • Latencies of units gt many operations must be
    scheduled
  • Need about Pipeline Depth x No. Functional Units
    of independent operations to keep machines busy.
  • Difficulties in building HW
  • Duplicate Functional Units to get parallel
    execution
  • Increase ports to Register File (VLIW example
    needs 6 read and 3 write for Int. Reg. 6 read
    and 4 write for Reg.)
  • Increase ports to memory
  • Decoding SS and impact on clock rate, pipeline
    depth

44
Limits to Multi-Issue Machines
Multiple Issue
Limitations With Multiple Issue
  • Limitations specific to either SS or VLIW
    implementation
  • Decode issue in SS
  • VLIW code size unroll loops wasted fields in
    VLIW
  • VLIW lock step gt 1 hazard all instructions
    stall
  • VLIW binary compatibility

45
Multiple Issue Challenges
Multiple Issue
Limitations With Multiple Issue
  • While Integer/FP split is simple for the HW, get
    CPI of 0.5 only for programs with
  • Exactly 50 FP operations
  • No hazards
  • If more instructions issue at same time, greater
    difficulty of decode and issue
  • Even 2-scalar gt examine 2 opcodes, 6 register
    specifiers, decide if 1 or 2 instructions can
    issue
  • VLIW tradeoff instruction space for simple
    decoding
  • The long instruction word has room for many
    operations
  • By definition, all the operations the compiler
    puts in the long instruction word are independent
    gt execute in parallel
  • E.g., 2 integer operations, 2 FP ops, 2 Memory
    refs, 1 branch
  • 16 to 24 bits per field gt 716 or 112 bits to
    724 or 168 bits wide
  • Need compiling technique that schedules across
    several branches

46
Compiler Support For ILP
  • 4.1 Compiler Techniques for Exposing ILP
  • 4.3 Static Multiple Issue VLIW
  • 4.4 Advanced Compiler Support for ILP
  • 4.5 Hardware Support for Exposing more
    Parallelism

How can compilers be smart? 1. Produce good
scheduling of code. 2. Determine which loops
might contain parallelism. 3. Eliminate name
dependencies. Compilers must be REALLY smart to
figure out aliases -- pointers in C are a real
problem. Techniques lead to Symbolic Loop
Unrolling Critical Path Scheduling
47
Software Pipelining
Compiler Support For ILP
Symbolic Loop Unrolling
  • Observation if iterations from loops are
    independent, then can get ILP by taking
    instructions from different iterations
  • Software pipelining reorganizes loops so that
    each iteration is made from instructions chosen
    from different iterations of the original loop
    (Tomasulo in SW)

48
SW Pipelining Example
Compiler Support For ILP
Symbolic Loop Unrolling
  • Before Unrolled 3 times
  • 1 LD F0,0(R1)
  • 2 ADDD F4,F0,F2
  • 3 SD 0(R1),F4
  • 4 LD F6,-8(R1)
  • 5 ADDD F8,F6,F2
  • 6 SD -8(R1),F8
  • 7 LD F10,-16(R1)
  • 8 ADDD F12,F10,F2
  • 9 SD -16(R1),F12
  • 10 SUBI R1,R1,24
  • 11 BNEZ R1,LOOP

After Software Pipelined LD F0,0(R1) ADDD F4,F0
,F2 LD F0,-8(R1) 1 SD 0(R1),F4 Stores Mi
2 ADDD F4,F0,F2 Adds to Mi-1
3 LD F0,-16(R1) loads Mi-2 4 SUBI R1,R1,8
5 BNEZ R1,LOOP SD 0(R1),F4 ADDD F4,F0,F2 SD -8(
R1),F4
Read F4
Read F0
IF ID EX Mem WB IF ID EX Mem WB
IF ID EX Mem WB
SD ADDD LD
Write F4
Write F0
49
SW Pipelining Example
Compiler Support For ILP
Symbolic Loop Unrolling
  • Symbolic Loop Unrolling
  • Less code space
  • Overhead paid only once vs. each iteration
    in loop unrolling

Software Pipelining
Loop Unrolling
100 iterations 25 loops with 4 unrolled
iterations each
50
Trace Scheduling
Compiler Support For ILP
Critical Path Scheduling
  • Parallelism across IF branches vs. LOOP branches
  • Two steps
  • Trace Selection
  • Find likely sequence of basic blocks (trace) of
    (statically predicted or profile predicted) long
    sequence of straight-line code
  • Trace Compaction
  • Squeeze trace into few VLIW instructions
  • Need bookkeeping code in case prediction is wrong
  • Compiler undoes bad guess (discards values in
    registers)
  • Subtle compiler bugs mean wrong answer vs.
    poorer performance no hardware interlocks

51
Hardware Support For Parallelism
  • 4.1 Compiler Techniques for Exposing ILP
  • 4.3 Static Multiple Issue VLIW
  • 4.4 Advanced Compiler Support for ILP
  • 4.5 Hardware Support for Exposing more
    Parallelism
  • Software support of ILP is best when code is
    predictable at compile time.
  • But what if theres no predictability?
  • Here well talk about hardware techniques. These
    include
  • Conditional or Predicated Instructions
  • Hardware Speculation

52
Tell the Hardware To Ignore An Instruction
Hardware Support For Parallelism
Nullified Instructions
  • Avoid branch prediction by turning branches into
    conditionally executed instructions
  • IF (x) then A B op C else NOP
  • If false, then neither store result nor cause
    exception
  • Expanded ISA of Alpha, MIPs, PowerPC, SPARC, have
    conditional move. PA-RISC can annul any
    following instruction.
  • IA-64 64 1-bit condition fields selected so
    conditional execution of any instruction
  • Drawbacks to conditional instructions
  • Still takes a clock, even if annulled
  • Stalls if condition evaluated late
  • Complex conditions reduce effectiveness
    condition becomes known late in pipeline.
  • This can be a major win because there is no time
    lost by taking a branch!!

x
A B op C
53
Tell the Hardware To Ignore An Instruction
Hardware Support For Parallelism
Nullified Instructions
  • Suppose we have the code
  • if ( VarA 0 )
  • VarS VarT
  • Previous Method
  • LD R1, VarA
  • BNEZ R1, Label
  • LD R2, VarT
  • SD VarS, R2
  • Label

Nullified Method LD R1, VarA LD R2,
VarT CMPNNZ R1, 0 SD VarS, R2 Label
Compare and Nullify Next Instr. If Not Zero
Nullified Method LD R1, VarA LD R2, VarT CMOVZ
VarS,R2, R1
Compare and Move IF Zero
54
Hardware Support For Parallelism
Compiler Speculation
Increasing Parallelism
  • The theory here is to move an instruction across
    a branch so as to increase the size of a basic
    block and thus to increase parallelism.
  • Primary difficulty is in avoiding exceptions.
    For example
  • if ( a 0 ) c b/a may have divide by
    zero error in some cases.
  • Methods for increasing speculation include
  • 1. Use a set of status bits (poison bits)
    associated with the registers. Are a signal that
    the instruction results are invalid until some
    later time.
  • 2. Result of instruction isnt written until
    its certain the instruction is no longer
    speculative.

55
Hardware Support For Parallelism
Compiler Speculation
Increasing Parallelism
Original Code LW R1, 0(R3) Load A BNEZ
R1, L1 Test A LW R1, 0(R2) If
Clause J L2 Skip Else L1 ADDI R1, R1,
4 Else Clause L2 SW 0(R3), R1 Store A
  • Example on Page 305.
  • Code for
  • if ( A 0 )
  • A B
  • else
  • A A 4
  • Assume A is at 0(R3) and B is at 0(R4)

Speculated Code LW R1, 0(R3) Load A
LW R14, 0(R2) Spec Load B BEQZ R1, L3
Other if Branch ADDI R14, R1, 4 Else
Clause L3 SW 0(R3), R14 Non-Spec Store
Note here that only ONE side needs to take a
branch!!
56
Hardware Support For Parallelism
Compiler Speculation
Poison Bits
Speculated Code LW R1, 0(R3) Load A
LW R14, 0(R2) Spec Load B BEQZ R1, L3
Other if Branch ADDI R14, R1, 4 Else
Clause L3 SW 0(R3), R14 Non-Spec Store
  • In the example on the last page, if the LW
    produces an exception, a poison bit is set on
    that register. The if a later instruction tries
    to use the register, an exception is THEN raised.

57
HW support for More ILP
Hardware Support For Parallelism
Hardware Speculation
  • Need HW buffer for results of uncommitted
    instructions reorder buffer
  • Reorder buffer can be operand source
  • Once operand commits, result is found in register
  • 3 fields instr. type, destination, value
  • Use reorder buffer number instead of reservation
    station
  • Discard instructions on mis-predicted branches or
    on exceptions

58
HW support for More ILP
Hardware Support For Parallelism
Hardware Speculation
  • How is this used in practice?
  • Rather than predicting the direction of a branch,
    execute the instructions on both side!!
  • We early on know the target of a branch, long
    before we know it if will be taken or not.
  • So begin fetching/executing at that new Target
    PC.
  • But also continue fetching/executing as if the
    branch NOT taken.

59
Summary
  • 4.1 Compiler Techniques for Exposing ILP
  • 4.3 Static Multiple Issue VLIW
  • 4.4 Advanced Compiler Support for ILP
  • 4.5 Hardware Support for Exposing more Parallelism
Write a Comment
User Comments (0)
About PowerShow.com