Title: COMP 740: Computer Architecture and Implementation
1COMP 740Computer Architecture and Implementation
- Montek Singh
- Tue, Feb 24, 2009
- Topic Instruction-Level Parallelism IV
- (Software Approaches/Compiler Techniques)
2Outline
- Motivation
- Compiler scheduling
- Loop unrolling
- Software pipelining
3Review Instruction-Level Parallelism (ILP)
- Pipelining most effective when parallelism
among instrs - instrs u and v are parallel if neither is
dependent on the other - Problem parallelism within a basic block is
limited - branch freq of 15 implies about 6 instructions
in basic block - these instructions are likely to depend on each
other - need to look beyond basic blocks
- Solution exploit loop-level parallelism
- i.e., parallelism across loop iterations
- to convert loop-level parallelism into ILP, need
to unroll the loop - dynamically, by the hardware
- statically, by the compiler
- using vector instructions same op applied to
all vector elements
4Motivating Example for Loop Unrolling
for (i 1000 i gt 0 i--) xi xi s
- Assumptions
- Scalar s is in register F2
- Array x starts at memory address 0
- 1-cycle branch delay
- No structural hazards
10 cycles per iteration
5How Far Can We Get With Scheduling?
LOOP L.D F0, 0(R1) DADDUI R1, R1, -8
ADD.D F4, F0, F2 nop BNEZ R1, LOOP
S.D 8(R1), F4
LOOP L.D F0, 0(R1) ADD.D F4, F0, F2
S.D 0(R1), F4 DADDUI R1, R1, -8
BNEZ R1, LOOP NOP
6 cycles per iteration
Note change in S.D instruction, from 0(R1) to
8(R1) this is a non-trivial change!
6Observations on Scheduled Code
- 3 out of 5 instructions involve FP work
- The other two constitute loop overhead
- Could we improve performance by unrolling the
loop? - assume number of loop iterations is a multiple of
4, and unroll loop body four times - in real life, must also handle loop counts that
are not multiples of 4
7Unrolling Take 1
- Even though we have gotten rid of the control
dependences, we have data dependences through R1 - We could remove data dependences by observing
that R1 is decremented by 8 each time - Adjust the address specifiers
- Delete the first three DADDUIs
- Change the constant in the fourth DADDUI to 32
- These are non-trivial inferences for a compiler
to make
LOOP L.D F0, 0(R1) ADD.D F4, F0, F2
S.D 0(R1), F4 DADDUI R1, R1, -8
L.D F0, 0(R1) ADD.D F4, F0, F2
S.D 0(R1), F4 DADDUI R1, R1, -8
L.D F0, 0(R1) ADD.D F4, F0, F2
S.D 0(R1), F4 DADDUI R1, R1, -8
L.D F0, 0(R1) ADD.D F4, F0, F2
S.D 0(R1), F4 DADDUI R1, R1, -8
BNEZ R1, LOOP NOP
8Unrolling Take 2
- Performance is now limited by the WAR
dependencies on F0 - These are name dependences
- The instructions are not in a producer-consumer
relation - They are simply using the same registers, but
they dont have to - We can use different registers in different loop
iterations, subject to availability - Lets rename registers
LOOP L.D F0, 0(R1) ADD.D F4, F0, F2
S.D 0(R1), F4 L.D F0, -8(R1)
ADD.D F4, F0, F2 S.D -8(R1), F4
L.D F0, -16(R1) ADD.D F4, F0, F2
S.D -16(R1), F4 L.D F0, -24(R1)
ADD.D F4, F0, F2 S.D -24(R1), F4
DADDUI R1, R1, -32 BNEZ R1, LOOP NOP
9Unrolling Take 3
- Time for execution of 4 iterations
- 14 instruction cycles
- 4 L.D?ADD.D stalls
- 8 ADD.D?S.D stalls
- 1 DADDUI?BNEZ stall
- 1 branch delay stall (NOP)
- 28 cycles for 4 iterations, or 7 cycles per
iteration - Slower than scheduled version of original loop,
which needed 6 cycles per iteration - Lets schedule the unrolled loop
LOOP L.D F0, 0(R1) ADD.D F4, F0, F2
S.D 0(R1), F4 L.D F6, -8(R1)
ADD.D F8, F6, F2 S.D -8(R1), F8
L.D F10, -16(R1) ADD.D F12, F10, F2
S.D -16(R1), F12 L.D F14, -24(R1)
ADD.D F16, F14, F2 S.D -24(R1), F16
DADDUI R1, R1, -32 BNEZ R1, LOOP NOP
10Unrolling Take 4
- This code runs without stalls
- 14 cycles for 4 iterations
- 3.5 cycles per iteration
- loop control overhead once every four
iterations - Note that original loop had three FP instructions
that were not independent - Loop unrolling exposed independent instructions
from multiple loop iterations - By unrolling further, can approach asymptotic
rate of 3 cycles per instruction - Subject to availability of registers
LOOP L.D F0, 0(R1) L.D F6, -8(R1)
L.D F10, -16(R1) L.D F14, -24(R1)
ADD.D F4, F0, F2 ADD.D F8, F6, F2
ADD.D F12, F10, F2 ADD.D F16, F14, F2
S.D 0(R1), F4 S.D -8(R1), F8
DADDUI R1, R1, -32 S.D 16(R1), F12
BNEZ R1, LOOP S.D 8(R1), F16
11What Did The Compiler Have To Do?
- Determine it was legal to move the S.D
- after the DADDUI and BNEZ
- and find the amount to adjust the S.D offset
- Determine that loop unrolling would be useful
- by discovering independence of loop iterations
- Rename registers to avoid name dependences
- Eliminate extra tests and branches and adjust
loop control - Determine that L.Ds and S.Ds can be
interchanged - by determining that (since R1 is not being
updated) the address specifiers 0(R1), -8(R1),
-16(R1), -24(R1) all refer to different memory
locations - Schedule the code, preserving dependences
12Limits to Gain from Loop Unrolling
- Benefit of reduction in loop overhead tapers off
- Amount of overhead amortized diminishes with
successive unrolls - Code size limitations
- For larger loops, code size growth is a concern
- Especially for embedded processors with limited
memory - Instruction cache miss rate increases
- Architectural/compiler limitations
- Register pressure
- Need many registers to exploit ILP
- Especially challenging in multiple-issue
architectures
13Dependences
- Three kinds of dependences
- Data dependence
- Name dependence
- Control dependence
- In the context of loop-level parallelism, data
dependence can be - Loop-independent
- Loop-carried
- Data dependences act as a limit of how much ILP
can be exploited in a compiled program - Compiler tries to identify and eliminate
dependences - Hardware tries to prevent dependences from
becoming stalls
14Control Dependences
- A control dependence determines the ordering of
an instruction with respect to a branch
instruction so that the non-branch instruction is
executed only when it should be - if (p1) s1
- if (p2) s2
- Control dependence constrains code motion
- An instruction that is control dependent on a
branch cannot be moved before the branch so that
its execution is no longer controlled by the
branch - An instruction that is not control dependent on a
branch cannot be moved after the branch so that
its execution is controlled by the branch
15Data Dependence in Loop Iterations
Au1 AuCu Bu1 BuAu1
Au AuBu Bu1 CuDu
Bu1 CuDu Au1 Au1Bu1
16Loop Transformation
- Sometimes loop-carried dependence does not
prevent loop parallelization - Example Second loop of previous slide
- In other cases, loop-carried dependence prohibits
loop parallelization - Example First loop of previous slide
Au AuBu Bu1 CuDu
17Software Pipelining
- Observation
- If iterations from loops are independent, then we
can get ILP by taking instructions from different
iterations - Software pipelining
- reorganize loops so that each iteration is made
from instructions chosen from different
iterations of the original loop
i0
i1
i2
i3
Software Pipeline Iteration
i4
18Software Pipelining Example
After Software Pipelined L.D F0,0(R1) ADD.D F4,
F0,F2 L.D F0,-8(R1) 1 S.D 0(R1),F4 Stores
Mi 2 ADD.D F4,F0,F2 Adds to Mi-1
3 L.D F0,-16(R1) loads Mi-2 4 DADDUI
R1,R1,-8 5 BNEZ R1,LOOP S.D 0(R1),F4 ADD.D F4,F
0,F2 S.D -8(R1),F4
Read F4
Read F0
IF ID EX Mem WB IF ID EX Mem WB
IF ID EX Mem WB
S.D ADD.D L.D
Write F4
Write F0
19Software Pipelining Concept
A Study of Scalar Compilation Techniques for
Pipelined Supercomputers, S. Weiss and J. E.
Smith, ISCA 1987, pages 105-109
- Notation Load, Execute, Store
- Iterations are independent
- In normal sequence, Ei depends on Li, and Si
depends on Ei, leading to pipeline stalls - Software pipelining attempts to reduce these
delays by inserting other instructions between
such dependent pairs and hiding the delay - Other instructions are L and S instructions
from other loop iterations - Does this without consuming extra code space or
registers - Performance usually not as high as that of loop
unrolling - How can we permute L, E, S to achieve this?
L1 E1 S1 B Loop L2 E2 S2 B Loop L3 E3 S3 B
Loop Ln En Sn
Loop Li Ei Si B Loop
20An Abstract View of Software Pipelining
L1 Loop Ei Si Li1 B Loop
En Sn
J Entry Loop Si-1 Entry Li
Ei B Loop Sn
Loop Li Ei Si B Loop
Maintains original L/S order
Changes original L/S order
L1 J Entry Loop Si-1 Entry
Ei Li1 B Loop Sn-1 En Sn
L1 J Entry Loop Li Si-1 Entry
Ei B Loop Sn
L1 Loop Ei Li1 Si B Loop
En Sn
21Other Compiler Techniques
- Static Branch Prediction
- Examples
- predict always taken
- predict never taken
- predict forward never taken, backward always
taken - Stall needed after LD
- if branch almost always taken, and R7 not needed
in fall-thru - move DADDU R7, R8, R9 to right after LD
- if branch almost never taken, and R4 not needed
on taken path - move OR instruction to right after LD
LD R1, 0(R2) DSUBU R1, R1, R3 BEQZ R1,
L OR R4, R5, R6 DADDU R10, R4, R3 L DADDU R7,
R8, R9
22Very Long Instruction Word (VLIW)
- VLIW compiler schedules multiple
instructions/issue - The long instruction word has room for many
operations - By definition, all the operations the compiler
puts in the long instruction word can execute in
parallel - E.g., 2 integer operations, 2 FP operations, 2
memory references, 1 branch - 16 to 24 bits per field gt 716 or 112 bits to
724 or 168 bits wide - Need very sophisticated compiling technique
- that schedules across several branches
23Loop Unrolling in VLIW
- Memory Memory FP FP Int. op/ Clockreference
1 reference 2 operation 1 op. 2 branch - LD F0,0(R1) LD F6,-8(R1) 1
- LD F10,-16(R1) LD F14,-24(R1) 2
- LD F18,-32(R1) LD F22,-40(R1) ADDD F4,F0,F2 ADDD
F8,F6,F2 3 - LD F26,-48(R1) ADDD F12,F10,F2 ADDD F16,F14,F2 4
- ADDD F20,F18,F2 ADDD F24,F22,F2 5
- SD 0(R1),F4 SD -8(R1),F8 ADDD F28,F26,F2 6
- SD -16(R1),F12 SD -24(R1),F16 7
- SD -32(R1),F20 SD -40(R1),F24 SUBI R1,R1,48 8
- SD -0(R1),F28 BNEZ R1,LOOP 9
- Unrolled 7 times to avoid delays
- 7 results in 9 clocks, or 1.3 clocks/iter (down
from 6) - Need more registers in VLIW