Title: Exploiting InstructionLevel Parallelism with Software Approach
1Exploiting Instruction-Level Parallelism with
Software Approach 1
E. J. Kim
2- To avoid a pipeline stall, a dependent
instruction must be separated from the source
instruction by a distance in clock cycles equal
to the pipeline latency of that source
instruction. - Goal to keep a pipeline full.
3Latencies
Branch 1, Integer ALU op branch 1 Integer
load 1 Integer ALU - integer ALU 1
4Example
for ( i 1000 i gt 0 i i 1) xi xi
s
Loop L.D F0, 0(R1) ADD.D F4, F0, F2 S.D F4,
0(R1) DADDIU R1, R1, -8 BNE R1, R2, LOOP
5Without any Scheduling
Clock cycle issued Loop L.D F0,
0(R1) 1 stall 2 ADD.D F4, F0,
F2 3 stall 4 stall 5 S.D F4,
0(R1) 6 DADDIU R1, R1, -8 7 stall 8
BNE R1, R2, LOOP 9 stall 10
6With Scheduling
Clock cycle issued Loop L.D F0,
0(R1) 1 DADDIU R1, R1, -8 2 ADD.D F4, F0,
F2 3 stall 4 BNE R1, R2,
LOOP 5 S.D F4, 8(R1) 6
delayed branch
not trivial
7- The actual work of operating on the array element
takes 3 (load, add, store). - The remaining 3 cycles
- Loop overhead (DADDIU, BNE)
- Stall
- To eliminate the 3 cycles, we need to get more
operations within the loop relative to the number
of overhead instructions.
8Reducing Loop Overhead
- Loop Unrolling
- Simple scheme for increasing the number of
instructions relative to the branch and overhead
instructions - Simply replicates the loop body multiple times,
adjusting the loop termination code. - Improves scheduling
- It allows instructions from different iterations
to be scheduled together. - Uses different registers for each iteration.
9Unrolled Loop (No Scheduling)
Clock cycle issued Loop L.D F0,
0(R1) 1 2 ADD.D F4, F0, F2 3 4 5 S.D F4,
0(R1) 6 L.D F6, -8(R1) 7 8 ADD.D F8, F6,
F2 9 10 11 S.D F8, -8(R1) 12 L.D F10,
-16(R1) 13 14 ADD.D F12, F10, F2 15 16
17 S.D F12, -16(R1) 18 L.D F14,
-24(R1) 19 20 ADD.D F16, F14, F2 21 22
23 S.D F16, -24(R1) 24 DADDIU R1, R1,
-32 25 26 BNE R1, R2, LOOP 27 28
10Loop Unrolling
- Loop unrolling is normally done early in the
compilation process, so that redundant
computations can be exposed and eliminated by the
optimizer. - Unrolling improves the performance of the loop by
eliminating overhead instructions.
11Loop Unrolling (Scheduling)
Clock cycle issued Loop L.D F0,
0(R1) 1 L.D F6, -8(R1) 2 L.D F10,
-16(R1) 3 L.D F14, -24(R1) 4 ADD.D F4,
F0, F2 5 ADD.D F8, F6, F2 6 ADD.D F12,
F10, F2 7 ADD.D F16, F14, F2 8 S.D F4,
0(R1) 9 S.D F8, -8(R1) 10 DADDIU R1, R1,
-32 11 S.D F12, 16(R1) 12 BNE R1, R2,
LOOP 13 S.D F16, 8(R1) 14
12Summary
- Goal To know when and how the ordering among
instructions may be changed. - This process must be performed in a methodical
fashion either by a compiler or by hardware.
13- To obtain the final unrolled code,
- Determine that it is legal to move the S.D after
the DADDIU and BNE, and find the amount to adjust
the S.D offset. - Determine that unrolling the loop will be useful
by finding that the loop iterations are
independent, except for the loop maintenance
code. - Use different registers to avoid unnecessary
constraints. - Eliminate the extra test and branch instructions
and adjust the loop termination and iteration
code.
14- Determine that the loads and stores in the
unrolled loop can be interchanged by observing
that the loads and stores from different
iterations are independent. This transformation
requires analyzing the memory addresses and
finding that they do not refer to the same
address. - Schedule the code, preserving any dependences
needed to yield the same result as the original
code.
15Loop Unrolling I(No Delayed Branch)
Loop L.D F0, 0(R1) ADD.D F4, F0,
F2 S.D F4, 0(R1) L.D F0, -8(R1) ADD.D F4,
F0, F2 S.D F4, -8(R1) L.D F0, -16(R1)
ADD.D F4, F0, F2 S.D F4, -16(R1) L.D F0,
-24(R1) ADD.D F4, F0, F2 S.D F4,
-24(R1) DADDIU R1, R1, -32 BNE R1, R2, LOOP
name dependence
true dependence
16Loop Unrolling II(Register Renaming)
Loop L.D F0, 0(R1) ADD.D F4, F0,
F2 S.D F4, 0(R1) L.D F6, -8(R1) ADD.D F8,
F6, F2 S.D F8, -8(R1) L.D F10, -16(R1)
ADD.D F12, F10, F2 S.D F12,
-16(R1) L.D F14, -24(R1) ADD.D F16, F14,
F2 S.D F16, -24(R1) DADDIU R1, R1,
-32 BNE R1, R2, LOOP
true dependence
17- With the renaming, the copies of each loop body
become independent and can be overlapped or
executed in parallel. - Potential shortfall in registers
- Register pressure
- It arises because scheduling code to increase ILP
causes the number of live values to increase. It
may not be possible to allocate all the live
values to registers. - The combination of unrolling and aggressive
scheduling can cause this problem.
18- Loop unrolling is a simple but useful method for
increasing the size of straight-line code
fragments that can be scheduled effectively.
19Unrolling with Two-Issue
Loop L.D F0, 0(R1) 1 L.D F6,
-8(R1) 2 L.D F10, -16(R1) ADD.D F4, F0,
F2 3 L.D F14, -24(R1) ADD.D F8, F6,
F2 4 L.D F18, -32(R1) ADD.D F12, F10,
F2 5 S.D F4, 0(R1) ADD.D F16, F14,
F2 6 S.D F8, -8(R1) ADD.D F20, F18,
F2 7 S.D F12, -16(R1) 8 DADDIU R1, R1,
-40 9 S.D F16, 16(R1) 10 BNE R1, R2,
LOOP 11 S.D F20, 8(R1) 12
20Static Branch Prediction
- Static branch predictors are sometimes used in
processors where the expectation is that branch
behavior is highly predictable at compile time.
21Static Branch Prediction
- Predict a branch taken
- Simplest
- Average misprediction rate for SPEC 34
- (9 59)
- Predict on the basis of branch direction
- backward-going branches taken
- forward-going branches not taken
- Unlikely to generate an overall misprediction
rate of less than 30 40.
22Static Branch Prediction
- Predict branches on the basis of profile
information collected from earlier runs. - An individual branch is often highly biased
toward taken or untaken. (bimodally distributed) - Changing the input so that the profile is for a
different run leads to only a small change in the
accuracy of profile-based prediction.
23VLIW
- Very Long Instruction Word
- Rely on compiler technology to minimize the
potential data hazard stalls. - Actually format the instructions in a potential
issue packet so that the hardware need not check
explicitly for dependences. - Wide instructions with multiple operations per
instruction. (64, 128 bits or more) - Intel IA-64 architecture
24Basic VLIW Approach
- VLIWs use multiple, independent functional units.
- A VLIW packages the multiple operations into one
very long instruction. - The hardware in a superscalar for multiple issue
is unnecessary. - Uses loop unrolling, scheduling
25- Local Scheduling Scheduling the code within a
single basic block. - Global Scheduling scheduling code across
branches - much more complex
- Trace Scheduling Section 4.5
- Figure 4.5 VLIW instructions
26Problems
- Increase in code size
- Wasted functional units
- In the previous example, only about 60 of the
functional units were used.
27Detecting and Enhancing Loop-level Parallelism
- Loop level parallelism source level
- ILP machine level code after compliation
- for (i 1000 ilt 0 i--)
- xi xi s
28Advanced Compiler Support for Exposing and
Exploiting ILP
for ( i 1 i lt 100 i ) Ai 1 Ai
Ci / S1 / Bi 1 Bi
Ai 1 / S2 /
29Loop-Carried Dependence
- Data accesses in later iterations are dependent
on data values produced in earlier iterations.
for ( i 1 i lt 100 i ) Ai 1 Ai
Ci / S1 / Bi 1 Bi
Ai 1 / S2 /
This dependence forces successive iterations of
this loop to execute in series.
Loop-Carried Dependences
30Does a loop-carried dependence mean there is no
parallelism???
- Consider for (i0 ilt 8 ii1) A A
Ci / S1 / Could computeCycle 1
temp0 C0 C1 temp1 C2
C3 temp2 C4 C5 temp3 C6
C7Cycle 2 temp4 temp0 temp1 temp5
temp2 temp3Cycle 3 A temp4 temp5 - Relies on associative nature of .
31for ( i 1 i lt 100 i ) Ai Ai
Bi / S1 / Bi 1 Ci
Di / S2 /
Loop-Carried Dependence
Despite this loop-carried dependence, this loop
can be made parallel.
32A1 A1 B1 for ( i 1 i lt 99 i )
Bi 1 Ci Di Ai1 Ai1
Bi1 B101 C100 D100
33Recurrence
- A recurrence is when a variable is defined based
on the value of that variable in an earlier
iteration, often the one immediately preceding. - Detecting a recurrence can be important
- Some architectures (especially vector computer)
have special support for executing recurrences. - Some recurrences can be the source of a
reasonable amount of parallelism.
34for ( i 2 i lt 100 i i 1) Yi Yi
1 Yi
Dependence distance 1
for ( i 6 i lt 100 i i 1) Yi Yi
5 Yi
Dependence distance 5
The larger the distance, the more potential
parallelism can be obtained by unrolling the loop.
35Finding Dependences
- Determining whether a dependence actually exists
gt NP-Complete - Dependence Analysis
- Basic tool for detecting loop-level parallelism
- Applies only under a limited set of
circumstances. - Greatest common divisor (GCD) test, points-to
analysis, interprocedural analysis,
36Eliminating Dependent Computation
- Algebraic Simplifications of Expressions
- Copy propagation
- Eliminates operations that copy values.
DADDIU R1, R2, 4 DADDIU R1, R1, 4
DADDIU R1, R2, 8
37Eliminating Dependent Computation
- Tree Height Reduction
- Reduces the height of the tree structure
representing a computation.
ADD R1, R2, R3 ADD R4, R1, R6 ADD R8, R4, R7
ADD R1, R2, R3 ADD R4, R6, R7 ADD R8, R1, R4
38Eliminating Dependent Computation
sum sum x1 x2 x3 x4 x5
sum (sum x1) (x2 x3) (x4 x5)
39Software Pipelining
- Technique for reorganizing loops such that each
iteration in the software-pipelined code is made
from instructions chosen from different
iterations of the original loop. - By choosing instructions from different
iterations, dependent computations are separated
from one another by an entire loop body.
40Software Pipelining
- Counterpart to what Tomasulos algorithm does in
hardware - Software pipelining symbolically unrolls the loop
and then selects instructions from each
iteration. - Start-up code before the loop and finish-up code
after the loop required.
41Software Pipelining
42Software Pipelining - Example
- Show a software-pipelined version of the
following loop. Omit the start-up and finish-up
code.
Loop L.D F0, 0(R1) ADD.D F4, F0, F2 S.D F4,
0(R1) DADDIU R1, R1, -8 BNE R1, R2, Loop
43Software Pipelining
- Software pipelining consumes less code space.
- Loop unrolling reduces the overhead of the loop
(branch, counter update code). - Software pipelining reduces the time when the
loop is not running at peak speed to once per
loop at the beginning and end.
44(No Transcript)
45Hw support for more parallelism at compile time
Conditional Instructions
- Predicated instructions
- Extension of instruction set
- Conditional instruction an instruction that
refers a condition, which is evaluated as part of
the instruction execution - Condition is true executed normally
- False no-op
- ex) conditional move
46Example
if (A 0) S T
BNEZ R1, L ADDU R2, R3, R0 L
CMOVZ R2, R3, R1
conditional move only if the third operand is
equal to zero
R1A, R2S, R3T
47- Conditional moves are used to change a control
dependence into a data dependence. - Handling multiple branches per cycle is complex.
gt Conditional moves provide a way of reducing
branch pressure. - A conditional move can often eliminate a branch
that is hard to predict, increasing the potential
gain.