CSC 4250 Computer Architectures - PowerPoint PPT Presentation

About This Presentation

Title:

CSC 4250 Computer Architectures

Description:

This loop runs in 14 cycles there is no stall ... The dependence of DSUBU and BEQZ on LD means that a stall will be needed after LD. ... – PowerPoint PPT presentation

Number of Views:31

Avg rating:3.0/5.0

Slides: 24

Provided by: stude6

Learn more at: http://www.cs.rpi.edu

Category:

more less

Transcript and Presenter's Notes

Title: CSC 4250 Computer Architectures

1
CSC 4250Computer Architectures

November 14, 2006
Chapter 4. Instruction-Level Parallelism
Software Approaches

2
Fig. 4.1. Latencies of FP ops in Chap. 4

The last column shows the number of intervening
clock cycles needed to avoid a stall
The latency of a FP load to a FP store is zero,
since the result of the load can be bypassed
without stalling the store
Continue to assume an integer load latency of 1
and an integer ALU operation latency of 0

3
Loop Unrolling

For (i1000 igt0 ii-1)
xi xi s
The above loop is parallel because the body of
each iteration is independent.
MIPS code
Loop L.D F0,0(R1)
ADD.D F4,F0,F2
S.D F4,0(R1)
DADDUI R1,R1, -8
BNE R1,R2,Loop

4
Example (p. 305)

Without any pipeline scheduling, the loop
executes as follows
Clock cycle issued
Loop L.D F0,0(R1) 1
stall 2
ADD.D F4,F0,F2 3
stall 4
stall 5
S.D F4,0(R1) 6
DADDUI R1,R1, -8 7
stall 8
BNE R1,R2,Loop 9
stall 10
Overhead (10-3)/10 0.7 10 cycles per result
How to reduce the stall to 1 clock cycle?

5
Example (p. 306)

With some pipeline scheduling, the loop executes
as follows
Clock cycle issued
Loop L.D F0,0(R1) 1
DADDUI R1,R1, -8 2
ADD.D F4,F0,F2 3
stall 4
BNE R1,R2,Loop 5
S.D F4,8(R1) 6
Overhead (6-3)/6 0.5 6 cycles per result
To schedule the delayed branch, the compiler has
to determine that it can swap DADDUI and S.D by
changing the address to which the S.D stores. The
change is not trivial. Most compilers would see
that S.D depends on DADDUI and would refuse to
interchange the two instructions.

6
Loop Unrolled Four Times - Registers not reused

Loop L.D F0,0(R1)
ADD.D F4,F0,F2
S.D F4,0(R1)
L.D F6,-8(R1)
ADD.D F8,F6,F2
S.D F8,-8(R1)
L.D F10,-16(R1)
ADD.D F12,F10,F2
S.D F12,-16(R1)
L.D F14,-24(R1)
ADD.D F16,F14,F2
S.D F16,-24 (R1)
DADDUI R1,R1, -32
BNE R1,R2,Loop
We have eliminated three branches and three
decrements of R1
The addresses on the loads and stores have been
adjusted
This loop runs in 28 cycles - each L.D has 1
stall, each ADD.D 2, the DADDUI 1, the branch 1,
plus 14 instruction issue cycles
Overhead (28 - 12)/28 4/7 0.57 7 (28/4)
cycles per result

7
Upper Bound on Loop (p. 307)

In real programs, we do not know upper bound of
loop call it n
Let us say we want to unroll the loop k times
Instead of one single unrolled loop, we generate
a pair of consecutive loops
The first loop executes (n mod k) times and has a
body that is the original loop
The second loop is the unrolled body surrounded
by an outer loop that iterates (n/k) times

8
Schedule Unrolled Loop

Loop L.D F0,0(R1)
L.D F6,-8(R1)
L.D F10,-16(R1)
L.D F14,-24(R1)
ADD.D F4,F0,F2
ADD.D F8,F6,F2
ADD.D F12,F10,F2
ADD.D F16,F14,F2
S.D F4,0(R1)
S.D F8,-8(R1)
DADDUI R1,R1, -32
S.D F12,16(R1)
BNE R1,R2,Loop
S.D F16,8 (R1)
This loop runs in 14 cycles - there is no stall
Overhead 2/14 1/7 0.14 3.5 (14/4) cycles
per result
We need to know that the loads and stores are
independent and can be interchanged

9
Loop Unrolling and Scheduling Example

Determine that it is legal to move the S.D after
the DADDUI and BNE, and find the amount to adjust
the S.D offset
Determine that unrolling the loop would be useful
by finding that the loop iterations are
independent, except for the loop maintenance code
Use different registers to avoid unnecessary
constraints that would be forced by using the
same registers for different computations
Eliminate the extra test and branch instructions
and adjust the loop termination and iteration
code
Determine that the loads and stores in the
unrolled loop can be interchanged by observing
that the loads and stores from different
iterations are independent. This transformation
requires analyzing the memory addresses and
finding that they do not refer to the same
address.
Schedule the code, preserving any dependences
needed to yield the same result as the original
code

10
Three Limits to Gains by Loop Unrolling

Decease in the amount of overhead amortized with
each unroll. In our example, when we unroll the
loop four times, it generates sufficient
parallelism among the instructions that the loop
can be scheduled with no stall cycles. In 14
clock cycles, only 2 cycles are loop overhead. If
the loop is unrolled 8 times, the overhead is
reduced from ½ per original iteration to ¼.
Code size limitations. For larger loops, the code
size growth may become a concern in either the
embedded processor space where memory is at a
premium or if the larger code size causes a
decrease in the instruction cache hit rate.
Register Pressure. Scheduling code to increase
ILP causes the number of live values to increase.
After aggressive instruction scheduling, it may
not be possible to allocate all live values to
registers.

11
Schedule Unrolled Loop with Dual Issue

To schedule the loop with no delays, we unroll
the loop five times
2.4 (12/5) cycles per result
There are not enough FP instructions to keep the
FP pipeline full

12
Static Branch Prediction

Static branch prediction is used in processors
where we expect branch behavior to be highly
predictable at compile time.
Delayed branches support static branch
prediction. They expose a pipeline hazard so that
the compiler can reduce the penalty associated
with the hazard. The effectiveness depends on
whether we can correctly guess which way a branch
will go.
The ability to accurately predict a branch at
compile time is helpful for scheduling data
hazards. Loop unrolling is one such example.
Another example arises from conditional selection
branches (next four slides).

13
Conditional Selection Branches (1)

LD R1,0(R2)
DSUBU R1,R1,R3
BEQZ R1,L
OR R4,R5,R6
DADDU R10,R4,R3
L DADDU R7,R8,R9
The dependence of DSUBU and BEQZ on LD means that
a stall will be needed after LD.
Suppose we know that the branch is almost always
taken and that the value of R7 is not needed on
the fall-through path. What should we do?

14
Conditional Selection Branches (2)

LD R1,0(R2)
DADDU R7,R8,R9
DSUBU R1,R1,R3
BEQZ R1,L
OR R4,R5,R6
DADDU R10,R4,R3
L
We could increase the speed of execution by
moving DADDU R7,R8,R9 to just after LD
Suppose we know that the branch is rarely taken
and that the value of R4 is not needed on the
taken path. What should we do?

15
Conditional Selection Branches (3)

LD R1,0(R2)
OR R4,R5,R6
DSUBU R1,R1,R3
BEQZ R1,L
DADDU R10,R4,R3
L DADDU R7,R8,R9
We could increase the speed of execution by
moving OR R4,R5,R6 to just after LD
Also, scheduling the branch delay slot in Fig.
A.14

16
Conditional Selection Branches (4)
17
Branch Prediction at Compile Time

Simplest scheme Predict branch as taken. The
average misprediction rate for the SPEC programs
is 34, ranging from not very accurate (59) to
highly accurate (9).
Predict on branch direction, choosing
backward-going branches as taken and
forward-going branches as not taken. This
strategy works for many programs. However, for
SPEC, more than half of the forward-going
branches are taken, and thus it is better to
predict all branches as taken.
A more accurate technique is to predict branches
on the basis of profile information collected
from earlier runs. The key observation is that
the behavior of branches is often bimodally
distributed that is, an individual branch is
often highly biased toward taken or untaken.

18
Misprediction Rate for Profile-based Predictor

Figure 4.3. The misprediction rate on SPEC92
varies widely but is generally better for the FP
programs, with an average misprediction rate of
9 and a standard deviation of 4, than for the
integer programs, with an average misprediction
rate of 15 and a standard deviation of 5.

19
Comparison of Predicted-taken and Profile-based
Strategies

Figure 4.4. The figure compares the accuracy of a
predicted-taken strategy and a profile-based
predictor for SPEC92 benchmarks as measured by
the number of instructions executed between
mispredicted branches on a log scale. The average
number is 20 for predicted-taken and 110 for
profile-based. The difference between the integer
and FP benchmarks as groups is large. The
corresponding distances are 10 and 30 (for
integer), and 46 and 173 (for FP).

20
Compiler to Format the Instructions

Superscalar processors decide on the fly how many
instructions to issue. A statically scheduled
superscalar must check for any dependences
between instructions in the issue packet as well
as between any issue candidate and any
instructions already in the pipeline. A
statically scheduled superscalar requires
significant compiler assistance to achieve good
performance. In contrast, a dynamically scheduled
superscalar requires less compiler assistance,
but has significant hardware costs.
An alternative is to rely on compiler technology
to actually format the instructions in a
potential issue packet so that the hardware needs
not check explicitly for dependences. The
compiler may be required to ensure that
dependences within the issue packet cannot be
present. Such approach offers the potential
advantage of simpler hardware while still
exhibiting good performance through extensive
compiler optimization.

21
VLIW Architecture

It is a multiple-issue processor that organizes
the instruction stream explicitly to avoid
dependences. It does so by using wide
instructions with multiple operations per
instruction. This architecture is named VLIW
(very long instruction word), denoting that the
instructions, since they contain several
instructions, are very wide (64 to 128 bits, or
more). Early VLIWs were quite rigid in their
instruction formats and required recompilation of
programs for different versions of the hardware.
A VLIW uses multiple, independent functional
units. It packages the multiple operations into
one very long instruction. For example, the
instruction may contain five operations,
including one integer operation (which could also
be a branch), two FP operations, and two memory
references.

22
How to Keep the Functional Units Busy

There must be sufficient parallelism in a code
sequence to fill the available operation slots.
The parallelism is uncovered by unrolling loops
and scheduling the code within the single larger
loop body. If the unrolling generates
straight-line code, then local scheduling
techniques, which operate on a single basic
block, can be used.
If finding and exploiting the parallelism
requires scheduling across the branches, a more
complex global scheduling algorithm must be used.
We will discuss trace scheduling, one global
scheduling technique developed specifically for
VLIWs.

23
Example of Straight-line Code Sequence

VLIW
2 memory references, 2 FP operations, and 1
integer or branch instr. per clock cycle
Loop xi xi s
Unroll as many times as necessary to eliminate
stalls - seven times
1.29 ( 9/7) cycles per result

Write a Comment

User Comments (0)