CSC 4250 Computer Architectures - PowerPoint PPT Presentation

About This Presentation
Title:

CSC 4250 Computer Architectures

Description:

This loop runs in 14 cycles there is no stall ... The dependence of DSUBU and BEQZ on LD means that a stall will be needed after LD. ... – PowerPoint PPT presentation

Number of Views:31
Avg rating:3.0/5.0
Slides: 24
Provided by: stude6
Learn more at: http://www.cs.rpi.edu
Category:

less

Transcript and Presenter's Notes

Title: CSC 4250 Computer Architectures


1
CSC 4250Computer Architectures
  • November 14, 2006
  • Chapter 4. Instruction-Level Parallelism
  • Software Approaches

2
Fig. 4.1. Latencies of FP ops in Chap. 4
  • The last column shows the number of intervening
    clock cycles needed to avoid a stall
  • The latency of a FP load to a FP store is zero,
    since the result of the load can be bypassed
    without stalling the store
  • Continue to assume an integer load latency of 1
    and an integer ALU operation latency of 0

3
Loop Unrolling
  • For (i1000 igt0 ii-1)
  • xi xi s
  • The above loop is parallel because the body of
    each iteration is independent.
  • MIPS code
  • Loop L.D F0,0(R1)
  • ADD.D F4,F0,F2
  • S.D F4,0(R1)
  • DADDUI R1,R1, -8
  • BNE R1,R2,Loop

4
Example (p. 305)
  • Without any pipeline scheduling, the loop
    executes as follows
  • Clock cycle issued
  • Loop L.D F0,0(R1) 1
  • stall 2
  • ADD.D F4,F0,F2 3
  • stall 4
  • stall 5
  • S.D F4,0(R1) 6
  • DADDUI R1,R1, -8 7
  • stall 8
  • BNE R1,R2,Loop 9
  • stall 10
  • Overhead (10-3)/10 0.7 10 cycles per result
  • How to reduce the stall to 1 clock cycle?

5
Example (p. 306)
  • With some pipeline scheduling, the loop executes
    as follows
  • Clock cycle issued
  • Loop L.D F0,0(R1) 1
  • DADDUI R1,R1, -8 2
  • ADD.D F4,F0,F2 3
  • stall 4
  • BNE R1,R2,Loop 5
  • S.D F4,8(R1) 6
  • Overhead (6-3)/6 0.5 6 cycles per result
  • To schedule the delayed branch, the compiler has
    to determine that it can swap DADDUI and S.D by
    changing the address to which the S.D stores. The
    change is not trivial. Most compilers would see
    that S.D depends on DADDUI and would refuse to
    interchange the two instructions.

6
Loop Unrolled Four Times - Registers not reused
  • Loop L.D F0,0(R1)
  • ADD.D F4,F0,F2
  • S.D F4,0(R1)
  • L.D F6,-8(R1)
  • ADD.D F8,F6,F2
  • S.D F8,-8(R1)
  • L.D F10,-16(R1)
  • ADD.D F12,F10,F2
  • S.D F12,-16(R1)
  • L.D F14,-24(R1)
  • ADD.D F16,F14,F2
  • S.D F16,-24 (R1)
  • DADDUI R1,R1, -32
  • BNE R1,R2,Loop
  • We have eliminated three branches and three
    decrements of R1
  • The addresses on the loads and stores have been
    adjusted
  • This loop runs in 28 cycles - each L.D has 1
    stall, each ADD.D 2, the DADDUI 1, the branch 1,
    plus 14 instruction issue cycles
  • Overhead (28 - 12)/28 4/7 0.57 7 (28/4)
    cycles per result

7
Upper Bound on Loop (p. 307)
  • In real programs, we do not know upper bound of
    loop call it n
  • Let us say we want to unroll the loop k times
  • Instead of one single unrolled loop, we generate
    a pair of consecutive loops
  • The first loop executes (n mod k) times and has a
    body that is the original loop
  • The second loop is the unrolled body surrounded
    by an outer loop that iterates (n/k) times

8
Schedule Unrolled Loop
  • Loop L.D F0,0(R1)
  • L.D F6,-8(R1)
  • L.D F10,-16(R1)
  • L.D F14,-24(R1)
  • ADD.D F4,F0,F2
  • ADD.D F8,F6,F2
  • ADD.D F12,F10,F2
  • ADD.D F16,F14,F2
  • S.D F4,0(R1)
  • S.D F8,-8(R1)
  • DADDUI R1,R1, -32
  • S.D F12,16(R1)
  • BNE R1,R2,Loop
  • S.D F16,8 (R1)
  • This loop runs in 14 cycles - there is no stall
  • Overhead 2/14 1/7 0.14 3.5 (14/4) cycles
    per result
  • We need to know that the loads and stores are
    independent and can be interchanged

9
Loop Unrolling and Scheduling Example
  • Determine that it is legal to move the S.D after
    the DADDUI and BNE, and find the amount to adjust
    the S.D offset
  • Determine that unrolling the loop would be useful
    by finding that the loop iterations are
    independent, except for the loop maintenance code
  • Use different registers to avoid unnecessary
    constraints that would be forced by using the
    same registers for different computations
  • Eliminate the extra test and branch instructions
    and adjust the loop termination and iteration
    code
  • Determine that the loads and stores in the
    unrolled loop can be interchanged by observing
    that the loads and stores from different
    iterations are independent. This transformation
    requires analyzing the memory addresses and
    finding that they do not refer to the same
    address.
  • Schedule the code, preserving any dependences
    needed to yield the same result as the original
    code

10
Three Limits to Gains by Loop Unrolling
  • Decease in the amount of overhead amortized with
    each unroll. In our example, when we unroll the
    loop four times, it generates sufficient
    parallelism among the instructions that the loop
    can be scheduled with no stall cycles. In 14
    clock cycles, only 2 cycles are loop overhead. If
    the loop is unrolled 8 times, the overhead is
    reduced from ½ per original iteration to ¼.
  • Code size limitations. For larger loops, the code
    size growth may become a concern in either the
    embedded processor space where memory is at a
    premium or if the larger code size causes a
    decrease in the instruction cache hit rate.
  • Register Pressure. Scheduling code to increase
    ILP causes the number of live values to increase.
    After aggressive instruction scheduling, it may
    not be possible to allocate all live values to
    registers.

11
Schedule Unrolled Loop with Dual Issue
  • To schedule the loop with no delays, we unroll
    the loop five times
  • 2.4 (12/5) cycles per result
  • There are not enough FP instructions to keep the
    FP pipeline full

12
Static Branch Prediction
  • Static branch prediction is used in processors
    where we expect branch behavior to be highly
    predictable at compile time.
  • Delayed branches support static branch
    prediction. They expose a pipeline hazard so that
    the compiler can reduce the penalty associated
    with the hazard. The effectiveness depends on
    whether we can correctly guess which way a branch
    will go.
  • The ability to accurately predict a branch at
    compile time is helpful for scheduling data
    hazards. Loop unrolling is one such example.
    Another example arises from conditional selection
    branches (next four slides).

13
Conditional Selection Branches (1)
  • LD R1,0(R2)
  • DSUBU R1,R1,R3
  • BEQZ R1,L
  • OR R4,R5,R6
  • DADDU R10,R4,R3
  • L DADDU R7,R8,R9
  • The dependence of DSUBU and BEQZ on LD means that
    a stall will be needed after LD.
  • Suppose we know that the branch is almost always
    taken and that the value of R7 is not needed on
    the fall-through path. What should we do?

14
Conditional Selection Branches (2)
  • LD R1,0(R2)
  • DADDU R7,R8,R9
  • DSUBU R1,R1,R3
  • BEQZ R1,L
  • OR R4,R5,R6
  • DADDU R10,R4,R3
  • L
  • We could increase the speed of execution by
    moving DADDU R7,R8,R9 to just after LD
  • Suppose we know that the branch is rarely taken
    and that the value of R4 is not needed on the
    taken path. What should we do?

15
Conditional Selection Branches (3)
  • LD R1,0(R2)
  • OR R4,R5,R6
  • DSUBU R1,R1,R3
  • BEQZ R1,L
  • DADDU R10,R4,R3
  • L DADDU R7,R8,R9
  • We could increase the speed of execution by
    moving OR R4,R5,R6 to just after LD
  • Also, scheduling the branch delay slot in Fig.
    A.14

16
Conditional Selection Branches (4)
17
Branch Prediction at Compile Time
  • Simplest scheme Predict branch as taken. The
    average misprediction rate for the SPEC programs
    is 34, ranging from not very accurate (59) to
    highly accurate (9).
  • Predict on branch direction, choosing
    backward-going branches as taken and
    forward-going branches as not taken. This
    strategy works for many programs. However, for
    SPEC, more than half of the forward-going
    branches are taken, and thus it is better to
    predict all branches as taken.
  • A more accurate technique is to predict branches
    on the basis of profile information collected
    from earlier runs. The key observation is that
    the behavior of branches is often bimodally
    distributed that is, an individual branch is
    often highly biased toward taken or untaken.

18
Misprediction Rate for Profile-based Predictor
  • Figure 4.3. The misprediction rate on SPEC92
    varies widely but is generally better for the FP
    programs, with an average misprediction rate of
    9 and a standard deviation of 4, than for the
    integer programs, with an average misprediction
    rate of 15 and a standard deviation of 5.

19
Comparison of Predicted-taken and Profile-based
Strategies
  • Figure 4.4. The figure compares the accuracy of a
    predicted-taken strategy and a profile-based
    predictor for SPEC92 benchmarks as measured by
    the number of instructions executed between
    mispredicted branches on a log scale. The average
    number is 20 for predicted-taken and 110 for
    profile-based. The difference between the integer
    and FP benchmarks as groups is large. The
    corresponding distances are 10 and 30 (for
    integer), and 46 and 173 (for FP).

20
Compiler to Format the Instructions
  • Superscalar processors decide on the fly how many
    instructions to issue. A statically scheduled
    superscalar must check for any dependences
    between instructions in the issue packet as well
    as between any issue candidate and any
    instructions already in the pipeline. A
    statically scheduled superscalar requires
    significant compiler assistance to achieve good
    performance. In contrast, a dynamically scheduled
    superscalar requires less compiler assistance,
    but has significant hardware costs.
  • An alternative is to rely on compiler technology
    to actually format the instructions in a
    potential issue packet so that the hardware needs
    not check explicitly for dependences. The
    compiler may be required to ensure that
    dependences within the issue packet cannot be
    present. Such approach offers the potential
    advantage of simpler hardware while still
    exhibiting good performance through extensive
    compiler optimization.

21
VLIW Architecture
  • It is a multiple-issue processor that organizes
    the instruction stream explicitly to avoid
    dependences. It does so by using wide
    instructions with multiple operations per
    instruction. This architecture is named VLIW
    (very long instruction word), denoting that the
    instructions, since they contain several
    instructions, are very wide (64 to 128 bits, or
    more). Early VLIWs were quite rigid in their
    instruction formats and required recompilation of
    programs for different versions of the hardware.
  • A VLIW uses multiple, independent functional
    units. It packages the multiple operations into
    one very long instruction. For example, the
    instruction may contain five operations,
    including one integer operation (which could also
    be a branch), two FP operations, and two memory
    references.

22
How to Keep the Functional Units Busy
  • There must be sufficient parallelism in a code
    sequence to fill the available operation slots.
  • The parallelism is uncovered by unrolling loops
    and scheduling the code within the single larger
    loop body. If the unrolling generates
    straight-line code, then local scheduling
    techniques, which operate on a single basic
    block, can be used.
  • If finding and exploiting the parallelism
    requires scheduling across the branches, a more
    complex global scheduling algorithm must be used.
    We will discuss trace scheduling, one global
    scheduling technique developed specifically for
    VLIWs.

23
Example of Straight-line Code Sequence
  • VLIW
  • 2 memory references, 2 FP operations, and 1
    integer or branch instr. per clock cycle
  • Loop xi xi s
  • Unroll as many times as necessary to eliminate
    stalls - seven times
  • 1.29 ( 9/7) cycles per result
Write a Comment
User Comments (0)
About PowerShow.com