Exploiting InstructionLevel Parallelism with Software Approach - PowerPoint PPT Presentation

About This Presentation
Title:

Exploiting InstructionLevel Parallelism with Software Approach

Description:

... can be the source of a reasonable amount of parallelism. ... detecting loop-level parallelism ... support for more parallelism at compile time. Conditional ... – PowerPoint PPT presentation

Number of Views:409
Avg rating:3.0/5.0
Slides: 48
Provided by: david2177
Category:

less

Transcript and Presenter's Notes

Title: Exploiting InstructionLevel Parallelism with Software Approach


1
Exploiting Instruction-Level Parallelism with
Software Approach 1
E. J. Kim
2
  • To avoid a pipeline stall, a dependent
    instruction must be separated from the source
    instruction by a distance in clock cycles equal
    to the pipeline latency of that source
    instruction.
  • Goal to keep a pipeline full.

3
Latencies
Branch 1, Integer ALU op branch 1 Integer
load 1 Integer ALU - integer ALU 1
4
Example
for ( i 1000 i gt 0 i i 1) xi xi
s
Loop L.D F0, 0(R1) ADD.D F4, F0, F2 S.D F4,
0(R1) DADDIU R1, R1, -8 BNE R1, R2, LOOP
5
Without any Scheduling
Clock cycle issued Loop L.D F0,
0(R1) 1 stall 2 ADD.D F4, F0,
F2 3 stall 4 stall 5 S.D F4,
0(R1) 6 DADDIU R1, R1, -8 7 stall 8
BNE R1, R2, LOOP 9 stall 10
6
With Scheduling
Clock cycle issued Loop L.D F0,
0(R1) 1 DADDIU R1, R1, -8 2 ADD.D F4, F0,
F2 3 stall 4 BNE R1, R2,
LOOP 5 S.D F4, 8(R1) 6
delayed branch
not trivial
7
  • The actual work of operating on the array element
    takes 3 (load, add, store).
  • The remaining 3 cycles
  • Loop overhead (DADDIU, BNE)
  • Stall
  • To eliminate the 3 cycles, we need to get more
    operations within the loop relative to the number
    of overhead instructions.

8
Reducing Loop Overhead
  • Loop Unrolling
  • Simple scheme for increasing the number of
    instructions relative to the branch and overhead
    instructions
  • Simply replicates the loop body multiple times,
    adjusting the loop termination code.
  • Improves scheduling
  • It allows instructions from different iterations
    to be scheduled together.
  • Uses different registers for each iteration.

9
Unrolled Loop (No Scheduling)
Clock cycle issued Loop L.D F0,
0(R1) 1 2 ADD.D F4, F0, F2 3 4 5 S.D F4,
0(R1) 6 L.D F6, -8(R1) 7 8 ADD.D F8, F6,
F2 9 10 11 S.D F8, -8(R1) 12 L.D F10,
-16(R1) 13 14 ADD.D F12, F10, F2 15 16
17 S.D F12, -16(R1) 18 L.D F14,
-24(R1) 19 20 ADD.D F16, F14, F2 21 22
23 S.D F16, -24(R1) 24 DADDIU R1, R1,
-32 25 26 BNE R1, R2, LOOP 27 28
10
Loop Unrolling
  • Loop unrolling is normally done early in the
    compilation process, so that redundant
    computations can be exposed and eliminated by the
    optimizer.
  • Unrolling improves the performance of the loop by
    eliminating overhead instructions.

11
Loop Unrolling (Scheduling)
Clock cycle issued Loop L.D F0,
0(R1) 1 L.D F6, -8(R1) 2 L.D F10,
-16(R1) 3 L.D F14, -24(R1) 4 ADD.D F4,
F0, F2 5 ADD.D F8, F6, F2 6 ADD.D F12,
F10, F2 7 ADD.D F16, F14, F2 8 S.D F4,
0(R1) 9 S.D F8, -8(R1) 10 DADDIU R1, R1,
-32 11 S.D F12, 16(R1) 12 BNE R1, R2,
LOOP 13 S.D F16, 8(R1) 14
12
Summary
  • Goal To know when and how the ordering among
    instructions may be changed.
  • This process must be performed in a methodical
    fashion either by a compiler or by hardware.

13
  • To obtain the final unrolled code,
  • Determine that it is legal to move the S.D after
    the DADDIU and BNE, and find the amount to adjust
    the S.D offset.
  • Determine that unrolling the loop will be useful
    by finding that the loop iterations are
    independent, except for the loop maintenance
    code.
  • Use different registers to avoid unnecessary
    constraints.
  • Eliminate the extra test and branch instructions
    and adjust the loop termination and iteration
    code.

14
  • Determine that the loads and stores in the
    unrolled loop can be interchanged by observing
    that the loads and stores from different
    iterations are independent. This transformation
    requires analyzing the memory addresses and
    finding that they do not refer to the same
    address.
  • Schedule the code, preserving any dependences
    needed to yield the same result as the original
    code.

15
Loop Unrolling I(No Delayed Branch)
Loop L.D F0, 0(R1) ADD.D F4, F0,
F2 S.D F4, 0(R1) L.D F0, -8(R1) ADD.D F4,
F0, F2 S.D F4, -8(R1) L.D F0, -16(R1)
ADD.D F4, F0, F2 S.D F4, -16(R1) L.D F0,
-24(R1) ADD.D F4, F0, F2 S.D F4,
-24(R1) DADDIU R1, R1, -32 BNE R1, R2, LOOP
name dependence
true dependence
16
Loop Unrolling II(Register Renaming)
Loop L.D F0, 0(R1) ADD.D F4, F0,
F2 S.D F4, 0(R1) L.D F6, -8(R1) ADD.D F8,
F6, F2 S.D F8, -8(R1) L.D F10, -16(R1)
ADD.D F12, F10, F2 S.D F12,
-16(R1) L.D F14, -24(R1) ADD.D F16, F14,
F2 S.D F16, -24(R1) DADDIU R1, R1,
-32 BNE R1, R2, LOOP
true dependence
17
  • With the renaming, the copies of each loop body
    become independent and can be overlapped or
    executed in parallel.
  • Potential shortfall in registers
  • Register pressure
  • It arises because scheduling code to increase ILP
    causes the number of live values to increase. It
    may not be possible to allocate all the live
    values to registers.
  • The combination of unrolling and aggressive
    scheduling can cause this problem.

18
  • Loop unrolling is a simple but useful method for
    increasing the size of straight-line code
    fragments that can be scheduled effectively.

19
Unrolling with Two-Issue
Loop L.D F0, 0(R1) 1 L.D F6,
-8(R1) 2 L.D F10, -16(R1) ADD.D F4, F0,
F2 3 L.D F14, -24(R1) ADD.D F8, F6,
F2 4 L.D F18, -32(R1) ADD.D F12, F10,
F2 5 S.D F4, 0(R1) ADD.D F16, F14,
F2 6 S.D F8, -8(R1) ADD.D F20, F18,
F2 7 S.D F12, -16(R1) 8 DADDIU R1, R1,
-40 9 S.D F16, 16(R1) 10 BNE R1, R2,
LOOP 11 S.D F20, 8(R1) 12
20
Static Branch Prediction
  • Static branch predictors are sometimes used in
    processors where the expectation is that branch
    behavior is highly predictable at compile time.

21
Static Branch Prediction
  • Predict a branch taken
  • Simplest
  • Average misprediction rate for SPEC 34
  • (9 59)
  • Predict on the basis of branch direction
  • backward-going branches taken
  • forward-going branches not taken
  • Unlikely to generate an overall misprediction
    rate of less than 30 40.

22
Static Branch Prediction
  • Predict branches on the basis of profile
    information collected from earlier runs.
  • An individual branch is often highly biased
    toward taken or untaken. (bimodally distributed)
  • Changing the input so that the profile is for a
    different run leads to only a small change in the
    accuracy of profile-based prediction.

23
VLIW
  • Very Long Instruction Word
  • Rely on compiler technology to minimize the
    potential data hazard stalls.
  • Actually format the instructions in a potential
    issue packet so that the hardware need not check
    explicitly for dependences.
  • Wide instructions with multiple operations per
    instruction. (64, 128 bits or more)
  • Intel IA-64 architecture

24
Basic VLIW Approach
  • VLIWs use multiple, independent functional units.
  • A VLIW packages the multiple operations into one
    very long instruction.
  • The hardware in a superscalar for multiple issue
    is unnecessary.
  • Uses loop unrolling, scheduling

25
  • Local Scheduling Scheduling the code within a
    single basic block.
  • Global Scheduling scheduling code across
    branches
  • much more complex
  • Trace Scheduling Section 4.5
  • Figure 4.5 VLIW instructions

26
Problems
  • Increase in code size
  • Wasted functional units
  • In the previous example, only about 60 of the
    functional units were used.

27
Detecting and Enhancing Loop-level Parallelism
  • Loop level parallelism source level
  • ILP machine level code after compliation
  • for (i 1000 ilt 0 i--)
  • xi xi s

28
Advanced Compiler Support for Exposing and
Exploiting ILP
for ( i 1 i lt 100 i ) Ai 1 Ai
Ci / S1 / Bi 1 Bi
Ai 1 / S2 /
29
Loop-Carried Dependence
  • Data accesses in later iterations are dependent
    on data values produced in earlier iterations.

for ( i 1 i lt 100 i ) Ai 1 Ai
Ci / S1 / Bi 1 Bi
Ai 1 / S2 /
This dependence forces successive iterations of
this loop to execute in series.
Loop-Carried Dependences
30
Does a loop-carried dependence mean there is no
parallelism???
  • Consider for (i0 ilt 8 ii1) A A
    Ci / S1 / Could computeCycle 1
    temp0 C0 C1 temp1 C2
    C3 temp2 C4 C5 temp3 C6
    C7Cycle 2 temp4 temp0 temp1 temp5
    temp2 temp3Cycle 3 A temp4 temp5
  • Relies on associative nature of .

31
for ( i 1 i lt 100 i ) Ai Ai
Bi / S1 / Bi 1 Ci
Di / S2 /
Loop-Carried Dependence
Despite this loop-carried dependence, this loop
can be made parallel.
32
A1 A1 B1 for ( i 1 i lt 99 i )
Bi 1 Ci Di Ai1 Ai1
Bi1 B101 C100 D100
33
Recurrence
  • A recurrence is when a variable is defined based
    on the value of that variable in an earlier
    iteration, often the one immediately preceding.
  • Detecting a recurrence can be important
  • Some architectures (especially vector computer)
    have special support for executing recurrences.
  • Some recurrences can be the source of a
    reasonable amount of parallelism.

34
for ( i 2 i lt 100 i i 1) Yi Yi
1 Yi
Dependence distance 1
for ( i 6 i lt 100 i i 1) Yi Yi
5 Yi
Dependence distance 5
The larger the distance, the more potential
parallelism can be obtained by unrolling the loop.
35
Finding Dependences
  • Determining whether a dependence actually exists
    gt NP-Complete
  • Dependence Analysis
  • Basic tool for detecting loop-level parallelism
  • Applies only under a limited set of
    circumstances.
  • Greatest common divisor (GCD) test, points-to
    analysis, interprocedural analysis,

36
Eliminating Dependent Computation
  • Algebraic Simplifications of Expressions
  • Copy propagation
  • Eliminates operations that copy values.

DADDIU R1, R2, 4 DADDIU R1, R1, 4
DADDIU R1, R2, 8
37
Eliminating Dependent Computation
  • Tree Height Reduction
  • Reduces the height of the tree structure
    representing a computation.

ADD R1, R2, R3 ADD R4, R1, R6 ADD R8, R4, R7
ADD R1, R2, R3 ADD R4, R6, R7 ADD R8, R1, R4
38
Eliminating Dependent Computation
  • Recurrences

sum sum x1 x2 x3 x4 x5
sum (sum x1) (x2 x3) (x4 x5)
39
Software Pipelining
  • Technique for reorganizing loops such that each
    iteration in the software-pipelined code is made
    from instructions chosen from different
    iterations of the original loop.
  • By choosing instructions from different
    iterations, dependent computations are separated
    from one another by an entire loop body.

40
Software Pipelining
  • Counterpart to what Tomasulos algorithm does in
    hardware
  • Software pipelining symbolically unrolls the loop
    and then selects instructions from each
    iteration.
  • Start-up code before the loop and finish-up code
    after the loop required.

41
Software Pipelining
42
Software Pipelining - Example
  • Show a software-pipelined version of the
    following loop. Omit the start-up and finish-up
    code.

Loop L.D F0, 0(R1) ADD.D F4, F0, F2 S.D F4,
0(R1) DADDIU R1, R1, -8 BNE R1, R2, Loop
43
Software Pipelining
  • Software pipelining consumes less code space.
  • Loop unrolling reduces the overhead of the loop
    (branch, counter update code).
  • Software pipelining reduces the time when the
    loop is not running at peak speed to once per
    loop at the beginning and end.

44
(No Transcript)
45
Hw support for more parallelism at compile time
Conditional Instructions
  • Predicated instructions
  • Extension of instruction set
  • Conditional instruction an instruction that
    refers a condition, which is evaluated as part of
    the instruction execution
  • Condition is true executed normally
  • False no-op
  • ex) conditional move

46
Example
if (A 0) S T
BNEZ R1, L ADDU R2, R3, R0 L
CMOVZ R2, R3, R1
conditional move only if the third operand is
equal to zero
R1A, R2S, R3T
47
  • Conditional moves are used to change a control
    dependence into a data dependence.
  • Handling multiple branches per cycle is complex.
    gt Conditional moves provide a way of reducing
    branch pressure.
  • A conditional move can often eliminate a branch
    that is hard to predict, increasing the potential
    gain.
Write a Comment
User Comments (0)
About PowerShow.com