Exploiting InstructionLevel Parallelism with Software Approach

About This Presentation

Title:

Exploiting InstructionLevel Parallelism with Software Approach

Description:

... can be the source of a reasonable amount of parallelism. ... detecting loop-level parallelism ... support for more parallelism at compile time. Conditional ... – PowerPoint PPT presentation

Number of Views:409

Avg rating:3.0/5.0

Slides: 48

Provided by: david2177

Learn more at: https://people.engr.tamu.edu

Category:

more less

Transcript and Presenter's Notes

Title: Exploiting InstructionLevel Parallelism with Software Approach

1
Exploiting Instruction-Level Parallelism with
Software Approach 1
E. J. Kim
2

To avoid a pipeline stall, a dependent
instruction must be separated from the source
instruction by a distance in clock cycles equal
to the pipeline latency of that source
instruction.
Goal to keep a pipeline full.

3
Latencies
Branch 1, Integer ALU op branch 1 Integer
load 1 Integer ALU - integer ALU 1
4
Example
for ( i 1000 i gt 0 i i 1) xi xi
s
Loop L.D F0, 0(R1) ADD.D F4, F0, F2 S.D F4,
0(R1) DADDIU R1, R1, -8 BNE R1, R2, LOOP
5
Without any Scheduling
Clock cycle issued Loop L.D F0,
0(R1) 1 stall 2 ADD.D F4, F0,
F2 3 stall 4 stall 5 S.D F4,
0(R1) 6 DADDIU R1, R1, -8 7 stall 8
BNE R1, R2, LOOP 9 stall 10
6
With Scheduling
Clock cycle issued Loop L.D F0,
0(R1) 1 DADDIU R1, R1, -8 2 ADD.D F4, F0,
F2 3 stall 4 BNE R1, R2,
LOOP 5 S.D F4, 8(R1) 6
delayed branch
not trivial
7

The actual work of operating on the array element
takes 3 (load, add, store).
The remaining 3 cycles
Loop overhead (DADDIU, BNE)
Stall
To eliminate the 3 cycles, we need to get more
operations within the loop relative to the number
of overhead instructions.

8
Reducing Loop Overhead

Loop Unrolling
Simple scheme for increasing the number of
instructions relative to the branch and overhead
instructions
Simply replicates the loop body multiple times,
adjusting the loop termination code.
Improves scheduling
It allows instructions from different iterations
to be scheduled together.
Uses different registers for each iteration.

9
Unrolled Loop (No Scheduling)
Clock cycle issued Loop L.D F0,
0(R1) 1 2 ADD.D F4, F0, F2 3 4 5 S.D F4,
0(R1) 6 L.D F6, -8(R1) 7 8 ADD.D F8, F6,
F2 9 10 11 S.D F8, -8(R1) 12 L.D F10,
-16(R1) 13 14 ADD.D F12, F10, F2 15 16
17 S.D F12, -16(R1) 18 L.D F14,
-24(R1) 19 20 ADD.D F16, F14, F2 21 22
23 S.D F16, -24(R1) 24 DADDIU R1, R1,
-32 25 26 BNE R1, R2, LOOP 27 28
10
Loop Unrolling

Loop unrolling is normally done early in the
compilation process, so that redundant
computations can be exposed and eliminated by the
optimizer.
Unrolling improves the performance of the loop by
eliminating overhead instructions.

11
Loop Unrolling (Scheduling)
Clock cycle issued Loop L.D F0,
0(R1) 1 L.D F6, -8(R1) 2 L.D F10,
-16(R1) 3 L.D F14, -24(R1) 4 ADD.D F4,
F0, F2 5 ADD.D F8, F6, F2 6 ADD.D F12,
F10, F2 7 ADD.D F16, F14, F2 8 S.D F4,
0(R1) 9 S.D F8, -8(R1) 10 DADDIU R1, R1,
-32 11 S.D F12, 16(R1) 12 BNE R1, R2,
LOOP 13 S.D F16, 8(R1) 14
12
Summary

Goal To know when and how the ordering among
instructions may be changed.
This process must be performed in a methodical
fashion either by a compiler or by hardware.

To obtain the final unrolled code,
Determine that it is legal to move the S.D after
the DADDIU and BNE, and find the amount to adjust
the S.D offset.
Determine that unrolling the loop will be useful
by finding that the loop iterations are
independent, except for the loop maintenance
code.
Use different registers to avoid unnecessary
constraints.
Eliminate the extra test and branch instructions
and adjust the loop termination and iteration
code.

Determine that the loads and stores in the
unrolled loop can be interchanged by observing
that the loads and stores from different
iterations are independent. This transformation
requires analyzing the memory addresses and
finding that they do not refer to the same
address.
Schedule the code, preserving any dependences
needed to yield the same result as the original
code.

15
Loop Unrolling I(No Delayed Branch)
Loop L.D F0, 0(R1) ADD.D F4, F0,
F2 S.D F4, 0(R1) L.D F0, -8(R1) ADD.D F4,
F0, F2 S.D F4, -8(R1) L.D F0, -16(R1)
ADD.D F4, F0, F2 S.D F4, -16(R1) L.D F0,
-24(R1) ADD.D F4, F0, F2 S.D F4,
-24(R1) DADDIU R1, R1, -32 BNE R1, R2, LOOP
name dependence
true dependence
16
Loop Unrolling II(Register Renaming)
Loop L.D F0, 0(R1) ADD.D F4, F0,
F2 S.D F4, 0(R1) L.D F6, -8(R1) ADD.D F8,
F6, F2 S.D F8, -8(R1) L.D F10, -16(R1)
ADD.D F12, F10, F2 S.D F12,
-16(R1) L.D F14, -24(R1) ADD.D F16, F14,
F2 S.D F16, -24(R1) DADDIU R1, R1,
-32 BNE R1, R2, LOOP
true dependence
17

With the renaming, the copies of each loop body
become independent and can be overlapped or
executed in parallel.
Potential shortfall in registers
Register pressure
It arises because scheduling code to increase ILP
causes the number of live values to increase. It
may not be possible to allocate all the live
values to registers.
The combination of unrolling and aggressive
scheduling can cause this problem.

Loop unrolling is a simple but useful method for
increasing the size of straight-line code
fragments that can be scheduled effectively.

19
Unrolling with Two-Issue
Loop L.D F0, 0(R1) 1 L.D F6,
-8(R1) 2 L.D F10, -16(R1) ADD.D F4, F0,
F2 3 L.D F14, -24(R1) ADD.D F8, F6,
F2 4 L.D F18, -32(R1) ADD.D F12, F10,
F2 5 S.D F4, 0(R1) ADD.D F16, F14,
F2 6 S.D F8, -8(R1) ADD.D F20, F18,
F2 7 S.D F12, -16(R1) 8 DADDIU R1, R1,
-40 9 S.D F16, 16(R1) 10 BNE R1, R2,
LOOP 11 S.D F20, 8(R1) 12
20
Static Branch Prediction

Static branch predictors are sometimes used in
processors where the expectation is that branch
behavior is highly predictable at compile time.

21
Static Branch Prediction

Predict a branch taken
Simplest
Average misprediction rate for SPEC 34
(9 59)
Predict on the basis of branch direction
backward-going branches taken
forward-going branches not taken
Unlikely to generate an overall misprediction
rate of less than 30 40.

22
Static Branch Prediction

Predict branches on the basis of profile
information collected from earlier runs.
An individual branch is often highly biased
toward taken or untaken. (bimodally distributed)
Changing the input so that the profile is for a
different run leads to only a small change in the
accuracy of profile-based prediction.

23
VLIW

Very Long Instruction Word
Rely on compiler technology to minimize the
potential data hazard stalls.
Actually format the instructions in a potential
issue packet so that the hardware need not check
explicitly for dependences.
Wide instructions with multiple operations per
instruction. (64, 128 bits or more)
Intel IA-64 architecture

24
Basic VLIW Approach

VLIWs use multiple, independent functional units.
A VLIW packages the multiple operations into one
very long instruction.
The hardware in a superscalar for multiple issue
is unnecessary.
Uses loop unrolling, scheduling

Local Scheduling Scheduling the code within a
single basic block.
Global Scheduling scheduling code across
branches
much more complex
Trace Scheduling Section 4.5
Figure 4.5 VLIW instructions

26
Problems

Increase in code size
Wasted functional units
In the previous example, only about 60 of the
functional units were used.

27
Detecting and Enhancing Loop-level Parallelism

Loop level parallelism source level
ILP machine level code after compliation
for (i 1000 ilt 0 i--)
xi xi s

28
Advanced Compiler Support for Exposing and
Exploiting ILP
for ( i 1 i lt 100 i ) Ai 1 Ai
Ci / S1 / Bi 1 Bi
Ai 1 / S2 /
29
Loop-Carried Dependence

Data accesses in later iterations are dependent
on data values produced in earlier iterations.

for ( i 1 i lt 100 i ) Ai 1 Ai
Ci / S1 / Bi 1 Bi
Ai 1 / S2 /
This dependence forces successive iterations of
this loop to execute in series.
Loop-Carried Dependences
30
Does a loop-carried dependence mean there is no
parallelism???

Consider for (i0 ilt 8 ii1) A A
Ci / S1 / Could computeCycle 1
temp0 C0 C1 temp1 C2
C3 temp2 C4 C5 temp3 C6
C7Cycle 2 temp4 temp0 temp1 temp5
temp2 temp3Cycle 3 A temp4 temp5
Relies on associative nature of .

31
for ( i 1 i lt 100 i ) Ai Ai
Bi / S1 / Bi 1 Ci
Di / S2 /
Loop-Carried Dependence
Despite this loop-carried dependence, this loop
can be made parallel.
32
A1 A1 B1 for ( i 1 i lt 99 i )
Bi 1 Ci Di Ai1 Ai1
Bi1 B101 C100 D100
33
Recurrence

A recurrence is when a variable is defined based
on the value of that variable in an earlier
iteration, often the one immediately preceding.
Detecting a recurrence can be important
Some architectures (especially vector computer)
have special support for executing recurrences.
Some recurrences can be the source of a
reasonable amount of parallelism.

34
for ( i 2 i lt 100 i i 1) Yi Yi
1 Yi
Dependence distance 1
for ( i 6 i lt 100 i i 1) Yi Yi
5 Yi
Dependence distance 5
The larger the distance, the more potential
parallelism can be obtained by unrolling the loop.
35
Finding Dependences

Determining whether a dependence actually exists
gt NP-Complete
Dependence Analysis
Basic tool for detecting loop-level parallelism
Applies only under a limited set of
circumstances.
Greatest common divisor (GCD) test, points-to
analysis, interprocedural analysis,

36
Eliminating Dependent Computation

Algebraic Simplifications of Expressions
Copy propagation
Eliminates operations that copy values.

DADDIU R1, R2, 4 DADDIU R1, R1, 4
DADDIU R1, R2, 8
37
Eliminating Dependent Computation

Tree Height Reduction
Reduces the height of the tree structure
representing a computation.

ADD R1, R2, R3 ADD R4, R1, R6 ADD R8, R4, R7
ADD R1, R2, R3 ADD R4, R6, R7 ADD R8, R1, R4
38
Eliminating Dependent Computation

Recurrences

sum sum x1 x2 x3 x4 x5
sum (sum x1) (x2 x3) (x4 x5)
39
Software Pipelining

Technique for reorganizing loops such that each
iteration in the software-pipelined code is made
from instructions chosen from different
iterations of the original loop.
By choosing instructions from different
iterations, dependent computations are separated
from one another by an entire loop body.

40
Software Pipelining

Counterpart to what Tomasulos algorithm does in
hardware
Software pipelining symbolically unrolls the loop
and then selects instructions from each
iteration.
Start-up code before the loop and finish-up code
after the loop required.

41
Software Pipelining
42
Software Pipelining - Example

Show a software-pipelined version of the
following loop. Omit the start-up and finish-up
code.

Loop L.D F0, 0(R1) ADD.D F4, F0, F2 S.D F4,
0(R1) DADDIU R1, R1, -8 BNE R1, R2, Loop
43
Software Pipelining

Software pipelining consumes less code space.
Loop unrolling reduces the overhead of the loop
(branch, counter update code).
Software pipelining reduces the time when the
loop is not running at peak speed to once per
loop at the beginning and end.

44
(No Transcript)
45
Hw support for more parallelism at compile time
Conditional Instructions

Predicated instructions
Extension of instruction set
Conditional instruction an instruction that
refers a condition, which is evaluated as part of
the instruction execution
Condition is true executed normally
False no-op
ex) conditional move

46
Example
if (A 0) S T
BNEZ R1, L ADDU R2, R3, R0 L
CMOVZ R2, R3, R1
conditional move only if the third operand is
equal to zero
R1A, R2S, R3T
47

Conditional moves are used to change a control
dependence into a data dependence.
Handling multiple branches per cycle is complex.
gt Conditional moves provide a way of reducing
branch pressure.
A conditional move can often eliminate a branch
that is hard to predict, increasing the potential
gain.

Write a Comment

User Comments (0)

About PowerShow.com

Exploiting InstructionLevel Parallelism with Software Approach - PowerPoint PPT Presentation

Exploiting InstructionLevel Parallelism with Software Approach

... can be the source of a reasonable amount of parallelism. ... detecting loop-level parallelism ... support for more parallelism at compile time. Conditional ... – PowerPoint PPT presentation