ILP: Software Approaches - PowerPoint PPT Presentation

1 / 27
About This Presentation
Title:

ILP: Software Approaches

Description:

Homework #2: due Friday 14th, 2.8, A.2, A.13, 3.6a&b, 3.10, 4.5, 4.8, (4.13 optional) ... Affine array indices: All array indices DIRECTLY depend on loop variable i ... – PowerPoint PPT presentation

Number of Views:261
Avg rating:3.0/5.0
Slides: 28
Provided by: vincen68
Category:

less

Transcript and Presenter's Notes

Title: ILP: Software Approaches


1
ILP Software Approaches
  • Vincent H. Berk
  • October 12th
  • Reading for today 3.7-3.9, 4.1
  • Reading for Friday 4.2 4.6
  • Homework 2 due Friday 14th, 2.8, A.2, A.13,
    3.6ab, 3.10, 4.5, 4.8, (4.13 optional)

2
Basic Loop Unrolling
for (i1000 igt0 ii-1) xi xi s
Loop LD F0, 0(R1) F0array element ADDD F4,
F0, F2 add scalar in F2 SD 0 (R1), F4 store
result SUBI R1, R1, 8 decrement pointer 8
bytes (DW) BNEZ R1, Loop branch R1!
zero NOP delayed branch slot
3
FP Loop Hazards
Loop LD F0, 0(R1) F0vector element ADDD F4,
F0, F2 add scalar in F2 SD 0 (R1), F4 store
result SUBI R1, R1, 8 decrement pointer 8
bytes (DW) BNEZ R1, Loop branch R1!
zero NOP delayed branch slot
Where are the stalls?
4
FP Loop Showing Stalls
Rewrite code to minimize stalls?
5
Revised FP Loop Minimizing Stalls
Can we unroll the loop to make it faster?
6
Loop Unrolling
  • Short loop minimizes parallelism, induces
    significant overhead
  • Branches per instruction is high
  • Replicate the loop body several times and adjust
    the loop termination code
  • for (i 0 i lt 100 i i 4)
  • xi xi yi
  • xi 1 xi 1 yi 1
  • xi 2 xi 2 yi
    2
  • xi 3 xi 3 yi
    3
  • Improves scheduling since instructions from
    different iterations can be scheduled together
  • This is done very early in the compilation
    process
  • All dependences have to be found beforehand
  • Need to use different registers for each iteration

7
Where are the control dependences?
1 Loop LD F0, 0 (R1) 2 ADDD F4, F0, F2
3 SD 0 (R1), F4 4 SUBI R1, R1, 8
5 BEQZ R1, exit 6 LD F0, 0 (R1) 7 ADDD F4,
F0, F2 8 SD 0 (R1), F4 9 SUBI R1, R1,
8 10 BEQZ R1, exit 11 LD F0, 0 (R1)
12 ADDD F4, F0, F2 13 SD 0 (R1), F4
14 SUBI R1, R1, 8 15 BEQZ R1, exit ....
8
Data Dependences
1 Loop LD F0, 0 (R1) 2 ADDD F4, F0, F2 3 SD 0
(R1), F4 drop SUBI BNEZ 4 LD F0, 8 (R1)
2 ADDD F4, F0, F2 3 SD 8 (R1), F4 drop SUBI
BNEZ 7 LD F0, 16 (R1) 8 ADDD F4, F0, F2
9 SD 16 (R1), F4 drop SUBI BNEZ 10 LD F0,
24 (R1) 11 ADDD F4, F0, F2 12 SD 24 (R1), F4
13 SUBI R1, R1, 32 alter to 48 14 BNEZ R1,
LOOP 15 NOP
9
Name Dependences
1 Loop LD F0, 0 (R1) 2 ADDD F4, F0, F2 3 SD 0
(R1), F4 drop SUBI BNEZ 4 LD F6, 8 (R1)
5 ADDD F8, F6, F2 6 SD 8 (R1), F8 drop SUBI
BNEZ 7 LD F10, 16 (R1) 8 ADDD F12, F10, F2
9 SD 16 (R1), F12 drop SUBI BNEZ
10 LD F14, 24 (R1) 11 ADDD F16, F14, F2
12 SD 24 (R1), F16 13 SUBI R1, R1, 32 alter
to 48 14 BNEZ R1, LOOP 15 NOP Register
renaming
10
Unroll Loop Four Times
Rewrite loop to minimize stalls?
15 4 ? (12) 1 28 clock cycles to initiate,
or 7 per iteration Assumes R1 is multiple of 4
11
Unrolled Loop That Minimizes Stalls
  • What assumptions were made when we moved code?
  • OK to move store past SUBI even though SUBI
    changes the register
  • OK to move loads before stores get right data?
  • When is it safe for compiler to do such changes?

Can we eliminate the remaining stall?
14115 clock cycles, or 3.75 per iteration
12
Compiler Loop Unrolling
  • Most important Code Correctness
  • Unrolling produces larger code that might
    interfere with cache
  • Code sequence no longer fits in L1 cache
  • Cache to memory bandwidth might not be wide
    enough
  • Compiler must understand hardware
  • Enough registers must be available OR
  • Compiler must rely on hardware register renaming
  • Compiler must understand the code
  • Determine that loop iterations are independent
  • Eliminate branch instructions while preserving
    correctness
  • Determine that the LD and SD are independent over
    the loop
  • Rescheduling of instructions and adjusting the
    offsets

13
Superscalar Example
  • Superscalar
  • Our system can issue one floating point and one
    other (non-floating point) instruction per cycle.
  • Instructions are dynamically scheduled from the
    window
  • Unroll the loop 5 times and reschedule to
    minimize cycles per iteration. (WHY?)
  • While Integer/FP split is simple for the HW, get
    CPI of 0.5 only for programs with
  • Exactly 50 FP operations
  • No hazards
  • If more instructions issued at same time, greater
    difficulty in decode and issue
  • Even 2-way scalar ? examine 2 opcodes, 6 register
    specifiers, decide if 1 or 2 instructions can
    issue

14
Loop Unrolling in Superscalar
  • Integer instruction FP instruction Clock cycle
  • Loop LD F0, 0 (R1) 1
  • LD F6, 8 (R1) 2
  • LD F10, 16 (R1) ADDD F4, F0, F2 3
  • LD F14, 24 (R1) ADDD F8, F6, F2 4
  • LD F18, 32 (R1) ADDD F12, F10, F2 5
  • SD 0 (R1), F4 ADDD F16, F14, F2 6
  • SD 8 (R1), F8 ADDD F20, F18, F2 7
  • SD 16 (R1), F12 8
  • SUBI R1, R1, 40 9
  • SD 16 (R1), F16 10
  • BNEZ R1, Loop 11
  • SD 8 (R1), F20 12
  • Unrolled 5 times to avoid delays ( 1 due to SS)
  • 12 clocks to initiate, or 2.4 clocks per iteration

15
VLIW Example
  • VLIW
  • 5 instructions in one very long instruction word.
  • 2 FP, 2 Memory, 1 branch/integer
  • Compiler avoids hazards
  • Not all slots are always full
  • VLIW tradeoff instruction space for simple
    decoding
  • The long instruction word has room for many
    operations
  • By definition, all the operations the compiler
    puts in the long instruction word are independent
    ? execute in parallel
  • E.g., 2 integer operations, 2 FP ops, 2 memory
    refs, 1 branch ? 16 to 24 bits per field ? 716
    or 112 bits to 724 or 168 bits wide
  • Need compiling technique that schedules across
    several branches

16
Loop Unrolling in VLIW
  • Memory Memory FP FP Int. op/ Clockreference
    1 reference 2 operation 1 op. 2 branch
  • LD F0, 0 (R1) LD F6, 8 (R1) 1
  • LD F10, 16 (R1) LD F14, 24 (R1) 2
  • LD F18, 32 (R1) LD F22, 40 (R1) ADDD F4, F0,
    F2 ADDD F8, F6, F2 3
  • LD F26, 48 (R1) ADDD F12, F10, F2 ADDD F16,
    F14, F2 4
  • ADDD F20, F18, F2 ADDD F24, F22, F2 5
  • SD 0 (R1), F4 SD 8 (R1), F8 ADDD F28, F26,
    F2 6
  • SD 16 (R1), F12 SD 24 (R1), F16 7
  • SD 32 (R1), F20 SD 40 (R1), F24 SUBI R1, R1,
    48 8
  • SD 0 (R1), F28 BNEZ R1, LOOP 9
  • Unrolled 7 times to avoid delays
  • 9 clocks to initiate, or 1.3 clocks per iteration
  • Average 2.5 ops per clock, 50 efficiency
  • Note Need more registers in VLIW (15 vs. 6 in
    SS)

17
Limits to Multi-Issue Machines
  • Inherent limitations of instruction-level
    parallelism
  • 1 branch in 5 How to keep a 5-way VLIW busy?
  • Latencies of units many operations must be
    scheduled
  • Easy More instruction bandwidth
  • Easy Duplicate functional units to get parallel
    execution
  • Hard Increase ports to register file
    (bandwidth)
  • VLIW example needs 7 reads and 3 writes for
    integer registers 5 reads and 3
    writes for FP registers
  • Harder Increase ports to memory (bandwidth)
  • Pipelines in lockstep
  • One pipeline stall, stalls all others to avoid
    hazards

18
Limits to Multi-Issue Machines
  • Limitations specific to either superscalar or
    VLIW implementation
  • Decode issue in superscalar how wide is
    practical?
  • VLIW code size unroll loops wasted fields in
    VLIW
  • IA-64 compresses dependent instructions, but
    still larger
  • VLIW lock step ? 1 hazard all instructions
    stall
  • IA-64 not lock step? Dynamic pipeline?
  • VLIW binary compatibility IA-64 promises
    binary compatibility

19
Dependences
  • Two instructions are parallel if they can execute
    simultaneously in a pipeline without causing any
    stalls (assuming no structural hazards) and can
    be reordered (depending on code semantics)
  • Two instructions that are dependent are not
    parallel and cannot be reordered
  • Types of dependences
  • Data dependences
  • Name dependences
  • Control dependences
  • Dependences are properties of programs
  • Hazards are properties of the pipeline
    organization
  • Dependence indicates the potential for a hazard

20
Compiler Perspectives on Code Movement
  • Hard for memory accesses
  • Does 100(R4) 20 (R6)?
  • From different loop iterations, does 20(R6)
    20(R6)?
  • Our example required compiler to know that if R1
    doesnt change then
  • 0(R1) ? -8 (R1) ? -16 (R1) ? -24
    (R1)
  • There were no dependences between some loads and
    stores so they could be moved by each other

21
Detecting Loop Level Dependences
for (i1 ilt100 ii1) Ai Ai
Bi / S1 / Bi1 Ci Di / S2 /
  • Loop carried dependence
  • S1 relies on the S2 of the previous iteration
  • There is no dependence between S1 and S2,
    consider

A1 A1 B1 for (i1 ilt99 ii1)
Bi1 Ci Di Ai1 A i1
Bi1 B101 C100 D100
22
Dependence Distance
for (i6 ilt100 ii1) Yi Yi-5 Yi
  • Loop carried dependence in the form of a
    recurrence of Y
  • Dependence distance of 5
  • Higher dependence distance allows for more ILP

23
Greatest Common Divisor test
  • Affine array indices
  • All array indices DIRECTLY depend on loop
    variable i
  • Assume the code properties
  • for loop runs from n to m with index i
  • loop has an access pattern X a i b X
    c i d
  • two values for i j and k both between n and m
  • store indexed by j and a load later on index by k
    with ajb ckd
  • A loop carried dependence exists if GCD (c,a)
    must divide (d-b)
  • a2, b3, c2, d0 GDC(a,c) 2 and d-b -3
  • There is no loop dependence since 2 does not
    divide -3

for (i1 ilt100 ii1) X2i3 X2i 5.0
24
Problem Cases
  • Reference by pointers instead of array indices
  • partly eliminated by strict type checking
  • Sparse arrays with indexing through other arrays
    (similar to pointers)
  • When a dependence exists for values of the
    indices but those values are never reached
  • The loop-carried dependence has a distance far
    greater than what loop-unrolling would cover

25
Software Pipelining
  • Observation if iterations from loops are
    independent, then can get more ILP by taking
    instructions from different iterations
  • Software pipelining reorganizes loops so that
    each iteration is made from instructions chosen
    from different iterations of the original loop

Software-
iteration
26
SW Pipelining Example
1 LD F0, 0 (R1) LD F0, 0 (R1) 2 ADDD F4, F0,
F2 ADDD F4, F0, F2 3 SD 0 (R1), F4 LD F0, 8
(R1) 4 LD F6, 8 (R1) 1 SD 0 (R1), F4 Stores
Mi 5 ADDD F8, F6, F2 2 ADDD F4, F0, F2 Adds to
Mi-1 6 SD 8, (R1), F8 3 LD F0, 16 (R1) Loads
Mi-2 7 LD F10, 16 (R1) 4 SUBI R1, R1,
8 8 ADDD F12, F10, F2 5 BNEZ R1, LOOP 9 SD 16
(R1), F12 SD 0 (R1), F4 10 SUBI R1, R1,
24 ADDD F4, F0, F2 11 BNEZ R1, LOOP SD 8
(R1), F4
Read F4 Read F0 SD IF ID EX Mem WB Write
F4 ADD IF ID EX Mem WB LD IF ID EX Mem WB
Write F0
27
SW Pipelining Example
  • Symbolic Loop Unrolling
  • Smaller code space
  • Overhead paid only once vs. each iteration in
    loop unrolling
  • 100 iterations 25 loops with 4 unrolled
    iterations each

Software Pipelining
Number of overlapped operations
(a) Software pipelining
Time
Loop Unrolling
Number of overlapped operations
Time
(b) Loop unrolling
Write a Comment
User Comments (0)
About PowerShow.com