COMP 740: Computer Architecture and Implementation - PowerPoint PPT Presentation

About This Presentation
Title:

COMP 740: Computer Architecture and Implementation

Description:

Scalar s is in register F2. Array x starts at memory address 0. 1-cycle branch delay ... They are simply using the same registers, but they don't have to ... – PowerPoint PPT presentation

Number of Views:33
Avg rating:3.0/5.0
Slides: 24
Provided by: Montek5
Learn more at: http://www.cs.unc.edu
Category:

less

Transcript and Presenter's Notes

Title: COMP 740: Computer Architecture and Implementation


1
COMP 740Computer Architecture and Implementation
  • Montek Singh
  • Tue, Feb 24, 2009
  • Topic Instruction-Level Parallelism IV
  • (Software Approaches/Compiler Techniques)

2
Outline
  • Motivation
  • Compiler scheduling
  • Loop unrolling
  • Software pipelining

3
Review Instruction-Level Parallelism (ILP)
  • Pipelining most effective when parallelism
    among instrs
  • instrs u and v are parallel if neither is
    dependent on the other
  • Problem parallelism within a basic block is
    limited
  • branch freq of 15 implies about 6 instructions
    in basic block
  • these instructions are likely to depend on each
    other
  • need to look beyond basic blocks
  • Solution exploit loop-level parallelism
  • i.e., parallelism across loop iterations
  • to convert loop-level parallelism into ILP, need
    to unroll the loop
  • dynamically, by the hardware
  • statically, by the compiler
  • using vector instructions same op applied to
    all vector elements

4
Motivating Example for Loop Unrolling
for (i 1000 i gt 0 i--) xi xi s
  • Assumptions
  • Scalar s is in register F2
  • Array x starts at memory address 0
  • 1-cycle branch delay
  • No structural hazards

10 cycles per iteration
5
How Far Can We Get With Scheduling?
LOOP L.D F0, 0(R1) DADDUI R1, R1, -8
ADD.D F4, F0, F2 nop BNEZ R1, LOOP
S.D 8(R1), F4
LOOP L.D F0, 0(R1) ADD.D F4, F0, F2
S.D 0(R1), F4 DADDUI R1, R1, -8
BNEZ R1, LOOP NOP
6 cycles per iteration
Note change in S.D instruction, from 0(R1) to
8(R1) this is a non-trivial change!
6
Observations on Scheduled Code
  • 3 out of 5 instructions involve FP work
  • The other two constitute loop overhead
  • Could we improve performance by unrolling the
    loop?
  • assume number of loop iterations is a multiple of
    4, and unroll loop body four times
  • in real life, must also handle loop counts that
    are not multiples of 4

7
Unrolling Take 1
  • Even though we have gotten rid of the control
    dependences, we have data dependences through R1
  • We could remove data dependences by observing
    that R1 is decremented by 8 each time
  • Adjust the address specifiers
  • Delete the first three DADDUIs
  • Change the constant in the fourth DADDUI to 32
  • These are non-trivial inferences for a compiler
    to make

LOOP L.D F0, 0(R1) ADD.D F4, F0, F2
S.D 0(R1), F4 DADDUI R1, R1, -8
L.D F0, 0(R1) ADD.D F4, F0, F2
S.D 0(R1), F4 DADDUI R1, R1, -8
L.D F0, 0(R1) ADD.D F4, F0, F2
S.D 0(R1), F4 DADDUI R1, R1, -8
L.D F0, 0(R1) ADD.D F4, F0, F2
S.D 0(R1), F4 DADDUI R1, R1, -8
BNEZ R1, LOOP NOP
8
Unrolling Take 2
  • Performance is now limited by the WAR
    dependencies on F0
  • These are name dependences
  • The instructions are not in a producer-consumer
    relation
  • They are simply using the same registers, but
    they dont have to
  • We can use different registers in different loop
    iterations, subject to availability
  • Lets rename registers

LOOP L.D F0, 0(R1) ADD.D F4, F0, F2
S.D 0(R1), F4 L.D F0, -8(R1)
ADD.D F4, F0, F2 S.D -8(R1), F4
L.D F0, -16(R1) ADD.D F4, F0, F2
S.D -16(R1), F4 L.D F0, -24(R1)
ADD.D F4, F0, F2 S.D -24(R1), F4
DADDUI R1, R1, -32 BNEZ R1, LOOP NOP
9
Unrolling Take 3
  • Time for execution of 4 iterations
  • 14 instruction cycles
  • 4 L.D?ADD.D stalls
  • 8 ADD.D?S.D stalls
  • 1 DADDUI?BNEZ stall
  • 1 branch delay stall (NOP)
  • 28 cycles for 4 iterations, or 7 cycles per
    iteration
  • Slower than scheduled version of original loop,
    which needed 6 cycles per iteration
  • Lets schedule the unrolled loop

LOOP L.D F0, 0(R1) ADD.D F4, F0, F2
S.D 0(R1), F4 L.D F6, -8(R1)
ADD.D F8, F6, F2 S.D -8(R1), F8
L.D F10, -16(R1) ADD.D F12, F10, F2
S.D -16(R1), F12 L.D F14, -24(R1)
ADD.D F16, F14, F2 S.D -24(R1), F16
DADDUI R1, R1, -32 BNEZ R1, LOOP NOP
10
Unrolling Take 4
  • This code runs without stalls
  • 14 cycles for 4 iterations
  • 3.5 cycles per iteration
  • loop control overhead once every four
    iterations
  • Note that original loop had three FP instructions
    that were not independent
  • Loop unrolling exposed independent instructions
    from multiple loop iterations
  • By unrolling further, can approach asymptotic
    rate of 3 cycles per instruction
  • Subject to availability of registers

LOOP L.D F0, 0(R1) L.D F6, -8(R1)
L.D F10, -16(R1) L.D F14, -24(R1)
ADD.D F4, F0, F2 ADD.D F8, F6, F2
ADD.D F12, F10, F2 ADD.D F16, F14, F2
S.D 0(R1), F4 S.D -8(R1), F8
DADDUI R1, R1, -32 S.D 16(R1), F12
BNEZ R1, LOOP S.D 8(R1), F16
11
What Did The Compiler Have To Do?
  • Determine it was legal to move the S.D
  • after the DADDUI and BNEZ
  • and find the amount to adjust the S.D offset
  • Determine that loop unrolling would be useful
  • by discovering independence of loop iterations
  • Rename registers to avoid name dependences
  • Eliminate extra tests and branches and adjust
    loop control
  • Determine that L.Ds and S.Ds can be
    interchanged
  • by determining that (since R1 is not being
    updated) the address specifiers 0(R1), -8(R1),
    -16(R1), -24(R1) all refer to different memory
    locations
  • Schedule the code, preserving dependences

12
Limits to Gain from Loop Unrolling
  • Benefit of reduction in loop overhead tapers off
  • Amount of overhead amortized diminishes with
    successive unrolls
  • Code size limitations
  • For larger loops, code size growth is a concern
  • Especially for embedded processors with limited
    memory
  • Instruction cache miss rate increases
  • Architectural/compiler limitations
  • Register pressure
  • Need many registers to exploit ILP
  • Especially challenging in multiple-issue
    architectures

13
Dependences
  • Three kinds of dependences
  • Data dependence
  • Name dependence
  • Control dependence
  • In the context of loop-level parallelism, data
    dependence can be
  • Loop-independent
  • Loop-carried
  • Data dependences act as a limit of how much ILP
    can be exploited in a compiled program
  • Compiler tries to identify and eliminate
    dependences
  • Hardware tries to prevent dependences from
    becoming stalls

14
Control Dependences
  • A control dependence determines the ordering of
    an instruction with respect to a branch
    instruction so that the non-branch instruction is
    executed only when it should be
  • if (p1) s1
  • if (p2) s2
  • Control dependence constrains code motion
  • An instruction that is control dependent on a
    branch cannot be moved before the branch so that
    its execution is no longer controlled by the
    branch
  • An instruction that is not control dependent on a
    branch cannot be moved after the branch so that
    its execution is controlled by the branch

15
Data Dependence in Loop Iterations
Au1 AuCu Bu1 BuAu1
Au AuBu Bu1 CuDu
Bu1 CuDu Au1 Au1Bu1
16
Loop Transformation
  • Sometimes loop-carried dependence does not
    prevent loop parallelization
  • Example Second loop of previous slide
  • In other cases, loop-carried dependence prohibits
    loop parallelization
  • Example First loop of previous slide

Au AuBu Bu1 CuDu
17
Software Pipelining
  • Observation
  • If iterations from loops are independent, then we
    can get ILP by taking instructions from different
    iterations
  • Software pipelining
  • reorganize loops so that each iteration is made
    from instructions chosen from different
    iterations of the original loop

i0
i1
i2
i3
Software Pipeline Iteration
i4
18
Software Pipelining Example
After Software Pipelined L.D F0,0(R1) ADD.D F4,
F0,F2 L.D F0,-8(R1) 1 S.D 0(R1),F4 Stores
Mi 2 ADD.D F4,F0,F2 Adds to Mi-1
3 L.D F0,-16(R1) loads Mi-2 4 DADDUI
R1,R1,-8 5 BNEZ R1,LOOP S.D 0(R1),F4 ADD.D F4,F
0,F2 S.D -8(R1),F4
Read F4
Read F0
IF ID EX Mem WB IF ID EX Mem WB
IF ID EX Mem WB
S.D ADD.D L.D
Write F4
Write F0
19
Software Pipelining Concept
A Study of Scalar Compilation Techniques for
Pipelined Supercomputers, S. Weiss and J. E.
Smith, ISCA 1987, pages 105-109
  • Notation Load, Execute, Store
  • Iterations are independent
  • In normal sequence, Ei depends on Li, and Si
    depends on Ei, leading to pipeline stalls
  • Software pipelining attempts to reduce these
    delays by inserting other instructions between
    such dependent pairs and hiding the delay
  • Other instructions are L and S instructions
    from other loop iterations
  • Does this without consuming extra code space or
    registers
  • Performance usually not as high as that of loop
    unrolling
  • How can we permute L, E, S to achieve this?

L1 E1 S1 B Loop L2 E2 S2 B Loop L3 E3 S3 B
Loop Ln En Sn
Loop Li Ei Si B Loop
20
An Abstract View of Software Pipelining
L1 Loop Ei Si Li1 B Loop
En Sn
J Entry Loop Si-1 Entry Li
Ei B Loop Sn
Loop Li Ei Si B Loop
Maintains original L/S order
Changes original L/S order
L1 J Entry Loop Si-1 Entry
Ei Li1 B Loop Sn-1 En Sn
L1 J Entry Loop Li Si-1 Entry
Ei B Loop Sn
L1 Loop Ei Li1 Si B Loop
En Sn
21
Other Compiler Techniques
  • Static Branch Prediction
  • Examples
  • predict always taken
  • predict never taken
  • predict forward never taken, backward always
    taken
  • Stall needed after LD
  • if branch almost always taken, and R7 not needed
    in fall-thru
  • move DADDU R7, R8, R9 to right after LD
  • if branch almost never taken, and R4 not needed
    on taken path
  • move OR instruction to right after LD

LD R1, 0(R2) DSUBU R1, R1, R3 BEQZ R1,
L OR R4, R5, R6 DADDU R10, R4, R3 L DADDU R7,
R8, R9
22
Very Long Instruction Word (VLIW)
  • VLIW compiler schedules multiple
    instructions/issue
  • The long instruction word has room for many
    operations
  • By definition, all the operations the compiler
    puts in the long instruction word can execute in
    parallel
  • E.g., 2 integer operations, 2 FP operations, 2
    memory references, 1 branch
  • 16 to 24 bits per field gt 716 or 112 bits to
    724 or 168 bits wide
  • Need very sophisticated compiling technique
  • that schedules across several branches

23
Loop Unrolling in VLIW
  • Memory Memory FP FP Int. op/ Clockreference
    1 reference 2 operation 1 op. 2 branch
  • LD F0,0(R1) LD F6,-8(R1) 1
  • LD F10,-16(R1) LD F14,-24(R1) 2
  • LD F18,-32(R1) LD F22,-40(R1) ADDD F4,F0,F2 ADDD
    F8,F6,F2 3
  • LD F26,-48(R1) ADDD F12,F10,F2 ADDD F16,F14,F2 4
  • ADDD F20,F18,F2 ADDD F24,F22,F2 5
  • SD 0(R1),F4 SD -8(R1),F8 ADDD F28,F26,F2 6
  • SD -16(R1),F12 SD -24(R1),F16 7
  • SD -32(R1),F20 SD -40(R1),F24 SUBI R1,R1,48 8
  • SD -0(R1),F28 BNEZ R1,LOOP 9
  • Unrolled 7 times to avoid delays
  • 7 results in 9 clocks, or 1.3 clocks/iter (down
    from 6)
  • Need more registers in VLIW
Write a Comment
User Comments (0)
About PowerShow.com