Advanced Computer Architecture 5MD00 Exploiting ILP with SW approaches - PowerPoint PPT Presentation

1 / 39
About This Presentation
Title:

Advanced Computer Architecture 5MD00 Exploiting ILP with SW approaches

Description:

Title: Microprocessor Design 2002 Author: henk corporaal Last modified by: Henk Corporaal Created Date: 6/19/2002 3:40:13 PM Document presentation format – PowerPoint PPT presentation

Number of Views:148
Avg rating:3.0/5.0
Slides: 40
Provided by: henkcor2
Category:

less

Transcript and Presenter's Notes

Title: Advanced Computer Architecture 5MD00 Exploiting ILP with SW approaches


1
Advanced Computer Architecture5MD00Exploiting
ILP with SW approaches
  • Henk Corporaal
  • www.ics.ele.tue.nl/heco
  • TUEindhoven
  • December 2012

2
Topics
  • Static branch prediction and speculation
  • Basic compiler techniques
  • Multiple issue architectures
  • Advanced compiler support techniques
  • Loop-level parallelism
  • Software pipelining
  • Hardware support for compile-time scheduling

3
We discussed previously dynamic branch
predictionThis does not help the compiler !!!
  • Should the compiler speculate operations ( move
    operations before a branch) from target or
    fall-through?
  • We need Static Branch Prediction

4
Static Branch Prediction and Speculation
  • Static branch prediction useful for code
    scheduling
  • Example
  • ld r1,0(r2)
  • sub r1,r1,r3 hazard
  • beqz r1,L
  • or r4,r5,r6
  • addu r10,r4,r3
  • L addu r7,r8,r9
  • If the branch is taken most of the times and
    since r7 is not needed on the fall-through path,
    we could move addu r7,r8,r9 directly after the
    ld
  • If the branch is not taken most of the times and
    assuming that r4 is not needed on the taken path,
    we could move or r4,r5,r6 after the ld

5
4 Static Branch Prediction Methods
  • Always predict taken
  • Average misprediction rate for SPEC 34 (9-59)
  • Backward branches predicted taken, forward
    branches not taken
  • In SPEC, most forward branches are taken, so
    always predict taken is better
  • Profiling
  • Run the program and profile all branches. If a
    branch is taken (not taken) most of the times, it
    is predicted taken (not taken)
  • Behavior of a branch is often biased to taken or
    not taken
  • Average misprediction rate for SPECint 15
    (11-22), SPECfp 9 (5-15)
  • Can we do better? YES, use control flow
    restructuring to exploit correlation

6
Static exploitation of correlation
If correlation, branch direction in block d
depends on branch in block a
control flow restructuring
7
Basic compiler techniques
  • Dependencies limit ILP (Instruction-Level
    Parallelism)
  • We can not always find sufficient independent
    operations to fill all the delay slots
  • May result in pipeline stalls
  • Scheduling to avoid stalls ( reorder
    instructions)
  • (Source-)code transformations to create more
    exploitable parallelism
  • Loop Unrolling
  • Loop Merging (Fusion)
  • see online slide-set about loop transformations
    !!

8
Dependencies Limit ILP Example
C loop for (i1 ilt1000 i) xi xi
s
  • MIPS assembly code
  • R1 x1
  • R2 x10008
  • F2 s
  • Loop L.D F0,0(R1) F0 xi
  • ADD.D F4,F0,F2 F4 xis
  • S.D 0(R1),F4 xi F4
  • ADDI R1,R1,8 R1 xi1
  • BNE R1,R2,Loop branch if R1!x10008

9
Schedule this on a MIPS Pipeline
  • FP operations are mostly multicycle
  • The pipeline must be stalled if an instruction
    uses the result of a not yet finished multicycle
    operation
  • Well assume the following latencies
  • Producing Consuming Latency
  • instruction instruction (clock cycles)
  • FP ALU op FP ALU op 3
  • FP ALU op Store double 2
  • Load double FP ALU op 1
  • Load double Store double 0

10
Where to Insert Stalls?
  • How would this loop be executed on the MIPS FP
    pipeline?

Inter-iteration dependence !!
Loop L.D F0,0(R1) ADD.D F4,F0,F2
S.D F4,0(R1) ADDI R1,R1,8
BNE R1,R2,Loop
What are the true (flow) dependences?
11
Where to Insert Stalls
  • How would this loop be executed on the MIPS FP
    pipeline?
  • 10 cycles per iteration

Loop L.D F0,0(R1) stall ADD.D
F4,F0,F2 stall stall S.D
0(R1),F4 ADDI R1,R1,8 stall
BNE R1,R2,Loop stall
12
Code Scheduling to Avoid Stalls
  • Can we reorder the order of instruction to avoid
    stalls?
  • Execution time reduced from 10 to 6 cycles per
    iteration
  • But only 3 instructions perform useful work, rest
    is loop overhead. How to avoid this ???

Loop L.D F0,0(R1) ADDI R1,R1,8
ADD.D F4,F0,F2 stall BNE
R1,R2,Loop S.D -8(R1),F4
watch out!
13
Loop Unrolling increasing ILP
  • At source level
  • for (i1 ilt1000 i)
  • xi xi s
  • for (i1 ilt1000 ii4)
  • xi xi s
  • xi1 xi1s
  • xi2 xi2s
  • xi3 xi3s
  • Any drawbacks?
  • loop unrolling increases code size
  • more registers needed

MIPS code after scheduling Loop L.D
F0,0(R1) L.D F6,8(R1) L.D
F10,16(R1) L.D F14,24(R1) ADD.D
F4,F0,F2 ADD.D F8,F6,F2 ADD.D
F12,F10,F2 ADD.D F16,F14,F2 S.D
0(R1),F4 S.D 8(R1),F8 ADDI
R1,R1,32 SD -16(R1),F12 BNE
R1,R2,Loop SD -8(R1),F16
14
Multiple issue architectures
  • How to get CPI lt 1 ?
  • Superscalar multiple instructions issued per
    cycle
  • Statically scheduled
  • Dynamically scheduled (see previous lecture)
  • VLIW ?
  • single instruction issue, but multiple operations
    per instruction (so CPIgt1)
  • SIMD / Vector ?
  • single instruction issue, single operation, but
    multiple data sets per operation (so CPIgt1)
  • Multi-threading ? (e.g. x86 Hyperthreading)
  • Multi-processor ? (e.g. x86 Multi-core)

15
Instruction Parallel (ILP) Processors
  • The name ILP is used for
  • Multiple-Issue Processors
  • Superscalar varying no. instructions/cycle (0 to
    8), scheduled by HW (dynamic issue capability)
  • IBM PowerPC, Sun UltraSparc, DEC Alpha, Pentium
    III/4, etc.
  • VLIW (very long instr. word) fixed number of
    instructions (4-16) scheduled by the compiler
    (static issue capability)
  • Intel Architecture-64 (IA-64, Itanium), TriMedia,
    TI C6x
  • (Super-) pipelined processors
  • Anticipated success of multiple instructions led
    to Instructions Per Cycle (IPC) metric instead
    of CPI

16
Vector processors
  • Vector Processing Explicit coding of
    independent loops as operations on large vectors
    of numbers
  • Multimedia instructions being added to many
    processors
  • Different implementations
  • real SIMD
  • e.g. 320 separate 32-bit ALUs RFs
  • (multiple) subword units
  • divide a single ALU into sub ALUs
  • deeply pipelined units
  • aiming at very high frequency
  • with forwarding between units

17
Simple In-order Superscalar
  • In-order Superscalar 2-issue processor 1 Integer
    1 FP
  • Used in first Pentium processor (also in
    Larrabee, but canceled!!)
  • Fetch 64-bits/clock cycle Int on left, FP on
    right
  • Can only issue 2nd instruction if 1st
    instruction issues
  • More ports needed for FP register file to
    execute FP load FP op in parallel
  • Type Pipe Stages
  • Int. instruction IF ID EX MEM WB
  • FP instruction IF ID EX MEM WB
  • Int. instruction IF ID EX MEM WB
  • FP instruction IF ID EX MEM WB
  • Int. instruction IF ID EX MEM WB
  • FP instruction IF ID EX MEM WB
  • 1 cycle load delay impacts the next 3
    instructions !

18
Dynamic trace for unrolled code
  • for (i1 ilt1000 i)
  • ai ais
  • Integer instruction FP instruction Cycle
  • L LD F0,0(R1) 1
  • LD F6,8(R1) 2
  • LD F10,16(R1) ADDD F4,F0,F2 3
  • LD F14,24(R1) ADDD F8,F6,F2 4
  • LD F18,32(R1) ADDD F12,F10,F2 5
  • SD 0(R1),F4 ADDD F16,F14,F2 6
  • SD 8(R1),F8 ADDD F20,F18,F2 7
  • SD 16(R1),F12 8
  • ADDI R1,R1,40 9
  • SD -16(R1),F16 10
  • BNE R1,R2,L 11
  • SD -8(R1),F20 12

Load 1 cycle latency ALU op 2 cycles latency
  • 2.4 cycles per element vs. 3.5 for ordinary MIPS
    pipeline
  • Int and FP instructions not perfectly balanced

19
Superscalar Multi-issue Issues
  • While Integer/FP split is simple for the HW, get
    IPC of 2 only for programs with
  • Exactly 50 FP operations AND no hazards
  • More complex decode and issue! E.g, already for a
    2-issue we need
  • Issue logic examine 2 opcodes, 6 register
    specifiers, and decide if 1 or 2 instructions can
    issue (N-issue O(N2) comparisons)
  • Register file complexity for 2-issue
    superscalar needs 4 reads and 2 writes/cycle
  • Rename logic must be able to rename same
    register multiple times in one cycle! For
    instance, consider 4-way issue
  • add r1, r2, r3 add p11, p4, p7 sub r4, r1,
    r2 ? sub p22, p11, p4 lw r1, 4(r4) lw p23,
    4(p22) add r5, r1, r2 add p12, p23, p4
  • Imagine doing this transformation in a single
    cycle!
  • Bypassing / Result buses Need to complete
    multiple instructions/cycle
  • Need multiple buses with associated matching
    logic at every reservation station.

20
Why not VLIW Processors
  • Superscalar HW expensive to build gt let compiler
    find independent instructions and pack them in
    one Very Long Instruction Word (VLIW)
  • Example VLIW processor with 2 ld/st units, two
    FP units, one integer/branch unit, no branch delay

9/7 cycles per iteration !
21
Superscalar versus VLIW
  • VLIW advantages
  • Much simpler to build. Potentially faster
  • VLIW disadvantages and proposed solutions
  • Binary code incompatibility
  • Object code translation or emulation
  • Less strict approach (EPIC, IA-64, Itanium)
  • Increase in code size, unfilled slots are wasted
    bits
  • Use clever encodings, only one immediate field
  • Compress instructions in memory and decode them
    when they are fetched, or when put in L1 cache
  • Lockstep operation if the operation in one
    instruction slot stalls, the entire processor is
    stalled
  • Less strict approach

22
Use compressed instructions
Memory
L1 Instruction Cache
compressed instructions in memory
CPU
decompress here?
or decompress here?
Q What are pros and cons?
23
Advanced compiler support techniques
  • Loop-level parallelism
  • Software pipelining
  • Global scheduling (across basic blocks)

24
Detecting Loop-Level Parallelism
  • Loop-carried dependence a statement executed in
    a certain iteration is dependent on a statement
    executed in an earlier iteration
  • If there is no loop-carried dependence, then its
    iterations can be executed in parallel
  • for (i1 ilt100 i)
  • Ai1 AiCi / S1 /
  • Bi1 BiAi1 / S2 /

S1
S2
A loop is parallel ? the corresponding dependence
graph does not contain a cycle
25
Finding Dependences
  • Is there a dependence in the following loop?
  • for (i1 ilt100 i)
  • A2i3 A2i 5.0
  • Affine expression an expression of the form ai
    b (a, b constants, i loop index variable)
  • Does the following equation have a solution?
  • ai b cj d
  • GCD test if there is a solution, then GCD(a,c)
    must divide d-b
  • Note Because the GCD test does not take the loop
    bounds into account, there are cases where the
    GCD test says yes, there is a solution while in
    reality there isnt

26
Software Pipelining
  • We have already seen loop unrolling
  • Software pipelining is a related technique that
    that consumes less code space. It interleaves
    instructions from different iterations
  • instructions in one iteration are often dependent
    on each other

Iteration 0
Iteration 1
Iteration 2
Software- pipelined iteration
Steady state kernel
instructions
27
Simple Software Pipelining Example
  • L l.d f0,0(r1) load Mi
  • add.d f4,f0,f2 compute Mi
  • s.d f4,0(r1) store Mi
  • addi r1,r1,-8 i i-1
  • bne r1,r2,L
  • Software pipelined loop
  • L s.d f4,16(r1) store Mi
  • add.d f4,f0,f2 compute Mi-1
  • l.d f0,0(r1) load Mi-2
  • addi r1,r1,-8
  • bne r1,r2,L
  • Need hardware to avoid the WAR hazards

28
Global code scheduling
  • Loop unrolling and software pipelining work well
    when there are no control statements (if
    statements) in the loop body -gt loop is a single
    basic block
  • Global code scheduling scheduling/moving code
    across branches larger scheduling scope
  • When can the assignments to B and C be moved
    before the test?

AiAiBi
T
F
Ai0?
Bi
X
Ci
29
Which scheduling scope?
Hyperblock/region
Trace
Superblock
Decision Tree
30
Comparing scheduling scopes
31
Scheduling scope creation (1)
Partitioning a CFG into scheduling scopes
32
Trace Scheduling
  • Find the most likely sequence of basic blocks
    that will be executed consecutively (trace
    selection)
  • Optimize the trace as much as possible (trace
    compaction)
  • move operations as early as possible in the trace
  • pack the operations in as few VLIW instructions
    as possible
  • additional bookkeeping code may be necessary on
    exit points of the trace

33
Scheduling scope creation (2)
Partitioning a CFG into scheduling scopes
34
Code movement (upwards) within regions
destination block
I
I
I
I
add
source block
35
Hardware support for compile-time scheduling
  • Predication
  • (discussed already)
  • see also Itanium example
  • Deferred exceptions
  • Speculative loads

36
Predicated Instructions (discussed before)
  • Avoid branch prediction by turning branches into
    conditional or predicated instructions
  • If false, then neither store result nor cause
    exception
  • Expanded ISA of Alpha, MIPS, PowerPC, SPARC have
    conditional move PA-RISC can annul any following
    instr.
  • IA-64/Itanium conditional execution of any
    instruction
  • Examples
  • if (R10) R2 R3 CMOVZ R2,R3,R1
  • if (R1 lt R2) SLT R9,R1,R2
  • R3 R1 CMOVNZ R3,R1,R9
  • else CMOVZ R3,R2,R9
  • R3 R2

37
Deferred Exceptions
ld r1,0(r3) load A bnez r1,L1 test
A ld r1,0(r2) then part load B j
L2 L1 addi r1,r1,4 else part inc A L2 st
r1,0(r3) store A
if (A0) A B else A A4
  • How to optimize when then-part is usually
    selected?

ld r1,0(r3) load A ld r9,0(r2)
speculative load B beqz r1,L3 test A
addi r9,r1,4 else part L3 st r9,0(r3)
store A
  • What if the load generates a page fault?
  • What if the load generates an index-out-of-bounds
    exception?

38
HW supporting Speculative Loads
  • Speculative load (sld) does not generate
    exceptions
  • Speculation check instruction (speck) check for
    exception. The exception occurs when this
    instruction is executed.

ld r1,0(r3) load A sld r9,0(r2)
speculative load of B bnez r1,L1 test
A speck 0(r2) perform exception check j
L2 L1 addi r9,r1,4 else part L2 st
r9,0(r3) store A
39
Next?
3GHz
100W
  • Trends
  • transistors follows Moore
  • but not freq. and performance/core

5
Write a Comment
User Comments (0)
About PowerShow.com