Advanced Computer Architecture 5MD00 Exploiting ILP with SW approaches - PowerPoint PPT Presentation

1 / 39

About This Presentation

Title:

Advanced Computer Architecture 5MD00 Exploiting ILP with SW approaches

Description:

Title: Microprocessor Design 2002 Author: henk corporaal Last modified by: Henk Corporaal Created Date: 6/19/2002 3:40:13 PM Document presentation format – PowerPoint PPT presentation

Number of Views:152

Avg rating:3.0/5.0

Slides: 40

Provided by: henkcor2

Category:

more less

Transcript and Presenter's Notes

Title: Advanced Computer Architecture 5MD00 Exploiting ILP with SW approaches

1
Advanced Computer Architecture5MD00Exploiting
ILP with SW approaches

Henk Corporaal
www.ics.ele.tue.nl/heco
TUEindhoven
December 2012

2
Topics

Static branch prediction and speculation
Basic compiler techniques
Multiple issue architectures
Advanced compiler support techniques
Loop-level parallelism
Software pipelining
Hardware support for compile-time scheduling

3
We discussed previously dynamic branch
predictionThis does not help the compiler !!!

Should the compiler speculate operations ( move
operations before a branch) from target or
fall-through?
We need Static Branch Prediction

4
Static Branch Prediction and Speculation

Static branch prediction useful for code
scheduling
Example
ld r1,0(r2)
sub r1,r1,r3 hazard
beqz r1,L
or r4,r5,r6
addu r10,r4,r3
L addu r7,r8,r9
If the branch is taken most of the times and
since r7 is not needed on the fall-through path,
we could move addu r7,r8,r9 directly after the
ld
If the branch is not taken most of the times and
assuming that r4 is not needed on the taken path,
we could move or r4,r5,r6 after the ld

5
4 Static Branch Prediction Methods

Always predict taken
Average misprediction rate for SPEC 34 (9-59)
Backward branches predicted taken, forward
branches not taken
In SPEC, most forward branches are taken, so
always predict taken is better
Profiling
Run the program and profile all branches. If a
branch is taken (not taken) most of the times, it
is predicted taken (not taken)
Behavior of a branch is often biased to taken or
not taken
Average misprediction rate for SPECint 15
(11-22), SPECfp 9 (5-15)
Can we do better? YES, use control flow
restructuring to exploit correlation

6
Static exploitation of correlation
If correlation, branch direction in block d
depends on branch in block a
control flow restructuring
7
Basic compiler techniques

Dependencies limit ILP (Instruction-Level
Parallelism)
We can not always find sufficient independent
operations to fill all the delay slots
May result in pipeline stalls
Scheduling to avoid stalls ( reorder
instructions)
(Source-)code transformations to create more
exploitable parallelism
Loop Unrolling
Loop Merging (Fusion)
see online slide-set about loop transformations
!!

8
Dependencies Limit ILP Example
C loop for (i1 ilt1000 i) xi xi
s

MIPS assembly code
R1 x1
R2 x10008
F2 s
Loop L.D F0,0(R1) F0 xi
ADD.D F4,F0,F2 F4 xis
S.D 0(R1),F4 xi F4
ADDI R1,R1,8 R1 xi1
BNE R1,R2,Loop branch if R1!x10008

9
Schedule this on a MIPS Pipeline

FP operations are mostly multicycle
The pipeline must be stalled if an instruction
uses the result of a not yet finished multicycle
operation
Well assume the following latencies
Producing Consuming Latency
instruction instruction (clock cycles)
FP ALU op FP ALU op 3
FP ALU op Store double 2
Load double FP ALU op 1
Load double Store double 0

10
Where to Insert Stalls?

How would this loop be executed on the MIPS FP
pipeline?

Inter-iteration dependence !!
Loop L.D F0,0(R1) ADD.D F4,F0,F2
S.D F4,0(R1) ADDI R1,R1,8
BNE R1,R2,Loop
What are the true (flow) dependences?
11
Where to Insert Stalls

How would this loop be executed on the MIPS FP
pipeline?
10 cycles per iteration

Loop L.D F0,0(R1) stall ADD.D
F4,F0,F2 stall stall S.D
0(R1),F4 ADDI R1,R1,8 stall
BNE R1,R2,Loop stall
12
Code Scheduling to Avoid Stalls

Can we reorder the order of instruction to avoid
stalls?
Execution time reduced from 10 to 6 cycles per
iteration
But only 3 instructions perform useful work, rest
is loop overhead. How to avoid this ???

Loop L.D F0,0(R1) ADDI R1,R1,8
ADD.D F4,F0,F2 stall BNE
R1,R2,Loop S.D -8(R1),F4
watch out!
13
Loop Unrolling increasing ILP

At source level
for (i1 ilt1000 i)
xi xi s
for (i1 ilt1000 ii4)
xi xi s
xi1 xi1s
xi2 xi2s
xi3 xi3s
Any drawbacks?
loop unrolling increases code size
more registers needed

MIPS code after scheduling Loop L.D
F0,0(R1) L.D F6,8(R1) L.D
F10,16(R1) L.D F14,24(R1) ADD.D
F4,F0,F2 ADD.D F8,F6,F2 ADD.D
F12,F10,F2 ADD.D F16,F14,F2 S.D
0(R1),F4 S.D 8(R1),F8 ADDI
R1,R1,32 SD -16(R1),F12 BNE
R1,R2,Loop SD -8(R1),F16
14
Multiple issue architectures

How to get CPI lt 1 ?
Superscalar multiple instructions issued per
cycle
Statically scheduled
Dynamically scheduled (see previous lecture)
VLIW ?
single instruction issue, but multiple operations
per instruction (so CPIgt1)
SIMD / Vector ?
single instruction issue, single operation, but
multiple data sets per operation (so CPIgt1)
Multi-threading ? (e.g. x86 Hyperthreading)
Multi-processor ? (e.g. x86 Multi-core)

15
Instruction Parallel (ILP) Processors

The name ILP is used for
Multiple-Issue Processors
Superscalar varying no. instructions/cycle (0 to
8), scheduled by HW (dynamic issue capability)
IBM PowerPC, Sun UltraSparc, DEC Alpha, Pentium
III/4, etc.
VLIW (very long instr. word) fixed number of
instructions (4-16) scheduled by the compiler
(static issue capability)
Intel Architecture-64 (IA-64, Itanium), TriMedia,
TI C6x
(Super-) pipelined processors
Anticipated success of multiple instructions led
to Instructions Per Cycle (IPC) metric instead
of CPI

16
Vector processors

Vector Processing Explicit coding of
independent loops as operations on large vectors
of numbers
Multimedia instructions being added to many
processors
Different implementations
real SIMD
e.g. 320 separate 32-bit ALUs RFs
(multiple) subword units
divide a single ALU into sub ALUs
deeply pipelined units
aiming at very high frequency
with forwarding between units

17
Simple In-order Superscalar

In-order Superscalar 2-issue processor 1 Integer
1 FP
Used in first Pentium processor (also in
Larrabee, but canceled!!)
Fetch 64-bits/clock cycle Int on left, FP on
right
Can only issue 2nd instruction if 1st
instruction issues
More ports needed for FP register file to
execute FP load FP op in parallel
Type Pipe Stages
Int. instruction IF ID EX MEM WB
FP instruction IF ID EX MEM WB
Int. instruction IF ID EX MEM WB
FP instruction IF ID EX MEM WB
Int. instruction IF ID EX MEM WB
FP instruction IF ID EX MEM WB
1 cycle load delay impacts the next 3
instructions !

18
Dynamic trace for unrolled code

for (i1 ilt1000 i)
ai ais
Integer instruction FP instruction Cycle
L LD F0,0(R1) 1
LD F6,8(R1) 2
LD F10,16(R1) ADDD F4,F0,F2 3
LD F14,24(R1) ADDD F8,F6,F2 4
LD F18,32(R1) ADDD F12,F10,F2 5
SD 0(R1),F4 ADDD F16,F14,F2 6
SD 8(R1),F8 ADDD F20,F18,F2 7
SD 16(R1),F12 8
ADDI R1,R1,40 9
SD -16(R1),F16 10
BNE R1,R2,L 11
SD -8(R1),F20 12

Load 1 cycle latency ALU op 2 cycles latency

2.4 cycles per element vs. 3.5 for ordinary MIPS
pipeline
Int and FP instructions not perfectly balanced

19
Superscalar Multi-issue Issues

While Integer/FP split is simple for the HW, get
IPC of 2 only for programs with
Exactly 50 FP operations AND no hazards
More complex decode and issue! E.g, already for a
2-issue we need
Issue logic examine 2 opcodes, 6 register
specifiers, and decide if 1 or 2 instructions can
issue (N-issue O(N2) comparisons)
Register file complexity for 2-issue
superscalar needs 4 reads and 2 writes/cycle
Rename logic must be able to rename same
register multiple times in one cycle! For
instance, consider 4-way issue
add r1, r2, r3 add p11, p4, p7 sub r4, r1,
r2 ? sub p22, p11, p4 lw r1, 4(r4) lw p23,
4(p22) add r5, r1, r2 add p12, p23, p4
Imagine doing this transformation in a single
cycle!
Bypassing / Result buses Need to complete
multiple instructions/cycle
Need multiple buses with associated matching
logic at every reservation station.

20
Why not VLIW Processors

Superscalar HW expensive to build gt let compiler
find independent instructions and pack them in
one Very Long Instruction Word (VLIW)
Example VLIW processor with 2 ld/st units, two
FP units, one integer/branch unit, no branch delay

9/7 cycles per iteration !
21
Superscalar versus VLIW

VLIW advantages
Much simpler to build. Potentially faster
VLIW disadvantages and proposed solutions
Binary code incompatibility
Object code translation or emulation
Less strict approach (EPIC, IA-64, Itanium)
Increase in code size, unfilled slots are wasted
bits
Use clever encodings, only one immediate field
Compress instructions in memory and decode them
when they are fetched, or when put in L1 cache
Lockstep operation if the operation in one
instruction slot stalls, the entire processor is
stalled
Less strict approach

22
Use compressed instructions
Memory
L1 Instruction Cache
compressed instructions in memory
CPU
decompress here?
or decompress here?
Q What are pros and cons?
23
Advanced compiler support techniques

Loop-level parallelism
Software pipelining
Global scheduling (across basic blocks)

24
Detecting Loop-Level Parallelism

Loop-carried dependence a statement executed in
a certain iteration is dependent on a statement
executed in an earlier iteration
If there is no loop-carried dependence, then its
iterations can be executed in parallel
for (i1 ilt100 i)
Ai1 AiCi / S1 /
Bi1 BiAi1 / S2 /

S1
S2
A loop is parallel ? the corresponding dependence
graph does not contain a cycle
25
Finding Dependences

Is there a dependence in the following loop?
for (i1 ilt100 i)
A2i3 A2i 5.0
Affine expression an expression of the form ai
b (a, b constants, i loop index variable)
Does the following equation have a solution?
ai b cj d
GCD test if there is a solution, then GCD(a,c)
must divide d-b
Note Because the GCD test does not take the loop
bounds into account, there are cases where the
GCD test says yes, there is a solution while in
reality there isnt

26
Software Pipelining

We have already seen loop unrolling
Software pipelining is a related technique that
that consumes less code space. It interleaves
instructions from different iterations
instructions in one iteration are often dependent
on each other

Iteration 0
Iteration 1
Iteration 2
Software- pipelined iteration
Steady state kernel
instructions
27
Simple Software Pipelining Example

L l.d f0,0(r1) load Mi
add.d f4,f0,f2 compute Mi
s.d f4,0(r1) store Mi
addi r1,r1,-8 i i-1
bne r1,r2,L
Software pipelined loop
L s.d f4,16(r1) store Mi
add.d f4,f0,f2 compute Mi-1
l.d f0,0(r1) load Mi-2
addi r1,r1,-8
bne r1,r2,L
Need hardware to avoid the WAR hazards

28
Global code scheduling

Loop unrolling and software pipelining work well
when there are no control statements (if
statements) in the loop body -gt loop is a single
basic block
Global code scheduling scheduling/moving code
across branches larger scheduling scope
When can the assignments to B and C be moved
before the test?

AiAiBi
T
F
Ai0?
Bi
X
Ci
29
Which scheduling scope?
Hyperblock/region
Trace
Superblock
Decision Tree
30
Comparing scheduling scopes
31
Scheduling scope creation (1)
Partitioning a CFG into scheduling scopes
32
Trace Scheduling

Find the most likely sequence of basic blocks
that will be executed consecutively (trace
selection)
Optimize the trace as much as possible (trace
compaction)
move operations as early as possible in the trace
pack the operations in as few VLIW instructions
as possible
additional bookkeeping code may be necessary on
exit points of the trace

33
Scheduling scope creation (2)
Partitioning a CFG into scheduling scopes
34
Code movement (upwards) within regions
destination block
I
I
I
I
add
source block
35
Hardware support for compile-time scheduling

Predication
(discussed already)
see also Itanium example
Deferred exceptions
Speculative loads

36
Predicated Instructions (discussed before)

Avoid branch prediction by turning branches into
conditional or predicated instructions
If false, then neither store result nor cause
exception
Expanded ISA of Alpha, MIPS, PowerPC, SPARC have
conditional move PA-RISC can annul any following
instr.
IA-64/Itanium conditional execution of any
instruction
Examples
if (R10) R2 R3 CMOVZ R2,R3,R1
if (R1 lt R2) SLT R9,R1,R2
R3 R1 CMOVNZ R3,R1,R9
else CMOVZ R3,R2,R9
R3 R2

37
Deferred Exceptions
ld r1,0(r3) load A bnez r1,L1 test
A ld r1,0(r2) then part load B j
L2 L1 addi r1,r1,4 else part inc A L2 st
r1,0(r3) store A
if (A0) A B else A A4

How to optimize when then-part is usually
selected?

ld r1,0(r3) load A ld r9,0(r2)
speculative load B beqz r1,L3 test A
addi r9,r1,4 else part L3 st r9,0(r3)
store A

What if the load generates a page fault?
What if the load generates an index-out-of-bounds
exception?

38
HW supporting Speculative Loads

Speculative load (sld) does not generate
exceptions
Speculation check instruction (speck) check for
exception. The exception occurs when this
instruction is executed.

ld r1,0(r3) load A sld r9,0(r2)
speculative load of B bnez r1,L1 test
A speck 0(r2) perform exception check j
L2 L1 addi r9,r1,4 else part L2 st
r9,0(r3) store A
39
Next?
3GHz
100W