Title: Advanced Computer Architecture 5MD00 Exploiting ILP with SW approaches
1Advanced Computer Architecture5MD00Exploiting
ILP with SW approaches
- Henk Corporaal
- www.ics.ele.tue.nl/heco
- TUEindhoven
- December 2012
2Topics
- Static branch prediction and speculation
- Basic compiler techniques
- Multiple issue architectures
- Advanced compiler support techniques
- Loop-level parallelism
- Software pipelining
- Hardware support for compile-time scheduling
3We discussed previously dynamic branch
predictionThis does not help the compiler !!!
- Should the compiler speculate operations ( move
operations before a branch) from target or
fall-through? - We need Static Branch Prediction
4Static Branch Prediction and Speculation
- Static branch prediction useful for code
scheduling - Example
- ld r1,0(r2)
- sub r1,r1,r3 hazard
- beqz r1,L
- or r4,r5,r6
- addu r10,r4,r3
- L addu r7,r8,r9
- If the branch is taken most of the times and
since r7 is not needed on the fall-through path,
we could move addu r7,r8,r9 directly after the
ld - If the branch is not taken most of the times and
assuming that r4 is not needed on the taken path,
we could move or r4,r5,r6 after the ld
54 Static Branch Prediction Methods
- Always predict taken
- Average misprediction rate for SPEC 34 (9-59)
- Backward branches predicted taken, forward
branches not taken - In SPEC, most forward branches are taken, so
always predict taken is better - Profiling
- Run the program and profile all branches. If a
branch is taken (not taken) most of the times, it
is predicted taken (not taken) - Behavior of a branch is often biased to taken or
not taken - Average misprediction rate for SPECint 15
(11-22), SPECfp 9 (5-15) - Can we do better? YES, use control flow
restructuring to exploit correlation
6Static exploitation of correlation
If correlation, branch direction in block d
depends on branch in block a
control flow restructuring
7Basic compiler techniques
- Dependencies limit ILP (Instruction-Level
Parallelism) - We can not always find sufficient independent
operations to fill all the delay slots - May result in pipeline stalls
- Scheduling to avoid stalls ( reorder
instructions) - (Source-)code transformations to create more
exploitable parallelism - Loop Unrolling
- Loop Merging (Fusion)
- see online slide-set about loop transformations
!!
8Dependencies Limit ILP Example
C loop for (i1 ilt1000 i) xi xi
s
- MIPS assembly code
- R1 x1
- R2 x10008
- F2 s
- Loop L.D F0,0(R1) F0 xi
- ADD.D F4,F0,F2 F4 xis
- S.D 0(R1),F4 xi F4
- ADDI R1,R1,8 R1 xi1
- BNE R1,R2,Loop branch if R1!x10008
9Schedule this on a MIPS Pipeline
- FP operations are mostly multicycle
- The pipeline must be stalled if an instruction
uses the result of a not yet finished multicycle
operation - Well assume the following latencies
- Producing Consuming Latency
- instruction instruction (clock cycles)
- FP ALU op FP ALU op 3
- FP ALU op Store double 2
- Load double FP ALU op 1
- Load double Store double 0
10Where to Insert Stalls?
- How would this loop be executed on the MIPS FP
pipeline?
Inter-iteration dependence !!
Loop L.D F0,0(R1) ADD.D F4,F0,F2
S.D F4,0(R1) ADDI R1,R1,8
BNE R1,R2,Loop
What are the true (flow) dependences?
11Where to Insert Stalls
- How would this loop be executed on the MIPS FP
pipeline? - 10 cycles per iteration
Loop L.D F0,0(R1) stall ADD.D
F4,F0,F2 stall stall S.D
0(R1),F4 ADDI R1,R1,8 stall
BNE R1,R2,Loop stall
12Code Scheduling to Avoid Stalls
- Can we reorder the order of instruction to avoid
stalls? - Execution time reduced from 10 to 6 cycles per
iteration - But only 3 instructions perform useful work, rest
is loop overhead. How to avoid this ???
Loop L.D F0,0(R1) ADDI R1,R1,8
ADD.D F4,F0,F2 stall BNE
R1,R2,Loop S.D -8(R1),F4
watch out!
13Loop Unrolling increasing ILP
- At source level
- for (i1 ilt1000 i)
- xi xi s
- for (i1 ilt1000 ii4)
-
- xi xi s
- xi1 xi1s
- xi2 xi2s
- xi3 xi3s
-
- Any drawbacks?
- loop unrolling increases code size
- more registers needed
MIPS code after scheduling Loop L.D
F0,0(R1) L.D F6,8(R1) L.D
F10,16(R1) L.D F14,24(R1) ADD.D
F4,F0,F2 ADD.D F8,F6,F2 ADD.D
F12,F10,F2 ADD.D F16,F14,F2 S.D
0(R1),F4 S.D 8(R1),F8 ADDI
R1,R1,32 SD -16(R1),F12 BNE
R1,R2,Loop SD -8(R1),F16
14Multiple issue architectures
- How to get CPI lt 1 ?
- Superscalar multiple instructions issued per
cycle - Statically scheduled
- Dynamically scheduled (see previous lecture)
- VLIW ?
- single instruction issue, but multiple operations
per instruction (so CPIgt1) - SIMD / Vector ?
- single instruction issue, single operation, but
multiple data sets per operation (so CPIgt1) - Multi-threading ? (e.g. x86 Hyperthreading)
- Multi-processor ? (e.g. x86 Multi-core)
15Instruction Parallel (ILP) Processors
- The name ILP is used for
- Multiple-Issue Processors
- Superscalar varying no. instructions/cycle (0 to
8), scheduled by HW (dynamic issue capability) - IBM PowerPC, Sun UltraSparc, DEC Alpha, Pentium
III/4, etc. - VLIW (very long instr. word) fixed number of
instructions (4-16) scheduled by the compiler
(static issue capability) - Intel Architecture-64 (IA-64, Itanium), TriMedia,
TI C6x - (Super-) pipelined processors
- Anticipated success of multiple instructions led
to Instructions Per Cycle (IPC) metric instead
of CPI
16Vector processors
- Vector Processing Explicit coding of
independent loops as operations on large vectors
of numbers - Multimedia instructions being added to many
processors - Different implementations
- real SIMD
- e.g. 320 separate 32-bit ALUs RFs
- (multiple) subword units
- divide a single ALU into sub ALUs
- deeply pipelined units
- aiming at very high frequency
- with forwarding between units
17Simple In-order Superscalar
- In-order Superscalar 2-issue processor 1 Integer
1 FP - Used in first Pentium processor (also in
Larrabee, but canceled!!) - Fetch 64-bits/clock cycle Int on left, FP on
right - Can only issue 2nd instruction if 1st
instruction issues - More ports needed for FP register file to
execute FP load FP op in parallel - Type Pipe Stages
- Int. instruction IF ID EX MEM WB
- FP instruction IF ID EX MEM WB
- Int. instruction IF ID EX MEM WB
- FP instruction IF ID EX MEM WB
- Int. instruction IF ID EX MEM WB
- FP instruction IF ID EX MEM WB
-
- 1 cycle load delay impacts the next 3
instructions !
18Dynamic trace for unrolled code
- for (i1 ilt1000 i)
- ai ais
- Integer instruction FP instruction Cycle
- L LD F0,0(R1) 1
- LD F6,8(R1) 2
- LD F10,16(R1) ADDD F4,F0,F2 3
- LD F14,24(R1) ADDD F8,F6,F2 4
- LD F18,32(R1) ADDD F12,F10,F2 5
- SD 0(R1),F4 ADDD F16,F14,F2 6
- SD 8(R1),F8 ADDD F20,F18,F2 7
- SD 16(R1),F12 8
- ADDI R1,R1,40 9
- SD -16(R1),F16 10
- BNE R1,R2,L 11
- SD -8(R1),F20 12
Load 1 cycle latency ALU op 2 cycles latency
- 2.4 cycles per element vs. 3.5 for ordinary MIPS
pipeline - Int and FP instructions not perfectly balanced
19Superscalar Multi-issue Issues
- While Integer/FP split is simple for the HW, get
IPC of 2 only for programs with - Exactly 50 FP operations AND no hazards
- More complex decode and issue! E.g, already for a
2-issue we need - Issue logic examine 2 opcodes, 6 register
specifiers, and decide if 1 or 2 instructions can
issue (N-issue O(N2) comparisons) - Register file complexity for 2-issue
superscalar needs 4 reads and 2 writes/cycle - Rename logic must be able to rename same
register multiple times in one cycle! For
instance, consider 4-way issue - add r1, r2, r3 add p11, p4, p7 sub r4, r1,
r2 ? sub p22, p11, p4 lw r1, 4(r4) lw p23,
4(p22) add r5, r1, r2 add p12, p23, p4 - Imagine doing this transformation in a single
cycle! - Bypassing / Result buses Need to complete
multiple instructions/cycle - Need multiple buses with associated matching
logic at every reservation station.
20Why not VLIW Processors
- Superscalar HW expensive to build gt let compiler
find independent instructions and pack them in
one Very Long Instruction Word (VLIW) - Example VLIW processor with 2 ld/st units, two
FP units, one integer/branch unit, no branch delay
9/7 cycles per iteration !
21Superscalar versus VLIW
- VLIW advantages
- Much simpler to build. Potentially faster
- VLIW disadvantages and proposed solutions
- Binary code incompatibility
- Object code translation or emulation
- Less strict approach (EPIC, IA-64, Itanium)
- Increase in code size, unfilled slots are wasted
bits - Use clever encodings, only one immediate field
- Compress instructions in memory and decode them
when they are fetched, or when put in L1 cache - Lockstep operation if the operation in one
instruction slot stalls, the entire processor is
stalled - Less strict approach
22Use compressed instructions
Memory
L1 Instruction Cache
compressed instructions in memory
CPU
decompress here?
or decompress here?
Q What are pros and cons?
23Advanced compiler support techniques
- Loop-level parallelism
- Software pipelining
- Global scheduling (across basic blocks)
24Detecting Loop-Level Parallelism
- Loop-carried dependence a statement executed in
a certain iteration is dependent on a statement
executed in an earlier iteration - If there is no loop-carried dependence, then its
iterations can be executed in parallel - for (i1 ilt100 i)
- Ai1 AiCi / S1 /
- Bi1 BiAi1 / S2 /
S1
S2
A loop is parallel ? the corresponding dependence
graph does not contain a cycle
25Finding Dependences
- Is there a dependence in the following loop?
- for (i1 ilt100 i)
- A2i3 A2i 5.0
- Affine expression an expression of the form ai
b (a, b constants, i loop index variable) - Does the following equation have a solution?
- ai b cj d
- GCD test if there is a solution, then GCD(a,c)
must divide d-b - Note Because the GCD test does not take the loop
bounds into account, there are cases where the
GCD test says yes, there is a solution while in
reality there isnt
26Software Pipelining
- We have already seen loop unrolling
- Software pipelining is a related technique that
that consumes less code space. It interleaves
instructions from different iterations - instructions in one iteration are often dependent
on each other
Iteration 0
Iteration 1
Iteration 2
Software- pipelined iteration
Steady state kernel
instructions
27Simple Software Pipelining Example
- L l.d f0,0(r1) load Mi
- add.d f4,f0,f2 compute Mi
- s.d f4,0(r1) store Mi
- addi r1,r1,-8 i i-1
- bne r1,r2,L
- Software pipelined loop
- L s.d f4,16(r1) store Mi
- add.d f4,f0,f2 compute Mi-1
- l.d f0,0(r1) load Mi-2
- addi r1,r1,-8
- bne r1,r2,L
- Need hardware to avoid the WAR hazards
28Global code scheduling
- Loop unrolling and software pipelining work well
when there are no control statements (if
statements) in the loop body -gt loop is a single
basic block - Global code scheduling scheduling/moving code
across branches larger scheduling scope - When can the assignments to B and C be moved
before the test?
AiAiBi
T
F
Ai0?
Bi
X
Ci
29Which scheduling scope?
Hyperblock/region
Trace
Superblock
Decision Tree
30Comparing scheduling scopes
31Scheduling scope creation (1)
Partitioning a CFG into scheduling scopes
32Trace Scheduling
- Find the most likely sequence of basic blocks
that will be executed consecutively (trace
selection) - Optimize the trace as much as possible (trace
compaction) - move operations as early as possible in the trace
- pack the operations in as few VLIW instructions
as possible - additional bookkeeping code may be necessary on
exit points of the trace
33Scheduling scope creation (2)
Partitioning a CFG into scheduling scopes
34Code movement (upwards) within regions
destination block
I
I
I
I
add
source block
35Hardware support for compile-time scheduling
- Predication
- (discussed already)
- see also Itanium example
- Deferred exceptions
- Speculative loads
36Predicated Instructions (discussed before)
- Avoid branch prediction by turning branches into
conditional or predicated instructions - If false, then neither store result nor cause
exception - Expanded ISA of Alpha, MIPS, PowerPC, SPARC have
conditional move PA-RISC can annul any following
instr. - IA-64/Itanium conditional execution of any
instruction - Examples
- if (R10) R2 R3 CMOVZ R2,R3,R1
- if (R1 lt R2) SLT R9,R1,R2
- R3 R1 CMOVNZ R3,R1,R9
- else CMOVZ R3,R2,R9
- R3 R2
37Deferred Exceptions
ld r1,0(r3) load A bnez r1,L1 test
A ld r1,0(r2) then part load B j
L2 L1 addi r1,r1,4 else part inc A L2 st
r1,0(r3) store A
if (A0) A B else A A4
- How to optimize when then-part is usually
selected?
ld r1,0(r3) load A ld r9,0(r2)
speculative load B beqz r1,L3 test A
addi r9,r1,4 else part L3 st r9,0(r3)
store A
- What if the load generates a page fault?
- What if the load generates an index-out-of-bounds
exception?
38HW supporting Speculative Loads
- Speculative load (sld) does not generate
exceptions - Speculation check instruction (speck) check for
exception. The exception occurs when this
instruction is executed.
ld r1,0(r3) load A sld r9,0(r2)
speculative load of B bnez r1,L1 test
A speck 0(r2) perform exception check j
L2 L1 addi r9,r1,4 else part L2 st
r9,0(r3) store A
39Next?
3GHz
100W
- Trends
- transistors follows Moore
- but not freq. and performance/core
5