Title: Embedded Computer Architectures
- Hennessy Patterson
- Chapter 4
- Exploiting ILP with Software Approaches
- Gerard Smit (Zilverling 4102), smit_at_cs.utwente.nl
- André Kokkeler (Zilverling 4096),
- Introduction
- Processor Architecture
- Loop Unrolling
- Software Pipelining
4Processor Architecture
- 5 stage pipeline
- Static scheduling
- Integer and Floating Point unit
5Processor Architecture
Integer ALU gt Integer ALU
Int. ALU
No Latency
Int. ALU
Floating point ALU gt Floating point ALU
Latency 3
6Processor Architecture
Load Memory gt Store Memory
No Latency
7Processor Architecture
Integer ALU gt Store Memory
Int. ALU
No Latency
Floating point ALU gt Store Memory
Latency 2
8Processor Architecture
Load Memory gt Integer ALU
Int. ALU
Latency 1
Load Memory gt Floating point ALU
Latency 1
9Processor Architecture
Integer ALU gt Branch
Int. ALU
Latency 1
10Loop Unrolling
- For i1000 downto 1 do xi xis
- Loop L.D F0,0(R1) F0? xi ADD.D F4,F0,F2
F4? xis S.D 0(R1),F4 xi?
xis DADDUI R1,R1,-8 i? i-1 BNE R1,R2,Lo
op repeat if i?0 NOP branch delay slot - R1 pointer within arrayF2 value to be added
(s)R2 last element in arrayF0 value in
arrayF4 value to be written in array
11Loop Unrolling
- Loop L.D F0,0(R1) F0? xi ADD.D F4,F0,F2
F4? xis S.D 0(R1),F4 xi?
xis DADDUI R1,R1,-8 i? i-1 BNE R1,R2,Lo
op repeat if i?0 NOP branch delay slot
Load Memory gt FP ALU 1 stall
12Loop Unrolling
- Loop L.D F0,0(R1) F0? xi stall ADD.D F4,
F0,F2 F4? xis S.D 0(R1),F4 xi?
xis DADDUI R1,R1,-8 i? i-1 BNE R1,R2,Lo
op repeat if i?0 NOP branch delay slot
FP ALU gt Store Memory gt 2 stalls
13Loop Unrolling
- Loop L.D F0,0(R1) F0? xi stall ADD.D F4,
F0,F2 F4? xis stall stall S.D 0(R1),F4
xi? xis DADDUI R1,R1,-8 i?
i-1 BNE R1,R2,Loop repeat if i?0 NOP
branch delay slot
Integer ALU gt Branch 1 stall
14Loop Unrolling
- Loop L.D F0,0(R1) F0? xi stall ADD.D F4,
F0,F2 F4? xis stall stall S.D 0(R1),F4
xi? xis DADDUI R1,R1,-8 i?
i-1 stall BNE R1,R2,Loop repeat if
i?0 NOP branch delay slot
Smart compiler
15Loop Unrolling
- Loop L.D F0,0(R1) F0? xi DADDUI R1,R1,-8
i? i-1 ADD.D F4,F0,F2 F4?
xis stall BNE R1,R2,Loop repeat if
i?0 S.D 8(R1),F4 xi? xis
Integer ALU gt Branch 1 stall
From 10 cycles per loop to 6 cycles per loop
16Loop Unrolling
- Loop L.D F0,0(R1) F0? xi DADDUI R1,R1,-8
i? i-1 ADD.D F4,F0,F2 F4?
xis BNE R1,R2,Loop repeat if
i?0 S.D 8(R1),F4 xi? xis - 5 instructions
- 3 doing the job
- 2 control or overhead
- Reduce overhead gt loop unrolling
- Add code
- From 1000 iterations to 500 iterations
17Loop Unrolling
- Original Code Sequence
- Loop L.D F0,0(R1) F0? xi ADD.D F4,F0,F2
F4? xis S.D 0(R1),F4 xi?
xis DADDUI R1,R1,-8 i? i-1 BNE R1,R2,Lo
op repeat if i?0 NOP branch delay slot
Copy this part With correct data pointer
18Loop Unrolling
- Unrolled Code Sequence
- Loop L.D F0,0(R1) F0? xi ADD.D F4,F0,F2
F4? xis S.D 0(R1),F4 xi?
xis L.D F0,-8(R1) F0? xi ADD.D F4,F0,
F2 F4? xis S.D -8(R1),F4 xi? xis - DADDUI R1,R1,-16 i? i-2 BNE R1,R2,Loop
repeat if i?0 NOP branch delay slot - There are still a lot of stalls. Removing is
easier if some additional registers are used
1 stall
2 stalls
1 stall
2 stalls
1 stall
19Loop Unrolling
- Unrolled Code Sequence
- Loop L.D F0,0(R1) F0? xi ADD.D F4,F0,F2
F4? xis S.D 0(R1),F4 xi?
xis L.D F6,-8(R1) F6 ? xi ADD.D F8,F6
,F2 F8 ? xis S.D -8(R1),F8 xi? xis - DADDUI R1,R1,-16 i? i-1 BNE R1,R2,Loop
repeat if i?0 NOP branch delay slot
1 stall
2 stalls
1 stall
2 stalls
1 stall
20Loop Unrolling
- Unrolled Code Sequence
- Loop L.D F0,0(R1) F0? xi L.D F6,-8(R1)
F6 ? xi ADD.D F4,F0,F2 F4?
xis S.D 0(R1),F4 xi? xis - ADD.D F8,F6,F2 F8 ? xis S.D -8(R1),F8
xi? xis - DADDUI R1,R1,-16 i? i-1 BNE R1,R2,Loop
repeat if i?0 NOP branch delay slot
1 stall
1 stall
2 stalls
1 stall
21Loop Unrolling
- Unrolled Code Sequence
- Loop L.D F0,0(R1) F0? xi L.D F6,-8(R1)
F6 ? xi ADD.D F4,F0,F2 F4? xis - ADD.D F8,F6,F2 F8 ? xis
S.D 0(R1),F4 xi? xis - S.D -8(R1),F8 xi? xis
- DADDUI R1,R1,-16 i? i-1 BNE R1,R2,Loop
repeat if i?0 NOP branch delay slot
2 stalls
1 stall
22Loop Unrolling
- Unrolled Code Sequence
- Loop L.D F0,0(R1) F0? xi L.D F6,-8(R1)
F6 ? xi ADD.D F4,F0,F2 F4? xis - ADD.D F8,F6,F2 F8 ? xis
- DADDUI R1,R1,-16 i? i-1 S.D 16(R1),F4
xi? xis - S.D 8(R1),F8 xi? xis
- BNE R1,R2,Loop repeat if i?0 NOP
branch delay slot
23Loop Unrolling
- Unrolled Code Sequence
- Loop L.D F0,0(R1) F0? xi L.D F6,-8(R1)
F6 ? xi ADD.D F4,F0,F2 F4? xis - ADD.D F8,F6,F2 F8 ? xis
- DADDUI R1,R1,-16 i? i-1 S.D 16(R1),F4
xi? xis - BNE R1,R2,Loop repeat if i?0 S.D 8(R1),F8
xi? xis
24Loop Unrolling
- In example loop-unrolling factor 2
- In general loop-unrolling factor k
- Limitations concerning k
- Amdahls law 3000 cycles are always needed
- Increasing k gt increasing number of registers
- Increasing k gt increasing code size
25Software Pipelining
- Original unrolled loopLoop L.D F0,0(R1) F0?
xi ADD.D F4,F0,F2 F4? xis S.D 0(R1),F
4 xi? xis DADDUI R1,R1,-8 i?
i-1 BNE R1,R2,Loop repeat if i?0 NOP
branch delay slot - Three actions involved with actual
calculations F0? xi F4? xi x xi? xi
s - Consider these as three different stages
1 stall
2 stalls
1 stall
26Software Pipelining
- Original unrolled loopLoop L.D F0,0(R1) F0?
xi ADD.D F4,F0,F2 F4? xis S.D 0(R1),F
4 xi? xis DADDUI R1,R1,-8 i?
i-1 BNE R1,R2,Loop repeat if i?0 NOP
branch delay slot - Three actions involved with actual
calculations F0? xi Stage 1 F4? xi
x Stage 2 xi? xi s Stage 3 - Associate array element with the stages
27Software Pipelining
- Original unrolled loopLoop L.D F0,0(R1) F0?
xi ADD.D F4,F0,F2 F4? xis S.D 0(R1),F
4 xi? xis DADDUI R1,R1,-8 i?
i-1 BNE R1,R2,Loop repeat if i?0 NOP
branch delay slot - Three actions involved with actual
calculations F0? xi Stage 1, xi F4? xi
x Stage 2, xi xi? xi s Stage 3, xi
28Software Pipelining
Stage 1
Stage 2
Stage 3
Stage 1 fill F0 Stage 2 read F0
fill F4 Stage 3 read F4
1 stall
2 stalls
1 stall
2 stalls
Register Empty
1 stall
Register Occupied
2 stalls
29Software Pipelining
- Software Pipelined Execution
Stage 1
Stage 2
Stage 3
Stage 1 fill F0 Stage 2 read F0
fill F4 Stage 3 read F4
1 stall
1 stall
0 stalls
1 stall
Register Empty
0 stalls
Register Occupied
1 stall
30Software Pipelining
- Software Pipelined Execution
Stage 1
Stage 2
Stage 3
1 stall
L.D F0,0(R1) F0 ? x1000
1 stall
ADD.D F4,F0,F2 F4 ? x1000 s
LD.D F0,-8(R1) F0 ? x999
0 stalls
S.D 0(R1),F4 xi ? F4
1 stall
ADD.D F4,F0,F2 F4 ? xi-1 s
ADD.D F4,F0,F2 F4 ? xi-1 s
LD.D F0,-16(R1) F0 ? xi-2
0 stalls
BNE R1,R2,Loop repeat if i?1
DADDUI R1,R1,-8 i ? i-8
31Software Pipelining
- Software Pipelined Execution
Stage 1
Stage 2
Stage 3
1 stall
L.D F0,0(R1) F0 ? x1000
1 stall
ADD.D F4,F0,F2 F4 ? x1000 s
LD.D F0,-8(R1) F0 ? x999
0 stalls
S.D 0(R1),F4 xi ? F4
ADD.D F4,F0,F2 F4 ? xi-1 s
ADD.D F4,F0,F2 F4 ? xi-1 s
LD.D F0,-16(R1) F0 ? xi-2
0 stalls
0 stalls
BNE R1,R2,Loop repeat if i?1
DADDUI R1,R1,-8 i ? i-8
32Software Pipelining
- No stalls inside loop
- Additional start-up (and clean-up) code
- No reduction of control overhead
- No additional registers
- To simplify processor hardware sophisticated
compilers (loop unrolling, software pipelining
etc.) - Extreme form Very Long Instruction Word
- Hardware
- Grouping
- Execution Unit Assignment
- Initiation
Execution Units
- Suppose 4 functional units
- Memory load unit
- Floating point unit
- Memory store unit
- Integer/Branch unit
- Instruction
Memory load
FP operation
Memory store
- Original unrolled loopLoop L.D F0,0(R1) F0?
xi ADD.D F4,F0,F2 F4? xis S.D 0(R1),F
4 xi? xis DADDUI R1,R1,-8 i?
i-1 BNE R1,R2,Loop repeat if i?0 NOP
branch delay slot
1 stall
2 stalls
1 stall
Memory load
FP operation
Memory store
Limit stall cycles by clever compilers (loop
unrolling, software pipelining)
- Hardware
- Grouping
- Execution Unit Assignment
- Initiation
Execution Units
- Hardware
- Grouping
- Execution Unit Assignment
- Initiation
Execution Units
39Dynamic VLIW
- VLIW no caches because no hardware to deal with
cache misses - Dynamic VLIW Hardware to stall on a cache miss.
- Not used frequently
- Dynamic VLIW
- Explicitly Parallel Instruction Computing (EPIC)
Execution Units
Execution Unit Assign-ment
- IA-64 architecture by HP and Intel
- IA-64 is an instruction set architecture intended
for implementation on EPIC - Itanium is first Intel product
- 64-bit architecture
- Basic concepts
- Instruction level parallelism indicated by
compiler - Long or very long instruction words
- Branch predication (? prediction)
- Speculative loading
42Key Features
- Large number of registers
- IA-64 instruction format assumes 256
- 128 64 bit integer, logical general purpose
- 128 82 bit floating point and graphic
- 64 1 bit predicated execution registers (see
later) - To support high degree of parallelism
- Multiple execution units
- Expected to be 8 or more
- Depends on number of transistors available
- Execution of parallel instructions depends on
hardware available - 8 parallel instructions may be spilt into two
lots of four if only four execution units are
43IA-64 Execution Units
- I-Unit
- Integer arithmetic
- Shift and add
- Logical
- Compare
- Integer multimedia ops
- M-Unit
- Load and store
- Between register and memory
- Some integer ALU
- B-Unit
- Branch instructions
- F-Unit
- Floating point instructions
44Instruction Format Diagram
45Instruction Format
- 128 bit bundle
- Holds three instructions (syllables) plus
template - Can fetch one or more bundles at a time
- Template contains info on which instructions can
be executed in parallel - Not confined to single bundle
- e.g. a stream of 8 instructions may be executed
in parallel - Compiler will have re-ordered instructions to
form contiguous bundles - Can mix dependent and independent instructions in
same bundle
46Assembly Language Format
- qp mnemonic .comp dest srcs //
- qp - predicate register
- 1 at execution then execute and commit result to
hardware - 0 result is discarded
- mnemonic - name of instruction
- comp one or more instruction completers used to
qualify mnemonic - dest one or more destination operands
- srcs one or more source operands
- // - comment
- Instruction groups and stops indicated by
- Sequence without read after write or write after
write - Do not need hardware register dependency checks
47Assembly Examples
- ld8 r1 r5 //first group
- add r3 r1, r4 //second group
- Second instruction depends on value in r1
- Changed by first instruction
- Can not be in same group for parallel execution
if a 0 then j j1 else k k1
Pseudo code
cmp a,0 jne L1 add j,1 jmp L2 L1 add k,1 L2
Using branches
If a 0 Then p1 1 and p2 0 Else p1 0 and
p2 1
cmp.eq p1, p2 0, a (p1) add j 1,
j (p2) add k 1, k
Should NOT be there to enable parallelism
50Speculative Loading
51Data Speculation
st8 r4 r12 ld8 r6 r8 add r5r6,
r7 st8 r18 r5
What if r4 contains same address as r8 ?
Writes source address (contents of r8) to
Advanced Load Adress Table (ALAT)
Each store checks ALAT and removes entry if match
Ld8.a r6 r8 advanced load st8 r4
r12 Ld8.c r6 r8 check load add r5r6,
r7 st8 r18 r5
If no matching entry in ALAT Load is performed
52Control Data Speculation
- Control Speculation
- AKA Speculative loading
- Load data from memory before needed
- Data Speculation
- Load moved before store that might alter memory
location - Subsequent check in value