Embedded Computer Architectures

1 / 52

About This Presentation

Title:

Embedded Computer Architectures

Description:

Stall. Latency = 2. Integer ALU = Store Memory. Floating point ALU = Store Memory. No Latency ... stall. BNE R1,R2,Loop ; repeat if i0. NOP ; branch delay ... – PowerPoint PPT presentation

Number of Views:17

Avg rating:3.0/5.0

Slides: 53

Provided by: Adri229

more less

Transcript and Presenter's Notes

Title: Embedded Computer Architectures

1
EmbeddedComputerArchitectures

Hennessy Patterson
Chapter 4
Exploiting ILP with Software Approaches
Gerard Smit (Zilverling 4102), smit_at_cs.utwente.nl
André Kokkeler (Zilverling 4096),
kokkeler_at_utwente.nl

2
Contents

Introduction
Processor Architecture
Loop Unrolling
Software Pipelining

3
Introduction
4
Processor Architecture

5 stage pipeline
Static scheduling
Integer and Floating Point unit

5
Processor Architecture

Latencies

Integer ALU gt Integer ALU
Int. ALU
No Latency
Int. ALU
Floating point ALU gt Floating point ALU
FP ALU
FP ALU
Latency 3
6
Processor Architecture

Latencies

Load Memory gt Store Memory
Load
No Latency
Store
7
Processor Architecture

Latencies

Integer ALU gt Store Memory
Int. ALU
No Latency
Store
Floating point ALU gt Store Memory
FP ALU
Store
Latency 2
8
Processor Architecture

Latencies

Load Memory gt Integer ALU
Load
Int. ALU
Latency 1
Load Memory gt Floating point ALU
Load
FP ALU
Latency 1
9
Processor Architecture

Latencies

Integer ALU gt Branch
Int. ALU
Branch
Latency 1
10
Loop Unrolling

For i1000 downto 1 do xi xis
Loop L.D F0,0(R1) F0? xi ADD.D F4,F0,F2
F4? xis S.D 0(R1),F4 xi?
xis DADDUI R1,R1,-8 i? i-1 BNE R1,R2,Lo
op repeat if i?0 NOP branch delay slot
R1 pointer within arrayF2 value to be added
(s)R2 last element in arrayF0 value in
arrayF4 value to be written in array

11
Loop Unrolling

Loop L.D F0,0(R1) F0? xi ADD.D F4,F0,F2
F4? xis S.D 0(R1),F4 xi?
xis DADDUI R1,R1,-8 i? i-1 BNE R1,R2,Lo
op repeat if i?0 NOP branch delay slot

Load Memory gt FP ALU 1 stall
12
Loop Unrolling

Loop L.D F0,0(R1) F0? xi stall ADD.D F4,
F0,F2 F4? xis S.D 0(R1),F4 xi?
xis DADDUI R1,R1,-8 i? i-1 BNE R1,R2,Lo
op repeat if i?0 NOP branch delay slot

FP ALU gt Store Memory gt 2 stalls
13
Loop Unrolling

Loop L.D F0,0(R1) F0? xi stall ADD.D F4,
F0,F2 F4? xis stall stall S.D 0(R1),F4
xi? xis DADDUI R1,R1,-8 i?
i-1 BNE R1,R2,Loop repeat if i?0 NOP
branch delay slot

Integer ALU gt Branch 1 stall
14
Loop Unrolling

Loop L.D F0,0(R1) F0? xi stall ADD.D F4,
F0,F2 F4? xis stall stall S.D 0(R1),F4
xi? xis DADDUI R1,R1,-8 i?
i-1 stall BNE R1,R2,Loop repeat if
i?0 NOP branch delay slot

Smart compiler
15
Loop Unrolling

Loop L.D F0,0(R1) F0? xi DADDUI R1,R1,-8
i? i-1 ADD.D F4,F0,F2 F4?
xis stall BNE R1,R2,Loop repeat if
i?0 S.D 8(R1),F4 xi? xis

Integer ALU gt Branch 1 stall
From 10 cycles per loop to 6 cycles per loop
16
Loop Unrolling

Loop L.D F0,0(R1) F0? xi DADDUI R1,R1,-8
i? i-1 ADD.D F4,F0,F2 F4?
xis BNE R1,R2,Loop repeat if
i?0 S.D 8(R1),F4 xi? xis
5 instructions
3 doing the job
2 control or overhead
Reduce overhead gt loop unrolling
Add code
From 1000 iterations to 500 iterations

17
Loop Unrolling

Original Code Sequence
Loop L.D F0,0(R1) F0? xi ADD.D F4,F0,F2
F4? xis S.D 0(R1),F4 xi?
xis DADDUI R1,R1,-8 i? i-1 BNE R1,R2,Lo
op repeat if i?0 NOP branch delay slot

Copy this part With correct data pointer
18
Loop Unrolling

Unrolled Code Sequence
Loop L.D F0,0(R1) F0? xi ADD.D F4,F0,F2
F4? xis S.D 0(R1),F4 xi?
xis L.D F0,-8(R1) F0? xi ADD.D F4,F0,
F2 F4? xis S.D -8(R1),F4 xi? xis
DADDUI R1,R1,-16 i? i-2 BNE R1,R2,Loop
repeat if i?0 NOP branch delay slot
There are still a lot of stalls. Removing is
easier if some additional registers are used

1 stall
2 stalls
1 stall
2 stalls
1 stall
19
Loop Unrolling

Unrolled Code Sequence
Loop L.D F0,0(R1) F0? xi ADD.D F4,F0,F2
F4? xis S.D 0(R1),F4 xi?
xis L.D F6,-8(R1) F6 ? xi ADD.D F8,F6
,F2 F8 ? xis S.D -8(R1),F8 xi? xis
DADDUI R1,R1,-16 i? i-1 BNE R1,R2,Loop
repeat if i?0 NOP branch delay slot

1 stall
2 stalls
1 stall
2 stalls
1 stall
20
Loop Unrolling

Unrolled Code Sequence
Loop L.D F0,0(R1) F0? xi L.D F6,-8(R1)
F6 ? xi ADD.D F4,F0,F2 F4?
xis S.D 0(R1),F4 xi? xis
ADD.D F8,F6,F2 F8 ? xis S.D -8(R1),F8
xi? xis
DADDUI R1,R1,-16 i? i-1 BNE R1,R2,Loop
repeat if i?0 NOP branch delay slot

1 stall
1 stall
2 stalls
1 stall
21
Loop Unrolling

Unrolled Code Sequence
Loop L.D F0,0(R1) F0? xi L.D F6,-8(R1)
F6 ? xi ADD.D F4,F0,F2 F4? xis
ADD.D F8,F6,F2 F8 ? xis
S.D 0(R1),F4 xi? xis
S.D -8(R1),F8 xi? xis
DADDUI R1,R1,-16 i? i-1 BNE R1,R2,Loop
repeat if i?0 NOP branch delay slot

16
8
2 stalls
1 stall
22
Loop Unrolling

Unrolled Code Sequence
Loop L.D F0,0(R1) F0? xi L.D F6,-8(R1)
F6 ? xi ADD.D F4,F0,F2 F4? xis
ADD.D F8,F6,F2 F8 ? xis
DADDUI R1,R1,-16 i? i-1 S.D 16(R1),F4
xi? xis
S.D 8(R1),F8 xi? xis
BNE R1,R2,Loop repeat if i?0 NOP
branch delay slot

23
Loop Unrolling

Unrolled Code Sequence
Loop L.D F0,0(R1) F0? xi L.D F6,-8(R1)
F6 ? xi ADD.D F4,F0,F2 F4? xis
ADD.D F8,F6,F2 F8 ? xis
DADDUI R1,R1,-16 i? i-1 S.D 16(R1),F4
xi? xis
BNE R1,R2,Loop repeat if i?0 S.D 8(R1),F8
xi? xis

24
Loop Unrolling

In example loop-unrolling factor 2
In general loop-unrolling factor k
Limitations concerning k
Amdahls law 3000 cycles are always needed
Increasing k gt increasing number of registers
Increasing k gt increasing code size

25
Software Pipelining

Original unrolled loopLoop L.D F0,0(R1) F0?
xi ADD.D F4,F0,F2 F4? xis S.D 0(R1),F
4 xi? xis DADDUI R1,R1,-8 i?
i-1 BNE R1,R2,Loop repeat if i?0 NOP
branch delay slot
Three actions involved with actual
calculations F0? xi F4? xi x xi? xi
s
Consider these as three different stages

1 stall
2 stalls
1 stall
26
Software Pipelining

Original unrolled loopLoop L.D F0,0(R1) F0?
xi ADD.D F4,F0,F2 F4? xis S.D 0(R1),F
4 xi? xis DADDUI R1,R1,-8 i?
i-1 BNE R1,R2,Loop repeat if i?0 NOP
branch delay slot
Three actions involved with actual
calculations F0? xi Stage 1 F4? xi
x Stage 2 xi? xi s Stage 3
Associate array element with the stages

27
Software Pipelining

Original unrolled loopLoop L.D F0,0(R1) F0?
xi ADD.D F4,F0,F2 F4? xis S.D 0(R1),F
4 xi? xis DADDUI R1,R1,-8 i?
i-1 BNE R1,R2,Loop repeat if i?0 NOP
branch delay slot
Three actions involved with actual
calculations F0? xi Stage 1, xi F4? xi
x Stage 2, xi xi? xi s Stage 3, xi

28
Software Pipelining

Normal Execution

Stage 1
Stage 2
Stage 3
F0
F4
Stage 1 fill F0 Stage 2 read F0
fill F4 Stage 3 read F4
1 stall
X1000
X1000
2 stalls
X1000
1 stall
X999
Time
X999
2 stalls
Register Empty
X999
1 stall
X998
Register Occupied
X998
2 stalls
X998
29
Software Pipelining

Software Pipelined Execution

Stage 1
Stage 2
Stage 3
F0
F4
Stage 1 fill F0 Stage 2 read F0
fill F4 Stage 3 read F4
1 stall
X1000
X1000
1 stall
X999
0 stalls
X1000
Time
X999
1 stall
Register Empty
X998
0 stalls
X999
Register Occupied
X998
1 stall
X997
X998
30
Software Pipelining

Software Pipelined Execution

Stage 1
Stage 2
Stage 3
1 stall
L.D F0,0(R1) F0 ? x1000
Loop
X1000
X1000
1 stall
ADD.D F4,F0,F2 F4 ? x1000 s
X999
LD.D F0,-8(R1) F0 ? x999
0 stalls
Xi
S.D 0(R1),F4 xi ? F4
Xi-1
1 stall
ADD.D F4,F0,F2 F4 ? xi-1 s
ADD.D F4,F0,F2 F4 ? xi-1 s
Xi-2
LD.D F0,-16(R1) F0 ? xi-2
0 stalls
BNE R1,R2,Loop repeat if i?1
DADDUI R1,R1,-8 i ? i-8
31
Software Pipelining

Software Pipelined Execution

Stage 1
Stage 2
Stage 3
1 stall
L.D F0,0(R1) F0 ? x1000
Loop
X1000
X1000
1 stall
ADD.D F4,F0,F2 F4 ? x1000 s
X999
LD.D F0,-8(R1) F0 ? x999
0 stalls
Xi
S.D 0(R1),F4 xi ? F4
Xi-1
ADD.D F4,F0,F2 F4 ? xi-1 s
ADD.D F4,F0,F2 F4 ? xi-1 s
Xi-2
LD.D F0,-16(R1) F0 ? xi-2
0 stalls
0 stalls
BNE R1,R2,Loop repeat if i?1
DADDUI R1,R1,-8 i ? i-8
32
Software Pipelining

No stalls inside loop
Additional start-up (and clean-up) code
No reduction of control overhead
No additional registers

33
VLIW

To simplify processor hardware sophisticated
compilers (loop unrolling, software pipelining
etc.)
Extreme form Very Long Instruction Word
processors

34
VLIW

Superscalar
VLIW

Hardware
Grouping
Execution Unit Assignment
Initiation

Execution Units
Instructions
35
VLIW

Suppose 4 functional units
Memory load unit
Floating point unit
Memory store unit
Integer/Branch unit
Instruction

Memory load
FP operation
Memory store
Integer/Branch
36
VLIW

Original unrolled loopLoop L.D F0,0(R1) F0?
xi ADD.D F4,F0,F2 F4? xis S.D 0(R1),F
4 xi? xis DADDUI R1,R1,-8 i?
i-1 BNE R1,R2,Loop repeat if i?0 NOP
branch delay slot

1 stall
2 stalls
1 stall
Memory load
FP operation
Memory store
Integer/Branch
L.D
stall
ADD.D
stall
stall
S.D
Limit stall cycles by clever compilers (loop
unrolling, software pipelining)
37
VLIW

Superscalar
VLIW

Hardware
Grouping
Execution Unit Assignment
Initiation

Execution Units
Instructions
38
VLIW

Superscalar
Dynamic VLIW

Hardware
Grouping
Execution Unit Assignment
Initiation

Execution Units
Instructions
Initiation
39
Dynamic VLIW

VLIW no caches because no hardware to deal with
cache misses
Dynamic VLIW Hardware to stall on a cache miss.
Not used frequently

40
VLIW

Dynamic VLIW
Explicitly Parallel Instruction Computing (EPIC)

Initiation
Execution Units
Instructions
Initiation
Execution Unit Assign-ment
41
EPIC

IA-64 architecture by HP and Intel
IA-64 is an instruction set architecture intended
for implementation on EPIC
Itanium is first Intel product
64-bit architecture
Basic concepts
Instruction level parallelism indicated by
compiler
Long or very long instruction words
Branch predication (? prediction)
Speculative loading

42
Key Features

Large number of registers
IA-64 instruction format assumes 256
128 64 bit integer, logical general purpose
128 82 bit floating point and graphic
64 1 bit predicated execution registers (see
later)
To support high degree of parallelism
Multiple execution units
Expected to be 8 or more
Depends on number of transistors available
Execution of parallel instructions depends on
hardware available
8 parallel instructions may be spilt into two
lots of four if only four execution units are
available

43
IA-64 Execution Units

I-Unit
Integer arithmetic
Shift and add
Logical
Compare
Integer multimedia ops
M-Unit
Load and store
Between register and memory
Some integer ALU
B-Unit
Branch instructions
F-Unit
Floating point instructions

44
Instruction Format Diagram
45
Instruction Format

128 bit bundle
Holds three instructions (syllables) plus
template
Can fetch one or more bundles at a time
Template contains info on which instructions can
be executed in parallel
Not confined to single bundle
e.g. a stream of 8 instructions may be executed
in parallel
Compiler will have re-ordered instructions to
form contiguous bundles
Can mix dependent and independent instructions in
same bundle

46
Assembly Language Format

qp mnemonic .comp dest srcs //
qp - predicate register
1 at execution then execute and commit result to
hardware
0 result is discarded
mnemonic - name of instruction
comp one or more instruction completers used to
qualify mnemonic
dest one or more destination operands
srcs one or more source operands
// - comment
Instruction groups and stops indicated by
Sequence without read after write or write after
write
Do not need hardware register dependency checks

47
Assembly Examples

ld8 r1 r5 //first group
add r3 r1, r4 //second group
Second instruction depends on value in r1
Changed by first instruction
Can not be in same group for parallel execution

48
Predication
49
Predication
if a 0 then j j1 else k k1
Pseudo code
cmp a,0 jne L1 add j,1 jmp L2 L1 add k,1 L2
Using branches
If a 0 Then p1 1 and p2 0 Else p1 0 and
p2 1
cmp.eq p1, p2 0, a (p1) add j 1,
j (p2) add k 1, k
Predicated
Should NOT be there to enable parallelism
50
Speculative Loading
51
Data Speculation
st8 r4 r12 ld8 r6 r8 add r5r6,
r7 st8 r18 r5
stall
What if r4 contains same address as r8 ?
Writes source address (contents of r8) to
Advanced Load Adress Table (ALAT)
Each store checks ALAT and removes entry if match
Ld8.a r6 r8 advanced load st8 r4
r12 Ld8.c r6 r8 check load add r5r6,
r7 st8 r18 r5
If no matching entry in ALAT Load is performed
again
52
Control Data Speculation