Embedded Computer Architectures

1 / 52
About This Presentation
Title:

Embedded Computer Architectures

Description:

Stall. Latency = 2. Integer ALU = Store Memory. Floating point ALU = Store Memory. No Latency ... stall. BNE R1,R2,Loop ; repeat if i0. NOP ; branch delay ... – PowerPoint PPT presentation

Number of Views:17
Avg rating:3.0/5.0
Slides: 53
Provided by: Adri229

less

Transcript and Presenter's Notes

Title: Embedded Computer Architectures


1
EmbeddedComputerArchitectures
  • Hennessy Patterson
  • Chapter 4
  • Exploiting ILP with Software Approaches
  • Gerard Smit (Zilverling 4102), smit_at_cs.utwente.nl
  • André Kokkeler (Zilverling 4096),
    kokkeler_at_utwente.nl

2
Contents
  • Introduction
  • Processor Architecture
  • Loop Unrolling
  • Software Pipelining

3
Introduction
4
Processor Architecture
  • 5 stage pipeline
  • Static scheduling
  • Integer and Floating Point unit

5
Processor Architecture
  • Latencies

Integer ALU gt Integer ALU
Int. ALU
No Latency
Int. ALU
Floating point ALU gt Floating point ALU
FP ALU
FP ALU
Latency 3
6
Processor Architecture
  • Latencies

Load Memory gt Store Memory
Load
No Latency
Store
7
Processor Architecture
  • Latencies

Integer ALU gt Store Memory
Int. ALU
No Latency
Store
Floating point ALU gt Store Memory
FP ALU
Store
Latency 2
8
Processor Architecture
  • Latencies

Load Memory gt Integer ALU
Load
Int. ALU
Latency 1
Load Memory gt Floating point ALU
Load
FP ALU
Latency 1
9
Processor Architecture
  • Latencies

Integer ALU gt Branch
Int. ALU
Branch
Latency 1
10
Loop Unrolling
  • For i1000 downto 1 do xi xis
  • Loop L.D F0,0(R1) F0? xi ADD.D F4,F0,F2
    F4? xis S.D 0(R1),F4 xi?
    xis DADDUI R1,R1,-8 i? i-1 BNE R1,R2,Lo
    op repeat if i?0 NOP branch delay slot
  • R1 pointer within arrayF2 value to be added
    (s)R2 last element in arrayF0 value in
    arrayF4 value to be written in array

11
Loop Unrolling
  • Loop L.D F0,0(R1) F0? xi ADD.D F4,F0,F2
    F4? xis S.D 0(R1),F4 xi?
    xis DADDUI R1,R1,-8 i? i-1 BNE R1,R2,Lo
    op repeat if i?0 NOP branch delay slot

Load Memory gt FP ALU 1 stall
12
Loop Unrolling
  • Loop L.D F0,0(R1) F0? xi stall ADD.D F4,
    F0,F2 F4? xis S.D 0(R1),F4 xi?
    xis DADDUI R1,R1,-8 i? i-1 BNE R1,R2,Lo
    op repeat if i?0 NOP branch delay slot

FP ALU gt Store Memory gt 2 stalls
13
Loop Unrolling
  • Loop L.D F0,0(R1) F0? xi stall ADD.D F4,
    F0,F2 F4? xis stall stall S.D 0(R1),F4
    xi? xis DADDUI R1,R1,-8 i?
    i-1 BNE R1,R2,Loop repeat if i?0 NOP
    branch delay slot

Integer ALU gt Branch 1 stall
14
Loop Unrolling
  • Loop L.D F0,0(R1) F0? xi stall ADD.D F4,
    F0,F2 F4? xis stall stall S.D 0(R1),F4
    xi? xis DADDUI R1,R1,-8 i?
    i-1 stall BNE R1,R2,Loop repeat if
    i?0 NOP branch delay slot

Smart compiler
15
Loop Unrolling
  • Loop L.D F0,0(R1) F0? xi DADDUI R1,R1,-8
    i? i-1 ADD.D F4,F0,F2 F4?
    xis stall BNE R1,R2,Loop repeat if
    i?0 S.D 8(R1),F4 xi? xis

Integer ALU gt Branch 1 stall
From 10 cycles per loop to 6 cycles per loop
16
Loop Unrolling
  • Loop L.D F0,0(R1) F0? xi DADDUI R1,R1,-8
    i? i-1 ADD.D F4,F0,F2 F4?
    xis BNE R1,R2,Loop repeat if
    i?0 S.D 8(R1),F4 xi? xis
  • 5 instructions
  • 3 doing the job
  • 2 control or overhead
  • Reduce overhead gt loop unrolling
  • Add code
  • From 1000 iterations to 500 iterations

17
Loop Unrolling
  • Original Code Sequence
  • Loop L.D F0,0(R1) F0? xi ADD.D F4,F0,F2
    F4? xis S.D 0(R1),F4 xi?
    xis DADDUI R1,R1,-8 i? i-1 BNE R1,R2,Lo
    op repeat if i?0 NOP branch delay slot

Copy this part With correct data pointer
18
Loop Unrolling
  • Unrolled Code Sequence
  • Loop L.D F0,0(R1) F0? xi ADD.D F4,F0,F2
    F4? xis S.D 0(R1),F4 xi?
    xis L.D F0,-8(R1) F0? xi ADD.D F4,F0,
    F2 F4? xis S.D -8(R1),F4 xi? xis
  • DADDUI R1,R1,-16 i? i-2 BNE R1,R2,Loop
    repeat if i?0 NOP branch delay slot
  • There are still a lot of stalls. Removing is
    easier if some additional registers are used

1 stall
2 stalls
1 stall
2 stalls
1 stall
19
Loop Unrolling
  • Unrolled Code Sequence
  • Loop L.D F0,0(R1) F0? xi ADD.D F4,F0,F2
    F4? xis S.D 0(R1),F4 xi?
    xis L.D F6,-8(R1) F6 ? xi ADD.D F8,F6
    ,F2 F8 ? xis S.D -8(R1),F8 xi? xis
  • DADDUI R1,R1,-16 i? i-1 BNE R1,R2,Loop
    repeat if i?0 NOP branch delay slot

1 stall
2 stalls
1 stall
2 stalls
1 stall
20
Loop Unrolling
  • Unrolled Code Sequence
  • Loop L.D F0,0(R1) F0? xi L.D F6,-8(R1)
    F6 ? xi ADD.D F4,F0,F2 F4?
    xis S.D 0(R1),F4 xi? xis
  • ADD.D F8,F6,F2 F8 ? xis S.D -8(R1),F8
    xi? xis
  • DADDUI R1,R1,-16 i? i-1 BNE R1,R2,Loop
    repeat if i?0 NOP branch delay slot

1 stall
1 stall
2 stalls
1 stall
21
Loop Unrolling
  • Unrolled Code Sequence
  • Loop L.D F0,0(R1) F0? xi L.D F6,-8(R1)
    F6 ? xi ADD.D F4,F0,F2 F4? xis
  • ADD.D F8,F6,F2 F8 ? xis
    S.D 0(R1),F4 xi? xis
  • S.D -8(R1),F8 xi? xis
  • DADDUI R1,R1,-16 i? i-1 BNE R1,R2,Loop
    repeat if i?0 NOP branch delay slot

16
8
2 stalls
1 stall
22
Loop Unrolling
  • Unrolled Code Sequence
  • Loop L.D F0,0(R1) F0? xi L.D F6,-8(R1)
    F6 ? xi ADD.D F4,F0,F2 F4? xis
  • ADD.D F8,F6,F2 F8 ? xis
  • DADDUI R1,R1,-16 i? i-1 S.D 16(R1),F4
    xi? xis
  • S.D 8(R1),F8 xi? xis
  • BNE R1,R2,Loop repeat if i?0 NOP
    branch delay slot

23
Loop Unrolling
  • Unrolled Code Sequence
  • Loop L.D F0,0(R1) F0? xi L.D F6,-8(R1)
    F6 ? xi ADD.D F4,F0,F2 F4? xis
  • ADD.D F8,F6,F2 F8 ? xis
  • DADDUI R1,R1,-16 i? i-1 S.D 16(R1),F4
    xi? xis
  • BNE R1,R2,Loop repeat if i?0 S.D 8(R1),F8
    xi? xis

24
Loop Unrolling
  • In example loop-unrolling factor 2
  • In general loop-unrolling factor k
  • Limitations concerning k
  • Amdahls law 3000 cycles are always needed
  • Increasing k gt increasing number of registers
  • Increasing k gt increasing code size

25
Software Pipelining
  • Original unrolled loopLoop L.D F0,0(R1) F0?
    xi ADD.D F4,F0,F2 F4? xis S.D 0(R1),F
    4 xi? xis DADDUI R1,R1,-8 i?
    i-1 BNE R1,R2,Loop repeat if i?0 NOP
    branch delay slot
  • Three actions involved with actual
    calculations F0? xi F4? xi x xi? xi
    s
  • Consider these as three different stages

1 stall
2 stalls
1 stall
26
Software Pipelining
  • Original unrolled loopLoop L.D F0,0(R1) F0?
    xi ADD.D F4,F0,F2 F4? xis S.D 0(R1),F
    4 xi? xis DADDUI R1,R1,-8 i?
    i-1 BNE R1,R2,Loop repeat if i?0 NOP
    branch delay slot
  • Three actions involved with actual
    calculations F0? xi Stage 1 F4? xi
    x Stage 2 xi? xi s Stage 3
  • Associate array element with the stages

27
Software Pipelining
  • Original unrolled loopLoop L.D F0,0(R1) F0?
    xi ADD.D F4,F0,F2 F4? xis S.D 0(R1),F
    4 xi? xis DADDUI R1,R1,-8 i?
    i-1 BNE R1,R2,Loop repeat if i?0 NOP
    branch delay slot
  • Three actions involved with actual
    calculations F0? xi Stage 1, xi F4? xi
    x Stage 2, xi xi? xi s Stage 3, xi

28
Software Pipelining
  • Normal Execution

Stage 1
Stage 2
Stage 3
F0
F4
Stage 1 fill F0 Stage 2 read F0
fill F4 Stage 3 read F4
1 stall
X1000
X1000
2 stalls
X1000
1 stall
X999
Time
X999
2 stalls
Register Empty
X999
1 stall
X998
Register Occupied
X998
2 stalls
X998
29
Software Pipelining
  • Software Pipelined Execution

Stage 1
Stage 2
Stage 3
F0
F4
Stage 1 fill F0 Stage 2 read F0
fill F4 Stage 3 read F4
1 stall
X1000
X1000
1 stall
X999
0 stalls
X1000
Time
X999
1 stall
Register Empty
X998
0 stalls
X999
Register Occupied
X998
1 stall
X997
X998
30
Software Pipelining
  • Software Pipelined Execution

Stage 1
Stage 2
Stage 3
1 stall
L.D F0,0(R1) F0 ? x1000
Loop
X1000
X1000
1 stall
ADD.D F4,F0,F2 F4 ? x1000 s
X999
LD.D F0,-8(R1) F0 ? x999
0 stalls
Xi
S.D 0(R1),F4 xi ? F4
Xi-1
1 stall
ADD.D F4,F0,F2 F4 ? xi-1 s
ADD.D F4,F0,F2 F4 ? xi-1 s
Xi-2
LD.D F0,-16(R1) F0 ? xi-2
0 stalls
BNE R1,R2,Loop repeat if i?1
DADDUI R1,R1,-8 i ? i-8
31
Software Pipelining
  • Software Pipelined Execution

Stage 1
Stage 2
Stage 3
1 stall
L.D F0,0(R1) F0 ? x1000
Loop
X1000
X1000
1 stall
ADD.D F4,F0,F2 F4 ? x1000 s
X999
LD.D F0,-8(R1) F0 ? x999
0 stalls
Xi
S.D 0(R1),F4 xi ? F4
Xi-1
ADD.D F4,F0,F2 F4 ? xi-1 s
ADD.D F4,F0,F2 F4 ? xi-1 s
Xi-2
LD.D F0,-16(R1) F0 ? xi-2
0 stalls
0 stalls
BNE R1,R2,Loop repeat if i?1
DADDUI R1,R1,-8 i ? i-8
32
Software Pipelining
  • No stalls inside loop
  • Additional start-up (and clean-up) code
  • No reduction of control overhead
  • No additional registers

33
VLIW
  • To simplify processor hardware sophisticated
    compilers (loop unrolling, software pipelining
    etc.)
  • Extreme form Very Long Instruction Word
    processors

34
VLIW
  • Superscalar
  • VLIW
  • Hardware
  • Grouping
  • Execution Unit Assignment
  • Initiation

Execution Units
Instructions
35
VLIW
  • Suppose 4 functional units
  • Memory load unit
  • Floating point unit
  • Memory store unit
  • Integer/Branch unit
  • Instruction

Memory load
FP operation
Memory store
Integer/Branch
36
VLIW
  • Original unrolled loopLoop L.D F0,0(R1) F0?
    xi ADD.D F4,F0,F2 F4? xis S.D 0(R1),F
    4 xi? xis DADDUI R1,R1,-8 i?
    i-1 BNE R1,R2,Loop repeat if i?0 NOP
    branch delay slot

1 stall
2 stalls
1 stall
Memory load
FP operation
Memory store
Integer/Branch
L.D
stall
ADD.D
stall
stall
S.D
Limit stall cycles by clever compilers (loop
unrolling, software pipelining)
37
VLIW
  • Superscalar
  • VLIW
  • Hardware
  • Grouping
  • Execution Unit Assignment
  • Initiation

Execution Units
Instructions
38
VLIW
  • Superscalar
  • Dynamic VLIW
  • Hardware
  • Grouping
  • Execution Unit Assignment
  • Initiation

Execution Units
Instructions
Initiation
39
Dynamic VLIW
  • VLIW no caches because no hardware to deal with
    cache misses
  • Dynamic VLIW Hardware to stall on a cache miss.
  • Not used frequently

40
VLIW
  • Dynamic VLIW
  • Explicitly Parallel Instruction Computing (EPIC)

Initiation
Execution Units
Instructions
Initiation
Execution Unit Assign-ment
41
EPIC
  • IA-64 architecture by HP and Intel
  • IA-64 is an instruction set architecture intended
    for implementation on EPIC
  • Itanium is first Intel product
  • 64-bit architecture
  • Basic concepts
  • Instruction level parallelism indicated by
    compiler
  • Long or very long instruction words
  • Branch predication (? prediction)
  • Speculative loading

42
Key Features
  • Large number of registers
  • IA-64 instruction format assumes 256
  • 128 64 bit integer, logical general purpose
  • 128 82 bit floating point and graphic
  • 64 1 bit predicated execution registers (see
    later)
  • To support high degree of parallelism
  • Multiple execution units
  • Expected to be 8 or more
  • Depends on number of transistors available
  • Execution of parallel instructions depends on
    hardware available
  • 8 parallel instructions may be spilt into two
    lots of four if only four execution units are
    available

43
IA-64 Execution Units
  • I-Unit
  • Integer arithmetic
  • Shift and add
  • Logical
  • Compare
  • Integer multimedia ops
  • M-Unit
  • Load and store
  • Between register and memory
  • Some integer ALU
  • B-Unit
  • Branch instructions
  • F-Unit
  • Floating point instructions

44
Instruction Format Diagram
45
Instruction Format
  • 128 bit bundle
  • Holds three instructions (syllables) plus
    template
  • Can fetch one or more bundles at a time
  • Template contains info on which instructions can
    be executed in parallel
  • Not confined to single bundle
  • e.g. a stream of 8 instructions may be executed
    in parallel
  • Compiler will have re-ordered instructions to
    form contiguous bundles
  • Can mix dependent and independent instructions in
    same bundle

46
Assembly Language Format
  • qp mnemonic .comp dest srcs //
  • qp - predicate register
  • 1 at execution then execute and commit result to
    hardware
  • 0 result is discarded
  • mnemonic - name of instruction
  • comp one or more instruction completers used to
    qualify mnemonic
  • dest one or more destination operands
  • srcs one or more source operands
  • // - comment
  • Instruction groups and stops indicated by
  • Sequence without read after write or write after
    write
  • Do not need hardware register dependency checks

47
Assembly Examples
  • ld8 r1 r5 //first group
  • add r3 r1, r4 //second group
  • Second instruction depends on value in r1
  • Changed by first instruction
  • Can not be in same group for parallel execution

48
Predication
49
Predication
if a 0 then j j1 else k k1
Pseudo code
cmp a,0 jne L1 add j,1 jmp L2 L1 add k,1 L2
Using branches
If a 0 Then p1 1 and p2 0 Else p1 0 and
p2 1
cmp.eq p1, p2 0, a (p1) add j 1,
j (p2) add k 1, k
Predicated
Should NOT be there to enable parallelism
50
Speculative Loading
51
Data Speculation
st8 r4 r12 ld8 r6 r8 add r5r6,
r7 st8 r18 r5
stall
What if r4 contains same address as r8 ?
Writes source address (contents of r8) to
Advanced Load Adress Table (ALAT)
Each store checks ALAT and removes entry if match
Ld8.a r6 r8 advanced load st8 r4
r12 Ld8.c r6 r8 check load add r5r6,
r7 st8 r18 r5
If no matching entry in ALAT Load is performed
again
52
Control Data Speculation
  • Control Speculation
  • AKA Speculative loading
  • Load data from memory before needed
  • Data Speculation
  • Load moved before store that might alter memory
    location
  • Subsequent check in value
Write a Comment
User Comments (0)