Programmable CPU cores - PowerPoint PPT Presentation

1 / 61

About This Presentation

Title:

Programmable CPU cores

Description:

rationale: as high multiplex factor R as possible. consequence: often manual handcrafted design optimised for clock rate ... Dummy op (noop) [Hennessy&Patterson] ... – PowerPoint PPT presentation

Number of Views:19

Avg rating:3.0/5.0

Slides: 62

Provided by: abc774

Category:

more less

Transcript and Presenter's Notes

Title: Programmable CPU cores

1
Programmable CPU cores

introduction
architecture of the MIPS core
discussed as an example
pipelining
application examples
software issues
comparison between different CPU cores
towards application specific architectures
discussion

2
Introduction

rationale as high multiplex factor R as possible
consequence often manual handcrafted design
optimised for clock rate
problem fast changes in the IC process
technology
examples embedded
MIPS (first one, licensing instruction set
architecture)
ARM (Advanced Risc Machines, telecom, low power,
small code size, most popular one, licensing also
the micro-architecture as hard or soft IP)
Sparc
derivatives from general purpose CPUs
Intel, NEC, Hitachi, National, PowerPC

3
Introduction
Instruction set architectures
implicit operands
explicit operands
4
Introduction
C A B
5
Architecture of the MIPS core
Hennessy Patterson
6
MIPS instruction formats ( 32 bits )
Hennessy Patterson
op operation of the instruction rs,rt,rd source
and destination registers shamt shift
amount funct operation of the instruction-part
2 imm for program constants addr target address
of a jump
7
Example 1 R - type add instruction
Hennessy Patterson
8
Critical path R-type operation
Clk
PC
Hennessy Patterson
Instruction address
Instruction Memory
Instruction
Rd
Rt
Rs
Imm
5
5
5
16
32
Rw Ra Rb 32 32-bit registers
Data Memory
Data address
32
32
Data out
Data in
Clk
32
Clk
9
Critical path R-type operation
Clock
Clock-to-Q
PC
New value
Old value
Instruction memory access time
Rs, rt, rd op, funct
Old value
New value
RFile access time
Bus A,B
Old value
New value
ALU delay
Bus W
Old value
New value
Set up skew
Write into RFile
10
Example 2 I-type load word
Hennessy Patterson

lw rs, rt, imm16
memPC
addr Rrs extimm16
Rrt memaddr
PC PC 4

11
Critical path load operation
Clock
Clock-to-Q
PC
Old value
New value
Instruction memory access time
Rs, rt, rd op, funct
Old value
New value
RFile access time
Bus A,B
Old value
New value
ALU delay
Old value
address
New value
Mem access time
Bus W
Old value
New value
set upskew
12
Example 3 I-type branch
Hennessy Patterson

beq rs, rt, imm16
memPC
cond Rrs - Rrt
if cond 0
PC PC 4 ext(imm16)4
else
PC PC 4

13
Example 3 I-type branch
Hennessy Patterson
Rd
Rt
RedDst
Branch
dc (Rt)
Rs
Clk
ALUctr
PC
5
5
5
Reg Wr
Next Address Logic
BusA 32
Imm 16 16
Rw Ra Rb 32 32-bit registers
Bus W
32
BusB 32
Zero
Clk
To Instruction Memory
Imm 16 16
32
Extender
ALUSrc
ExtOp
14
Example 3 I-type branch
HennessyPatterson
30
30
Addrlt312gt Addrlt10gt Instruction Memory
30
PC
00
0
30
Clk
1
30
32
1
Imm 16 16
Instruction lt310gt
30
SignExt
Branch Zero
Instruction lt150gt
15
Example 3 I-type branch
HennessyPatterson
30
Addrlt312gt Addrlt10gt Instruction Memory
30
PC
1 c_in
00
Clk
0
0
32
30
Imm 16 16
SignExt
1
Instruction lt150gt
Instruction lt310gt
Branch Zero
16
Architecture of the MIPS core

problem long critical path
defined by the slowest instruction (load)
solution ?
pipelining
break the instruction into smaller steps
all steps have about the same critical path

17
Pipelining lw instructions
HennessyPatterson
cycle 1
cycle 2
cycle 3
cycle 4
cycle 5
cycle 6
cycle 7
Ifetch
RF read
ALU
dmem
RF write
lw
lw
Ifetch
RF read
ALU
dmem
RF write
Ifetch
RF read
ALU
dmem
RF write
lw

One instructions enters the pipeline every clock
cycle
One instructions leaves the pipeline every clock
cycle
gt CPI 1 (Cycles per Instruction)

18
Pipelining lw instructions
I
R
A
M
W
Instructions
Data
Current CPU cycle
19
4 stages of R-type instruction
HennessyPatterson
cycle 1
cycle 2
cycle 3
cycle 4
Ifetch
RF read
ALU
RF write
E.g. ADD
20
Pipelining lw and R-type instructions
HennessyPatterson
cycle 1
cycle 2
cycle 3
cycle 4
cycle 5
cycle 6
cycle 7
Ifetch
RF read
ALU
dmem
RF write
lw
add
Ifetch
RF read
ALU
RF write
21
Solution stretch R-type to 5 stages
Ifetch
RF read
ALU
dmem
RF write
Dummy op (noop)
HennessyPatterson
22
Ifetch
Reg/dec
exec
mem
wr
RegWr
branch
Next PC
Rfile
4
flags
Rs
BusA
Ra

Rt
Rb
BusB
adr
Prog mem
Data mem
Rw
Di
Dout
ext.
Din
Imm16
Rt
Rd
MemtoReg
MemWr
HennessyPatterson
RegDst
ALUSrc
ExtOp
ALUop
23
Data dependencies R-type instructions
HennessyPatterson
R1 ...
R1 ...
R1 ...
R1 ...
R1 ...
24
Data dependencies R-type instructions
HennessyPatterson
R1 ...
R1 ...
R1 ...
R1 ...
R1 ...
Solution bypasses
25
Bypasses
HennessyPatterson
adr
Data mem
26
Data dependencies load instruction
HennessyPatterson
R1 lw...
R1 ...
R1 ...
R1 ...
27
Data dependencies load instruction
HennessyPatterson
R1 lw...
Bypass is no solution for instruction
R1 ...
R1 - ...
R1 - ...
28
Data dependencies load instruction
HennessyPatterson
R1 lw...
DM
IM
RF
RF
R1 ...
DM
IM
RF
RF
R1 - ...
R1 - ...
IM
RF
DM
RF
Solution pipeline interlock detects a data
hazard and stalls the pipeline until the hazard
is cleared
29
Instructions i1) lw r10, r2, r0 i2) add r8,
r9, r10
i1
Data available from data cache
i2
I
R(interlocked)
A
M
W
30
Instructions i1) MULT r3, r2, r1 i2) ADD
r5, r4, r3
i1
i2
I
R(interlocked)
A
M
W
31
Control hazards
branch
Next PC
Rfile
4
flags
Rs
BusA
Ra

Rt
Rb
BusB
adr
Prog mem
Data mem
Rw
Di
Dout
ext.
Din
Imm16
Rt
Rd
HennessyPatterson
32
Control hazards
branch
Next PC
0?
4
flags
Rs
Ra
BusA

Rt
Rfile
Rb
BusB
adr
Prog mem
Data mem
Rw
Di
Dout
ext.
Din
Imm16
Rt
Rd
HennessyPatterson
33
Control hazards
i1) beq r10, r2, 1b i2) nop/independent
instructions i3) add r8, r9, r10
i1
i2
Address available for instr. fetch
i3
Solution compiler action possibly filling the
branch delay slot
34
PR3930 CPU
35
TCP chip TV controller

PR3930 peripherals
Gfx, SDRAM controller,
Serial interconnect bus,
I2C, UART, timers
PI bus architecture
80 mm2
352 pins
0.35 micron process
48 MHz (96 for gfx)

D
I
36
Programmable CPU cores

introduction
architecture of the MIPS core
discussed as an example
pipelining
application examples
software issues
comparison between different CPU cores
towards application specific architectures
discussion

37
Application examples (1)
38
Application examples (1)
39
Application examples (2)
Bit level operations finite field arithmetic
40
Application examples (2)
Bit level operations DES example
41
Application examples (2)
Bit level operations A5 example (GSM encryption)
42
Application examples (3)
Video conferencing H263
CIF format 352 288 px, 211, 8
bits/sample QCIF 1/4 CIF SQCIF 96128 Process
0.25 micron power consumption 100 mW _at_ 10 Hz
961281.510Hz 180 KB/s
72
20Kb/s
Compare 85257616b/p 50 49MB/s
43
Application examples (3)
H.263 video encoder
44
Application examples (3)
PR3940
I
D
memory
10 Hz gt 140 MHz CPU
45
Application examples (3)
In which process can the H263 video encoder be
executed on a single MIPS processor ?
46
func() ax.value 0x3 if (a ! 0) b
a c d else b
y.post(b)
compile each BB to instructions
ax.value 0x3
BB1
a 0
a ! 0
parser
b a c d
b
BB2
BB3
ldi 0x3, R5 and R4,R5,R6 cmp R0,R6,R7 br
R7,true ba false
y.post(b)
BB4
func() ax.value 0x3 DelayCycles(7)
if (a ! 0) b a c d
DelayCycles(8) else b
DelayCycles(5) y.post(b)
DelayCycles(4)
compile and run
generate new C with delay counts
Arch. Model ldi2 cycles nop 1 cycle ...
47
Comparison between different CPU cores
48
Comparison between different CPU cores
http//bwrc.eecs.berkeley.edu/cic
49
Comparison between different CPU cores
50
Comparison between different CPU cores
51
Comparison between different CPU cores
52
Power Consumption in microprocessors

Power consumption is (becoming) the limiting
factor in processor design
Solution in direction of
Hardware accelleration
Instruction Level Parallelism instead of clock
speed
Code size efficiency

source ISSCC2001, Patrick Gelsinger, Intel
53
Towards application specific architectures ConCISe
Bernardo Kastrup
54
Towards application specific architectures
Example equation for one output bit (12) is shown!
55
Towards application specific architectures
56
Towards application specific architectures
ConCISe integrated tool-set
Simulator executable
Assembly code
Modified assembly with ASIs
Profile data
Hardware netlist
Assembler/ linker
Core compiler
Hardware/ software partitioning
hardware compiler
Does it fit? Y/N
Source code
Translator
HDL file
Hardware partition
57
Towards application specific architectures ConCISe
Bernardo Kastrup

Advantages faster execution, smaller code size,
lower power
The Configurable Functional Unit (CFU) can be
Standard cell
Field-Programmable Logic (FPL)
Considerably bigger in silicon (4 to 5mm2 in
C075)
But its reconfigurable reprogrammable for
different application programs

58
Some benchmarks
59
Amdahls law

Impact of an improvement on the execution time of
a program depends on 2 parameters
f fraction of the original computation time
that is affected by the improvement
s speedup factor (local)
exec_time_new exec_time_old (1-f)
exec_time_old f / s
speedup_overall exec_time_old /
exec_time_new 1 / ( 1 f f / s)
if s gtgt 1 then speedup_overall 1 / ( 1 f )
Example 40 of program can be executed 10 x
faster speedup_overall 1 / ( 0.6 0.4 / 10 )
1.56

60
Towards application specific architectures
www.tensilica.com
61
Conclusions

Programmable CPU cores are important for the
control parts of the application.
They are well supported with tools to support
the development of end-user software. ( vs.
deeply embedded sw)
Keep it Simple heuristic (RISC vs. CISC)
Make frequent cases fast and rare cases correct.
Regular (orthogonal) instruction set
No special features that match a high level
language construct.
At least 16 registers to ease register
allocation.
Embedded cores are often light cores which are a
compromise between performance, area and power
dissipation. (vs.
stand-alone CPU cores which are optimised for
performance)